Command-line interface

commonnexus comes with a shell command of the same name: commonnexus. commonnexus is a multi-command CLI, or a “git-like multi-tool command”, i.e. actual functionality to manipulate NEXUS files is implemented as sub-commands. To get an overview and a list of sub-commands, run

$ commonnexus -h
usage: commonnexus [-h] [--log-level LOG_LEVEL] COMMAND ...

commonnexus 1.9.3.dev0 is a set of commands to manipulate of files in the
NEXUS file format.

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR 40|WARNING 30|INFO 20|DEBUG 10]
                        (default: 20)

available commands:
  Run "commonnexus COMAMND -h" to get help for a specific command.

  COMMAND
    characters          Manipulate the CHARACTERS (or DATA) block of a NEXUS
                        file.
    combine             Combine data from multiple NEXUS files and put it in a
                        new one.
    help                Get help on subcommands.
    normalise           Normalise a NEXUS file.
    split               Split a Mesquite multi-block NEXUS into individual
                        NEXUS files per CHARACTERS/TREES block.
    taxa                Manipulate the list of TAXA used in a NEXUS file.
    trees               Manipulate a TREES block in a NEXUS file.

See https://github.com/dlce-eva/commonnexus for details.

Most commands can read input from stdin and print results to stdout. Thus, these commands can easily be chained together with other shell commands:

$ echo "#nexus begin trees; tree 1 = ((a,b)c); end;" | commonnexus normalise - | grep TREE | grep -v TREES
TREE 1 = ((a,b)c);

Warning

While commonnexus.Nexus can read multiple blocks with the same name just fine, most of the commands listed below assume just one block per block type in their input (i.e. only act on the first occurrence of each block type).

In the following we describe the available sub-commands.

commonnexus normalise

Arguably the most important sub-command is normalise, because it removes quite a few complexities of the NEXUS format (e.g. different TRIANGLE options for DISTANCES, or EQUATE mappings for CHARACTERS), and thus makes downstream NEXUS reading a lot more reliable.

$ commonnexus normalise -h
usage: commonnexus normalise [-h] [--strip-comments] nexus

Normalise a NEXUS file.

Normalisation includes

 - converting CHARACTERS/DATA matrices to non-transposed, non-interleaved representation with
   taxon labels (and resolved EQUATEs), extracting taxon labels into a TAXA block;
 - converting a DISTANCES matrix to non-interleaved matrices with diagonal and both triangles
   and taxon labels;
 - translating all TREEs in a TREES block (such that the TRANSLATE command becomes superfluous).

In addition, after normalisation, the following assumptions hold:

- All commands start on a new line.
- All command names (**not** block names) are in uppercase with no "in-name-comment",
  like "MA[c]TRiX"
- The ";" terminating MATRIX commands is on a separate line, allowing more simplistic parsing
  of matrix rows.

positional arguments:
  nexus

options:
  -h, --help        show this help message and exit
  --strip-comments  Remove non-command comments. (default: False)

For examples of of running commonnexus normalise refer to the documentation of the underlying function commonnexus.tools.normalise.normalise().

Normalising CHARACTERS

$ commonnexus normalise '#nexus begin d[c]ata; dimensions nchar=3; format missing=x nolabels; matrix x01 100 010; end;'
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS 1 2 3;
END;
BEGIN DATA;
DIMENSIONS NCHAR=3;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
MATRIX 
1 ?01
2 100
3 010
;
END;

Normalising DISTANCES

$ commonnexus normalise '#nexus begin distances; dimensions ntax=3; format missing=x nodiagonal; matrix t1 t2 x t3 1.0 2.1; end;'
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS t1 t2 t3;
END;
BEGIN DISTANCES;
DIMENSIONS NTAX=3;
FORMAT TRIANGLE=BOTH MISSING=?;
MATRIX 
t1 0 ? 1.0
t2 ? 0 2.1
t3 1.0 2.1 0
;
END;

Normalising TREES

$ commonnexus normalise '#nexus begin trees; translate a t1, b t2, c t3; tree 1 = ((a,b)c); end;'
#NEXUS
BEGIN TREES;
TREE 1 = ((t1,t2)t3);
END;

commonnexus combine

Combining data from multiple NEXUS files into a single one can be useful to have data and resulting trees from a phylogenetic analysis in a single file or to aggregate character data for the same set of taxa.

$ commonnexus combine -h
usage: commonnexus combine [-h] [--drop-unsupported] nexus [nexus ...]

Combine data from multiple NEXUS files and put it in a new one.

The following blocks can be handled:

 - TAXA: Taxa are identified across NEXUS files based on label (not number).
 - CHARACTERS/DATA: Characters are aggregated across NEXUS files (with character labels prefixed,
   for disambiguation).
 - TREES: Trees are (translated and) aggregated across NEXUS files.

positional arguments:
  nexus               NEXUS content specified as file path or string or "-" to
                      read from stdin. Note that "-" can only be used once.
                      Content from a string or stdin will be split into
                      individual NEXUS "files" using the "#NEXUS" token as
                      separator.

options:
  -h, --help          show this help message and exit
  --drop-unsupported  Drop NEXUS blocks that cannot be combined. (default:
                      False)

Combining CHARACTERS blocks

$ cat characters.nex | commonnexus combine - characters.nex
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=5;
TAXLABELS t1 t2 t3 t4 t5;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=20;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
CHARSTATELABELS 
    1 1.1, 
    2 1.2, 
    3 1.3, 
    4 1.4, 
    5 1.5, 
    6 1.6, 
    7 1.7, 
    8 1.8, 
    9 1.9, 
    10 1.10, 
    11 2.1, 
    12 2.2, 
    13 2.3, 
    14 2.4, 
    15 2.5, 
    16 2.6, 
    17 2.7, 
    18 2.8, 
    19 2.9, 
    20 2.10;
MATRIX 
t1 10010100001001010000
t2 01010001000101000100
t3 00111010100011101010
t4 00011000010001100001
t5 00011000010001100001
;
END;

commonnexus split

The Mesquite software can write multiple TAXA, CHARACTERS and TREES blocks - linked together via TITLE and LINK commands to a single NEXUS file. Most other tools can’t handle such “multi-taxa” files, though.

Running commonnexus split will split such files into one NEXUS file per CARACTERS or TREES block, bundled with the appropriate TAXA block.

$ commonnexus split -h
usage: commonnexus split [-h] [--stem STEM] [--outdir OUTDIR] nexus

Split a Mesquite multi-block NEXUS into individual NEXUS files per CHARACTERS/TREES block.

Output files will be named according to the pattern "<stem>_<BLOCK.TITLE>.{nex|trees}".
So a CHARACTERS block with TITLE 'x.y' will end up in 'mesquite_x.y.nex'.

positional arguments:
  nexus

options:
  -h, --help       show this help message and exit
  --stem STEM      Stem of the filenames for the individual nexus files
                   (default: mesquite)
  --outdir OUTDIR  (Existing) directory to write the output to. (default: .)

commonnexus characters

The characters sub-command provides functionality to manipulate the characters matrix in a NEXUS file.

$ commonnexus characters -h
usage: commonnexus characters [-h] [--binarise] [--multistatise GROUPKEY]
                              [--convert {fasta,phylip}]
                              [--drop {constant,polymorphic,uncertain,missing,gapped}]
                              [--drop-numbered DROP_NUMBERED]
                              [--describe {binary-setsize,binary-unique,binary-constant,states-distribution}]
                              nexus

Manipulate the CHARACTERS (or DATA) block of a NEXUS file.

Note: Only one option can be chosen at a time.

positional arguments:
  nexus

options:
  -h, --help            show this help message and exit
  --binarise            Recode a matrix such that it only contains binary
                        characters. (default: False)
  --multistatise GROUPKEY
                        Recode a matrix such that it only contains one
                        multistate characters for each group of characters as
                        determined by GROUPKEY, a Python lambda function
                        accepting character label and returning a key (or a
                        string to group all characters into one multistate
                        character labeled with this string). (default: None)
  --convert {fasta,phylip}
                        Convert a matrix to another sequence format. (default:
                        None)
  --drop {constant,polymorphic,uncertain,missing,gapped}
                        Drop specified characters from a matrix. (default:
                        None)
  --drop-numbered DROP_NUMBERED
                        Drop characters specified by (ranges of) numbers from
                        a matrix. (default: None)
  --describe {binary-setsize,binary-unique,binary-constant,states-distribution}

“Binarise” the matrix

Some tools (e.g. BEAST) offer special analysis options for binary data. To convert multistate character data to you can run characters --binarise:

$ commonnexus characters --binarise "#NEXUS BEGIN DATA; DIMENSIONS nchar=1; MATRIX t1 a t2 b t3 c t4 d t5 e; END;"
#NEXUS
BEGIN CHARACTERS;
DIMENSIONS NCHAR=5;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
CHARSTATELABELS 
    1 1_A, 
    2 1_B, 
    3 1_C, 
    4 1_D, 
    5 1_E;
MATRIX 
t1 10000
t2 01000
t3 00100
t4 00010
t5 00001
;
END;

“Multistatise” the matrix

Sometimes characters which are “naturally multistate” are coded as binary data (for the above reason). E.g. cognate-coded wordlist data are often binarised for analysis with BEAST, i.e. each cognate set is considered a separate character as opposed to grouping cognate sets for the same meaning into a multistate character. Binary data is somewhat harder to inspect “manually”, though. E.g. figuring out whether languages may have words coded as cognate in two different cognate sets for the same meaning is difficult looking at data such as https://github.com/phlorest/birchall_et_al2016/blob/main/raw/Chapacuran_Swadesh207-2019-labelled.nex.

Running characters --multitatise on such data can make this easier. The --multistatise option expects a Python lambda function as argument, which converts a character label into a group key. E.g. the character labels

100_laugh_A,
100_laugh_B,
100_laugh_C,

could be merged into a multistate character passing lambda c: '_'.join(c.split('_')[:-1]).

curl https://raw.githubusercontent.com/phlorest/birchall_et_al2016/main/raw/Chapacuran_Swadesh207-2019-labelled.nex |\\
commonnexus characters --multistatise "lambda c: '_'.join(c.split('_')[:-1])" -

will output a MATRIX with rows like

Cojubim  AAAAAAA??AB(AB)AECABAAAAACAABBECAAAA?A?(AB)ACAA?AA?AEACAA??CBA??AADACBB?C?(AB)...

where polymorphisms (e.g. (AB)) mean a language has a word coded as cognate with two different cognate sets for the same meaning.

Describing character set sizes

The output of the most commands is also suitable for piping to other commands. E.g. termgraph can be used to display character set sizes:

$ commonnexus characters characters.nex --describe binary-setsize | termgraph

INFO:commonnexus:Character set sizes (for binary matrix):
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.00 
: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
: ▇▇▇▇▇▇▇▇▇▇ 1.00 
▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.00

commonnexus trees

The trees sub-command provides functionality to manipulate the TREES block in a NEXUS file.

$ commonnexus trees -h
usage: commonnexus trees [-h] [--drop DROP] [--sample N] [--random N]
                         [--random-seed RANDOM_SEED] [--strip-comments]
                         [--rename RENAME] [--describe]
                         nexus

Manipulate a TREES block in a NEXUS file.

Note: Only one option can be chosen at a time.

positional arguments:
  nexus

options:
  -h, --help            show this help message and exit
  --drop DROP           Remove the trees specified as comma-separated ranges
                        of 1-based indices, e.g. '1', '1-5', '1,20-30'
                        (default: [])
  --sample N            Resample the trees every Nth tree (default: 0)
  --random N            Randomly sample N trees from the treefile (default: 0)
  --random-seed RANDOM_SEED
                        Set random seed (to a number) to allow for
                        reproducible random sampling. (default: None)
  --strip-comments      Remove comments from the trees (default: False)
  --rename RENAME       Rename a tree specified as 'old,new' where 'old' is
                        the current name or number and 'new' is the new name
                        or as Python lambda function accepting a tree label as
                        input. (default: None)
  --describe            list 1-based index, names and rooting of trees
                        (default: False)

commonnexus taxa

The taxa sub-command provides functionality to manipulate the set of taxa in a NEXUS file.

$ commonnexus taxa -h
usage: commonnexus taxa [-h] [--drop DROP] [--rename RENAME]
                        [--describe DESCRIBE] [--check]
                        nexus

Manipulate the list of TAXA used in a NEXUS file.

positional arguments:
  nexus

options:
  -h, --help           show this help message and exit
  --drop DROP          Comma-separated list of taxon labels or numbers to
                       remove from the NEXUS. This will remove the taxa from
                       the TAXA block, the relevant rows from a CHARACTERS (or
                       DATA) matrix, and prune the specified taxa from any
                       TREE in a TREES block. (default: [])
  --rename RENAME      Rename a taxon specified as 'old,new' where 'old' is
                       the current name or number and 'new' is the new name or
                       as Python lambda function accepting a taxon label as
                       input. (default: None)
  --describe DESCRIBE  Describe a named taxon, i.e. aggregate the data for the
                       taxon in a NEXUS file. (default: None)
  --check              Check whether taxa labels in a NEXUS file are used
                       consistently. (default: False)

Removing taxa

While removing a taxon from a NEXUS file can be as simple as deleting one line in the CHARACTERS MATRIX command, it typically isn’t because the taxon may also appears in TREES TRANSLATE, etc. taxa --drop will remove relevant taxon references from TAXA, TREES, CHARACTERS, DATA, DISTANCES and NOTES blocks.

$ commonnexus taxa --drop t1 "#NEXUS BEGIN DATA; DIMENSIONS nchar=1; MATRIX t1 a t2 b t3 c t4 d t5 e; END;"
#NEXUS
BEGIN DATA;
DIMENSIONS NCHAR=1;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="BCDE";
MATRIX 
t2 B
t3 C
t4 D
t5 E
;
END;
BEGIN TAXA;
DIMENSIONS NTAX=4;
TAXLABELS t2 t3 t4 t5;
END;

If you want to drop constant/invariant characters which might have arisen due to removing a taxon, you could pipe the result of taxa --drop into characters --drop constant.

Describing taxa

Describing the data for a taxon in a NEXUS file is particularly useful for files with a CHARACTERS MATRIX of DATATYPE=STANDARD and labeled states - such as the files from Morphobank.

Running

commonnexus taxa ../tests/fixtures/regression/mbank_X962_11-22-2013_1534.nex --describe 1

will output a markdown formatted table of characters looking like

Character	State	Notes
Vomer, shape of tooth patch	Trapezoidal to ovate
Orbitosphenoid	Present
Pterotic, enclosure of lateral line canal	absent or incomplete
Frontals, midline suture	joined along entire midline
Frontoparietal crests	absent
Frontoparietal crests, sensory pore on dorsal margin	?
Supraoccipital crest, shape	long and low
Supraoccipital crest, horizontal shelf projecting laterally at mid-height	present
Supraoccipital crest, shape of dorsal margin	blade-like
Sphenotic, horizontal shelf	absent
Mesethmoid, anterolaterally facing projection	absent
Lateral ethmoid-lacrimal articulation, orientation	entirely or primarily in the horizontal plane	Waldman, 1986
…