Command-line interface
commonnexus comes with a shell command of the same name: commonnexus. commonnexus is a multi-command CLI, or a “git-like multi-tool command”, i.e. actual functionality to manipulate NEXUS files is implemented as sub-commands. To get an overview and a list of sub-commands, run
$ commonnexus -h
usage: commonnexus [-h] [--log-level LOG_LEVEL] COMMAND ...
commonnexus 1.9.3.dev0 is a set of commands to manipulate of files in the
NEXUS file format.
options:
-h, --help show this help message and exit
--log-level LOG_LEVEL
log level [ERROR 40|WARNING 30|INFO 20|DEBUG 10]
(default: 20)
available commands:
Run "commonnexus COMAMND -h" to get help for a specific command.
COMMAND
characters Manipulate the CHARACTERS (or DATA) block of a NEXUS
file.
combine Combine data from multiple NEXUS files and put it in a
new one.
help Get help on subcommands.
normalise Normalise a NEXUS file.
split Split a Mesquite multi-block NEXUS into individual
NEXUS files per CHARACTERS/TREES block.
taxa Manipulate the list of TAXA used in a NEXUS file.
trees Manipulate a TREES block in a NEXUS file.
See https://github.com/dlce-eva/commonnexus for details.
Most commands can read input from stdin and print results to stdout. Thus, these commands can easily be chained together with other shell commands:
$ echo "#nexus begin trees; tree 1 = ((a,b)c); end;" | commonnexus normalise - | grep TREE | grep -v TREES
TREE 1 = ((a,b)c);
Warning
While commonnexus.Nexus
can read multiple blocks with the same
name just fine, most of the commands listed below assume just one block per block type in their
input (i.e. only act on the first occurrence of each block type).
In the following we describe the available sub-commands.
commonnexus normalise
Arguably the most important sub-command is normalise, because it removes quite a few complexities of the NEXUS format (e.g. different TRIANGLE options for DISTANCES, or EQUATE mappings for CHARACTERS), and thus makes downstream NEXUS reading a lot more reliable.
$ commonnexus normalise -h
usage: commonnexus normalise [-h] [--strip-comments] nexus
Normalise a NEXUS file.
Normalisation includes
- converting CHARACTERS/DATA matrices to non-transposed, non-interleaved representation with
taxon labels (and resolved EQUATEs), extracting taxon labels into a TAXA block;
- converting a DISTANCES matrix to non-interleaved matrices with diagonal and both triangles
and taxon labels;
- translating all TREEs in a TREES block (such that the TRANSLATE command becomes superfluous).
In addition, after normalisation, the following assumptions hold:
- All commands start on a new line.
- All command names (**not** block names) are in uppercase with no "in-name-comment",
like "MA[c]TRiX"
- The ";" terminating MATRIX commands is on a separate line, allowing more simplistic parsing
of matrix rows.
positional arguments:
nexus
options:
-h, --help show this help message and exit
--strip-comments Remove non-command comments. (default: False)
For examples of of running commonnexus normalise refer to the documentation of the underlying
function commonnexus.tools.normalise.normalise()
.
Normalising CHARACTERS
$ commonnexus normalise '#nexus begin d[c]ata; dimensions nchar=3; format missing=x nolabels; matrix x01 100 010; end;'
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS 1 2 3;
END;
BEGIN DATA;
DIMENSIONS NCHAR=3;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
MATRIX
1 ?01
2 100
3 010
;
END;
Normalising DISTANCES
$ commonnexus normalise '#nexus begin distances; dimensions ntax=3; format missing=x nodiagonal; matrix t1 t2 x t3 1.0 2.1; end;'
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS t1 t2 t3;
END;
BEGIN DISTANCES;
DIMENSIONS NTAX=3;
FORMAT TRIANGLE=BOTH MISSING=?;
MATRIX
t1 0 ? 1.0
t2 ? 0 2.1
t3 1.0 2.1 0
;
END;
Normalising TREES
$ commonnexus normalise '#nexus begin trees; translate a t1, b t2, c t3; tree 1 = ((a,b)c); end;'
#NEXUS
BEGIN TREES;
TREE 1 = ((t1,t2)t3);
END;
commonnexus combine
Combining data from multiple NEXUS files into a single one can be useful to have data and resulting trees from a phylogenetic analysis in a single file or to aggregate character data for the same set of taxa.
$ commonnexus combine -h
usage: commonnexus combine [-h] [--drop-unsupported] nexus [nexus ...]
Combine data from multiple NEXUS files and put it in a new one.
The following blocks can be handled:
- TAXA: Taxa are identified across NEXUS files based on label (not number).
- CHARACTERS/DATA: Characters are aggregated across NEXUS files (with character labels prefixed,
for disambiguation).
- TREES: Trees are (translated and) aggregated across NEXUS files.
positional arguments:
nexus NEXUS content specified as file path or string or "-" to
read from stdin. Note that "-" can only be used once.
Content from a string or stdin will be split into
individual NEXUS "files" using the "#NEXUS" token as
separator.
options:
-h, --help show this help message and exit
--drop-unsupported Drop NEXUS blocks that cannot be combined. (default:
False)
Combining CHARACTERS blocks
$ cat characters.nex | commonnexus combine - characters.nex
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=5;
TAXLABELS t1 t2 t3 t4 t5;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=20;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
CHARSTATELABELS
1 1.1,
2 1.2,
3 1.3,
4 1.4,
5 1.5,
6 1.6,
7 1.7,
8 1.8,
9 1.9,
10 1.10,
11 2.1,
12 2.2,
13 2.3,
14 2.4,
15 2.5,
16 2.6,
17 2.7,
18 2.8,
19 2.9,
20 2.10;
MATRIX
t1 10010100001001010000
t2 01010001000101000100
t3 00111010100011101010
t4 00011000010001100001
t5 00011000010001100001
;
END;
commonnexus split
The Mesquite software can write multiple TAXA, CHARACTERS and TREES blocks - linked together via TITLE and LINK commands to a single NEXUS file. Most other tools can’t handle such “multi-taxa” files, though.
Running commonnexus split
will split such files into one NEXUS file per CARACTERS or TREES block,
bundled with the appropriate TAXA block.
$ commonnexus split -h
usage: commonnexus split [-h] [--stem STEM] [--outdir OUTDIR] nexus
Split a Mesquite multi-block NEXUS into individual NEXUS files per CHARACTERS/TREES block.
Output files will be named according to the pattern "<stem>_<BLOCK.TITLE>.{nex|trees}".
So a CHARACTERS block with TITLE 'x.y' will end up in 'mesquite_x.y.nex'.
positional arguments:
nexus
options:
-h, --help show this help message and exit
--stem STEM Stem of the filenames for the individual nexus files
(default: mesquite)
--outdir OUTDIR (Existing) directory to write the output to. (default: .)
commonnexus characters
The characters sub-command provides functionality to manipulate the characters matrix in a NEXUS file.
$ commonnexus characters -h
usage: commonnexus characters [-h] [--binarise] [--multistatise GROUPKEY]
[--convert {fasta,phylip}]
[--drop {constant,polymorphic,uncertain,missing,gapped}]
[--drop-numbered DROP_NUMBERED]
[--describe {binary-setsize,binary-unique,binary-constant,states-distribution}]
nexus
Manipulate the CHARACTERS (or DATA) block of a NEXUS file.
Note: Only one option can be chosen at a time.
positional arguments:
nexus
options:
-h, --help show this help message and exit
--binarise Recode a matrix such that it only contains binary
characters. (default: False)
--multistatise GROUPKEY
Recode a matrix such that it only contains one
multistate characters for each group of characters as
determined by GROUPKEY, a Python lambda function
accepting character label and returning a key (or a
string to group all characters into one multistate
character labeled with this string). (default: None)
--convert {fasta,phylip}
Convert a matrix to another sequence format. (default:
None)
--drop {constant,polymorphic,uncertain,missing,gapped}
Drop specified characters from a matrix. (default:
None)
--drop-numbered DROP_NUMBERED
Drop characters specified by (ranges of) numbers from
a matrix. (default: None)
--describe {binary-setsize,binary-unique,binary-constant,states-distribution}
“Binarise” the matrix
Some tools (e.g. BEAST) offer special analysis options
for binary data. To convert multistate character data to you can run characters --binarise
:
$ commonnexus characters --binarise "#NEXUS BEGIN DATA; DIMENSIONS nchar=1; MATRIX t1 a t2 b t3 c t4 d t5 e; END;"
#NEXUS
BEGIN CHARACTERS;
DIMENSIONS NCHAR=5;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
CHARSTATELABELS
1 1_A,
2 1_B,
3 1_C,
4 1_D,
5 1_E;
MATRIX
t1 10000
t2 01000
t3 00100
t4 00010
t5 00001
;
END;
“Multistatise” the matrix
Sometimes characters which are “naturally multistate” are coded as binary data (for the above reason). E.g. cognate-coded wordlist data are often binarised for analysis with BEAST, i.e. each cognate set is considered a separate character as opposed to grouping cognate sets for the same meaning into a multistate character. Binary data is somewhat harder to inspect “manually”, though. E.g. figuring out whether languages may have words coded as cognate in two different cognate sets for the same meaning is difficult looking at data such as https://github.com/phlorest/birchall_et_al2016/blob/main/raw/Chapacuran_Swadesh207-2019-labelled.nex.
Running characters --multitatise
on such data can make this easier. The --multistatise
option
expects a Python lambda function as argument, which converts a character label into a group key.
E.g. the character labels
1 100_laugh_A,
2 100_laugh_B,
3 100_laugh_C,
could be merged into a multistate character passing lambda c: '_'.join(c.split('_')[:-1])
.
curl https://raw.githubusercontent.com/phlorest/birchall_et_al2016/main/raw/Chapacuran_Swadesh207-2019-labelled.nex |\\
commonnexus characters --multistatise "lambda c: '_'.join(c.split('_')[:-1])" -
will output a MATRIX with rows like
Cojubim AAAAAAA??AB(AB)AECABAAAAACAABBECAAAA?A?(AB)ACAA?AA?AEACAA??CBA??AADACBB?C?(AB)...
where polymorphisms (e.g. (AB)
) mean a language has a word coded as cognate with two different
cognate sets for the same meaning.
Describing character set sizes
The output of the most commands is also suitable for piping to other commands. E.g. termgraph can be used to display character set sizes:
$ commonnexus characters characters.nex --describe binary-setsize | termgraph
INFO:commonnexus:Character set sizes (for binary matrix):
1 : ▇▇▇▇▇▇▇▇▇▇ 1.00
2 : ▇▇▇▇▇▇▇▇▇▇ 1.00
3 : ▇▇▇▇▇▇▇▇▇▇ 1.00
4 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.00
5 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.00
6 : ▇▇▇▇▇▇▇▇▇▇ 1.00
7 : ▇▇▇▇▇▇▇▇▇▇ 1.00
8 : ▇▇▇▇▇▇▇▇▇▇ 1.00
9 : ▇▇▇▇▇▇▇▇▇▇ 1.00
10: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.00
commonnexus trees
The trees sub-command provides functionality to manipulate the TREES block in a NEXUS file.
$ commonnexus trees -h
usage: commonnexus trees [-h] [--drop DROP] [--sample N] [--random N]
[--random-seed RANDOM_SEED] [--strip-comments]
[--rename RENAME] [--describe]
nexus
Manipulate a TREES block in a NEXUS file.
Note: Only one option can be chosen at a time.
positional arguments:
nexus
options:
-h, --help show this help message and exit
--drop DROP Remove the trees specified as comma-separated ranges
of 1-based indices, e.g. '1', '1-5', '1,20-30'
(default: [])
--sample N Resample the trees every Nth tree (default: 0)
--random N Randomly sample N trees from the treefile (default: 0)
--random-seed RANDOM_SEED
Set random seed (to a number) to allow for
reproducible random sampling. (default: None)
--strip-comments Remove comments from the trees (default: False)
--rename RENAME Rename a tree specified as 'old,new' where 'old' is
the current name or number and 'new' is the new name
or as Python lambda function accepting a tree label as
input. (default: None)
--describe list 1-based index, names and rooting of trees
(default: False)
commonnexus taxa
The taxa sub-command provides functionality to manipulate the set of taxa in a NEXUS file.
$ commonnexus taxa -h
usage: commonnexus taxa [-h] [--drop DROP] [--rename RENAME]
[--describe DESCRIBE] [--check]
nexus
Manipulate the list of TAXA used in a NEXUS file.
positional arguments:
nexus
options:
-h, --help show this help message and exit
--drop DROP Comma-separated list of taxon labels or numbers to
remove from the NEXUS. This will remove the taxa from
the TAXA block, the relevant rows from a CHARACTERS (or
DATA) matrix, and prune the specified taxa from any
TREE in a TREES block. (default: [])
--rename RENAME Rename a taxon specified as 'old,new' where 'old' is
the current name or number and 'new' is the new name or
as Python lambda function accepting a taxon label as
input. (default: None)
--describe DESCRIBE Describe a named taxon, i.e. aggregate the data for the
taxon in a NEXUS file. (default: None)
--check Check whether taxa labels in a NEXUS file are used
consistently. (default: False)
Removing taxa
While removing a taxon from a NEXUS file can be as simple as deleting one line in the CHARACTERS MATRIX
command, it typically isn’t because the taxon may also appears in TREES TRANSLATE, etc. taxa --drop
will remove relevant taxon references from TAXA, TREES, CHARACTERS, DATA, DISTANCES and NOTES blocks.
$ commonnexus taxa --drop t1 "#NEXUS BEGIN DATA; DIMENSIONS nchar=1; MATRIX t1 a t2 b t3 c t4 d t5 e; END;"
#NEXUS
BEGIN DATA;
DIMENSIONS NCHAR=1;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="BCDE";
MATRIX
t2 B
t3 C
t4 D
t5 E
;
END;
BEGIN TAXA;
DIMENSIONS NTAX=4;
TAXLABELS t2 t3 t4 t5;
END;
If you want to drop constant/invariant characters which might have arisen due to removing a taxon, you
could pipe the result of taxa --drop
into characters --drop constant
.
Describing taxa
Describing the data for a taxon in a NEXUS file is particularly useful for files with a CHARACTERS MATRIX of DATATYPE=STANDARD and labeled states - such as the files from Morphobank.
Running
commonnexus taxa ../tests/fixtures/regression/mbank_X962_11-22-2013_1534.nex --describe 1
will output a markdown formatted table of characters looking like
Character |
State |
Notes |
---|---|---|
Vomer, shape of tooth patch |
Trapezoidal to ovate |
|
Orbitosphenoid |
Present |
|
Pterotic, enclosure of lateral line canal |
absent or incomplete |
|
Frontals, midline suture |
joined along entire midline |
|
Frontoparietal crests |
absent |
|
Frontoparietal crests, sensory pore on dorsal margin |
? |
|
Supraoccipital crest, shape |
long and low |
|
Supraoccipital crest, horizontal shelf projecting laterally at mid-height |
present |
|
Supraoccipital crest, shape of dorsal margin |
blade-like |
|
Sphenotic, horizontal shelf |
absent |
|
Mesethmoid, anterolaterally facing projection |
absent |
|
Lateral ethmoid-lacrimal articulation, orientation |
entirely or primarily in the horizontal plane |
Waldman, 1986 |
… |