Blocks

From the specification:

Modularity is the primary design feature of a NEXUS file. A NEXUS file is composed of a number of blocks, such as TAXA, CHARACTERS, and TREES blocks.

Other blocks can be added to the file to house various kinds of data, including discrete and continuous morphological characters, distance data, frequency data, and information about protein-coding regions, genetic codes, assumptions about weights, trees, etc. Images can be included. This modular format allows a computer program reading the file to ignore safely the unfamiliar parts of the file, permits sharing of the file by various programs, and permits future expansion to encompass new information.

The eight [sic] primary public blocks are TAXA, CHARACTERS, UNALIGNED, DISTANCES, SETS, ASSUMPTIONS, CODONS, TREES, and NOTES.

Mutability

Nexus objects are mutable, i.e. blocks can be added and removed during the “lifetime” of a Nexus instance. To make this possible, the only data kept in an instance is the list of tokens representing the parsed NEXUS content. Thus, when accessing a block in a Nexus object, this Block instance is created from the token list, and, consequently, accessing a block again will return a new instance:

>>> from commonnexus import Nexus
>>> nex = Nexus('#NEXUS\nBEGIN BLOCK;\nEND;')
>>> nex.BLOCK is nex.BLOCK
False
>>> id(nex.BLOCK)
140138642078064
>>> id(nex.BLOCK)
140138642079584

Since blocks are tuple instances, they will still compare as expected, if they are created from the same list of tokens, though:

>>> nex.BLOCK == nex.BLOCK
True

But to take advantage of caching happening of block level (e.g. of TRANSLATE mappings in TREES a block), care must be taken to retain a reference to the Block instance:

>>> nex = Nexus('#NEXUS\nBEGIN TREES;\nTRANSLATE a b, c d;\nTREE tree = (a,c);\nEND;')
>>> trees = nex.TREES
>>> trees.translate(trees.TREE).newick
'(b,d)'

TAXA

class commonnexus.blocks.taxa.Taxa(nexus, cmds)[source]

The TAXA block specifies information about taxa.

BEGIN TAXA;
    DIMENSIONS NTAX=number-of-taxa;
    TAXLABELS taxon-name [taxon-name ...];
END;

DIMENSIONS must appear before TAXLABELS. Only one of each command is allowed per block.

classmethod from_data(labels, comment=None, nexus=None, TITLE=None, ID=None, LINK=None)[source]

Block implementations must overwrite this method to implement “meaningful” NEXUS writing functionality.

Parameters:

labels (typing.Sequence) –
comment (typing.Optional[str]) –
nexus (typing.Optional[commonnexus.nexus.Nexus]) –
TITLE (typing.Optional[str]) –
ID (typing.Optional[str]) –
LINK (typing.Union[str, typing.Tuple[str, str], None]) –

Return type:

commonnexus.blocks.base.Block

TAXA Commands

class commonnexus.blocks.taxa.Dimensions(tokens, nexus=None)[source]

The NTAX subcommand of the DIMENSIONS command indicates the number of taxa. The NEXUS standard does not impose limits on number of taxa; a limit may be imposed by particular computer programs.

Variables:: ntax (int) – The number of taxa.

class commonnexus.blocks.taxa.Taxlabels(tokens, nexus=None)[source]

This command defines taxa, specifies their names, and determines their order:

TAXLABELS Fungus Insect Mammal;

Taxon names are single NEXUS words. They must not correspond to another taxon name or number; thus, 1 is not a valid name for the second taxon listed. The standard defines no limit to their length, although individual programs might impose restrictions.

Taxa may also be defined in the CHARACTERS, UNALIGNED, and DISTANCES blocks if the NEWTAXA token is included in the DIMENSIONS command; see the descriptions of those blocks for details.

Variables:: labels (Dict[int, str]) – Mapping of taxon number to taxon label.

The taxon number is the number of a taxon, as defined by its position in a TAXLABELS command. […] For example, the third taxon listed in TAXLABELS is taxon number 3.

CHARACTERS (and DATA)

class commonnexus.blocks.characters.Characters(nexus, cmds)[source]

A CHARACTERS block defines characters and includes character data.

Taxa are usually not defined in a CHARACTERS block; if they are not, the CHARACTERS block must be preceded by a block that defines taxon labels and ordering (e.g., TAXA).

Syntax of the CHARACTERS block is as follows:

BEGIN CHARACTERS;

DIMENSIONS [NEWTAXA NTAX=num-taxa] NCHAR=num-characters;

[FORMAT

[DATATYPE = { STANDARD| DNA | RNA | NUCLEOTIDE | PROTEIN | CONTINUOUS} ]
[RESPECTCASE]
[MISSING=symbol]
[GAP=symbol]
[SYMBOLS=”symbol [symbol…]”]
[EQUATE=”symbol = entry [symbol = entry… ] ” ]
[MATCHCHAR= symbol ]
[[NO]LABELS]
[TRANSPOSE]
[INTERLEAVE]
[ITEMS=([MIN] [MAX] [MEDIAN] [AVERAGE] [VARIANCE] [STDERROR] [SAMPLESIZE] [STATES])]
[STATESFORMAT= {STATESPRESENT | INDIVIDUALS | COUNT | FREQUENCY}]
[[NO]TOKENS]

;]

[ELIMINATE character-set;]

[TAXLABELS taxon-name [taxon-name …];]

[CHARSTATELABELS

character-number [character-name] [/state-name [state-name…]]

[, character-number [character-name] [/state-name [state-name…]]…]

;]

[CHARLABELS character-name [character-name…];]

[STATELABELS

character-number [state-name [state-name …]]

[, character-number [state-name [state-name…]]…]

;]

MATRIX data-matrix;

END;

DIMENSIONS, FORMAT, and ELIMINATE must all precede CHARLABELS, CHARSTATELABELS, STATELABELS, and MATRIX. DIMENSIONS must precede ELIMINATE. Only one of each command is allowed per block.

is_binary()[source]

Return type:: bool
Returns:: Whether the matrix in the block is binary, i.e. codes items as presence/absence using symbols “01”.

get_matrix(labeled_states=False)[source]

Parameters:: labeled_states (bool) – Flag signaling whether state symbols should be translated to state labels (if available).
Return type:: typing.OrderedDict[str, typing.OrderedDict[str, typing.Union[None, str, typing.Set[str], typing.Tuple[str]]]]
Returns:: The values of the matrix, read according to FORMAT. The matrix is returned as ordered dict, mapping taxon labels (if available, else numbers) to ordered dict`s mapping character labels (if available, else numbers) to state values. State values are either atomic values (of type `str) or tuple`s (indicating polymorphism) or `set`s (indicating uncertainty) of atomic values. Atomic values may be `None (indicating missing data), the special string GAP (indicating gaps) or state symbols or labels (if available and explicitly requested via labeled_states=True). State symbols are returned using the case given in FORMAT SYMBOLS, i.e. if a RESPECTCASE directive is missing and FORMAT SYMBOLS=”ABC”, a value “a” in the matrix will be returned as “A”.

classmethod from_data(matrix, taxlabels=False, statelabels=None, datatype='STANDARD', missing='?', gap='-', comment=None, nexus=None, TITLE=None, ID=None, LINK=None)[source]

Instantiate a CHARACTERS or DATA block from a metrix.

This functionality can be used to normalize the NEXUS formatting of CHARACTERS matrices:

>>> nex = Nexus('''#NEXUS
... BEGIN TAXA;
... DIMENSIONS NTAX=3;
... TAXLABELS t1 t2 t3;
... END;
... BEGIN CHARACTERS;
... DIMENSIONS NCHAR=3;
... FORMAT TRANSPOSE NOLABELS;
... MATRIX 100 010 001;
... END;''')
>>> matrix = nex.CHARACTERS.get_matrix()
>>> nex.replace_block(nex.CHARACTERS, Characters.from_data(matrix))
>>> print(nex)
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS t1 t2 t3;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=3;
FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
MATRIX
t1 100
t2 010
t3 001
;
END;

Parameters:

matrix (typing.OrderedDict[str, typing.OrderedDict[str, typing.Union[None, str, typing.Set[str], typing.Tuple[str]]]]) – A matrix as returned by Characters.get_matrix(), with unlabeled states. I.e. None is used to mark missing values, and GAP to mark gapped values. These special states will be converted to the symbols passed as missing and gap upon writing.
taxlabels (bool) – If True, include a TAXLABELS command rather than relying on a TAXA block being present.
datatype (str) –
missing (str) –
gap (str) –
nexus (typing.Optional[commonnexus.nexus.Nexus]) – An optional Nexus instance to lookup global config options.
statelabels (typing.Optional[typing.Dict[str, typing.Dict[str, str]]]) –
comment (typing.Optional[str]) –
TITLE (typing.Optional[str]) –
ID (typing.Optional[str]) –
LINK (typing.Union[str, typing.Tuple[str, str], None]) –

Return type:

commonnexus.blocks.characters.Characters

class commonnexus.blocks.characters.Data(nexus, cmds)[source]: This block is equivalent to a CHARACTERS block in which the NEWTAXA subcommand is included in the DIMENSIONS command. That is, the DATA block is a CHARACTERS block that includes not only the definition of characters but also the definition of taxa.

Note

The GAPMODE subcommand of the OPTIONS command of the ASSUMPTIONS block was originally housed in an OPTIONS command in the DATA block.

CHARACTERS Commands

commonnexus.blocks.characters.INVALID_SYMBOLS = '()[]{}/\\,;:=*\'"*`<>^': Some - but not all - punctuation is invalid as (special) state symbol.

class commonnexus.blocks.characters.Eliminate(tokens, nexus=None)[source]

This command allows specification of a list of characters that are to be excluded from consideration. Programs are expected to ignore ELIMINATEd characters completely during reading. In avoiding allocation of memory to store character information, the programs can save a considerable amount of computer memory. (This subcommand is similar to ZAP in version 3.1.1 of PAUP.) For example,

ELIMINATE 4-100;

tells the program to skip over characters 4 through 100 in reading the matrix. Character-set names are not allowed in the character list. This command does not affect character numbers.

Warning

The ELIMINATE command is currently not supported in commonnexus.

class commonnexus.blocks.characters.Dimensions(tokens, nexus=None)[source]

The DIMENSIONS command specifies the number of characters. The number following NCHAR is the number of characters in the data matrix. The NEXUS standard does not impose limits on the number of characters; a limit may be imposed by particular computer programs.

It is strongly advised that new taxa not be defined in a CHARACTERS block, for the reasons discussed in the description of the DATA block. If new taxa are to be defined, this must be indicated by the NEWTAXA subcommand, specifying that new taxa are to be defined (this allows the computer program to prepare for creation of new taxa). NEWTAXA, if present, must appear before the NTAX subcommand. The NTAX subcommand, indicating the number of taxa in the MATRIX command in the block, is optional, unless NEWTAXA is specified, in which case it is required.

Variables:

newtaxa (bool) –
ntax (Optional[int]) –
nchar (int) –

class commonnexus.blocks.characters.Format(tokens, nexus=None)[source]

The FORMAT command specifies the format of the data MATRIX. This is a crucial command because misinterpretation of the format of the data matrix could lead to anything from incorrect results to spectacular crashes. The DATATYPE subcommand must appear first in the command.

The RESPECTCASE subcommand must appear before the MISSING, GAP, SYMBOLS, and MATCHCHAR subcommands.

The following are possible formatting subcommands.

DATATYPE = {STANDARD | DNA | RNA | NUCLEOTIDE | PROTEIN | CONTINUOUS}. This subcommand specifies the class of data. If present, it must be the first subcommand in the FORMAT command. Standard data consist of any general sort of discrete character data, and this class is typically used for morphological data, restriction site data, and so on. DNA, RNA, NUCLEOTIDE, and PROTEIN designate molecular sequence data. Meristic morphometric data and other information with continuous values can be housed in matrices of DATATYPE=CONTINUOUS. These DATATYPES are described in detail, with examples, at the end of the description of the CHARACTERS block.

Warning

DATATYPE=CONTINUOUS is currently not supported in commonnexus. Some programs accept (or expect) datatypes beyond the ones defined in the NEXUS spec; e.g. MrBayes has DATATYPE=RESTRICTION and Beauti may create “NEXUS” files with DATATYPE=BINARY. commonnexus does not accept these non-standard datatypes and raises an exception when trying to read the MATRIX. Thus, to make “NEXUS” files with non-standard datatypes readable for commonnexus, substituting DATATYPE=STANDARD is typically the right thing to do.
RESPECTCASE. By default, information in a MATRIX may be entered in uppercase, lowercase, or a mixture of uppercase and lowercase. If RESPECTCASE is requested, case is considered significant in SYMBOLS, MISSING, GAP, and MATCHCHAR subcommands and in subsequent references to states. For example, if RESPECTCASE is invoked, then SYMBOLS=”A a B b” designates four states whose symbols are A, a, B, and b, which can then each be used in the MATRIX command and elsewhere. If RESPECTCASE is not invoked, then A and a are considered homonymous state symbols. This subcommand must appear before the SYMBOLS subcommand. This subcommand is not applicable to DATATYPE = DNA, RNA, NUCLEOTIDE, PROTEIN, and CONTINUOUS.
MISSING. This subcommand declares the symbol that designates missing data. The default is “?”. For example, MISSING=X defines an X to represent missing data. Whitespace is illegal as a missing data symbol, as are the INVALID_SYMBOLS
GAP. This subcommand declares the symbol that designates a data gap (e.g., base absent in DNA sequence because of deletion or an inapplicable character in morphological data). There is no default gap symbol; a gap symbol must be defined by the GAP subcommand before any gaps can be entered into the matrix. For example, GAP=- defines a hyphen to represent a gap. Whitespace is illegal as a gap symbol, as are the INVALID_SYMBOLS
SYMBOLS. This subcommand specifies the symbols and their order for character states used in the file (including in the MATRIX command). For example, SYMBOLS=”0 1 2 3 4 5 6 7” designates numbers 0 through 7 as acceptable symbols in a matrix. The SYMBOLS subcommand is not allowed for DATATYPE=CONTINUOUS. The default symbols list differs from one DATATYPE to another, as described under state symbol in the Appendix. Whitespace is not needed between elements: SYMBOLS=”012” is equivalent to SYMBOLS=”0 1 2”. For STANDARD DATATYPES, a SYMBOLS subcommand will replace the default symbols list of “0 1”. For DNA, RNA, NUCLEOTIDE, and PROTEIN DATATYPES, a SYMBOLS subcommand will not replace the default symbols list but will add character-state symbols to the SYMBOLS list. The NEXUS standard does not define the position of these additional symbols within the SYMBOLS list. (These additional symbols will be inserted at the beginning of the SYMBOLS list in PAUP and at the end in MacClade. MacClade will accept additional symbols for PROTEIN but not DNA, RNA, and NUCLEOTIDE matrices.)

Warning

While the specification requires the content of the SYMBOLS subcommand to be enclosed in doublequotes, commonnexus also allows unquoted content; i.e. SYMBOLS=01 is treated as equivalent to SYMBOLS=”01”.
EQUATE. This subcommand allows one to define symbols to represent one matrix entry. For example, EQUATE=”E=(012)” means that each occurrence of E in the MATRIX command will be interpreted as meaning states 0, 1, and 2. The equate symbols cannot be any of the INVALID_SYMBOLS or any or the currently defined MISSING, GAP, MATCHCHAR, or state SYMBOLS. Case is significant in equate symbols. That is, MISSING=? EQUATE=”E=(012)e=?” means that E will be interpreted as 0, 1, and 2 and e will be interpreted as missing data.
MATCHCHAR. This subcommand defines a matching character symbol. If this subcommand is included, then a matching character symbol in the MATRIX indicates that the states are equivalent to the states possessed by the first taxon listed in the matrix for that character. In the following matrix, the sequence for taxon 2 is GACTTTC:
```
BEGIN DATA;
    DIMENSIONS NCHAR = 7;
    FORMAT DATATYPE=DNA MATCHCHAR=.;
    MATRIX
        taxon_l GACCTTA
        taxon_2 ...T..C
        taxon_3 ..T.C..;
END;
```
Whitespace is illegal as a matching character symbol, as are the INVALID_SYMBOLS

Warning

commonnexus uses “.” as default MATCHCHAR. So if “.” is used as a regular state symbol, the NEXUS must be read using the no_default_matchchar config option.
[NO]LABELS. This subcommand declares whether taxon or character labels are to appear on the left side of the matrix. By default, they should appear. If NOLABELS is used, then no labels appear, but then all currently defined taxa must be included in the MATRIX in the order in which they were originally defined.
TRANSPOSE. This subcommand indicates that the MATRIX is in transposed format, with each row of the matrix representing the information from one character and each column representing the information from one taxon. The following is an example of a TRANSPOSEd MATRIX:
```
MATRIX
    character_1 101101
    character_2 011100
    character_3 011110;
```
INTERLEAVE. This subcommand indicates that the MATRIX is in interleaved format, i.e., it is broken up into sections. If the data are not transposed, then each section contains the information for some of the characters for all taxa. For example, the first section might contain data for characters 1-50 for all taxa, the second section contains data for characters 51-100, etc. Taxa in each section must occur in the same order. This format is especially useful for molecular sequence data, where the number of characters can be large. A small interleaved matrix follows:
```
MATRIX
    taxon_1 ACCTCGGC
    taxon_2 ACCTCGGC
    taxon_3 ACGTCGCT
    taxon_4 ACGTCGCT
    taxon_1 TTAACGA
    taxon_2 TTAACCA
    taxon_3 CTCACCA
    taxon_4 TTCACCA
```
The interleaved sections need not all be of the same length. In an interleaved matrix, newline characters are significant: they indicate that the next character information encountered applies to a different taxon (for nontransposed matrices).
ITEMS. Each entry in the matrix gives information about a character’s condition in a taxon. The ITEMS subcommand indicates what items of information are listed at each entry of the matrix. With discrete character data, the entry typically consists of the states observed in the taxon (either the single state observed or several states if the taxon is polymorphic or of uncertain state). This can be specified by the state- ment ITEMS=STATES, but because it is the default and the only option allowed by most current programs for discrete data, an ITEMS statement is usually unnecessary. For continuous data, however, the wealth of alternatives (average, median, variance, minimum, maximum, sample size)t often requires an explicit ITEMS statement to in- dicate what information is represented in each data matrix entry. Some ITEMS (such as VARIANCE) would be appropriate to only some DATATYPES; other ITEMS such as SAMPLESIZE and STATES would be appropriate to most or all DATATYPES. If more than one item is indicated, parentheses must be used to surround the list of items, e.g., ITEMS=(AVERAGE VARIANCE); otherwise the parentheses are unnecessary, e.g., ITEMS=AVERAGE. More information about ITEMS options can be found in the discussion of the different DATATYPES under MATRIX; information specifically about the STATES option is given under STATESFORMAT.

Warning

Settings other than ITEMS=STATES are currently not supported in commonnexus.

STATESFORMAT. The entry in a matrix usually lists (for discrete data) or may list (for continuous data) the states observed in the taxon. The STATESFORMAT subcommand specifies what information is conveyed in that list of STATES. In most current programs for discrete data, when a taxon is polymorphic the entry of the matrix lists only what distinct states were observed, without any indication of the number or frequency of individuals sampled with each of the states. Thus, if all individuals sampled within the taxon have state A, the matrix entry would be A, whereas if some have state A and others have state B, the entry would be (AB), which corresponds to the option STATESFORMAT=STATESPRESENT. Because it is the default for discrete data, this statement is typically unnecessary with current programs. The other STATESFORMAT options can be illustrated with an example, in which two individuals of a taxon were observed to have state A and three were observed to have state B. When STATESFORMAT=INDIVIDUALS, the state of each of the individuals (or other appropriate sampling subunit) is listed exhaustively, (A A B B B); when STATESFORMAT=COUNT, the number of individuals with each observed state is indicated, e.g., (A:2 B:3); when STATESFORMAT=FREQUENCY, the frequencies of various observed states are indicated, e.g., (A:0.40 B:0.60). The STATESFORMAT command may also be used for continuous data, for which the default is STATESFORMAT=INDIVIDUALS.

Warning

Only the default setting STATESFORMAT=STATESPRESENT is currently supported in commonnexus.
[NO]TOKENS. This subcommand specifies whether data matrix entries are single symbols or whether they can be tokens. If TOKENS, then the data values must be full NEXUS tokens, separated by whitespace or punctuation as appropriate, as in the following example:
```
BEGIN CHARACTERS;
    DIMENSIONS NCHAR= 3 ;
    CHARSTATELABELS 1 hair/absent
        present, 2 color/red blue,
        3 size/small big;
    FORMAT TOKENS;
    MATRIX
        taxon_1 absent red big
        taxon_2 absent blue small
        taxon_3 present blue small ;
END;
```
TOKENS is the default (and the only allowed option) for DATATYPE=CONTINUOUS; NOTOKENS is the default for all other DATATYPES. TOKENS is not allowed for DATATYPES DNA, RNA, and NUCLEOTIDE. If TOKENS is invoked, the standard three-letter amino acid abbreviations can be used with DATATYPE=PROTEIN and defined state names can be used for DATATYPE=STANDARD.

Warning

TOKENS is currently not supported in commonnexus.

Variables:

datatype (str) –
respectcase (bool) –
missing (str) –
gap (Optional[str]) –
symbols (List[str]) –
equate (Dict[str, str]) –
matchchar (Optional[str]) –
labels (Optional[bool]) –
transpose (bool) –
interleave (bool) –
items (List[str]) –
statesformat (Optional[str]) –
tokens (Optional[bool]) –

Note

It’s typically not necessary to access the attributes of a Format instance from user code. Instead, the information is accessed when reading the matrix data in Characters.get_matrix().

class commonnexus.blocks.characters.Charstatelabels(tokens, nexus=None)[source]

This command allows specification of both the names of the characters and the names of the states. This command was developed as an alternative to the older commands CHARLABELS and STATELABELS. For example,

CHARSTATELABELS
eye_color/red blue green,
head_shape/round square,
pronotum_size/small medium large

A forward slash (/) separates the character name and the state names, with a comma separating the information for different characters. If no state names are to be specified, the slash may be omitted; if no character names are to be specified, the slash must be included, but no token needs to be included between the character number and the slash. If state x is the last state to be named, then subsequent states need not be named, but states 1 through x must be. If no name is to be applied to a state, enter a single underscore for its name. Character and state names are single NEXUS words. Character names must not correspond to another character name or number; thus, 1 is not a valid name for the second character listed. State names cannot be applied if DATATYPE=CONTINUOUS.

Variables:: characters (List[types.SimpleNamespace]) –

>>> cmd = Charstatelabels('1 eye_color/red blue green, 3 head_shape/round square')
>>> cmd.characters[0].name
'eye_color'
>>> cmd.characters[0].states
['red', 'blue', 'green']

Warning

In strict mode (see commonnexus.nexus.Config) duplicate character names will raise a ValueError, otherwise a UserWarning will be emitted. While a matrix with duplicate character names can still be read, it will typically not be as expected, because only the values for the last character for a given name will be present.

class commonnexus.blocks.characters.Charlabels(tokens, nexus=None)[source]

This command allows specification of names of characters:

CHARLABELS
    flange microsculpture
    body_length
    hind_angles #_spines
    spine_size _ _ head_size
    pubescent_intervals head_color
    clypeal_margin;

Character labels are listed consecutively. If character x is the last character to be named, then subsequent characters need not be named, but characters 1 through x need to be. If no name is to be applied to a character, a single underscore can be used for its name. Character names are single NEXUS words. They must not correspond to another character name or number; thus, 1 is not a valid name for the second character listed. The command should be used only for nontransposed matrices (in transposed matrices, the character labels are defined in the MATRIX command). We recommend that programs abandon this command in place of the more flexible CHARSTATELABELS command when writing NEXUS files, although programs should continue to read CHARLABELS because many existing NEXUS files use CHARLABELS.

Variables:: characters (List[types.SimpleNamespace]) –

class commonnexus.blocks.characters.Statelabels(tokens, nexus=None)[source]

This command allows specification of the names of states:

STATELABELS
absent present,
isodiametric transverse,
'4.5-6.2mm' '6.3-7.0mm' '7.7-11.0mm' '>12.0mm',
rounded subangulate angulate,
0 '1-4' '6-9' '7-9' '8-9' 7 8 9,
black rufous metallic flavous,
straight concave,

State labels need not be specified for all characters. A comma must separate state labels for each character. State labels are listed consecutively within a character. If state x is the last state to be named, then subsequent states need not be named, but states 1 through x must be. If no name is to be applied to a state, enter a single underscore for its name. State names are single NEXUS words. This command is not valid for DATATYPE=CONTINUOUS. We recommend that programs abandon this command in place of the more flexible CHARSTATELABELS command when writing NEXUS files, although programs should continue to read STATELABELS because many existing NEXUS files use STATELABELS.

Variables:: characters (List[types.SimpleNamespace]) –

class commonnexus.blocks.characters.Matrix(tokens, nexus=None)[source]

In its standard format, the MATRIX command contains a sequence of taxon names and state information for that taxon. The MATRIX itself is of the form

MATRIX
    taxon-name entry entry... entry
    taxon-name entry entry... entry
    taxon-name entry entry... entry;

Each entry in the matrix is the information about a particular character for a particular taxon. For example, it might be the assignment of state 0 to taxon 1 for character 1. Thus, the entry would consist of one state symbol, 0. If the taxon were polymorphic, the entry would consist of multiple state symbols, e.g. (0 1), indicating the taxon has both states 0 and 1. More details about the nature of each entry of the matrix are given under ITEMS and under each DATATYPE. Each entry needs to be enclosed in parentheses or braces whenever more than one state symbol is given, e.g. (01) with standard data and the default NOTOKENS option, or if the information is conveyed by more than one NEXUS token, e.g., (0:100) or (2.3 4.5 6.7). Otherwise, the parentheses or braces are optional. No whitespace is needed between entries in the matrix unless the TOKENS subcommand of the FORMAT command is invoked or implied and parentheses or braces do not surround an entry. Taxa need not be in the same order as in the TAXA block, and the matrix need not contain all taxa. For interleaved matrices, all sections must have the same taxa in the same order. Examples of matrices of different DATATYPES are described below.

For STANDARD data, each entry of the matrix consists of a single state-set. Under the defaults (ITEMS=STATES and STATESFORMAT=STATESPRESENT), each entry of the matrix consists of a single state-set; if there are multiple states, then the entry must be enclosed in parentheses (indicating polymorphism) or braces (indicating uncertainty in state). For example, in the following matrix,
```
BEGIN CHARACTERS;
    DIMENSIONS NCHAR=9;
    FORMAT SYMBOLS="-+x";
    MATRIX
        taxon_1 (-+){-+}+---+--
        taxon_2 +x-++--+x
        taxon_3 -++++--+x;
END;
```
taxon_1 is polymorphic for the first character and has either state - or state + for the second character. If STATESFORMAT=COUNT or FREQUENCY, then each entry must be enclosed in parentheses because more than one token is required to convey information for even one state:
```
BEGIN CHARACTERS;
    DIMENSIONS NCHAR=3;
    FORMAT STATESFORMAT=FREQUENCY SYMBOLS = "012";
    MATRIX
        taxon_1 (0:0.251:0.75) (0:0.31:0.7) (0:0.51:0.32:0.2)
        taxon_2 (0:0.41:0.6) (0:0.81:0.2) (1:0.152:0.85)
        taxon_3 (0:0.01:1.0) (0:0.551:0.45) (0:0.11:0.9);
END;
```
For DNA, RNA, NUCLEOTIDE, and PROTEIN data, each entry of the matrix consists of one or more state symbols describing the state(s) at one site in a molecular sequence. If STATESFORMAT=STATESPRESENT and if an entry represents a single state, then it is represented as a single state symbol (or if DATATYPE=PROTEIN and TOKENS, as a three-letter amino acid name). If an entry represents multiple states, then it must be enclosed in parentheses (indicating polymorphism) or braces (indicating uncertainty in state). Following is a matrix of DATATYPE=DNA:
```
BEGIN CHARACTERS;
    DIMENSIONS NCHAR=12;
    FORMAT DATATYPE = DNA;
    MATRIX
        taxon_1 ACCATGGTACGT
        taxon_2 TCCATGCTACCC
        taxon_3 TCCATGGAACCC;
END;
```
For CONTINUOUS data, each entry in the matrix must be enclosed by parentheses if more than one item is specified in the ITEMS subcommand. Parentheses must also be used whenever multiple tokens are needed for an entry in the matrix. If an entry consists of a single token (eg., 0.231), it may be written without parentheses but must then be separated from other entries by whitespace.
```
MATRIX
    A 0.453 1.43 78.6
    B 0.34 1.02 55.7
    C 0.22 1.79 69.1;
```
A matrix entry can include average, minimum, maximum, variance, standard error, sample size, and a listing of states observed in the taxon, as specified in the ITEMS subcommand. The sample size, if included, must be in the form of an integer; the other numbers can be either in English decimal (e.g., 0.00452) or in exponential form (e.g., 4.52E-3). The information listed for each taxon for a continuous character is specified in the ITEMS subcommand of the FORMAT command. For example, if the matrix contains only information about the minimum and maximum value for each taxon, the ITEMS subcommand would be ITEMS=(MIN MAX) and a small matrix might look something like this:
```
MATRIX
    taxon_1 (0.21 0.45) (0.34 0.36)
    taxon_2 (0.13 0.22) (0.45 0.55);
```
If the ITEMS include the raw measurements (states), e.g., to list a sample of measurements from individuals, then the other items must precede the listing of states. There is no restriction on the number of elements in the listing of states. This example has only one continuous character:
```
FORMAT DATATYPE=CONTINUOUS ITEMS=(AVERAGE STATES) STATESFORMAT=INDIVIDUALS;
MATRIX
    taxon_1 (1.2 2.1 1.6 0.8 1.8 0.3 0.6)
    taxon_2 (1.6 2.2 1.7 1.0 2.0 1.6 1.9 0.8);
```
in which the first value is the sample average and the subsequent values comprise the sample of observed states. Possible ITEMS to be included are MIN (minimum), MAX (maximum), AVERAGE (sample average), VARIANCE (sample variance), STDERROR (standard error), MEDIAN (sample median), SAMPLESIZE, and STATES. The manner of presentations of states can be indicated using the STATESFORMAT command. The default ITEMS for continuous data is AVERAGE.

Note

Since reading the matrix data only makes sense if information from other commands - in particular FORMAT - is considered, the Matrix object does not have any attributes for data access. Instead, the matrix data can be read via Characters.get_matrix().

class commonnexus.blocks.characters.Options(tokens, nexus=None)[source]

The GAPMODE subcommand of the OPTIONS command of the ASSUMPTIONS block was originally housed in an OPTIONS command in the DATA block.

Variables:: gapmode (Optional[str]) – missing or newstate.

UNALIGNED [not supported]

class commonnexus.blocks.unaligned.Unaligned(nexus, cmds)[source]

Warning

commonnexus doesn’t provide any functionality - other than parsing as generic commands - for UNALIGNED blocks yet.

The UNALIGNED block includes data that are not aligned. Its primary intent is to house unaligned DNA, RNA, NUCLEOTIDE, and PROTEIN sequence data. Taxa are usually not defined in an UNALIGNED block; if not, this block must be preceded by a block that defines taxon labels and ordering (e.g., TAXA). Syntax of the UNALIGNED block is as follows:

BEGIN UNALIGNED;

[DIMENSIONS NEWTAXA NTAX=number-of-taxa;]
[FORMAT
[DATATYPE = { STANDARD | DNA | RNA | NUCLEOTIDE | PROTEIN}]
[RESPECTCASE]
[MISSING=symbol]
[SYMBOLS=”symbol [symbol…]”]
[EQUATE = “symbol=entry [symbol=entry…]”]
[[NO]LABELS]
;]
[TAXLABELS taxon-name [taxon-name…];]
MATRIX data-matrix;

END;

Commands must appear in the order listed. Only one of each command is allowed per block. The DIMENSIONS command should only be included if new taxa are being defined in this block, which is discouraged (see discussion under DATA block). The format for the DIMENSIONS command is as in the CHARACTERS block, except NCHAR is not allowed. Subcommands of the FORMAT command are described in the CHARACTERS block. The TAXLABELS command serves to define taxa and is only allowed if the NEWTAXA token is included in the DIMENSIONS statement. It follows the same form as described in the TAXA block. Here is an example of an UNALIGNED block

BEGIN UNALIGNED;
    FORMAT DATATYPE=DNA;
    MATRIX
        taxon_1 ACTAGGACTAGATCAAGTT,
        taxon_2 ACCAGGACTAGCGGATCAAG,
        taxon_3 ACCAGGACTAGATCAAG,
        taxon_4 AGCCAGGACTAGTTC,
        taxon_5 ATCAGGACTAGATCAAGTTC;
END;

A comma must be placed at the end of each sequence (except the last, which requires a semicolon). Each sequence can occupy more than one line.

DISTANCES

class commonnexus.blocks.distances.Distances(nexus, cmds)[source]

This block contains distance matrices. Taxa are usually not defined in a DISTANCES block; if they are not, this block must be preceded by a block that defines taxon labels and ordering (e.g., TAXA). The syntax of the block is as follows:

BEGIN DISTANCES;

[DIMENSIONS [NEWTAXA] NTAX=num-taxa NCHAR=num-characters;]
[FORMAT
[TRIANGLE={LOWER | UPPER | BOTH}]
[[NO]DIAGONAL]
[[NO]LABELS]
[MISSING=SYMBOL]
[INTERLEAVE]
;]
[TAXLABELS taxon-name [taxon-name…];]
MATRIX distance-matrix;

END;

Commands must appear in the order listed. Only one of each command is allowed per block.

get_matrix()[source]

Return type:: typing.OrderedDict[str, typing.OrderedDict[str, typing.Optional[decimal.Decimal]]]
Returns:: A full distance matrix encoded as nested ordered dictionaries.

>>> from commonnexus import Nexus
>>> nex = Nexus('''#NEXUS
... BEGIN DISTANCES;
...     DIMENSIONS NEWTAXA NTAX=5;
...     TAXLABELS taxon_1 taxon_2 taxon_3 taxon_4 taxon_5;
...     FORMAT TRIANGLE=UPPER;
...     MATRIX
...         taxon_1 0.0  1.0  2.0  4.0  7.0
...         taxon_2      0.0  3.0  5.0  8.0
...         taxon_3           0.0  6.0  9.0
...         taxon_4                0.0 10.0
...         taxon_5                     0.0;
... END;''')
>>> nex.DISTANCES.get_matrix()['taxon_3']['taxon_1']
Decimal('2.0')

classmethod from_data(matrix, taxlabels=False, comment=None, nexus=None, TITLE=None, ID=None, LINK=None)[source]

Create a DISTANCES block from the distance matrix matrix.

Parameters:

matrix (typing.OrderedDict[str, typing.OrderedDict[str, typing.Union[None, float, int, decimal.Decimal]]]) – The distance matrix as dict mapping taxon labels to dicts mapping taxon labels to numbers. A “full” matrix is expected here, just like it is returned from Distances.get_matrix().
taxlabels (bool) – Whether to include a TAXLABELS command.
comment (typing.Optional[str]) –
nexus (typing.Optional[commonnexus.nexus.Nexus]) –
TITLE (typing.Optional[str]) –
ID (typing.Optional[str]) –
LINK (typing.Union[str, typing.Tuple[str, str], None]) –

Return type:

commonnexus.blocks.base.Block

DISTANCES Commands

class commonnexus.blocks.distances.Dimensions(tokens, nexus=None)[source]

The NTAX subcommand of this command is needed to process the matrix when some defined taxa are omitted from the distance matrix. The NCHAR subcommand is optional and can be used to indicate the number of characters for those analyses that need to know how many characters (if any) were used to calculate the distances. NCHAR is not required for successful reading of the matrix. As for the CHARACTERS and UNALIGNED block, taxa can be defined in a DISTANCES block if NEWTAXA precedes the NTAXA subcommand in the DIMENSIONS command. It is advised that new taxa not be defined in a DISTANCES block, for the reasons discussed in the description of the DATA block. NEWTAXA, if present, must be appear before the NTAX subcommand.

Variables:

newtaxa (bool) –
nchar (Optional[int]) –
ntax (int) –

class commonnexus.blocks.distances.Format(tokens, nexus=None)[source]

This command specifies the formatting of the MATRIX. The [NO]LABELS and MISSING subcommands are as described in the CHARACTERS block.

TRIANGLE = {LOWER | UPPER | BOTH}. This subcommand specifies whether only the lower left half of the matrix is present, or only the upper right, or both halves. Below is one example of an upper triangular matrix and one of a matrix with both halves included.

BEGIN DISTANCES;
    FORMAT TRIANGLE=UPPER;
    MATRIX
        taxon_1 0.0  1.0  2.0  4.0  7.0
        taxon_2      0.0  3.0  5.0  8.0
        taxon_3           0.0  6.0  9.0
        taxon_4                0.0 10.0
        taxon_5                     0.0;
END;

BEGIN DISTANCES;
    FORMAT TRIANGLE = BOTH;
    MATRIX
        taxon_1  0    1.0  2.0  4.0  7.0
        taxon_2  1.0  0    3.0  5.0  8.0
        taxon_3  2.0  3.0  0    6.0  9.0
        taxon_4  4.0  5.0  6.0  0   10.0
        taxon_5  7.0  8.0  9.0 10.0  0;
END;

DIAGONAL. If DIAGONAL is turned off, the diagonal elements are not included:
```
FORMAT NODIAGONAL;
MATRIX
    taxon_1
    taxon_2  1.0
    taxon_3  2.0  3.0
    taxon_4  4.0  5.0  6.0
    taxon_5  7.0  8.0  9.0 10.0;
```
If TRIANGLE is not BOTH and DIAGONAL is turned off, then there will be one row that contains only the name of a taxon. This row is required. If TRIANGLE=BOTH, then the diagonal must be included.
INTERLEAVE. As in the CHARACTERS block, this subcommand indicates sections in the matrix, although interleaved matrices take a slightly different form for distance matrices:
```
taxon_1  0
taxon_2  1  0
taxon_3  2  3  0
taxon_4  4  5  6
taxon_5  7  8  9
taxon_6 11 12 13
taxon_4  0
taxon_5 10  0
taxon_6 14 15  0;
```
As in the CHARACTERS block, newline characters in interleaved matrices are significant, in that they indicate a switch to a new taxon.

class commonnexus.blocks.distances.Taxlabels(tokens, nexus=None)[source]: This command allows specification of the names and ordering of the taxa. It serves to define taxa and is allowed only if the NEWTAXA token is included in the DIMENSIONS statement.

class commonnexus.blocks.distances.Matrix(tokens, nexus=None)[source]: This command contains the distance data.

Note

Since reading the matrix data only makes sense if information from other commands - in particular FORMAT - is considered, the Matrix object does not have any attributes for data access. Instead, the matrix data can be read via Distances.get_matrix().

SETS

class commonnexus.blocks.sets.Sets(nexus, cmds)[source]

This block stores sets of objects (characters, states, taxa, etc.).

The general structure of the SETS block is as follows.

BEGIN SETS;

[CHARSET charset-name [({STANDARD | VECTOR})] = character-set; ]

[STATESET stateset-name [({STANDARD | VECTOR})] = state-set;]

[CHANGESET

changeset-name=state-set<-> state-set [state-set<-> state-set…];]

[TAXSET taxset-name [({ STANDARD | VECTOR})] = taxon-set; ]

[TREESET treeset-name [({STANDARD | VECTOR})] = tree-set;]

[CHARPARTITION partition-name

[([{[NO]TOKENS}] [{STANDARD | VECTOR}])]

=subset-name:character-set[, subset-name:character-set …]

;]

[TAXPARTITION partition-name

[([{[No]TOKENS}] [{STANDARD | VECTOR}])]

=subset-name:taxon-set[, subset-name:taxon-set…]

;]

[TREEPARTITION partition-name

[([{[No]TOKENS}] [{STANDARD | VECTOR}])]

=subset-name: tree-set[, subset-name:tree-set…]

;]

END;

An example SETS block is

BEGIN SETS;
    CHARSET larval = 1-3 5-8;
    STATESET eyeless = 0;
    STATESET eyed = 1 2 3;
    CHANGESET eyeloss = eyed -> eyeless;
    TAXSET outgroup=l-4;
    TREESET AfrNZVicariance = 3 5 9-12;
    CHARPARTITION bodyparts=head: 1-4 7, body:5 6, legs: 8-10 ;
END;

SETS Commands

class commonnexus.blocks.sets.Charset(tokens, nexus=None)[source]

This command specifies and names a set of characters; this name can then be used in subsequent CHARSET definitions or wherever a character-set is required. The VECTOR format consists of 0’s and 1’s: a 1 indicates that the character is to be included in the CHARSET; whitespace is not necessary between 0’s and l’s. The name of a CHARSET cannot be equivalent to a character name or character number.

Predefined character-sets:

The character-set CONSTANT is predefined for all DATATYPES; it specifies all invariant characters.
The character-set REMAINDER is predefined for all DATATYPES; it specifies all characters not previously referenced in the command.
The character-set GAPPED is predefined for all DATATYPES; it specifies all characters with a gap for at least one taxon.

There are four additional predefined character-sets for characters of DATATYPE=DNA, RNA, and NUCLEOTIDE:

POS1 - All characters defined by current CODONPOSSET as first positions.
POS2 - All characters defined by current CODONPOSSET as second positions.
POS3 - All characters defined by current CODONPOSSET as third positions.
NONCODING - All characters defined by current CODONPOSSET as non-protein-coding sites.

class commonnexus.blocks.sets.Stateset(tokens, nexus=None)[source]

This command allows one to name a set of states; it is not currently supported by any program. It is not available for DATATYPE=CONTINUOUS. For STANDARD format, the state-set is described by a list of state symbols, except that it should not be enclosed in parentheses or braces. Any current state-set symbols are valid in the state-set description. The following STATESET

STATESET theSet = 2 3 4 5;

defines the set composed of states 2, 3, 4, and 5.

The VECTOR format consists of 0’s and 1’s: a 1 indicates that the state is to be included in the STATESET; whitespace is not necessary between 0’s and l’s. For example, the state-set

STATESET theSet (VECTOR) =1001000;

designates theSet to be the set containing first and fourth states.

Warning

commonnexus can read NEXUS containing this command, but will not resolve references to state-sets anywhere.

class commonnexus.blocks.sets.Changeset(tokens, nexus=None)[source]

This command allows naming of a set of state changes; it is not currently supported by any program. It is not available for DATATYPE=CONTINUOUS. The description of the CHANGESET consists of pairs of state-sets joined by an operator. State-sets that consist of more than one token must be contained in parentheses. There are two allowed operators: -> and <-> (<- is not allowed). These operators can best be explained by example.

CHANGESET changes1 = (1 2 3) -> (4 6) ;
CHANGESET changes2 = 1 <-> 4 ;
CHANGESET transversions = (A G) <-> (C T) ;

The first CHANGESET represents any change from 1 to 4, 1 to 6, 2 to 4, 2 to 6, 3 to 4, or 3 to 6, and the second set represents changes from 1 to 4 and 4 to 1. The CHANGESET “transversions” defines the set of all changes between purines and pyrimidines as transversions.

class commonnexus.blocks.sets.Taxset(tokens, nexus=None)[source]

This command defines a set of taxa. A TAXSET name can be used in subsequent TAXSET definitions or wherever a taxon-set is required. The name of a TAXSET cannot be equivalent to a taxon name or taxon number. The taxa to be included are described in a taxon-set. For example, the following command

TAXSET beetles=0mma-.;

defines the TAXSET “beetles” to include all taxa from the taxon Omma to the last defined taxon. The VECTOR format consists of 0’s and 1’s: a 1 indicates that the taxon is to be included in the TAXSET; whitespace is not necessary between 0’s and 1’s.

class commonnexus.blocks.sets.Treeset(tokens, nexus=None)[source]: This command defines a set of trees. A TREESET name can be used in subsequent TREESET definitions or wherever a tree-set is required. It is not currently supported by any program. It follows the same general format as a TAXSET command.

Warning

commonnexus can read NEXUS containing this command, but will not resolve references to tree-sets anywhere.

class commonnexus.blocks.sets.Partition(tokens, nexus=None)[source]

[*]PARTITION commands define partitions of characters, taxa, and trees, respectively. The partition divides the objects into several (mutually exclusive) subsets. They all follow the same format. There are several formatting options. The VECTOR format consists of a list of partition names. By default, the name of each subset is a NEXUS word (this is the TOKENS option). The NOTOKENS option is only available in the VECTOR format; this allows use of single symbols for the subset names. Each value in a definition in VECTOR format must be separated by whitespace if the names are tokens but not if they are NOTOKENS. The following two examples are equivalent:

TAXPARTITION populations = 1:1-3 , 2: 4-6 , 3:7 8;
TAXPARTITION populations (VECTOR NOTOKENS) =11122233;

The following two examples are equivalent:

TAXPARTITION mypartition= Chiricahua: 1-3, Huachuca: 4-6, Galiuro: 7 8;
TAXPARTITION mypartition (VECTOR) =
    Chiricahua Chiricahua Chiricahua Huachuca Huachuca Huachuca Galiuro Galiuro;

class commonnexus.blocks.sets.Charpartition(tokens, nexus=None)[source]

class commonnexus.blocks.sets.Taxpartition(tokens, nexus=None)[source]

class commonnexus.blocks.sets.Treepartition(tokens, nexus=None)[source]

ASSUMPTIONS [not supported]

class commonnexus.blocks.assumptions.Assumptions(nexus, cmds)[source]

Warning

commonnexus doesn’t provide any functionality - other than parsing as generic commands - for ASSUMPTIONS blocks.

>>> nex = Nexus('''#NEXUS
...     BEGIN ASSUMPTIONS;
...         OPTIONS DEFTYPE=ORD;
...         USERTYPE my0rd = 4
...                 0 1 2 3
...                 . 1 2 3
...                 1 . 1 2
...                 2 1 . 1
...                 3 2 1 .;
...         USERTYPE myTree (CSTREE) = ((0,1) a,(2,3)b)c;
...         TYPESET * mixed=lRREv: 1 3 10, UNORD 5-7;
...         WTSET * one = 2 : 1-3 6 11-15, 3: 7 8;
...         WTSET two = 2:4 9, 3: 1-3 5;
...         EXSET nolarval = 1-9;
...         ANCSTATES mixed = 0: 1 3 5-8 11, 1: 2 4 9-15;
...     END;''')
>>> str(nex.ASSUMPTIONS.ANCSTATES)
'mixed = 0: 1 3 5-8 11, 1: 2 4 9-15'

The ASSUMPTIONS block houses assumptions about the data or gives general directions as to how to treat them (e.g., which characters are to be excluded from consideration). The commands currently placed in this block were primarily designed for parsimony analysis. More commands, embodying assumptions useful in distance, maximum likelihood, and other sorts of analyses, will be developed in the future. For example, matrices specifying relative rates of character state change, useful for both distance and likelihood analyses, will eventually be included here. The general structure of the assumptions block is

BEGIN ASSUMPTIONS;

[OPTIONS [DEFTYPE = type-name]

[POLYTCOUNT= {MINSTEPS | MAXSTEPS}]
[GAPMODE= {MISSING | NEWSTATE}];]
[USERTYPE type-name [ ( {STEPMATRIX | CSTREE} ) ] =UsERTYPE-description; ]

[TYPESET [*] typeset-name [ ({STANDARD | VECTOR} ) ] = TYPESET-definition;]

[WTSET [*] wtset-name [({STANDARD | VECTOR} {TOKENS | NOTOKENS})] = WrSET-defini tion;]

[EXSET [*] exset-name [( {STANDARD | VECTOR} ) ] = character-set; ]

[ANCSTATES [*] ancstates-name [({STANDARD | VECTOR} {[NO]TOKENS})] =

ANCSTATES-definition;]

END;

An example ASSUMPTIONS block follows:

BEGIN ASSUMPTIONS;
    OPTIONS DEFTYPE=ORD;
    USERTYPE my0rd = 4
        0 1 2 3
        . 1 2 3
        1 . 1 2
        2 1 . 1
        3 2 1 .;
    USERTYPE myTree (CSTREE) = ((0,1) a,(2,3)b)c;
    TYPESET * mixed=lRREv: 1 3 10, UNORD 5-7;
    WTSET * one = 2 : 1-3 6 11-15, 3: 7 8;
    WTSET two = 2:4 9, 3: 1-3 5;
    EXSET nolarval = 1-9;
    ANCSTATES mixed = 0: 1 3 5-8 11, 1: 2 4 9-15;
END;

USERTYPES must be defined before they are referred to in any TYPESET.

In earlier versions of MacClade and PAUP, TAXSET and CHARSET also appeared in the ASSUMPTIONS block. These now appear in the SETS block. There are a number of other commands in the ASSUMPTIONS block that also have SET in their name (e.g., WTSET, EXSET), but these commands assign values to objects, they do not define sets of objects, and therefore they do not belong in the SETS block. (Commands such as WTSET and EXSET are so named for historical reasons; although they might ideally be renamed CHARACTERWEIGHTS and EXCLUDEDCHARACTERS, doing so would cause existing programs to be incompatible with the file format.) We recommend that programs also accept TAXSET and CHARSET commands in the ASSUMPTIONS block so that older files can be read. In addition, the GAPMODE subcommand of the OPTIONS command of this block was originally housed in an OPTIONS command in the DATA block. Because this subcommand dictates how data are to be treated rather than providing details about the data themselves, it was moved into the ASSUMPTIONS block.

OPTIONS — This command houses a number of disparate subcommands. They are all of the form subcommand=option.

DEFTYPE. This subcommand specifies the default character type for parsimony analyses. Whenever a character’s type is not explicitly stated, its type is taken to be the default type. Default DEFTYPE is UNORD (see the Appendix, character trans- formation type, for a definition of UNORD).
POLYTCOUNT. Setting POLYTCOUNT to MINSTEPS specifies that trees with polytomies are to be counted (in parsimony analyses) in such a way that the number of steps for each character is the minimum number of steps for that character over any resolution of the polytomy. A tree length that is the sum of these minimum numbers of steps may be below the tree length of the most-parsimonious dichotomous resolution. Setting POLYTCOUNT to MAXSTEPS specifies that trees with polytomies are to be counted in such a way that occurrence of derived states on elements of a polytomy are to be counted as independent derivations. Such a tree length may be above the tree length of any fully dichotomous resolution. The NEXUS format does not specify a default value for POLYTCOUNT; the default value may differ from program to program.
GAPMODE. This subcommand specifies how gaps are to be treated. GAPMODE=MISSING specifies that gaps are to be treated in the same way as missing data; GAPMODE=NEWSTATE specifies that gaps are to be treated as an additional state (for DNA/RNA/NUCLEOTIDE data, as a fifth base).

USERTYPE —This command defines a character transformation type, as used in parsimony analysis to designate the cost of changes between states. There are several predefined character types (see character transformation type in the Appendix); USERTYPE allows additional character types to be created. USERTYPE is an object definition command with the exception that an asterisk cannot be used to indicate the default type (default type is stated in the OPTIONS command of the ASSUMPTIONS block). The standard defines no limit to the length of the type name, although individual programs might impose restrictions.

STEPMATRIX format is

USERTYPE myMatrix (STEPMATRIX) = n
    s s s s
    . k k k
    k . k k
    k k . k
    k k k .;

where n is the number of rows and columns in the step matrix, the s’s are state symbols, and the k’s are the cost for going between states. The n can take any value >2. Diagonal elements may be listed as periods. If a change is to be prohibited, then one enters an “i” for infinity. Typically, the state symbols will be in sequence, but they need not be. The following matrices assign values identically:

USERTYPE myMatrix (STEPMATRIX) =4
    0 1 2 3
    . 1 5 1
    1 . 5 1
    5 5 . 5
    1 1 5 . ;
USERTYPE myMatrix2 (STEPMATRIX) =4
    2 0 3 1
    . 5 5 5
    5 . 1 1
    5 1 . 1
    5 1 1 .;

The number of steps may be either integers or real numbers. The range of possible values will differ from program to program. Versions 3.0-3.04 of MacClade use the format name REALMATRIX rather than STEPMATRIX if the matrix contains real numbers. Future programs should treat REALMATRIX as a synonym of STEPMATRIX.

CSTREE format is very similar to the TREE format in a TREES block. That is, character state trees are described in the parenthesis notation following the rules given for TREES of taxa. Instead of taxon labels, character state symbols are used. Thus,

USERTYPE cstree-name (CSTREE) = [{list-of-subtrees)] [state-symbol]];

where each subtree has the same format as the overall tree and the subtrees are separated by commas.

TYPESET — This command specifies the type assigned to each character as used in parsimony analysis. This is a standard object definition command. Any characters not listed in the character-set have the default character type. The type names to be used are either the predefined ones or those defined in a USERTYPE command. Each value in a definition in VECTOR format must be separated by whitespace. The following are equivalent type sets:

TYPESET mytypes = ORD: 1 4 6 , UNORD: 2 3 5 ;
TYPESET mytypes (VECTOR) = ORD UNORD UNORD ORD UNORD ORD;

WTSET — This command specifies the weights of each character. This is a standard object definition command. Any characters not listed in the character-set have weight 1. The weights may be either integers or real numbers. The minimum and maximum weight value will differ from program to program. Each value in a definition in VECTOR format must be separated by whitespace unless the NOTOKENS option is invoked, in which case no whitespace is needed and all weights must be integers in the range 0-9. The following are equivalent weight sets:

WTSET mywts = 3 : 1 4 6, 1: 2 3 5;
WTSET mywts (VECTOR) =3 1 1 3 1 3 ;

In earlier versions of MacClade, the formatting subcommand REAL was used to indicate that real-valued weights were included in the WTSET. This subcommand is no longer in use; programs are expected to detect the presence of integral or real-value weights while reading the WTSET command.

EXSET — This command specifies which characters are to be excluded from consideration. This is a standard object definition command. Any characters not listed in the character-set are included. The VECTOR format consists of 0’s and l’s: a 1 indicates that the character is to be excluded; whitespace is not necessary between 0’s and l’s. The following commands are equivalent and serve to exclude characters 5, 6, 7, 8, and 12.

EXSET * toExclude = 5-8 12;
EXSET * toExclude (VECTOR) = 000011110001;

ANCSTATES — This command allows specification of ancestral states. This is a standard object definition command. Any valid state symbol can be used in the description for discrete data, and any valid value can be used for continuous data. TOKENS is the default for DATATYPE=CONTINUOUS; NOTOKENS is the default for all other DATATYPES. TOKENS is not allowed for DATATYPES DNA, RNA, and NUCLEOTIDE. If TOKENS is invoked, the standard three-letter amino acid abbreviations can be used with DATATYPE=PROTEIN and defined state names can be used for DATATYPE=STANDARD. NOTOKENS is not allowed for DATATYPE=CONTINUOUS. The following commands are equivalent:

ANCSTATES a n c e s t o r = 0 :1-3 5-7 12, 1:4 8-10, 2 : 1 1 ;
ANCSTATES a n c e s t o r (VECTOR) = 000100011120;

CODONS [not supported]

class commonnexus.blocks.codons.Codons(nexus, cmds)[source]

Warning

commonnexus doesn’t provide any functionality - other than parsing as generic commands - for CODONS blocks yet.

The CODONS block contains information about the genetic code, the regions of DNA and RNA sequences that are protein coding, and the location of triplets coding for amino adds in nucleotide sequences.

BEGIN CODONS;

[CODONPOSSET [*] name [( {STANDARD | VECTOR}) ] =

N: character-set,
character-set,
character-set,
character-set; ]

[GENETICCODE code-name

[([CODEORDER=132|other] [NUCORDER = TCAG|other] [[NO]TOKENS]
[EXTENSIONS=”symbols-list”])]
= genetic code description];]

[CODESET [*] codeset-name { (CHARACTERS | UNALIGNED | TAXA) } =

code-name:character-set or taxon-set

[,code-name:character-set or taxon-set…]; ]

END;

GENETICCODE must precede any CODESET that refers to it. There are several predefined genetic codes:

UNIVERSAL      [universal]
UNIVERSAL.EXT  [universal, extended]
MTDNA.DROS     [Drosophila mtDNA]
MTDNA.DROS.EXT [Drosophila | mtDNA , extended]
MTDNA.MAM      [Mammalian mtDNA]
MTDNA.MAM.EXT  [Mamma1ian mtDNA, extended]
MTDNA.YEAST    [Yeast mtDNA]

For a summary of the genetic codes, see Osawa et al. (1992). Extended codes are those in which “extra” amino acids have been added to avoid disjunct amino adds (see the EXTENSIONS subcommand under GENETICCODE).

CODONPOSSET — This command stores information about protein-coding regions and the codon positions of nucleotide bases in protein-coding regions and follows the format of a standard object definition command.

Those characters designated as 1, 2, or 3 are coding bases specified as being of positions 1, 2, and 3, respectively. Those characters designated as N are considered non-protein-coding. Those characters designated as ? are of unknown nature. Any unspecified bases are considered of unknown nature (equivalent to ?). If no CODONPOSSET statement is present, all bases are presumed of unknown nature. For example, the following command

CODONPOSSET * coding = N:1-10, 1:11-.\3, 2:12-.\3, 3:13-.\3;

designates bases 1-10 as noncoding and positions the remaining bases in the order 123123123…

GENETICCODE — GENETICCODE stores information about a user-defined genetic code. Multiple GENETICCODES may be defined in the block. This is a standard object definition command except that the default genetic code is not indicated by an asterisk after GENETICCODE. The genetic code description is a listing of amino acids. By default, the first amino add listed is that coded for by the triplet TTT, and the last amino add listed is that coded for by GGG. In between, the order of triplets follows a pattern controlled by the subcommands NUCORDER and CODEORDER. By default, the amino acids are listed in the following order: TTT, TCT, TAT, TGT, TTC, TCC, TAC, TGC, and so on. The universal genetic code can thus be written

GENETICCODE UNTITLED=
    F S Y C
    F S Y C
    L S * *
    L S * W

    L P H R
    L P H R
    L P Q R
    L P Q R

    I T N S
    I T N S
    I T K R
    M T K R

    V A D G
    V A D G
    V A E G
    V A E G

This assigns TTT to phenylalanine, TCT to serine, TAT to tyrosine, TGT to cysteine, and so on. The following subcommands are included.

CODEORDER. The default CODEORDER is 231, indicating that the second codon nucleotide changes most quickly (i.e., codons represented by adjacent amino acids in the listing always differ at second positions), the third nucleotide changes next most quickly, and the first nucleotide changes most slowly in the list (such that the codons representing the first 16 listed amino acids all have the same first nucleotide).
NUCORDER. The default NUCORDER is TCAG, indicating that the codons with T at a given position are listed first, C next, etc. For example, if CODEORDER is 123 and the NUCORDER is ACGT, then the amino acids would be listed in order to correspond to codons in the order AAA CAA GAA TAA ACA CCA GCA TCA, etc.
[NO]TOKENS. If TOKENS, then amino acids are to be listed by their standard three-letter abbreviations for amino acids (e.g., Leu, Glu). A termination codon is designated by Ter or Stp. If NOTOKENS (the default), then the IUPAC symbols are used in the listing. A termination codon is designated by an asterisk.
EXTENSIONS. This command lists the symbols for “extra” amino acids that are added to avoid disjunct amino acids. For example, serines in the universal code are coded for by two distinct groups of codons; one cannot change between these groups without going through a different amino acid. Serine is therefore disjunct. In the UNIVERSAL code, serine is kept disjunct, with all serines symbolized by S. In the UNIVERSAL.EXT code, however, one serine group is identified by the extra amino acid symbol 1 and the other group is identified by 2. To indicate this, the EXTENSIONS subcommand would read EXTENSIONS=”S S”, indicating that “extra” amino acids 1 and 2 are both serines. In the genetic code description, the symbols 1 and 2 would then both stand for serine. For example, the universal extended genetic code could be represented by

GENETICCODE * universal (NUCORDER=TCAG CODEORDER=213 EXTENSIONS= "S S") =
    F 1 Y C L P H R I T N 2 V A D G
    F 1 Y C L P H R I T N 2 V A D G
    L I * * L P Q R I T K R V A E G
    L 1 * W L P Q R M T K R V A E G ;

(Note that the CODEORDER has been changed in this example.)

CODESET — This object definition assigns genetic codes to various characters and taxa. All nucleotide sites are designated as coding, and all amino acid sites have a genetic code assigned to them. If the CHARACTERS format is used, then character-sets are to be used in the description, and the genetic code is thus applied to the characters listed. For example,

CODESET oddcodeset = customcode: 4-99;

designates the genetic code “custom code” as applying to characters 4-99 for all taxa. If UNALIGNED, then genetic code is presumed to apply to all sites in an UNALIGNED block. Such a CODESET command might look like this:

CODESET mycodeset (UNALIGNED) = universal: ALL;

For UNALIGNED, the only character-set that can be used is ALL. If the TAXA format is used, then taxon-sets are used in the description. Thus, different genetic codes can be assigned to different taxa for all characters. Current programs do not accept any character-set other than ALL nor any taxon-set other than ALL.

TREES

class commonnexus.blocks.trees.Trees(nexus, cmds)[source]

This block stores information about trees. The syntax for the TREES block is

BEGIN TREES;

[TRANSLATE arbitrary-token-used-in-tree-description

valid-taxon-name [, arbitrary-token-used-in-tree-description valid-taxon-name…];]

[TREE [*] tree-name= tree-specification;]

END;

A TRANSLATE command, if present, must precede any TREE command.

property trees: List[Tree]: Since TREE is one of the few NEXUS commands which may appear multiple times per block, we provide a shortcut to this list.

translate(tree)[source]

Translate a tree according to the mapping TREES TRANSLATE.

Return type:: newick.Node
Returns:: A Newick node where the node labels have been translated to valid taxon labels.

Note

Translating a tree does not change tree’s representation in the containing Nexus instance. To replace un-translated trees in a NEXUS file with translated ones, the following code would work:

>>> untranslated = Nexus.from_file(path)
>>> trees = []
>>> for tree in untranslated.TREES.trees:
...     trees.append(Tree.format(
...         tree.name,
...         untranslated.TREES.translate(tree).newick,
...         rooted=tree.rooted))
>>> untranslated.replace_block(
...     untranslated.TREES, [('TREE', tree) for tree in trees])
>>> path.write_text(str(untranslated))

Parameters:: tree (typing.Union[commonnexus.blocks.trees.Tree, newick.Node]) –

classmethod from_data(*tree_specs, nexus=None, comment=None, lowercase_command=False, TITLE=None, LINK=None, ID=None, **translate_labels)[source]

Create a TREES block from a list of tree specifications.

A tree specification is a triple (label, newick, rooted), e.g. (‘t1’, ‘(a,b)c;’, False).

If translate_labels are passed in, a corresponding TRANSLATE command will be added to the block and the trees will be “de-translated” accordingly.

>>> print(Trees.from_data(('t1', '(a,b)c;', False), comment='A consensus tree'))
[A consensus tree]
BEGIN TREES;
TREE t1 = [&U] (a,b)c;
END;

Parameters:

tree_specs (typing.Tuple[str, typing.Union[str, newick.Node], typing.Optional[bool]]) –
nexus (typing.Optional[commonnexus.nexus.Nexus]) –
comment (typing.Optional[str]) –
lowercase_command (bool) –
TITLE (typing.Optional[str]) –
LINK (typing.Optional[str]) –
ID (typing.Optional[str]) –
translate_labels (typing.Dict[str, str]) –

Return type:

commonnexus.blocks.trees.Trees

TREES Commands

class commonnexus.blocks.trees.Translate(tokens, nexus=None)[source]

The tree description requires references to the taxa defined in a TAXA, DATA, CHARACTERS, UNALIGNED, or DISTANCES block. These references can be made using the label assigned to them in the TAXA or DATA blocks, their numbers, or a token specified in the TRANSLATE command. The TRANSLATE statement maps arbitrary labels in the tree specification to valid taxon names. If the arbitrary labels are integers, they are mapped onto the valid taxon names as dictated by the TRANSLATE command without any consideration of the order of the taxa in the matrix. Thus, if an integer is encountered in the tree description, a program first checks to see if it matches one of the arbitrary labels defined in the TRANSLATE command; only if no matching label is found will the integer be presumed to refer to the taxon in that position in the matrix (e.g., if the label in the description is 15, but this is not a label defined in the TRANSLATE command, a program should take this to refer to the 15th taxon).

In the following example,

BEGIN TAXA;
    TAXLABELS Scarabaeus Drosophila Aranaeus;
END;
BEGIN TREES;
    TRANSLATE beetle Scarabaeus, fly Drosophila, spider Aranaeus;
    TREE tree1 = ((1,2),3);
    TREE tree2 = ((beetle,fly),spider);
    TREE tree3= ((Scarabaeus,Drosophila),Aranaeus);
END;

the TRANSLATE command specifies that the label “beetle” can be used in the tree description to refer to Scarabaeus, “fly” to Drosophila, and “spider” to Aranaeus. This means that Scarabaeus can be referred to in a tree description as 1, Scarabaeus, or beetle. Thus, the three trees are identical.

Variables:: mapping (Dict[str, str]) – The mapping of tokens used in the tree description to valid taxon names.

Note

The TRANSLATE data is typically not accessed directly, but just used implicitly when calling Trees.translate().

class commonnexus.blocks.trees.Tree(tokens, nexus=None)[source]

This command describes a tree. Tree descriptions are standard object definition commands. They use the familiar parenthesis notation, with node names, branch lengths, and comments following the established Newick tree standard (see Felsenstein, 1993).

The label of the node is a NEXUS token that is a taxon’s defined name, a taxon’s number, a taxon’s label from the translation table, or a clade’s defined name. The label is optional for internal nodes that are not observed taxa; it is not optional for terminal nodes. Internal nodes that have no label are represented implicitly by the parentheses containing the list of subclades. If the name of a TAXSET is used, it is interpreted as a list of the terminal taxa defined to be in the TAXSET (with commas implicitly inserted between the taxa). The length of the branch below the node is a number, positive or negative. Rooted and unrooted trees can be specified using the [&R] and [&U] comments at the start of the tree description. For example,

TREE mytree = [&R] ((1,2),(3,4));

is a rooted tree, whereas

TREE mytree = [&U] ((1,2),(3,4));

is an unrooted tree. The NEXUS standard does not specify whether rooted or unrooted is default.

An example tree with branch lengths is

TREE tree4 = ((beetle:4.3,fly:1.1):1.8,spider:2.5);

If a file (and its data matrix) has four defined taxa, Crocodile, Bluebird, Archaeopteryx, and Rattlesnake, the following tree,

TREE tree4= (((Bluebird)Archaeopteryx,Crocodile)Archosauria,Rattlesnake);

would indicate that the taxon Archaeopteryx is ancestral to Bluebird and that Crocodile is their sister. Archosauria, because it does not refer to a taxon that has been defined in a TAXA or DATA block, is interpreted as the name of the clade including Archaeopteryx, Bluebird, and Crocodile. Any additional information about a clade, its ancestral node, or the branch below it is to be placed in NEXUS comment commands associated with the node. Al- though different programs may choose their own conventions for how to embed information in comments, the comments that begin with &N are reserved for future NEXUS comment commands. The NEXUS standard places no restrictions on the number of taxa contained in each tree.

Variables:

name (str) – The name of the tree.
rooted (Union[bool, None]) – Flag indicating whether the tree is rooted (or None if no information is given)
newick (newick.Node) – The tree description as newick.Node.

>>> tree = Tree('tree4= (((Bluebird)Archaeopteryx,Crocodile)Archosauria,Rattlesnake)')
>>> tree.name
'tree4'
>>> print(tree.newick.ascii_art())
                                ┌─Archaeopteryx ──Bluebird
                ┌─Archosauria───┤
────────────────┤               └─Crocodile
                └─Rattlesnake

static format(name, newick_node, rooted=None)[source]

Returns a representation of a tree as NEXUS string, suitable as payload of a TREE command.

Parameters:

name (str) –
newick_node (newick.Node) –
rooted (typing.Optional[bool]) –

Return type:

str

property rooted: None | bool: Whether the tree is rooted (True) or not (False) or no information is given (None).

property newick_string: str

The Newick-formatted string representation of the tree.

Note

This property is intended for cases where only the string representation is of interest and the somewhat expensive construction of a newick.Node object is not necessary. Accessing the Tree.newick() property will trigger node construction.

Warning

Due to some normalization (e.g. of whitespace) done by the Newick parser, newick_string may differ from newick.newick.

>>> from commonnexus import Nexus
>>> nex = Nexus('#nexus begin trees; tree 1 = (a,b)\nc; end;')
>>> nex.TREES.TREE.newick_string
'(a,b)\nc;'
>>> nex.TREES.TREE.newick.newick
'(a,b)c'

property newick: Node

A newick.Node instance parsed from the Newick representation of the tree.

>>> from commonnexus import Nexus
>>> nex = Nexus('#nexus begin trees; tree 1 = ((a,b)c,d)e; end;')
>>> print(nex.TREES.TREE.newick.ascii_art())
        ┌─a
    ┌─c─┤
──e─┤   └─b
    └─d

NOTES

class commonnexus.blocks.notes.Notes(nexus, cmds)[source]

The NOTES block stores notes about various objects in a NEXUS file, including taxa, characters, states, and trees:

BEGIN NOTES;

[TEXT

[TAXON=taxon-set]
[CHARACTER=character-set]
[STATE=state-set]
[TREE=tree-set]
SOURCE={INLINE | FILE | RESOURCE} TEXT=text-or-source-descriptor;]

[PICTURE

[TAX0N=taxon-set]
[CHARACTER=character-set]
[STATE= state-set]
[TREE=tree-set]
[FORMAT={PICT | TIFF | EPS | JPEG | GIF}]
[ENCODE={NONE | UUENCODE | BINHEX}]
SOURCE={INLINE | FILE | RESOURCE}
PICTURE=picture-or-source-descriptor; ]

END;

There are no restrictions on the order of commands.

If the written description of the taxon-set, character-set, state-set, or tree-set contains more than one token, it must be enclosed in parentheses, as in the following example:

TEXT TAXON=(1-3) TEXT= 'these taxa from the far north';

If both a taxon-set and a character-set are specified, then the text or picture applies to those characters for those particular taxa. If both a character-set and a state-set are specified, then the text or picture applies to those states for those particular characters.

Warning

PICTURE and SOURCE=RESOURCE for TEXT is not supported by commonnexus.

classmethod from_data(texts, comment=None, nexus=None, TITLE=None, ID=None, LINK=None)[source]

Block implementations must overwrite this method to implement “meaningful” NEXUS writing functionality.

Parameters:

texts (typing.List[typing.Dict[str, typing.Union[str, typing.List[str]]]]) –
comment (typing.Optional[str]) –
nexus (typing.Optional[commonnexus.nexus.Nexus]) –
TITLE (typing.Optional[str]) –
ID (typing.Optional[str]) –
LINK (typing.Union[str, typing.Tuple[str, str], None]) –

Return type:

commonnexus.blocks.base.Block

NOTES Commands

class commonnexus.blocks.notes.Text(tokens=None, nexus=None, **kw)[source]

This command allows text to be attached to various objects.

The SOURCE subcommand indicates the location of the text. The INLINE option indicates that the text is present at the end of the TEXT command; the FILE option indicates that it is in a separate file (the name of which is then specified in the TEXT subcommand); the RESOURCE option indicates that it is in the resource fork of the current file, in a resource of type TEXT (the numerical ID of which is then specified in the TEXT subcommand).

For example, in the following

TEXT TAXON=5 CHARACTER=2 TEXT='4 specimens observed';
TEXT TAXON=Pan TEXT='This genus lives in Africa';
TEXT CHARACTER=2 TEXT='Perhaps this character should be deleted';
TEXT CHARACTER=1 STATE=0 TEXT='This state is hard to detect';

the first command assigns the note “4 specimens observed” to the data entry for taxon 5, character 2; the second command assigns the note “Perhaps this character should be deleted” to character 2; the third command assigns the note “This genus lives in Africa” to the taxon Pan, and the last command assigns the note “This state is hard to detect” to state 0 of character 1.

The text or source descriptor must be a single NEXUS word. If the text contains NEXUS whitespace or punctuation, it needs to be surrounded by single quotes, with any contained single quotes converted to a pair of single quotes.

Variables:: taxons (List[str]) – list of taxon labels or numbers the text relates to.

class commonnexus.blocks.notes.Picture(tokens, nexus=None)[source]

This command allows a picture to be attached to an object.

The FORMAT subcommand allows specification of the graphics format of the image. The SOURCE subcommand indicates the location of the picture. The INLINE option indicates that the picture is present at the end of the PICTURE command; the FILE option indicates that it is in a separate file (the name of which is then specified in the PICTURE subcommand); the RESOURCE option indicates that it is in the resource fork of the current file, in a resource of type PICT (the numerical ID of which is then specified in the PICTURE command). The RESOURCE option is designed for AppleMacintosh® text files.

For example, the following command

PICTURE TAXON=5 CHARACTER=2 FORMAT=GIF SouRCE=file PiCTURE=wide.thorax.gif;

assigns the image in the GIF-formatted file wide.thorax.gif to the data entry for taxon 5, character 2.

The picture or source descriptor must be a single NEXUS word. If the picture contains NEXUS whitespace or punctuation, it needs to be surrounded by single quotes, with any contained single quotes converted to a pair of single quotes.

Most graphics formats do not describe pictures using standard text characters. For this reason many images cannot be included INLINE in a NEXUS command unless they are converted into text characters. The ENCODE subcommand specifies the conversion mechanism used for inline images.

Warning

Support for encoding of type UUENCODE will be removed in Python 3.13, because

base64 is a modern alternative