Source code for commonnexus.blocks.assumptions

from .base import Block


[docs]class Assumptions(Block): """ .. warning:: `commonnexus` doesn't provide any functionality - other than parsing as generic commands - for ``ASSUMPTIONS`` blocks. .. code-block:: python >>> nex = Nexus('''#NEXUS ... BEGIN ASSUMPTIONS; ... OPTIONS DEFTYPE=ORD; ... USERTYPE my0rd = 4 ... 0 1 2 3 ... . 1 2 3 ... 1 . 1 2 ... 2 1 . 1 ... 3 2 1 .; ... USERTYPE myTree (CSTREE) = ((0,1) a,(2,3)b)c; ... TYPESET * mixed=lRREv: 1 3 10, UNORD 5-7; ... WTSET * one = 2 : 1-3 6 11-15, 3: 7 8; ... WTSET two = 2:4 9, 3: 1-3 5; ... EXSET nolarval = 1-9; ... ANCSTATES mixed = 0: 1 3 5-8 11, 1: 2 4 9-15; ... END;''') >>> str(nex.ASSUMPTIONS.ANCSTATES) 'mixed = 0: 1 3 5-8 11, 1: 2 4 9-15' The ASSUMPTIONS block houses assumptions about the data or gives general directions as to how to treat them (e.g., which characters are to be excluded from consideration). The commands currently placed in this block were primarily designed for parsimony analysis. More commands, embodying assumptions useful in distance, maximum likelihood, and other sorts of analyses, will be developed in the future. For example, matrices specifying relative rates of character state change, useful for both distance and likelihood analyses, will eventually be included here. The general structure of the assumptions block is .. rst-class:: nexus | BEGIN ASSUMPTIONS; | [OPTIONS [DEFTYPE = type-name] | [POLYTCOUNT= {MINSTEPS | MAXSTEPS}] | [GAPMODE= {MISSING | NEWSTATE}];] | [USERTYPE type-name [ ( {STEPMATRIX | CSTREE} ) ] =UsERTYPE-description; ] | [TYPESET [*] typeset-name [ ({STANDARD | VECTOR} ) ] = TYPESET-definition;] | [WTSET [*] wtset-name [({STANDARD | VECTOR} {TOKENS | NOTOKENS})] = WrSET-defini tion;] | [EXSET [*] exset-name [( {STANDARD | VECTOR} ) ] = character-set; ] | [ANCSTATES [*] ancstates-name [({STANDARD | VECTOR} {[NO]TOKENS})] = | ANCSTATES-definition;] | END; An example ASSUMPTIONS block follows: .. code-block:: BEGIN ASSUMPTIONS; OPTIONS DEFTYPE=ORD; USERTYPE my0rd = 4 0 1 2 3 . 1 2 3 1 . 1 2 2 1 . 1 3 2 1 .; USERTYPE myTree (CSTREE) = ((0,1) a,(2,3)b)c; TYPESET * mixed=lRREv: 1 3 10, UNORD 5-7; WTSET * one = 2 : 1-3 6 11-15, 3: 7 8; WTSET two = 2:4 9, 3: 1-3 5; EXSET nolarval = 1-9; ANCSTATES mixed = 0: 1 3 5-8 11, 1: 2 4 9-15; END; **USERTYPES** must be defined before they are referred to in any TYPESET. In earlier versions of MacClade and PAUP, TAXSET and CHARSET also appeared in the ASSUMPTIONS block. These now appear in the SETS block. There are a number of other commands in the ASSUMPTIONS block that also have SET in their name (e.g., WTSET, EXSET), but these commands assign values to objects, they do not define sets of objects, and therefore they do not belong in the SETS block. (Commands such as WTSET and EXSET are so named for historical reasons; although they might ideally be renamed CHARACTERWEIGHTS and EXCLUDEDCHARACTERS, doing so would cause existing programs to be incompatible with the file format.) We recommend that programs also accept TAXSET and CHARSET commands in the ASSUMPTIONS block so that older files can be read. In addition, the GAPMODE subcommand of the OPTIONS command of this block was originally housed in an OPTIONS command in the DATA block. Because this subcommand dictates how data are to be treated rather than providing details about the data themselves, it was moved into the ASSUMPTIONS block. **OPTIONS** — This command houses a number of disparate subcommands. They are all of the form subcommand=option. 1. **DEFTYPE**. This subcommand specifies the default character type for parsimony analyses. Whenever a character's type is not explicitly stated, its type is taken to be the default type. Default DEFTYPE is UNORD (see the Appendix, character trans- formation type, for a definition of UNORD). 2. **POLYTCOUNT**. Setting POLYTCOUNT to MINSTEPS specifies that trees with polytomies are to be counted (in parsimony analyses) in such a way that the number of steps for each character is the minimum number of steps for that character over any resolution of the polytomy. A tree length that is the sum of these minimum numbers of steps may be below the tree length of the most-parsimonious dichotomous resolution. Setting POLYTCOUNT to MAXSTEPS specifies that trees with polytomies are to be counted in such a way that occurrence of derived states on elements of a polytomy are to be counted as independent derivations. Such a tree length may be above the tree length of any fully dichotomous resolution. The NEXUS format does not specify a default value for POLYTCOUNT; the default value may differ from program to program. 3. **GAPMODE**. This subcommand specifies how gaps are to be treated. GAPMODE=MISSING specifies that gaps are to be treated in the same way as missing data; GAPMODE=NEWSTATE specifies that gaps are to be treated as an additional state (for DNA/RNA/NUCLEOTIDE data, as a fifth base). **USERTYPE** —This command defines a character transformation type, as used in parsimony analysis to designate the cost of changes between states. There are several predefined character types (see character transformation type in the Appendix); USERTYPE allows additional character types to be created. USERTYPE is an object definition command with the exception that an asterisk cannot be used to indicate the default type (default type is stated in the OPTIONS command of the ASSUMPTIONS block). The standard defines no limit to the length of the type name, although individual programs might impose restrictions. STEPMATRIX format is .. code-block:: USERTYPE myMatrix (STEPMATRIX) = n s s s s . k k k k . k k k k . k k k k .; where n is the number of rows and columns in the step matrix, the s's are state symbols, and the k's are the cost for going between states. The n can take any value >2. Diagonal elements may be listed as periods. If a change is to be prohibited, then one enters an "i" for infinity. Typically, the state symbols will be in sequence, but they need not be. The following matrices assign values identically: .. code-block:: USERTYPE myMatrix (STEPMATRIX) =4 0 1 2 3 . 1 5 1 1 . 5 1 5 5 . 5 1 1 5 . ; USERTYPE myMatrix2 (STEPMATRIX) =4 2 0 3 1 . 5 5 5 5 . 1 1 5 1 . 1 5 1 1 .; The number of steps may be either integers or real numbers. The range of possible values will differ from program to program. Versions 3.0-3.04 of MacClade use the format name REALMATRIX rather than STEPMATRIX if the matrix contains real numbers. Future programs should treat REALMATRIX as a synonym of STEPMATRIX. CSTREE format is very similar to the TREE format in a TREES block. That is, character state trees are described in the parenthesis notation following the rules given for TREES of taxa. Instead of taxon labels, character state symbols are used. Thus, .. code-block:: USERTYPE cstree-name (CSTREE) = [{list-of-subtrees)] [state-symbol]]; where each subtree has the same format as the overall tree and the subtrees are separated by commas. **TYPESET** — This command specifies the type assigned to each character as used in parsimony analysis. This is a standard object definition command. Any characters not listed in the character-set have the default character type. The type names to be used are either the predefined ones or those defined in a USERTYPE command. Each value in a definition in VECTOR format must be separated by whitespace. The following are equivalent type sets: .. code-block:: TYPESET mytypes = ORD: 1 4 6 , UNORD: 2 3 5 ; TYPESET mytypes (VECTOR) = ORD UNORD UNORD ORD UNORD ORD; **WTSET** — This command specifies the weights of each character. This is a standard object definition command. Any characters not listed in the character-set have weight 1. The weights may be either integers or real numbers. The minimum and maximum weight value will differ from program to program. Each value in a definition in VECTOR format must be separated by whitespace unless the NOTOKENS option is invoked, in which case no whitespace is needed and all weights must be integers in the range 0-9. The following are equivalent weight sets: .. code-block:: WTSET mywts = 3 : 1 4 6, 1: 2 3 5; WTSET mywts (VECTOR) =3 1 1 3 1 3 ; In earlier versions of MacClade, the formatting subcommand REAL was used to indicate that real-valued weights were included in the WTSET. This subcommand is no longer in use; programs are expected to detect the presence of integral or real-value weights while reading the WTSET command. **EXSET** — This command specifies which characters are to be excluded from consideration. This is a standard object definition command. Any characters not listed in the character-set are included. The VECTOR format consists of 0's and l's: a 1 indicates that the character is to be excluded; whitespace is not necessary between 0's and l's. The following commands are equivalent and serve to exclude characters 5, 6, 7, 8, and 12. .. code-block:: EXSET * toExclude = 5-8 12; EXSET * toExclude (VECTOR) = 000011110001; **ANCSTATES** — This command allows specification of ancestral states. This is a standard object definition command. Any valid state symbol can be used in the description for discrete data, and any valid value can be used for continuous data. TOKENS is the default for DATATYPE=CONTINUOUS; NOTOKENS is the default for all other DATATYPES. TOKENS is not allowed for DATATYPES DNA, RNA, and NUCLEOTIDE. If TOKENS is invoked, the standard three-letter amino acid abbreviations can be used with DATATYPE=PROTEIN and defined state names can be used for DATATYPE=STANDARD. NOTOKENS is not allowed for DATATYPE=CONTINUOUS. The following commands are equivalent: .. code-block:: ANCSTATES a n c e s t o r = 0 :1-3 5-7 12, 1:4 8-10, 2 : 1 1 ; ANCSTATES a n c e s t o r (VECTOR) = 000100011120; """ pass