Source code for commonnexus.blocks.codons

from .base import Block


[docs]class Codons(Block): """ .. warning:: `commonnexus` doesn't provide any functionality - other than parsing as generic commands - for ``CODONS`` blocks yet. The CODONS block contains information about the genetic code, the regions of DNA and RNA sequences that are protein coding, and the location of triplets coding for amino adds in nucleotide sequences. .. rst-class:: nexus | BEGIN CODONS; | [CODONPOSSET [*] name [( {STANDARD | VECTOR}) ] = | N: character-set, | 1: character-set, | 2: character-set, | 3: character-set; ] | [GENETICCODE code-name | [([CODEORDER=132|other] [NUCORDER = TCAG|other] [[NO]TOKENS] | [EXTENSIONS="symbols-list"])] | = genetic code description];] | [CODESET [*] codeset-name { (CHARACTERS | UNALIGNED | TAXA) } = | code-name:character-set or taxon-set | [,code-name:character-set or taxon-set...]; ] | END; GENETICCODE must precede any CODESET that refers to it. There are several predefined genetic codes: .. code-block:: UNIVERSAL [universal] UNIVERSAL.EXT [universal, extended] MTDNA.DROS [Drosophila mtDNA] MTDNA.DROS.EXT [Drosophila | mtDNA , extended] MTDNA.MAM [Mammalian mtDNA] MTDNA.MAM.EXT [Mamma1ian mtDNA, extended] MTDNA.YEAST [Yeast mtDNA] For a summary of the genetic codes, see Osawa et al. (1992). Extended codes are those in which "extra" amino acids have been added to avoid disjunct amino adds (see the EXTENSIONS subcommand under GENETICCODE). **CODONPOSSET** — This command stores information about protein-coding regions and the codon positions of nucleotide bases in protein-coding regions and follows the format of a standard object definition command. Those characters designated as 1, 2, or 3 are coding bases specified as being of positions 1, 2, and 3, respectively. Those characters designated as N are considered non-protein-coding. Those characters designated as ? are of unknown nature. Any unspecified bases are considered of unknown nature (equivalent to ?). If no CODONPOSSET statement is present, all bases are presumed of unknown nature. For example, the following command .. code-block:: CODONPOSSET * coding = N:1-10, 1:11-.\\3, 2:12-.\\3, 3:13-.\\3; designates bases 1-10 as noncoding and positions the remaining bases in the order 123123123... **GENETICCODE** — GENETICCODE stores information about a user-defined genetic code. Multiple GENETICCODES may be defined in the block. This is a standard object definition command except that the default genetic code is not indicated by an asterisk after GENETICCODE. The genetic code description is a listing of amino acids. By default, the first amino add listed is that coded for by the triplet TTT, and the last amino add listed is that coded for by GGG. In between, the order of triplets follows a pattern controlled by the subcommands NUCORDER and CODEORDER. By default, the amino acids are listed in the following order: TTT, TCT, TAT, TGT, TTC, TCC, TAC, TGC, and so on. The universal genetic code can thus be written .. code-block:: GENETICCODE UNTITLED= F S Y C F S Y C L S * * L S * W L P H R L P H R L P Q R L P Q R I T N S I T N S I T K R M T K R V A D G V A D G V A E G V A E G This assigns TTT to phenylalanine, TCT to serine, TAT to tyrosine, TGT to cysteine, and so on. The following subcommands are included. 1. CODEORDER. The default CODEORDER is 231, indicating that the second codon nucleotide changes most quickly (i.e., codons represented by adjacent amino acids in the listing always differ at second positions), the third nucleotide changes next most quickly, and the first nucleotide changes most slowly in the list (such that the codons representing the first 16 listed amino acids all have the same first nucleotide). 2. NUCORDER. The default NUCORDER is TCAG, indicating that the codons with T at a given position are listed first, C next, etc. For example, if CODEORDER is 123 and the NUCORDER is ACGT, then the amino acids would be listed in order to correspond to codons in the order AAA CAA GAA TAA ACA CCA GCA TCA, etc. 3. [NO]TOKENS. If TOKENS, then amino acids are to be listed by their standard three-letter abbreviations for amino acids (e.g., Leu, Glu). A termination codon is designated by Ter or Stp. If NOTOKENS (the default), then the IUPAC symbols are used in the listing. A termination codon is designated by an asterisk. 4. EXTENSIONS. This command lists the symbols for "extra" amino acids that are added to avoid disjunct amino acids. For example, serines in the universal code are coded for by two distinct groups of codons; one cannot change between these groups without going through a different amino acid. Serine is therefore disjunct. In the UNIVERSAL code, serine is kept disjunct, with all serines symbolized by S. In the UNIVERSAL.EXT code, however, one serine group is identified by the extra amino acid symbol 1 and the other group is identified by 2. To indicate this, the EXTENSIONS subcommand would read EXTENSIONS="S S", indicating that "extra" amino acids 1 and 2 are both serines. In the genetic code description, the symbols 1 and 2 would then both stand for serine. For example, the universal extended genetic code could be represented by .. code-block:: GENETICCODE * universal (NUCORDER=TCAG CODEORDER=213 EXTENSIONS= "S S") = F 1 Y C L P H R I T N 2 V A D G F 1 Y C L P H R I T N 2 V A D G L I * * L P Q R I T K R V A E G L 1 * W L P Q R M T K R V A E G ; (Note that the CODEORDER has been changed in this example.) **CODESET** — This object definition assigns genetic codes to various characters and taxa. All nucleotide sites are designated as coding, and all amino acid sites have a genetic code assigned to them. If the CHARACTERS format is used, then character-sets are to be used in the description, and the genetic code is thus applied to the characters listed. For example, .. code-block:: CODESET oddcodeset = customcode: 4-99; designates the genetic code "custom code" as applying to characters 4-99 for all taxa. If UNALIGNED, then genetic code is presumed to apply to all sites in an UNALIGNED block. Such a CODESET command might look like this: .. code-block:: CODESET mycodeset (UNALIGNED) = universal: ALL; For UNALIGNED, the only character-set that can be used is ALL. If the TAXA format is used, then taxon-sets are used in the description. Thus, different genetic codes can be assigned to different taxa for all characters. Current programs do not accept any character-set other than ALL nor any taxon-set other than ALL. """