Handling DNA/RNA sequences

The core of sugar are the sequence handling classes BioSeq and BioBasket. The BioSeq class behaves like a string with useful bioinformatics methods attached. The BioBasket class is a container for multiple BioSeq objects and behaves like a list with useful methods attached. An example of such a useful method is certainly the translate() method.

To read sequences, use the powerful read() routine. It can handle glob expressions, web resources, archives, and automatically detects file formats by inspecting the file contents. To write sequences to files, use the BioBasket.write() method. When writing, the format can be automatically detected from the file extension.

Note

If you don’t have any example files on hand: All of the local example files used in the tutorial are distributed with the sugar package. You can copy them into the current folder using the sugar tutorial command.

The following code loads two local files. The format of the first file is specified explicitly, the format of the second file is autodetected (Stockholm). Gaps are removed from the alignment using a string method. The original ali object is left untouched. Then, a file containing the sequence from the FASTA file and the first two sequences from the alignment is created.

>>> from sugar import read
>>> seqs = read('AF086833.fasta', 'fasta')
>>> ali = read('pesti.stk')
>>> cleaned = ali.copy().str.replace('-', '')
>>> combined = seqs + cleaned[:2]
>>> combined.write('combined.fasta')

Calling read() without any arguments returns sample sequences.

>>> seqs = read()
>>> print(seqs)
2 seqs in basket
AB047639  9678  ACCTGCCCCTAATAGGGGCGACACTCCGCCATGAATCACTCCCCTGTGA...  GC:58.26%
AB677533  9471  GCCCGCCCCCTGATGGGGGCGACACTCCGCCATGAATCACTCCCCTGTG...  GC:57.46%
  customize output with BioBasket.tostr() method

Available file formats

Since sugar uses a plugin system, it is easy to add support for new file formats on the fly. The following sequence file formats are supported out of the box:

format

module

read

write

description

fasta

sugar._io.fasta

FASTA IO

genbank

sugar._io.genbank

GenBank reader

embl

sugar._io.embl

EMBL flat file reader for ENA and UniProt

clustal

sugar._io.clustal

Clustal IO

stockholm

sugar._io.stockholm

Stockholm IO

gff

sugar._io.gff

Generic feature format (GFF) and Gene transfer format (GTF) IO

sjson

sugar._io.sjson

SJson IO, custom lossless sugar format

The table links to the used modules and function.

Architecture of the BioBasket and BioSeq classes

In the following we explore the BioSeq and BioBasket objects.

>>> from sugar import read
>>> seqs = read()
>>> print('ids', seqs.ids)
ids ['AB047639', 'AB677533']
>>> print(seqs)
2 seqs in basket
AB047639  9678  ACCTGCCCCTAATAGGGGCGACACTCCGCCATGAATCACTCCCCTGTGA...  GC:58.26%
AB677533  9471  GCCCGCCCCCTGATGGGGGCGACACTCCGCCATGAATCACTCCCCTGTG...  GC:57.46%
  customize output with BioBasket.tostr() method
>>> type(seqs)
<class 'sugar.core.seq.BioBasket'>
>>> type(seqs[0])
<class 'sugar.core.seq.BioSeq'>
>>> seq1 = seqs[0]
>>> seq2 = seqs.d['AB677533']  # Select sequence by id
>>> print(f'First sequence {seq1.id} starts with {seq1[:10]}.')
First sequence AB047639 starts with ACCTGCCCCT.
>>> print(f'Metadata:\n{seq1.meta}')
Metadata:
      id: AB047639
    _fmt: genbank
_genbank: Attr(locus='AB047639, 9678, bp, RNA, linear, VRL, 12-NOV-2005', def...
features:
source   0+ 9_678  seqid=AB047639;_genbank=Attr(organism='Hepatitis C virus J...
   CDS 340+ 9_102  seqid=AB047639;_genbank=Attr(codon_start=1, product='polyp...
>>> print('Metadata can be accessed with keys or as attributes:',
...       seq1.meta.id, seq1.meta['id'])
Metadata can be accessed with keys or as attributes: AB047639 AB047639

As you can see, the seqs object is an instance of the BioBasket class, which is basically a list of BioSeq objects, each containing a single sequence and its metadata. Here, the sequences have been loaded from a GenBank file, which can be inferred from the value of the meta._fmt attribute. Each format plugin stores format-specific data in the private attribute of its name when reading, here _genbank. This attribute can also be used by the corresponding format plugin to write format specific data.

The BioBasket and BioSeq classes have useful methods attached to them, such as rc(), translate(), match(), or find_orfs(). Also, many string methods are accessible via the str attribute, e.g. str.find() or str.replace(). Note, that unlike the original string methods, the methods attached to the str attribute work in-place, if applicable.

The general architecture of the BioBasket and BioSeq classes is shown in the following diagram:

../_images/datamodel_seq.svg
../_images/datamodel_seq_dark.svg

Attributes marked with an asterisk are directly accessible from the sequence object.

Slicing BioBasket and BioSeq objects

Slicing is heavily overloaded for the BioBasket and BioSeq classes.

The BioSeq class can take the following slicing arguments:
  • A normal slice notation, in which case a BioSeq object with sliced data is returned.

  • A Feature, LocationTuple, or Location object, in which case the sequence will be sliced according to the given location.

  • A string, in which case the first Feature attached to the sequence of that type is used.

The BioBasket class can take the following slicing arguments:
  • A normal slice notation, in which case the underlying list is sliced.

  • For strings, locations, and features, each sequence is sliced separately with the argument.

  • For len-2 slices, the first item selects the sequences, and the second item slices the sequences according to the rules above.

Some examples:

>>> from sugar import read
>>> seqs = read()
>>> seq = seqs[0]
>>> print(seq[:10])  # prints the BioSeq object
ACCTGCCCCT
>>> print(seq.data[:10])  # use data attribute to get the string
ACCTGCCCCT
>>> loc = seq.fts[1].loc
>>> seq_of_ft = seq[loc]
>>> cds = seq['cds']
>>> print(seqs[0, :10])
ACCTGCCCCT
>>> print(seqs[:, :10])
2 seqs in basket
AB047639  10  ACCTGCCCCT  GC:70.00%
AB677533  10  GCCCGCCCCC  GC:100.00%
  customize output with BioBasket.tostr() method
>>> print(seqs['cds'])
2 seqs in basket
AB047639  9102  ATGAGCACAAATCCTAAACCTCAAAGAAAAACCAAAAGAAACACCAACC...  GC:58.68%
AB677533  9045  ATGAGCACAAATCCTAAACCTCAAAGAAAAACCAAAAGAAACACAAACC...  GC:57.63%
  customize output with BioBasket.tostr() method

It is possible to specify options for these slicing operations. This is done using the BioBasket and BioSeq.sl() methods. The syntax is as follows: seq.sl(**kw)[slice]. The gap option can be used to account for gaps. Use it when slices or features are specified relative to the sequences without gaps. This is best demonstrated with an example:

>>> seqs[:1, :5] = '-----'  # replace 5 chars with gaps for first seq
>>> print(seqs[:, :10])
2 seqs in basket
AB047639  10  -----CCCCT  GC:80.00%
AB677533  10  GCCCGCCCCC  GC:100.00%
  customize output with BioBasket.tostr() method
>>> print(seqs.sl(gap='-')[:, :10])
2 seqs in basket
AB047639  15  -----CCCCTAATAG  GC:50.00%
AB677533  10  GCCCGCCCCC  GC:100.00%
  customize output with BioBasket.tostr() method

Some methods, including .sl(), come with an update_fts option, which updates the features to match the new manipulated sequences.

An advanced example

The code is explained below.

>>> from sugar import read
>>> resource = ('https://github.com/rnajena/anchorna/'
...             'raw/refs/heads/master/anchorna/tests/data/pesti55.gff.zip')
>>> seqs = read(resource)  # 1
>>> print(seqs)  # 2
55 seqs in basket
NC_076029   12k  GTATACGAGAACTAGATAAAATACTCGTATACATATTGGACAACAGAA...  GC:45.79%
NC_001461   13k  GTATACGAGAATTAGAAAAGGCACTCGTATACGTATTGGGCAATTAAA...  GC:45.79%
KC853440    12k  GTATACGAGATTTAGAAAAGTCTCTCGTATACACATTGGACATAACAA...  GC:44.77%
JX419398    12k  AATAAGGGGGGTAGCAACAGTGGTGAGTTCGTTGGATGGCTGAAGCCC...  GC:45.47%
M96687      12k  ATGTATACGAGAATTTGCCTAACCTCGTATACATATTGGGCATTCTAA...  GC:46.11%
KT951841    12k  GAATTAGAAAAAGCTCCTGTATACAGATTGGGCAATTAAAAATTGAAG...  GC:45.33%
ON165517    12k  TGCCATGCCCTTAGTAGGACTAGCATAATGGAAGGGGTAGCAACAGTG...  GC:45.91%
KX577637    12k  AGATTTAGGAAAGCTCTCGTATGCATATTGGGCAATCTAAAATAATAA...  GC:45.86%
MW054939    12k  GGATTGGGCAAACTAAATAATAATTAGGCCTAGGGAACAAATCCTCCA...  GC:45.88%
...
MH221025    11k  GGCGGATGCCTCAGGTAAGAGTATAAAATCCGTTGTTTTTAACATGAA...  GC:46.22%
MN099165    12k  CATAATGCTTAGATTGGCTGCATTATGTGTGGGACATCCTAAATTTTT...  GC:46.13%
MN584738    11k  AATGTAAGGGACTCCTAAATTGCTATAAGCCCTGCGGTGAGTGGGGGA...  GC:46.41%
NC_077015   12k  TGCATAACCCTGATTGTAATTGGCTGGGTTATGCATGTGAGAACGCAA...  GC:43.03%
OU592965    12k  TGTAACATCACAAAGTCACCAAATTTGCTCCAGCTACGAGGTTGGCGG...  GC:44.79%
NC_077001   13k  GTATAAGGGATGGGATCCTTGTACATTTCATACCTCTACCGAACAGTA...  GC:39.23%
NC_025677   13k  GTATAAGGGTTGGGAACCTTGTACCAAGCTACCTCTGCCATTCAGTAT...  GC:40.63%
NC_077000   13k  TGTATAAGGAAGGGGTGGGAGCCTTCCCTATACTTGTTTACAAATACC...  GC:41.55%
OM030319    13k  GTTTGTGTATACCGGCTGGCCACCGTTACCTGGCTCGGAGTCTCCTAA...  GC:43.15%
  customize output with BioBasket.tostr() method
>>> print(seqs.fts[:5])  # 3
CDS 385+ 11_697  seqid=NC_076029;_gff=Attr(seqid='NC_076029')
CDS 385+ 11_967  seqid=NC_001461;_gff=Attr(seqid='NC_001461')
CDS 384+ 11_697  seqid=KC853440;_gff=Attr(seqid='KC853440')
CDS 255+ 11_697  seqid=JX419398;_gff=Attr(seqid='JX419398')
CDS 383+ 11_928  seqid=M96687;_gff=Attr(seqid='M96687')
>>> print(seqs[0].str.find('ATG'))  # 4
69
>>> print(seqs[0].matchall('start')[:8])  # 5
8 BioMatches
match=ATG span=(69, 72) rf=0 seqid=NC_076029
match=ATG span=(107, 110) rf=2 seqid=NC_076029
match=ATG span=(130, 133) rf=1 seqid=NC_076029
match=ATG span=(164, 167) rf=2 seqid=NC_076029
match=ATG span=(229, 232) rf=1 seqid=NC_076029
match=ATG span=(378, 381) rf=0 seqid=NC_076029
match=ATG span=(385, 388) rf=1 seqid=NC_076029
match=ATG span=(401, 404) rf=2 seqid=NC_076029
>>> orfs = seqs.find_orfs()  # 6
>>> print(orfs[:5])
ORF    69+  36  seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF   378+  27  seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF   783+  60  seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF   873+ 102  seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF 1_152+  30  seqid=NC_076029;rf=0;has_start=True;has_stop=True
>>> long_orfs = orfs.select(len_gt=500)
>>> print(long_orfs[:5])
ORF 385+ 11_697  seqid=NC_076029;rf=1;has_start=True;has_stop=True
ORF 385+ 11_967  seqid=NC_001461;rf=1;has_start=True;has_stop=True
ORF 384+ 11_697  seqid=KC853440;rf=0;has_start=True;has_stop=True
ORF 255+ 11_697  seqid=JX419398;rf=0;has_start=True;has_stop=True
ORF 383+ 11_928  seqid=M96687;rf=2;has_start=True;has_stop=True
>>> proteins = seqs['cds'].copy().translate()  # 7
>>> print(proteins)
55 seqs in basket
NC_076029  3898  MELITNELLYKTYKQKPVGVEEPVYDQAGNPLFGERGAIHPQSTLKLPHKRGERNVPTS...
NC_001461  3988  MELITNELLYKTYKQKPVGVEEPVYDQAGDPLFGERGAVHPQSTLKLPHKRGERDVPTN...
KC853440   3898  MELITNELLYKTYKQKPAGIEEPVYDQAGNPMFGEVGRIHPQSTLKLPHKRGECEVLTN...
JX419398   3898  MELITNELLYKTYKQKPTGVEEPVYDQAGNPLFGEKGEIHPQSTLKLPHKRGEREVPTN...
M96687     3975  MELITNELLYKTYKQKPAGVEEPVYNQAGDPLFGERGVVHPQATLKLPHKRGEREVPTN...
KT951841   3897  MELITNELLYKTYKQKPAGVEEPVYDRTGNPLFGERGPIHPQSTLKLPHKRGKDEIPTN...
ON165517   3898  MELISNELLYKTYKQKPVGVEEPVYDRGGNPLFGEKGTVHHQSTLKLPHKRGEREVPTN...
KX577637   3898  MKLITNELLYKTYKQKPFGVEEPVYDQAGEPLFGEKGMVHPQSTLKLPHKRGEREVLTN...
MW054939   3898  MELISNELLYKTYKQKPAGVEEPVYDQAGNPLFGERGAIHPQSTLKLPHKRGERNVLTN...
...
MH221025   3635  MKKQITYYLKKEKQRNGWTELVVGESHMKITTLSGRTYRGTWEMEKWSNPYGTYLPRPS...
MN099165   3635  MKKQIAYYLKKEKQKNGWTELVVGESHTKITTLSGKTYRGTWEMEKWPNPYGTYLPRPS...
MN584738   3635  MEKQISYYLKKEKQRNGWTELVVGESSTKITTLSGKTYRGTWEMEKRTNPYGTYLPRPS...
NC_077015  3663  MDTLNFYLTKEKRNNAWDELVVSETSTTITTLKGSRYRGIWEMPTKKREEEVVPKMNVV...
OU592965   3662  MDSLQSYLLKEKRNNAWDELIVSETRRTITTLKGSRYKGVWQMPTKKRKEEVIPKMDTN...
NC_077001  4023  MSCEHLLNDNNKKTIQIIKKFYNPLLRCGCRLMSKASGPPRRGSNTTLRLKEYKDNYQR...
NC_025677  3993  MSGVQGQPRRGSEKTVRMREYKDSFQRGLYVEIGPKLYSSYEGPVYDTIPIKITEEIQE...
NC_077000  3989  MRIDTTIEETKEKIFQKIKEKYDPVFRCGCRMMTKIKTRPRKGSTKTVSTDDYSRALYV...
OM030319   3779  MSEPSSCNNKSSTSASNADVKPKLRPDRLKKGKLQTRPKEHEPDVARKPPDATIVVDGT...
  customize output with BioBasket.tostr() method
>>> proteins.write('pesti_proteins.fasta')  # 8

Let us break up the code:

  1. We read an archive containing a GFF file with FASTA directive from an online resource. Sugar automatically downloads and extracts the archive (detected by file extension) and reads the file with the appropriate plugin (detected by content).

  2. The content of seqs is printed, to print more than 20 lines, use e.g. print(seqs.tostr(h=100)) to print 100 lines, use h=0 to print all lines.

  3. Features are attached to the object, we can print the features of the first sequence with print(seqs[0].fts). In the example, the first 5 features of all sequences are printed. Obviously, these mark the coding sequences.

  4. We try to find the start codon using the str.find() method, but the first ATG string is found before the start of the CDS.

  5. We use the more dedicated matchall() method to find possible start codons. In fact, one of the matches span=(385, 388) rf=1 marks the start of the CDS according to the annotation data.

  6. We use the find_orfs() method to find open reading frames. The first 5 ORFs found are printed, a CDS is not one of them. We could sort the ORFs by length, but we decided to select all ORFs which are longer than 500 nucleotides. Again, we print the first 5 ORFs longer than 500 nucleotides which coincide with the CDS of the first 5 sequences.

  7. Finally, we translate the CDS of the sequences to proteins. Since the translation is an in-place operation, we create a copy to keep the original sequences in the seqs variable. Sugar methods generally return the calling object if applicable, so we were able to use method chaining. Note that instead of using the CDS annotation for slicing, we could have used the found ORFs:

    >>> seqs.fts = long_orfs
    >>> proteins = seqs['orf'].copy().translate()
    
  8. We write the protein strings to a new FASTA file. The format is automatically detected from the file extension, if you want to be verbose, you can specify the format with proteins.write('pesti_proteins.fasta', 'fasta').