Handling DNA/RNA sequences¶
The core of sugar are the sequence handling classes BioSeq and BioBasket.
The BioSeq class behaves like a string with useful bioinformatics methods attached.
The BioBasket class is a container for multiple BioSeq objects and
behaves like a list with useful methods attached.
An example of such a useful method is certainly the translate() method.
To read sequences, use the powerful read() routine.
It can handle glob expressions, web resources, archives,
and automatically detects file formats by inspecting the file contents.
To write sequences to files, use the BioBasket.write() method.
When writing, the format can be automatically detected from the file extension.
Note
If you don’t have any example files on hand:
All of the local example files used in the tutorial are distributed with the sugar package.
You can copy them into the current folder using the sugar tutorial command.
The following code loads two local files.
The format of the first file is specified explicitly,
the format of the second file is autodetected (Stockholm).
Gaps are removed from the alignment using a string method.
The original ali object is left untouched.
Then, a file containing the sequence from the FASTA file and
the first two sequences from the alignment is created.
>>> from sugar import read
>>> seqs = read('AF086833.fasta', 'fasta')
>>> ali = read('pesti.stk')
>>> cleaned = ali.copy().str.replace('-', '')
>>> combined = seqs + cleaned[:2]
>>> combined.write('combined.fasta')
Calling read() without any arguments returns sample sequences.
>>> seqs = read()
>>> print(seqs)
2 seqs in basket
AB047639 9678 ACCTGCCCCTAATAGGGGCGACACTCCGCCATGAATCACTCCCCTGTGA... GC:58.26%
AB677533 9471 GCCCGCCCCCTGATGGGGGCGACACTCCGCCATGAATCACTCCCCTGTG... GC:57.46%
customize output with BioBasket.tostr() method
Available file formats
Since sugar uses a plugin system, it is easy to add support for new file formats on the fly. The following sequence file formats are supported out of the box:
format |
module |
read |
write |
description |
|---|---|---|---|---|
fasta |
FASTA IO |
|||
genbank |
GenBank reader |
|||
embl |
EMBL flat file reader for ENA and UniProt |
|||
clustal |
Clustal IO |
|||
stockholm |
Stockholm IO |
|||
gff |
Generic feature format (GFF) and Gene transfer format (GTF) IO |
|||
sjson |
SJson IO, custom lossless sugar format |
The table links to the used modules and function.
Architecture of the BioBasket and BioSeq classes
In the following we explore the BioSeq and BioBasket objects.
>>> from sugar import read
>>> seqs = read()
>>> print('ids', seqs.ids)
ids ['AB047639', 'AB677533']
>>> print(seqs)
2 seqs in basket
AB047639 9678 ACCTGCCCCTAATAGGGGCGACACTCCGCCATGAATCACTCCCCTGTGA... GC:58.26%
AB677533 9471 GCCCGCCCCCTGATGGGGGCGACACTCCGCCATGAATCACTCCCCTGTG... GC:57.46%
customize output with BioBasket.tostr() method
>>> type(seqs)
<class 'sugar.core.seq.BioBasket'>
>>> type(seqs[0])
<class 'sugar.core.seq.BioSeq'>
>>> seq1 = seqs[0]
>>> seq2 = seqs.d['AB677533'] # Select sequence by id
>>> print(f'First sequence {seq1.id} starts with {seq1[:10]}.')
First sequence AB047639 starts with ACCTGCCCCT.
>>> print(f'Metadata:\n{seq1.meta}')
Metadata:
id: AB047639
_fmt: genbank
_genbank: Attr(locus='AB047639, 9678, bp, RNA, linear, VRL, 12-NOV-2005', def...
features:
source 0+ 9_678 seqid=AB047639;_genbank=Attr(organism='Hepatitis C virus J...
CDS 340+ 9_102 seqid=AB047639;_genbank=Attr(codon_start=1, product='polyp...
>>> print('Metadata can be accessed with keys or as attributes:',
... seq1.meta.id, seq1.meta['id'])
Metadata can be accessed with keys or as attributes: AB047639 AB047639
As you can see, the seqs object is an instance of the BioBasket class,
which is basically a list of BioSeq objects,
each containing a single sequence and its metadata.
Here, the sequences have been loaded from a GenBank file,
which can be inferred from the value of the meta._fmt attribute.
Each format plugin stores format-specific data
in the private attribute of its name when reading, here _genbank.
This attribute can also be used by the corresponding format plugin to write format specific data.
The BioBasket and BioSeq classes have useful methods attached to them,
such as rc(), translate(), match(), or find_orfs().
Also, many string methods are accessible via the str attribute,
e.g. str.find() or str.replace().
Note, that unlike the original string methods,
the methods attached to the str attribute work in-place, if applicable.
The general architecture of the BioBasket and BioSeq classes
is shown in the following diagram:
Attributes marked with an asterisk are directly accessible from the sequence object.
Slicing BioBasket and BioSeq objects
Slicing is heavily overloaded for the BioBasket and BioSeq classes.
- The
BioSeqclass can take the following slicing arguments: A normal slice notation, in which case a BioSeq object with sliced data is returned.
A
Feature,LocationTuple, orLocationobject, in which case the sequence will be sliced according to the given location.A string, in which case the first Feature attached to the sequence of that type is used.
- The
BioBasketclass can take the following slicing arguments: A normal slice notation, in which case the underlying list is sliced.
For strings, locations, and features, each sequence is sliced separately with the argument.
For len-2 slices, the first item selects the sequences, and the second item slices the sequences according to the rules above.
Some examples:
>>> from sugar import read
>>> seqs = read()
>>> seq = seqs[0]
>>> print(seq[:10]) # prints the BioSeq object
ACCTGCCCCT
>>> print(seq.data[:10]) # use data attribute to get the string
ACCTGCCCCT
>>> loc = seq.fts[1].loc
>>> seq_of_ft = seq[loc]
>>> cds = seq['cds']
>>> print(seqs[0, :10])
ACCTGCCCCT
>>> print(seqs[:, :10])
2 seqs in basket
AB047639 10 ACCTGCCCCT GC:70.00%
AB677533 10 GCCCGCCCCC GC:100.00%
customize output with BioBasket.tostr() method
>>> print(seqs['cds'])
2 seqs in basket
AB047639 9102 ATGAGCACAAATCCTAAACCTCAAAGAAAAACCAAAAGAAACACCAACC... GC:58.68%
AB677533 9045 ATGAGCACAAATCCTAAACCTCAAAGAAAAACCAAAAGAAACACAAACC... GC:57.63%
customize output with BioBasket.tostr() method
It is possible to specify options for these slicing operations.
This is done using the BioBasket and BioSeq.sl() methods.
The syntax is as follows: seq.sl(**kw)[slice].
The gap option can be used to account for gaps.
Use it when slices or features are specified relative to the sequences without gaps.
This is best demonstrated with an example:
>>> seqs[:1, :5] = '-----' # replace 5 chars with gaps for first seq
>>> print(seqs[:, :10])
2 seqs in basket
AB047639 10 -----CCCCT GC:80.00%
AB677533 10 GCCCGCCCCC GC:100.00%
customize output with BioBasket.tostr() method
>>> print(seqs.sl(gap='-')[:, :10])
2 seqs in basket
AB047639 15 -----CCCCTAATAG GC:50.00%
AB677533 10 GCCCGCCCCC GC:100.00%
customize output with BioBasket.tostr() method
Some methods, including .sl(), come with an update_fts option,
which updates the features to match the new manipulated sequences.
An advanced example
The code is explained below.
>>> from sugar import read
>>> resource = ('https://github.com/rnajena/anchorna/'
... 'raw/refs/heads/master/anchorna/tests/data/pesti55.gff.zip')
>>> seqs = read(resource) # 1
>>> print(seqs) # 2
55 seqs in basket
NC_076029 12k GTATACGAGAACTAGATAAAATACTCGTATACATATTGGACAACAGAA... GC:45.79%
NC_001461 13k GTATACGAGAATTAGAAAAGGCACTCGTATACGTATTGGGCAATTAAA... GC:45.79%
KC853440 12k GTATACGAGATTTAGAAAAGTCTCTCGTATACACATTGGACATAACAA... GC:44.77%
JX419398 12k AATAAGGGGGGTAGCAACAGTGGTGAGTTCGTTGGATGGCTGAAGCCC... GC:45.47%
M96687 12k ATGTATACGAGAATTTGCCTAACCTCGTATACATATTGGGCATTCTAA... GC:46.11%
KT951841 12k GAATTAGAAAAAGCTCCTGTATACAGATTGGGCAATTAAAAATTGAAG... GC:45.33%
ON165517 12k TGCCATGCCCTTAGTAGGACTAGCATAATGGAAGGGGTAGCAACAGTG... GC:45.91%
KX577637 12k AGATTTAGGAAAGCTCTCGTATGCATATTGGGCAATCTAAAATAATAA... GC:45.86%
MW054939 12k GGATTGGGCAAACTAAATAATAATTAGGCCTAGGGAACAAATCCTCCA... GC:45.88%
...
MH221025 11k GGCGGATGCCTCAGGTAAGAGTATAAAATCCGTTGTTTTTAACATGAA... GC:46.22%
MN099165 12k CATAATGCTTAGATTGGCTGCATTATGTGTGGGACATCCTAAATTTTT... GC:46.13%
MN584738 11k AATGTAAGGGACTCCTAAATTGCTATAAGCCCTGCGGTGAGTGGGGGA... GC:46.41%
NC_077015 12k TGCATAACCCTGATTGTAATTGGCTGGGTTATGCATGTGAGAACGCAA... GC:43.03%
OU592965 12k TGTAACATCACAAAGTCACCAAATTTGCTCCAGCTACGAGGTTGGCGG... GC:44.79%
NC_077001 13k GTATAAGGGATGGGATCCTTGTACATTTCATACCTCTACCGAACAGTA... GC:39.23%
NC_025677 13k GTATAAGGGTTGGGAACCTTGTACCAAGCTACCTCTGCCATTCAGTAT... GC:40.63%
NC_077000 13k TGTATAAGGAAGGGGTGGGAGCCTTCCCTATACTTGTTTACAAATACC... GC:41.55%
OM030319 13k GTTTGTGTATACCGGCTGGCCACCGTTACCTGGCTCGGAGTCTCCTAA... GC:43.15%
customize output with BioBasket.tostr() method
>>> print(seqs.fts[:5]) # 3
CDS 385+ 11_697 seqid=NC_076029;_gff=Attr(seqid='NC_076029')
CDS 385+ 11_967 seqid=NC_001461;_gff=Attr(seqid='NC_001461')
CDS 384+ 11_697 seqid=KC853440;_gff=Attr(seqid='KC853440')
CDS 255+ 11_697 seqid=JX419398;_gff=Attr(seqid='JX419398')
CDS 383+ 11_928 seqid=M96687;_gff=Attr(seqid='M96687')
>>> print(seqs[0].str.find('ATG')) # 4
69
>>> print(seqs[0].matchall('start')[:8]) # 5
8 BioMatches
match=ATG span=(69, 72) rf=0 seqid=NC_076029
match=ATG span=(107, 110) rf=2 seqid=NC_076029
match=ATG span=(130, 133) rf=1 seqid=NC_076029
match=ATG span=(164, 167) rf=2 seqid=NC_076029
match=ATG span=(229, 232) rf=1 seqid=NC_076029
match=ATG span=(378, 381) rf=0 seqid=NC_076029
match=ATG span=(385, 388) rf=1 seqid=NC_076029
match=ATG span=(401, 404) rf=2 seqid=NC_076029
>>> orfs = seqs.find_orfs() # 6
>>> print(orfs[:5])
ORF 69+ 36 seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF 378+ 27 seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF 783+ 60 seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF 873+ 102 seqid=NC_076029;rf=0;has_start=True;has_stop=True
ORF 1_152+ 30 seqid=NC_076029;rf=0;has_start=True;has_stop=True
>>> long_orfs = orfs.select(len_gt=500)
>>> print(long_orfs[:5])
ORF 385+ 11_697 seqid=NC_076029;rf=1;has_start=True;has_stop=True
ORF 385+ 11_967 seqid=NC_001461;rf=1;has_start=True;has_stop=True
ORF 384+ 11_697 seqid=KC853440;rf=0;has_start=True;has_stop=True
ORF 255+ 11_697 seqid=JX419398;rf=0;has_start=True;has_stop=True
ORF 383+ 11_928 seqid=M96687;rf=2;has_start=True;has_stop=True
>>> proteins = seqs['cds'].copy().translate() # 7
>>> print(proteins)
55 seqs in basket
NC_076029 3898 MELITNELLYKTYKQKPVGVEEPVYDQAGNPLFGERGAIHPQSTLKLPHKRGERNVPTS...
NC_001461 3988 MELITNELLYKTYKQKPVGVEEPVYDQAGDPLFGERGAVHPQSTLKLPHKRGERDVPTN...
KC853440 3898 MELITNELLYKTYKQKPAGIEEPVYDQAGNPMFGEVGRIHPQSTLKLPHKRGECEVLTN...
JX419398 3898 MELITNELLYKTYKQKPTGVEEPVYDQAGNPLFGEKGEIHPQSTLKLPHKRGEREVPTN...
M96687 3975 MELITNELLYKTYKQKPAGVEEPVYNQAGDPLFGERGVVHPQATLKLPHKRGEREVPTN...
KT951841 3897 MELITNELLYKTYKQKPAGVEEPVYDRTGNPLFGERGPIHPQSTLKLPHKRGKDEIPTN...
ON165517 3898 MELISNELLYKTYKQKPVGVEEPVYDRGGNPLFGEKGTVHHQSTLKLPHKRGEREVPTN...
KX577637 3898 MKLITNELLYKTYKQKPFGVEEPVYDQAGEPLFGEKGMVHPQSTLKLPHKRGEREVLTN...
MW054939 3898 MELISNELLYKTYKQKPAGVEEPVYDQAGNPLFGERGAIHPQSTLKLPHKRGERNVLTN...
...
MH221025 3635 MKKQITYYLKKEKQRNGWTELVVGESHMKITTLSGRTYRGTWEMEKWSNPYGTYLPRPS...
MN099165 3635 MKKQIAYYLKKEKQKNGWTELVVGESHTKITTLSGKTYRGTWEMEKWPNPYGTYLPRPS...
MN584738 3635 MEKQISYYLKKEKQRNGWTELVVGESSTKITTLSGKTYRGTWEMEKRTNPYGTYLPRPS...
NC_077015 3663 MDTLNFYLTKEKRNNAWDELVVSETSTTITTLKGSRYRGIWEMPTKKREEEVVPKMNVV...
OU592965 3662 MDSLQSYLLKEKRNNAWDELIVSETRRTITTLKGSRYKGVWQMPTKKRKEEVIPKMDTN...
NC_077001 4023 MSCEHLLNDNNKKTIQIIKKFYNPLLRCGCRLMSKASGPPRRGSNTTLRLKEYKDNYQR...
NC_025677 3993 MSGVQGQPRRGSEKTVRMREYKDSFQRGLYVEIGPKLYSSYEGPVYDTIPIKITEEIQE...
NC_077000 3989 MRIDTTIEETKEKIFQKIKEKYDPVFRCGCRMMTKIKTRPRKGSTKTVSTDDYSRALYV...
OM030319 3779 MSEPSSCNNKSSTSASNADVKPKLRPDRLKKGKLQTRPKEHEPDVARKPPDATIVVDGT...
customize output with BioBasket.tostr() method
>>> proteins.write('pesti_proteins.fasta') # 8
Let us break up the code:
We
readan archive containing a GFF file with FASTA directive from an online resource. Sugar automatically downloads and extracts the archive (detected by file extension) and reads the file with the appropriate plugin (detected by content).The content of
seqsis printed, to print more than 20 lines, use e.g.print(seqs.tostr(h=100))to print 100 lines, useh=0to print all lines.Features are attached to the object, we can print the features of the first sequence with
print(seqs[0].fts). In the example, the first 5 features of all sequences are printed. Obviously, these mark the coding sequences.We try to find the start codon using the
str.find()method, but the first ATG string is found before the start of the CDS.We use the more dedicated
matchall()method to find possible start codons. In fact, one of the matchesspan=(385, 388) rf=1marks the start of the CDS according to the annotation data.We use the
find_orfs()method to find open reading frames. The first 5 ORFs found are printed, a CDS is not one of them. We couldsortthe ORFs by length, but we decided toselectall ORFs which are longer than 500 nucleotides. Again, we print the first 5 ORFs longer than 500 nucleotides which coincide with the CDS of the first 5 sequences.Finally, we translate the CDS of the sequences to proteins. Since the translation is an in-place operation, we create a copy to keep the original sequences in the
seqsvariable. Sugar methods generally return the calling object if applicable, so we were able to use method chaining. Note that instead of using the CDS annotation for slicing, we could have used the found ORFs:>>> seqs.fts = long_orfs >>> proteins = seqs['orf'].copy().translate()
We write the protein strings to a new FASTA file. The format is automatically detected from the file extension, if you want to be verbose, you can specify the format with
proteins.write('pesti_proteins.fasta', 'fasta').