Sequence features¶
Sequence features, respective annotations,
can be handled with the Feature and FeatureList classes.
To read features, use the read_fts() routine.
It has the same gimmicks as the sequence read() function.
To write features, use the FeatureList.write() method.
The following feature formats are supported out of the box:
format |
module |
read |
write |
description |
|---|---|---|---|---|
gff |
Generic feature format (GFF) and Gene transfer format (GTF) IO |
|||
gtf |
Generic feature format (GFF) and Gene transfer format (GTF) IO |
|||
genbank |
GenBank reader |
|||
embl |
EMBL flat file reader for ENA and UniProt |
|||
infernal |
Infernal reader for output generated with tblout fmt 1, 2, 3 |
|||
mmseqs |
MMseqs2 reader for output generated with option fmtmode 4 (preferred) or 0 |
|||
meme_txt |
Read support for file formats associated with the MEME Suite |
|||
blast |
BLAST reader for output generated with outfmt 7 (preferred), 6 or 10 |
|||
tsv |
TSV, CSV and XSV file formats IO |
|||
csv |
TSV, CSV and XSV file formats IO |
|||
sjson |
SJson IO, custom lossless sugar format |
The following example loads a BLAST file and writes the hit locations to a GFF file:
>>> from sugar import read_fts
>>> fts = read_fts('hits.blastn')
>>> fts.write('blast_hits.gff')
Printing a FeatureList shows lines like this:
{type} {start}{strand} {len} {meta}
The following example prints the the first 5 features of the sample GFF file included in the package. This file can be read by calling .read_fts() without any arguments.
>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:5])
region 0+ 108_689_991 name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129- 497 name=LOC103298147;id=gene-LOC103298147;seqid=...
exon 43_129- 497 id=id-LOC103298147;seqid=NC_072478.1;_gff=Att...
gene 61_869+ 36_813 name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_g...
mRNA 61_869+ 36_813 name=XM_054717066.1;id=rna-XM_054717066.1;seq...
Note
Location positions are Python-like 0-based numbers.
Selecting features
Features can be selected in several ways.
Selecting from the
FeatureListby slicing:
>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:2]) # Select the first two features
region 0+ 108_689_991 name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129- 497 name=LOC103298147;id=gene-LOC103298147;seqid=...
Use
select()to select features of a particular type:
>>> print(fts.select('mRNA')) # Select the mRNA type features
mRNA 61_869+ 36_813 name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_07247...
>>> print(fts.select(['gene', 'region'])) # Select the features of type gene and region
region 0+ 108_689_991 name=6;id=NC_072478.1:1..108689991;seqid=NC_07247...
gene 61_869+ 36_813 name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=A...
Use
select()to select features with other criteria:
/home/docs/checkouts/readthedocs.org/user_builds/rnajena-sugar/envs/latest/lib/python3.12/site-packages/sugar/core/cane.py:88: UserWarning: Attribute "name" not present in Meta object
warnings.warn(f'Attribute "{key}" not present in {obj.__class__.__name__} object')
Use
slice()to select and slice by position:
>>> slfts = fts.slice(70_000, 80_000)
>>> print(slfts)
region 70_000+ 10_000 name=6;id=NC_072478.1:1..108689991;seqid=NC_072478.1;_...
gene 70_000+ 10_000 name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=Attr(I...
mRNA 70_000+ 10_000 name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_072...
exon 75_477+ 49 id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(I...
exon 76_998+ 27 id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(I...
CDS 75_477+ 49 name=XP_054573041.1;id=cds-XP_054573041.1;seqid=NC_072...
76_998+ 27
Selecting by slice may result in features with defects, i.e. the feature locations do not span the entire feature:
>>> print(slfts[0].loc) # Show the first location of the first feature
Location(70000, 80000, strand='+', defect=3)
>>> slfts[0].loc.defect
<Defect.MISS_LEFT|MISS_RIGHT: 3>
Other useful methods
To sort features, use the FeatureList.sort() method, e.g. to sort by id of the
corresponding sequence, use fts.sort('seqid').
The following example sorts by length:
>>> print(fts.sort(len)[:3]) # Sort is in-place by default
exon 80_391+ 26 id=exon-XM_054717066.1-4;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 76_998+ 27 id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 75_477+ 49 id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(ID='exo...
The FeatureList.tolists(), topandas() and frompandas() methods
can be handy in some cases:
>>> from sugar import read_fts
>>> fts = read_fts().select('cDNA_match')
>>> for record in fts.tolists('type start strand len'):
... print(record)
...
['cDNA_match', 101888622, <Strand.REVERSE: '-'>, 4245]
['cDNA_match', 103140200, <Strand.REVERSE: '-'>, 30745]
['cDNA_match', 103944892, <Strand.REVERSE: '-'>, 7136]
['cDNA_match', 107859806, <Strand.REVERSE: '-'>, 2392]
>>> print(fts.topandas())
type start stop strand
0 cDNA_match 101888622 101892867 -
1 cDNA_match 103140200 103170945 -
2 cDNA_match 103944892 103952028 -
3 cDNA_match 107859806 107862198 -
See also the advanced example in the Sequences Tutorial.
Data model of the FeatureList and Feature classes
Attributes marked with an asterisk are accessible directly from the feature object.
Associate features
You can associate features with sequences using the BioBasket.add_fts() methods,
or by setting the BioBasket.fts attribute directly.
For example, if you have a FASTA file and a GFF file with the corresponding features,
you can do the following:
>>> seqs = read('AF086833.fasta')
>>> fts = read_fts('AF086833.gff')
>>> seqs.fts = fts
The last line associates the features to the correct sequences in the BioSeq.meta.fts attribute
(also accessible via BioSeq.fts).
If you want to write sequences and features in separate files,
just use the seqs.write() and seqs.fts.write() methods.