Sequence features

Sequence features, respective annotations, can be handled with the Feature and FeatureList classes. To read features, use the read_fts() routine. It has the same gimmicks as the sequence read() function. To write features, use the FeatureList.write() method. The following feature formats are supported out of the box:

format

module

read

write

description

gff

sugar._io.gff

Generic feature format (GFF) and Gene transfer format (GTF) IO

gtf

sugar._io.gff

Generic feature format (GFF) and Gene transfer format (GTF) IO

genbank

sugar._io.genbank

GenBank reader

embl

sugar._io.embl

EMBL flat file reader for ENA and UniProt

infernal

sugar._io.tab.infernal

Infernal reader for output generated with tblout fmt 1, 2, 3

mmseqs

sugar._io.tab.mmseqs

MMseqs2 reader for output generated with option fmtmode 4 (preferred) or 0

meme_txt

sugar._io.meme

Read support for file formats associated with the MEME Suite

blast

sugar._io.tab.blast

BLAST reader for output generated with outfmt 7 (preferred), 6 or 10

tsv

sugar._io.tab.xsv

TSV, CSV and XSV file formats IO

csv

sugar._io.tab.xsv

TSV, CSV and XSV file formats IO

sjson

sugar._io.sjson

SJson IO, custom lossless sugar format

The following example loads a BLAST file and writes the hit locations to a GFF file:

>>> from sugar import read_fts
>>> fts = read_fts('hits.blastn')
>>> fts.write('blast_hits.gff')

Printing a FeatureList shows lines like this:

{type} {start}{strand} {len} {meta}

The following example prints the the first 5 features of the sample GFF file included in the package. This file can be read by calling .read_fts() without any arguments.

>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:5])
    region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129-         497  name=LOC103298147;id=gene-LOC103298147;seqid=...
      exon 43_129-         497  id=id-LOC103298147;seqid=NC_072478.1;_gff=Att...
      gene 61_869+      36_813  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_g...
      mRNA 61_869+      36_813  name=XM_054717066.1;id=rna-XM_054717066.1;seq...

Note

Location positions are Python-like 0-based numbers.

Selecting features

Features can be selected in several ways.

  1. Selecting from the FeatureList by slicing:

>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:2])  # Select the first two features
    region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129-         497  name=LOC103298147;id=gene-LOC103298147;seqid=...
  1. Use select() to select features of a particular type:

>>> print(fts.select('mRNA'))  # Select the mRNA type features
mRNA 61_869+ 36_813  name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_07247...
>>> print(fts.select(['gene', 'region']))  # Select the features of type gene and region
region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_07247...
  gene 61_869+      36_813  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=A...
  1. Use select() to select features with other criteria:

/home/docs/checkouts/readthedocs.org/user_builds/rnajena-sugar/envs/latest/lib/python3.12/site-packages/sugar/core/cane.py:88: UserWarning: Attribute "name" not present in Meta object
  warnings.warn(f'Attribute "{key}" not present in {obj.__class__.__name__} object')
  1. Use slice() to select and slice by position:

>>> slfts = fts.slice(70_000, 80_000)
>>> print(slfts)
region 70_000+ 10_000  name=6;id=NC_072478.1:1..108689991;seqid=NC_072478.1;_...
  gene 70_000+ 10_000  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=Attr(I...
  mRNA 70_000+ 10_000  name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_072...
  exon 75_477+     49  id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(I...
  exon 76_998+     27  id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(I...
   CDS 75_477+     49  name=XP_054573041.1;id=cds-XP_054573041.1;seqid=NC_072...
       76_998+     27  

Selecting by slice may result in features with defects, i.e. the feature locations do not span the entire feature:

>>> print(slfts[0].loc)  # Show the first location of the first feature
Location(70000, 80000, strand='+', defect=3)
>>> slfts[0].loc.defect
<Defect.MISS_LEFT|MISS_RIGHT: 3>

Other useful methods

To sort features, use the FeatureList.sort() method, e.g. to sort by id of the corresponding sequence, use fts.sort('seqid'). The following example sorts by length:

>>> print(fts.sort(len)[:3])  # Sort is in-place by default
exon 80_391+ 26  id=exon-XM_054717066.1-4;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 76_998+ 27  id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 75_477+ 49  id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(ID='exo...

The FeatureList.tolists(), topandas() and frompandas() methods can be handy in some cases:

>>> from sugar import read_fts
>>> fts = read_fts().select('cDNA_match')
>>> for record in fts.tolists('type start strand len'):
...     print(record)
... 
['cDNA_match', 101888622, <Strand.REVERSE: '-'>, 4245]
['cDNA_match', 103140200, <Strand.REVERSE: '-'>, 30745]
['cDNA_match', 103944892, <Strand.REVERSE: '-'>, 7136]
['cDNA_match', 107859806, <Strand.REVERSE: '-'>, 2392]
>>> print(fts.topandas())
         type      start       stop strand
0  cDNA_match  101888622  101892867      -
1  cDNA_match  103140200  103170945      -
2  cDNA_match  103944892  103952028      -
3  cDNA_match  107859806  107862198      -

See also the advanced example in the Sequences Tutorial.

Data model of the FeatureList and Feature classes

../_images/datamodel_fts.svg
../_images/datamodel_fts_dark.svg

Attributes marked with an asterisk are accessible directly from the feature object.

Associate features

You can associate features with sequences using the BioBasket.add_fts() methods, or by setting the BioBasket.fts attribute directly. For example, if you have a FASTA file and a GFF file with the corresponding features, you can do the following:

>>> seqs = read('AF086833.fasta')
>>> fts = read_fts('AF086833.gff')
>>> seqs.fts = fts

The last line associates the features to the correct sequences in the BioSeq.meta.fts attribute (also accessible via BioSeq.fts). If you want to write sequences and features in separate files, just use the seqs.write() and seqs.fts.write() methods.