Sequence features¶

Sequence features, respective annotations, can be handled with the Feature and FeatureList classes. To read features, use the read_fts() routine. It has the same gimmicks as the sequence read() function. To write features, use the FeatureList.write() method. The following feature formats are supported out of the box:

format	module	read	write	description
gff	`sugar._io.gff`	`✅`	`✅`	Generic feature format (GFF) and Gene transfer format (GTF) IO
gtf	`sugar._io.gff`	`✅`	`✅`	Generic feature format (GFF) and Gene transfer format (GTF) IO
genbank	`sugar._io.genbank`	`✅`		GenBank reader
embl	`sugar._io.embl`	`✅`		EMBL flat file reader for ENA and UniProt
infernal	`sugar._io.tab.infernal`	`✅`		Infernal reader for output generated with tblout fmt 1, 2, 3
mmseqs	`sugar._io.tab.mmseqs`	`✅`		MMseqs2 reader for output generated with option fmtmode 4 (preferred) or 0
meme_txt	`sugar._io.meme`	`✅`		Read support for file formats associated with the MEME Suite
blast	`sugar._io.tab.blast`	`✅`		BLAST reader for output generated with outfmt 7 (preferred), 6 or 10
tsv	`sugar._io.tab.xsv`	`✅`	`✅`	TSV, CSV and XSV file formats IO
csv	`sugar._io.tab.xsv`	`✅`	`✅`	TSV, CSV and XSV file formats IO
sjson	`sugar._io.sjson`	`✅`	`✅`	SJson IO, custom lossless sugar format

The following example loads a BLAST file and writes the hit locations to a GFF file:

>>> from sugar import read_fts
>>> fts = read_fts('hits.blastn')
>>> fts.write('blast_hits.gff')

Printing a FeatureList shows lines like this:

{type} {start}{strand} {len} {meta}

The following example prints the the first 5 features of the sample GFF file included in the package. This file can be read by calling .read_fts() without any arguments.

>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:5])
    region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129-         497  name=LOC103298147;id=gene-LOC103298147;seqid=...
      exon 43_129-         497  id=id-LOC103298147;seqid=NC_072478.1;_gff=Att...
      gene 61_869+      36_813  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_g...
      mRNA 61_869+      36_813  name=XM_054717066.1;id=rna-XM_054717066.1;seq...

Note

Location positions are Python-like 0-based numbers.

Selecting features

Features can be selected in several ways.

Selecting from the FeatureList by slicing:

>>> from sugar import read_fts
>>> fts = read_fts()
>>> print(fts[:2])  # Select the first two features
    region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_0...
pseudogene 43_129-         497  name=LOC103298147;id=gene-LOC103298147;seqid=...

Use select() to select features of a particular type:

>>> print(fts.select('mRNA'))  # Select the mRNA type features
mRNA 61_869+ 36_813  name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_07247...
>>> print(fts.select(['gene', 'region']))  # Select the features of type gene and region
region      0+ 108_689_991  name=6;id=NC_072478.1:1..108689991;seqid=NC_07247...
  gene 61_869+      36_813  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=A...

Use select() to select features with other criteria:

/home/docs/checkouts/readthedocs.org/user_builds/rnajena-sugar/envs/latest/lib/python3.12/site-packages/sugar/core/cane.py:88: UserWarning: Attribute "name" not present in Meta object
  warnings.warn(f'Attribute "{key}" not present in {obj.__class__.__name__} object')

Use slice() to select and slice by position:

>>> slfts = fts.slice(70_000, 80_000)
>>> print(slfts)
region 70_000+ 10_000  name=6;id=NC_072478.1:1..108689991;seqid=NC_072478.1;_...
  gene 70_000+ 10_000  name=AP1AR;id=gene-AP1AR;seqid=NC_072478.1;_gff=Attr(I...
  mRNA 70_000+ 10_000  name=XM_054717066.1;id=rna-XM_054717066.1;seqid=NC_072...
  exon 75_477+     49  id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(I...
  exon 76_998+     27  id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(I...
   CDS 75_477+     49  name=XP_054573041.1;id=cds-XP_054573041.1;seqid=NC_072...
       76_998+     27  

Selecting by slice may result in features with defects, i.e. the feature locations do not span the entire feature:

>>> print(slfts[0].loc)  # Show the first location of the first feature
Location(70000, 80000, strand='+', defect=3)
>>> slfts[0].loc.defect
<Defect.MISS_LEFT|MISS_RIGHT: 3>

Other useful methods

To sort features, use the FeatureList.sort() method, e.g. to sort by id of the corresponding sequence, use fts.sort('seqid'). The following example sorts by length:

>>> print(fts.sort(len)[:3])  # Sort is in-place by default
exon 80_391+ 26  id=exon-XM_054717066.1-4;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 76_998+ 27  id=exon-XM_054717066.1-3;seqid=NC_072478.1;_gff=Attr(ID='exo...
exon 75_477+ 49  id=exon-XM_054717066.1-2;seqid=NC_072478.1;_gff=Attr(ID='exo...

The FeatureList.tolists(), topandas() and frompandas() methods can be handy in some cases:

>>> from sugar import read_fts
>>> fts = read_fts().select('cDNA_match')
>>> for record in fts.tolists('type start strand len'):
...     print(record)
... 
['cDNA_match', 101888622, <Strand.REVERSE: '-'>, 4245]
['cDNA_match', 103140200, <Strand.REVERSE: '-'>, 30745]
['cDNA_match', 103944892, <Strand.REVERSE: '-'>, 7136]
['cDNA_match', 107859806, <Strand.REVERSE: '-'>, 2392]
>>> print(fts.topandas())
         type      start       stop strand
0  cDNA_match  101888622  101892867      -
1  cDNA_match  103140200  103170945      -
2  cDNA_match  103944892  103952028      -
3  cDNA_match  107859806  107862198      -

See also the advanced example in the Sequences Tutorial.

Data model of the FeatureList and Feature classes

Attributes marked with an asterisk are accessible directly from the feature object.

Associate features

You can associate features with sequences using the BioBasket.add_fts() methods, or by setting the BioBasket.fts attribute directly. For example, if you have a FASTA file and a GFF file with the corresponding features, you can do the following:

>>> seqs = read('AF086833.fasta')
>>> fts = read_fts('AF086833.gff')
>>> seqs.fts = fts

The last line associates the features to the correct sequences in the BioSeq.meta.fts attribute (also accessible via BioSeq.fts). If you want to write sequences and features in separate files, just use the seqs.write() and seqs.fts.write() methods.