Sequence features ================= Sequence features, respective annotations, can be handled with the `.Feature` and `.FeatureList` classes. To read features, use the `.read_fts()` routine. It has the same gimmicks as the sequence `.read()` function. To write features, use the `.FeatureList.write()` method. The following feature formats are supported out of the box: .. include:: autogenerated_format_table_fts.rst The following example loads a BLAST file and writes the hit locations to a GFF file:: >>> from sugar import read_fts >>> fts = read_fts('hits.blastn') >>> fts.write('blast_hits.gff') Printing a FeatureList shows lines like this:: {type} {start}{strand} {len} {meta} The following example prints the the first 5 features of the sample GFF file included in the package. This file can be read by calling .read_fts() without any arguments. .. runblock:: pycon >>> from sugar import read_fts >>> fts = read_fts() >>> print(fts[:5]) .. note:: Location positions are Python-like 0-based numbers. .. rubric:: Selecting features Features can be selected in several ways. 1. Selecting from the `.FeatureList` by slicing: .. runblock:: pycon >>> from sugar import read_fts >>> fts = read_fts() >>> print(fts[:2]) # Select the first two features 2. Use `~.FeatureList.select()` to select features of a particular type: .. runblock:: pycon >>> from sugar import read_fts; fts = read_fts() # ignore >>> print(fts.select('mRNA')) # Select the mRNA type features >>> print(fts.select(['gene', 'region'])) # Select the features of type gene and region 3. Use `~.FeatureList.select()` to select features with other criteria: .. runblock:: pycon >>> from sugar import read_fts; fts = read_fts() # ignore >>> print(fts.select('exon', len_gt=1000)) # Select exons longer than 200 nucleotides >>> print(fts.select(name_eq='LOC103298147')) # Selection based on name >>> print(fts.select(id_in=['exon-XM_054717066.1-6', ... 'exon-XM_054717066.1-7'])) # Selection based on ids 4. Use `~.FeatureList.slice()` to select and slice by position: .. runblock:: pycon >>> from sugar import read_fts; fts = read_fts() # ignore >>> slfts = fts.slice(70_000, 80_000) >>> print(slfts) Selecting by slice may result in features with defects, i.e. the feature locations do not span the entire feature: .. runblock:: pycon >>> from sugar import read_fts; fts = read_fts() # ignore >>> slfts = fts.slice(70_000, 80_000) # ignore >>> print(slfts[0].loc) # Show the first location of the first feature >>> slfts[0].loc.defect .. rubric:: Other useful methods To sort features, use the `.FeatureList.sort()` method, e.g. to sort by id of the corresponding sequence, use ``fts.sort('seqid')``. The following example sorts by length: .. runblock:: pycon >>> from sugar import read_fts; fts = read_fts() # ignore >>> print(fts.sort(len)[:3]) # Sort is in-place by default The `.FeatureList.tolists()`, `~.FeatureList.topandas()` and `~.FeatureList.frompandas()` methods can be handy in some cases: .. runblock:: pycon >>> from sugar import read_fts >>> fts = read_fts().select('cDNA_match') >>> for record in fts.tolists('type start strand len'): ... print(record) >>> print(fts.topandas()) See also the advanced example in the :doc:`Sequences Tutorial `. .. rubric:: Data model of the ``FeatureList`` and ``Feature`` classes .. figure:: ../_static/datamodel_fts.svg :align: center :figclass: only-light :width: 90% .. figure:: ../_static/datamodel_fts_dark.svg :align: center :figclass: only-dark :width: 90% Attributes marked with an asterisk are accessible directly from the feature object. .. rubric:: Associate features You can associate features with sequences using the `.BioBasket.add_fts()` methods, or by setting the `.BioBasket.fts` attribute directly. For example, if you have a FASTA file and a GFF file with the corresponding features, you can do the following: >>> seqs = read('AF086833.fasta') >>> fts = read_fts('AF086833.gff') >>> seqs.fts = fts The last line associates the features to the correct sequences in the ``BioSeq.meta.fts`` attribute (also accessible via ``BioSeq.fts``). If you want to write sequences and features in separate files, just use the ``seqs.write()`` and ``seqs.fts.write()`` methods.