Sequence features
=================

Sequence features, respective annotations,
can be handled with the `.Feature` and `.FeatureList` classes.
To read features, use the `.read_fts()` routine.
It has the same gimmicks as the sequence `.read()` function.
To write features, use the `.FeatureList.write()` method.
The following feature formats are supported out of the box:

.. include:: autogenerated_format_table_fts.rst

The following example loads a BLAST file and writes the hit locations to a GFF file::

    >>> from sugar import read_fts
    >>> fts = read_fts('hits.blastn')
    >>> fts.write('blast_hits.gff')

Printing a FeatureList shows lines like this::

    {type} {start}{strand} {len} {meta}

The following example prints the the first 5 features of the sample GFF file included in the package.
This file can be read by calling .read_fts() without any arguments.

.. runblock:: pycon

    >>> from sugar import read_fts
    >>> fts = read_fts()
    >>> print(fts[:5])

.. note::
    Location positions are Python-like 0-based numbers.


.. rubric:: Selecting features

Features can be selected in several ways.

1. Selecting from the `.FeatureList` by slicing:

.. runblock:: pycon

    >>> from sugar import read_fts
    >>> fts = read_fts()
    >>> print(fts[:2])  # Select the first two features

2. Use `~.FeatureList.select()` to select features of a particular type:

.. runblock:: pycon

    >>> from sugar import read_fts; fts = read_fts()  # ignore
    >>> print(fts.select('mRNA'))  # Select the mRNA type features
    >>> print(fts.select(['gene', 'region']))  # Select the features of type gene and region

3. Use `~.FeatureList.select()` to select features with other criteria:

.. runblock:: pycon

    >>> from sugar import read_fts; fts = read_fts()  # ignore
    >>> print(fts.select('exon', len_gt=1000))  # Select exons longer than 200 nucleotides
    >>> print(fts.select(name_eq='LOC103298147'))  # Selection based on name
    >>> print(fts.select(id_in=['exon-XM_054717066.1-6',
    ...                         'exon-XM_054717066.1-7']))  # Selection based on ids

4. Use `~.FeatureList.slice()` to select and slice by position:

.. runblock:: pycon

    >>> from sugar import read_fts; fts = read_fts()  # ignore
    >>> slfts = fts.slice(70_000, 80_000)
    >>> print(slfts)

Selecting by slice may result in features with defects,
i.e. the feature locations do not span the entire feature:

.. runblock:: pycon

    >>> from sugar import read_fts; fts = read_fts()  # ignore
    >>> slfts = fts.slice(70_000, 80_000)  # ignore
    >>> print(slfts[0].loc)  # Show the first location of the first feature
    >>> slfts[0].loc.defect


.. rubric:: Other useful methods

To sort features, use the `.FeatureList.sort()` method, e.g. to sort by id of the
corresponding sequence, use ``fts.sort('seqid')``.
The following example sorts by length:

.. runblock:: pycon

    >>> from sugar import read_fts; fts = read_fts()  # ignore
    >>> print(fts.sort(len)[:3])  # Sort is in-place by default

The `.FeatureList.tolists()`, `~.FeatureList.topandas()` and `~.FeatureList.frompandas()` methods
can be handy in some cases:

.. runblock:: pycon

    >>> from sugar import read_fts
    >>> fts = read_fts().select('cDNA_match')
    >>> for record in fts.tolists('type start strand len'):
    ...     print(record)

    >>> print(fts.topandas())

See also the advanced example in the :doc:`Sequences Tutorial <tutorial_seqs>`.


.. rubric:: Data model of the ``FeatureList`` and ``Feature`` classes

.. figure:: ../_static/datamodel_fts.svg
   :align: center
   :figclass: only-light
   :width: 90%


.. figure:: ../_static/datamodel_fts_dark.svg
   :align: center
   :figclass: only-dark
   :width: 90%

Attributes marked with an asterisk are accessible directly from the feature object.


.. rubric:: Associate features

You can associate features with sequences using the `.BioBasket.add_fts()` methods,
or by setting the `.BioBasket.fts` attribute directly.
For example, if you have a FASTA file and a GFF file with the corresponding features,
you can do the following:

>>> seqs = read('AF086833.fasta')
>>> fts = read_fts('AF086833.gff')
>>> seqs.fts = fts

The last line associates the features to the correct sequences in the ``BioSeq.meta.fts`` attribute
(also accessible via ``BioSeq.fts``).
If you want to write sequences and features in separate files,
just use the ``seqs.write()`` and ``seqs.fts.write()`` methods.