Handling DNA/RNA sequences
==========================

The core of sugar are the sequence handling classes `.BioSeq` and `.BioBasket`.
The `.BioSeq` class behaves like a string with useful bioinformatics methods attached.
The `.BioBasket` class is a container for multiple `.BioSeq` objects and
behaves like a list with useful methods attached.
An example of such a useful method is certainly the `~.BioBasket.translate()` method.

To read sequences, use the powerful `~._io.main.read()` routine.
It can handle glob expressions, web resources, archives,
and automatically detects file formats by inspecting the file contents.
To write sequences to files, use the `.BioBasket.write()` method.
When writing, the format can be automatically detected from the file extension.

.. note::
   If you don't have any example files on hand:
   All of the local example files used in the tutorial are distributed with the sugar package.
   You can copy them into the current folder using the ``sugar tutorial`` command.

The following code loads two local files.
The format of the first file is specified explicitly,
the format of the second file is autodetected (Stockholm).
Gaps are removed from the alignment using a string method.
The original ``ali`` object is left untouched.
Then, a file containing the sequence from the FASTA file and
the first two sequences from the alignment is created. ::

    >>> from sugar import read
    >>> seqs = read('AF086833.fasta', 'fasta')
    >>> ali = read('pesti.stk')
    >>> cleaned = ali.copy().str.replace('-', '')
    >>> combined = seqs + cleaned[:2]
    >>> combined.write('combined.fasta')

Calling `.read()` without any arguments returns sample sequences.

.. runblock:: pycon

    >>> from sugar import read  # ignore
    >>> seqs = read()
    >>> print(seqs)


.. rubric:: Available file formats

Since sugar uses a plugin system, it is easy to add support for new file formats on the fly.
The following sequence file formats are supported out of the box:

.. include:: autogenerated_format_table_seqs.rst

The table links to the used modules and function.


.. rubric:: Architecture of the ``BioBasket`` and ``BioSeq`` classes

In the following we explore the `.BioSeq` and `.BioBasket` objects.

.. runblock:: pycon

    >>> from sugar import read
    >>> seqs = read()
    >>> print('ids', seqs.ids)
    >>> print(seqs)
    >>> type(seqs)
    >>> type(seqs[0])
    >>> seq1 = seqs[0]
    >>> seq2 = seqs.d['AB677533']  # Select sequence by id
    >>> print(f'First sequence {seq1.id} starts with {seq1[:10]}.')
    >>> print(f'Metadata:\n{seq1.meta}')
    >>> print('Metadata can be accessed with keys or as attributes:',
    ...       seq1.meta.id, seq1.meta['id'])

As you can see, the ``seqs`` object is an instance of the `.BioBasket` class,
which is basically a list of `.BioSeq` objects,
each containing a single sequence and its metadata.
Here, the sequences have been loaded from a GenBank file,
which can be inferred from the value of the ``meta._fmt`` attribute.
Each format plugin stores format-specific data
in the private attribute of its name when reading, here ``_genbank``.
This attribute can also be used by the corresponding format plugin to write format specific data.

The `.BioBasket` and `.BioSeq` classes have useful methods attached to them,
such as `~.BioBasket.rc()`, `~.BioBasket.translate()`, `~.BioBasket.match()`, or `~.BioBasket.find_orfs()`.
Also, many string methods are accessible via the ``str`` attribute,
e.g. `str.find()<._BioSeqStr.find()>` or `str.replace()<._BioSeqStr.replace()>`.
Note, that unlike the original string methods,
the methods attached to the ``str`` attribute work in-place, if applicable.

The general architecture of the `.BioBasket` and `.BioSeq` classes
is shown in the following diagram:

.. figure:: ../_static/datamodel_seq.svg
   :align: center
   :figclass: only-light
   :width: 90%

.. figure:: ../_static/datamodel_seq_dark.svg
   :align: center
   :figclass: only-dark
   :width: 90%

Attributes marked with an asterisk are directly accessible from the sequence object.


.. rubric:: Slicing ``BioBasket`` and ``BioSeq`` objects

Slicing is heavily overloaded for the `.BioBasket` and `.BioSeq` classes.

The `.BioSeq` class can take the following slicing arguments:
   * A normal slice notation, in which case a BioSeq object with sliced data is returned.
   * A `.Feature`, `.LocationTuple`, or `.Location` object,
     in which case the sequence will be sliced according to the given location.
   * A string, in which case the first Feature attached to the sequence of that type is used.

The `.BioBasket` class can take the following slicing arguments:
   * A normal slice notation, in which case the underlying list is sliced.
   * For strings, locations, and features, each sequence is sliced separately with the argument.
   * For len-2 slices, the first item selects the sequences,
     and the second item slices the sequences
     according to the rules above.

Some examples:

.. runblock:: pycon

    >>> from sugar import read
    >>> seqs = read()
    >>> seq = seqs[0]
    >>> print(seq[:10])  # prints the BioSeq object
    >>> print(seq.data[:10])  # use data attribute to get the string
    >>> loc = seq.fts[1].loc
    >>> seq_of_ft = seq[loc]
    >>> cds = seq['cds']

.. runblock:: pycon

    >>> from sugar import read  # ignore
    >>> seqs = read()  # ignore
    >>> print(seqs[0, :10])
    >>> print(seqs[:, :10])
    >>> print(seqs['cds'])

It is possible to specify options for these slicing operations.
This is done using the `.BioBasket` and `.BioSeq.sl()` methods.
The syntax is as follows: ``seq.sl(**kw)[slice]``.
The ``gap`` option can be used to account for gaps.
Use it when slices or features are specified relative to the sequences without gaps.
This is best demonstrated with an example:

.. runblock:: pycon

    >>> from sugar import read  # ignore
    >>> seqs = read()  # ignore
    >>> seqs[:1, :5] = '-----'  # replace 5 chars with gaps for first seq
    >>> print(seqs[:, :10])
    >>> print(seqs.sl(gap='-')[:, :10])

Some methods, including ``.sl()``, come with an ``update_fts`` option,
which updates the features to match the new manipulated sequences.


.. rubric:: An advanced example

The code is explained below.

.. runblock:: pycon

    >>> from sugar import read
    >>> resource = ('https://github.com/rnajena/anchorna/'
    ...             'raw/refs/heads/master/anchorna/tests/data/pesti55.gff.zip')
    >>> seqs = read(resource)  # 1
    >>> print(seqs)  # 2
    >>> print(seqs.fts[:5])  # 3
    >>> print(seqs[0].str.find('ATG'))  # 4
    >>> print(seqs[0].matchall('start')[:8])  # 5
    >>> orfs = seqs.find_orfs()  # 6
    >>> print(orfs[:5])
    >>> long_orfs = orfs.select(len_gt=500)
    >>> print(long_orfs[:5])
    >>> proteins = seqs['cds'].copy().translate()  # 7
    >>> print(proteins)
    >>> proteins.write('pesti_proteins.fasta')  # 8

Let us break up the code:

1. We `.read` an archive containing a GFF file with FASTA directive from an online resource.
   Sugar automatically downloads and extracts the archive (detected by file extension)
   and reads the file with the appropriate plugin (detected by content).
2. The content of ``seqs`` is printed, to print more than 20 lines, use e.g.
   ``print(seqs.tostr(h=100))`` to print 100 lines, use ``h=0`` to print all lines.
3. Features are attached to the object,
   we can print the features of the first sequence with ``print(seqs[0].fts)``.
   In the example, the first 5 features of all sequences are printed.
   Obviously, these mark the coding sequences.
4. We try to find the start codon using the ``str.find()`` method,
   but the first ATG string is found before the start of the CDS.
5. We use the more dedicated `~.BioSeq.matchall()` method to find possible start codons.
   In fact, one of the matches ``span=(385, 388) rf=1`` marks the start of the CDS according to the annotation data.
6. We use the `~.BioBasket.find_orfs()` method to find open reading frames.
   The first 5 ORFs found are printed, a CDS is not one of them.
   We could `~.FeatureList.sort` the ORFs by length,
   but we decided to `~.FeatureList.select` all ORFs which are longer than 500 nucleotides.
   Again, we print the first 5 ORFs longer than 500 nucleotides which coincide with the CDS
   of the first 5 sequences.
7. Finally, we translate the CDS of the sequences to proteins. Since the translation is an
   in-place operation, we create a copy to keep the original sequences in the ``seqs`` variable.
   Sugar methods generally return the calling object if applicable,
   so we were able to use method chaining.
   Note that instead of using the CDS annotation for slicing,
   we could have used the found ORFs::

      >>> seqs.fts = long_orfs
      >>> proteins = seqs['orf'].copy().translate()

8. We write the protein strings to a new FASTA file.
   The format is automatically detected from the file extension,
   if you want to be verbose, you can specify the format with
   ``proteins.write('pesti_proteins.fasta', 'fasta')``.