sugar.core.seq module

Sequence related classes, BioSeq, BioBasket

class sugar.core.seq.BioSeq(data, id='', meta=None, type=None)[source]

Bases: object

Class holding sequence data and metadata, exposing bioinformatics methods.

Most methods work in-place by default, but return the BioSeq object again. Therefore, method chaining can be used.

classmethod frombiopython(obj)[source]

Create a BioSeq object from a biopython SeqRecord or Seq object.

Parameters:

obj – The object to convert.

Note

BioPython Features in the SeqRecord.features attribute are automatically converted.

classmethod frombiotite(obj)[source]

Create a BioSeq object from a biotite sequence object.

Parameters:

obj – The object to convert.

add_fts(fts)[source]

Add some features to the feature list.

If you want to set all features, use the BioSeq.fts attribute.

Parameters:

fts – features to add

complement()[source]

Complementary sequence, i.e. transcription

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

copy()[source]

Return a deep copy of the object

countall(*args, **kw)[source]
countplot(*args, hue=None, **kw)[source]
find_orfs(*args, **kw)[source]

Find ORFS in the sequence, see find_orfs()

match(*args, **kw)[source]

Search regex and return match, see ~.cane.match()

matchall(*args, **kw)[source]

Search regex and return BioMatchList with all matches, see match()

plot_ftsviewer(*args, **kw)[source]

Plot features of the sequence using DNAFeaturesViewer, see plot_ftsviewer()

Note

Using BioSeq or BioBasket.plot_ftsviewer() over FeatureList.plot_ftsviewer() has the advantage, that sequence lengths are used automatically.

rc(update_fts=False)[source]

Reverse complement, alias for BioSeq.reverse().complement()

reverse()[source]

Reverse the sequence

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

sl(**kw)[source]

Method that allows you to slice the BioSeq object with non-default options.

If you want to use the default options, you can slice the BioSeq object directly. For non-default options, slice the sliceable object returned by this method.

Parameters:
  • inplace (bool) – Not only will the subsequence be returned, but the original sequence will be modified in-place (default: False).

  • gap (str) – Gaps of the given characters are taken into account when slicing the sequence (default: gaps are not taken into account)

Slicing options:

The slice specifies which part of the sequence is returned, and is defined inside the square brackets [] The following types are supported.

int,slice

The location is specified by int or slice

Location

specified by location

Feature

specified by feature

str

Position of the first feature of the given type, e.g. 'cds' will return the sequence with the first coding sequence.

Example:

>>> from sugar import read
>>> seq = read()[0]
>>> print(seq[:5])  # use direct slicing for default options
ACCTG
>>> print(seq[4])
G
>>> print(seq['cds'][:3])
ATG
>>> print(seq.sl(inplace=True, gap='-')[:5:2])  # non-default options
ACG
>>> print(seq)  # has been modified in-place
ACG
slindex(gap=None)[source]

Method that translates an index to account for gaps

Example:

>>> from sugar import BioSeq
>>> seq = BioSeq('ATG---GGA')
>>> print(seq)
ATG---GGA
>>> print(seq[1:5])
TG--
>>> print(seq.sl(gap='-')[1:5])
TG---GG
>>> print(seq.slindex(gap='-')[1:5])
slice(1, 8, None)
>>> print(seq[seq.slindex(gap='-')[1:5]])
TG---GG
tobiopython()[source]

Convert BioSeq to biopython SeqRecord instance

Attached BioSeq.fts features are automatically converted.

tobiotite(**kw)[source]

Convert BioSeq to biotite NucleotideSequence or ProteinSequence instance

Parameters:
  • type (str) – 'nt' creates a NucleotideSequence instance, 'aa' creates a ProteinSequence instance, by default the class is inferred from the sequence itself.

  • gap (str) – Gap characters that must be removed from the sequence string.

  • warn (bool) – Whether to warn if gap characters have been removed, default is True.

tofmtstr(fmt, **kw)[source]

Write object to a string of given format, see write()

toftsviewer(**kw)[source]

Convert features of this sequence to DNAFeaturesViewer GraphicRecord

See FeatureList.toftsviewer.

tostr(**kw)[source]

Return a nice string, see BioBasket.tostr()

translate(*args, update_fts=False, **kw)[source]

Translate nucleotide sequence to amino acid sequence, see translate().

The original translate method of the str class can be used via BioBasket.str.translate().

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

write(fname=None, fmt=None, **kw)[source]

Write sequence to file, see write()

data

Property holding the data string

property fts

Alias for BioSeq.meta.fts

The fts object holds all feature metadata. It is an instance of FeatureList.

property gc

GC content of the sequence

property id

Alias for BioSeq.meta.id

meta

Property holding metadata

property str

Namespace holding all available string methods, see _BioSeqStr for available methods and str for documentation of the methods

Example:

>>> seq = read()[0]
>>> seq.str.find('ATG')  # Use string method
30
type

type of the sequence, either 'nt' or 'aa'

class sugar.core.seq.BioBasket(data=None, meta=None)[source]

Bases: UserList

Class holding a list of BioSeq objects

The BioBasket object can be used like a list. It has useful bioinformatics methods attached to it.

The list itself is stored in the data property. The BioBasket object may also have a metadata attribute.

classmethod frombiopython(obj)[source]

Create a BioBasket object from a list of biopython SeqRecord or Seq objects.

Parameters:

obj – The object to convert, can also be a MultipleSeqAlignment object.

Note

BioPython Features in the SeqRecord.features attribute are automatically converted.

classmethod frombiotite(obj)[source]

Create a BioBasket object from a list of biotite sequence objects.

Parameters:

obj – The object to convert, can also be a biotite Alignment object.

static fromfmtstr(in_, fmt=None, **kw)[source]

Read sequences from a string

add_fts(fts)[source]

Add some features to the feature list of the corresponding sequences.

If you want to set all features, use the BioBasket.fts attribute.

Parameters:

fts – features to add

complement()[source]

Complementary sequences, i.e. transcription

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

copy()[source]

Return a deep copy of the BioBasket object.

countall(rtype='counter', k=1)[source]

Count letters in sequences

This method may undergo disrupting changes or it may be removed in a later release.

Parameters:

rtype

  • 'counter' Return Counter object

  • 'prob' Return dictionary with normalized counts

  • 'df' Return pandas DataFrame object with count, prob and tprob (total prob) fields

countplot(y='word', x='count', hue='id', order=None, plot='show', figsize=None, ax=None, savefigkw={}, **kw)[source]

Create a plot of letter counts

This method may undergo disruptive changes, or it may be removed in a later release.

Under the hood this method uses the pandas and seaborn libraries. For a help on most of the arguments, see seaborn.barplot().

find_orfs(*args, **kw)[source]

Find ORFS in sequences, see find_orfs()

groupby(keys=('id',), flatten=False)[source]

Group sequences

Parameters:

keys – Tuple of meta keys or functions to use for grouping. Can also be a single string or a callable. By default, the method groups only by id.

Returns:

Nested dict structure

Example:

>>> from sugar import read
>>> seqs = read()
>>> grouped = seqs.groupby()
match(*args, **kw)[source]

Search regex and return BioMatchList of matches, see match()

matchall(*args, **kw)[source]

Search regex and return BioMatchList of all matches, see match()

merge(spacer='', update_fts=False, keys=('id',))[source]
plot_alignment(*args, **kw)[source]

Plot an alignment, see plot_alignment()

plot_ftsviewer(*args, **kw)[source]

Plot features of the sequences using DNAFeaturesViewer, see plot_ftsviewer()

Note

Using BioSeq or BioBasket.plot_ftsviewer() over FeatureList.plot_ftsviewer() has the advantage, that sequence lengths are used automatically.

rc(**kw)[source]

Reverse complement, alias for BioBasket.reverse().complement()

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

reverse(*args, **kw)[source]

Reverse sequences

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

select(inplace=False, **kw)[source]

Select sequences

Parameters:
  • **kw – All kwargs must be of the form key_op=value, where op is one of the operators from the operator module. Additionally, the operator 'in' (membership) is supported. The different select conditions are combined with the and operator. If you need or, call select twice and combine the results with the | operator, e.g. seqs.select(...) | seqs.select(...)

  • inplace – Whether to modify the original object (default: False)

Returns:

Selected sequences

Example:

>>> from sugar import read
>>> seqs = read()
>>> seqs2 = seqs.select(len_gt=9500)  # Select sequences with length > 9500
sl(**kw)[source]

Method that allows you to slice the BioBasket object with non-default options.

If you want to use the default options, you can slice the BioBasket object directly. For non-default options, slice the sliceable object returned by this method.

Parameters:

**kw – All kwargs are documented in BioSeq.sl().

Slice options:

The slice specifies which part of the sequence(s) are returned and is defined inside the square brackets [] The following options are supported.

int

Returns a BioSeq from the basket

slice

Returns a new BioBasket object with a subset of the sequences

str,feature,location

Returns a new BioBasket object with updated sequences inside, see BioSeq.sl()

(int, object)

Returns a BioSeq from the basket and slices it with the object, see BioSeq.sl()

(slice, object)

Returns a new BioBasket object with a subset of the sequences which are replaced by subsequences according to BioSeq.sl()

Example:

>>> from sugar import read
>>> seqs = read()
>>> print(seqs[:2, 5:10])
2 seqs in basket
AB047639  5  CCCCT  ...
AB677533  5  CCCCC  ...
>>> print(seqs[:2, 'cds'][:, 0:3])
2 seqs in basket
AB047639  3  ATG  ...
AB677533  3  ATG  ...
sort(keys=('id',), reverse=False)[source]

Sort sequences in-place

Parameters:
  • keys – Tuple of meta keys or functions to use for sorting. Can also be a single string or a callable. Defaults to sorting by id.

  • reverse – Use reverse order (default: False)

Returns:

Sorted sequences

Example:

>>> from sugar import read
>>> seqs = read()
>>> seqs.sort(len)

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

tobiopython(*, msa=False)[source]

Convert the BioBasket to a list of biopython SeqRecord objects

Parameters:

msa (bool) – Return a biopython MultipleSeqAlignment object instead of a list.

Attached BioSeq.fts features are not converted.

tobiotite(**kw)[source]

Convert BioBasket to a list of biotite NucleotideSequence or ProteinSequence instance

Parameters:
  • type (str) – 'nt' creates a NucleotideSequence instance, 'aa' creates a ProteinSequence instance, by default the class is inferred from the sequence itself.

  • msa (bool) – Return a biotite Alignment object instead of a list, default is False

  • gap (str) – Gap characters that must be removed from the sequence strings.

  • warn (bool) – Wether to warn if gap characters have been removed, default is True, not used with msa=True

todict()[source]

Return a dictionary with sequence ids as keys and sequences as values

Note

This method is different from the BioBasket.groupby() method. Each value of the dict returned by todict() is a sequence, while each value of the dict returned by groupby() is a BioBasket.

tofmtstr(fmt, **kw)[source]

Write sequences to a string of the specified format, see write()

tostr(h=19, w=80, wid=19, wlen=4, showgc=True, add_hint=False, raw=False, add_header=True)[source]

Return string with information about sequences, used by __str__() method

translate(*args, **kw)[source]

Translate nucleotide sequences to amino acid sequences, see translate().

The original translate method of the str class can be used via BioBasket.str.translate().

Note

This function works in place and modifies the data. If you want to keep the original data, use the copy() method first.

write(fname=None, fmt=None, **kw)[source]

Write sequences to file, see write()

property d

Alias for BioBasket.todict()

data

Property holding the list of sequences

property fts

FeatureList of containing features of all sequences

Can also be used as setter. Code example: seqs.fts = new_fts.

property ids

List of sequence ids

meta

Property holding metadata

property str

Namespace holding all available string methods.

The BioBasket.str methods call the corresponding BioSeq.str methods under the hood and return either the modified BioBasket object or a list of results. See _BioSeqStr for available methods and str for method documentation.

Example:

>>> seqs = read()
>>> seqs.str.find('ATG')  # Use string method
[30, 12]
class sugar.core.seq._BioBasketStr(parent)[source]

Bases: object

Helper class to move all string methods into the BioBasket.str namespace

It calls the corresponding BioSeq.str method under the hood and returns either the modified BioBasket object or a list of results.

class sugar.core.seq._BioSeqStr(parent)[source]

Bases: object

Helper class to hold all string methods in the BioSeq.str namespace.

The methods modify the data in-place, if applicable, which is different from the behavior of the original string methods.

See str for documentation of the methods.

static maketrans(*args)[source]
center(width, *args)[source]
count(sub, start=0, end=9223372036854775807)[source]
encode(encoding='utf-8', errors='strict')[source]
endswith(suffix, start=0, end=9223372036854775807)[source]
find(sub, start=0, end=9223372036854775807)[source]
index(sub, start=0, end=9223372036854775807)[source]
isalpha()[source]
isascii()[source]
islower()[source]
isupper()[source]
ljust(width, *args)[source]

The ljust() and rjust() methods can be used to fill up an alignment with gaps.

Example: seqs.ljust(500, '-')

lower()[source]
lstrip(chars=None)[source]
removeprefix(prefix, /)[source]
removesuffix(suffix, /)[source]
replace(old, new, maxsplit=-1)[source]
rfind(sub, start=0, end=9223372036854775807)[source]
rindex(sub, start=0, end=9223372036854775807)[source]
rjust(width, *args)[source]
rsplit(sep=None, maxsplit=-1)[source]
rstrip(chars=None)[source]
split(sep=None, maxsplit=-1)[source]
splitlines(keepends=False)[source]
startswith(prefix, start=0, end=9223372036854775807)[source]
strip(chars=None)[source]
swapcase()[source]
translate(*args)[source]
upper()[source]