sugar.core.cane module

Several core helper classes and functions, such as translate(), match(), and find_orfs()

class sugar.core.cane.BioMatch(match, rf=None, seqlen=None, seqid=None)[source]

Bases: object

The BioMatch object is returned by match() and the different match methods.

It is designed to behave like the original re.Match object. See there for available methods. It also has the BioMatch.rf attribute, which holds the reading frame (between -3 and 2, inclusive) of the match.

Example:

>>> match = read()[0].match('AT.')
>>> match
<sugar.BioMatch object; seqid=AB047639; rf=2; span=(11, 14); match=ATA>
>>> print(match.group(), match.rf)
ATA 2
end()[source]
span()[source]
start()[source]
rf

reading frame (-3 to 2)

seqid

sequence id

class sugar.core.cane.BioMatchList(initlist=None)[source]

Bases: UserList

List of BioMatch objects

groupby(keys='rf')[source]

Group matches

Parameters:

keys – Tuple of meta keys or functions to use for grouping. Can also be a single string or a callable. By default the method groups by rf.

Returns:

Nested dict structure

select(**kw)[source]

Select matches

Parameters:

keys – Tuple of meta keys or functions to use for grouping. Can also be a single string or a callable. By default the method groups by seqid only.

Returns:

Nested dict structure

tostr()[source]
property d

Group matches by seqid, alias for BioMatchList.groupby('seqid')

class sugar.core.cane.ORFList(data=None)[source]

Bases: FeatureList

List of open reading frames (ORFs)

sugar.core.cane.find_orfs(seq, rf='all', start='start', stop='stop', need_start='always', need_stop=True, nested='other', gap=None, len_ge=0, ftype='ORF')[source]

Find open reading frames (ORFs)

Parameters:
  • seqBioSeq sequence

  • rf – reading frame, possible values: int, string or tuple. See also match. Default is 'all'.

  • start – regular expression defining the start codons, defaults to ATG/AUG

  • stop – regular expression defining the stop codons, defaults to stop codons in the default translation table

  • need_start – One of ('always', 'once', 'never'). Always: Each ORF starts with a start codon. Once: Only the first ORF on the forward and backward strand starts with a start codon. Never: ORFs can start at each codon.

  • need_stop – Whether the last ORF in each RF must end with a stop codon.

  • nested – Allow nested ORFs fully contained within other ORFs, one of ('no', 'other', 'all'). No: No nested ORFs. Other: Nested ORFs allowed in other reading frames (default). All: Nested ORFs allowed in all reading frames.

  • gap – Gap character inserted into the start and stop codon regexes, default is None.

  • len_ge – Return only ORFs with length greater equal, default: 0.

  • ftype – Feature type for found ORFs, default is 'ORF'

Returns:

Returns a ORFList of all found ORFs. You can attach these features to sequences using BioSeq.add_fts() or BioBasket.add_fts(). Use the BioSeq.fts and BioBasket.fts properties to overwrite features with the found ORFs.

Note

Python’s re.finditer() is used internally to find start and stop codons. The limitations of this function apply; for example, matches cannot overlap. Care must be taken in special cases. For instance, if ORFs do not need to start with a start codon, do not use the regular expression start='...'; use the need_start='never' option instead.

sugar.core.cane.match(seq, sub, *, rf='fwd', start=0, gap='-', matchall=False)[source]

Return BioMatch object for first found match of regex sub, None if not found.

Parameters:
  • sub (str) – regex or 'start' or 'stop' to find start/stop codon, please specify different codons like

  • rf (int) – Can be set to an integer between -3 and 2 inclusive to respect the corresponding reading frame. Rfs 0 to 2 are on the forward strand, rfs -3 to -1 are on the backward strand, You can also specify a set or tuple of reading frames. Additionally you can use one of (‘fwd’, ‘bwd’, ‘all’) to select all reading frames on the specified strands. Defaults to 'fwd' – all three reading frames on the forward strand. You may set rf to None to ignore reading frames (i.e. for aa seqs)

  • start (int) – Index of the nucleobase to start matching. Defaults to 0.

  • gap (str) – Consider gaps of given character, Defaults to ‘-’. The character is inserted between each two letters of the regex. Be careful, this approach does not work for arbitrary regexes.

  • matchall (bool) – False will return first match of type BioMatch, True will return all matches in a BioMatchList. Defaults to False.

Returns:

match (BioMatch or BioMatchList of matches or None), the list will be sorted by match position, matches on the forward strand first, then matches on the backward strand.

sugar.core.cane.translate(seq, *, complete=False, check_start=None, check_stop=False, final_stop=None, warn=False, astop='X', gap='-', gap_after=2, tt=1)[source]

Translate a string or BioSeq object into an amino acid string

Parameters:
  • complete (bool) – If set to True ignore stop codons, otherwise the translation is stopped at the first stop codon

  • check_start (bool) – Check that the first codon is a start codon, default is True for complete=False otherwise False

  • check_stop (bool) – Check that the sequence ends with the first stop codon, default is False

  • final_stop (bool) – Append * for the final stop character, defaults to False for complete=False and True for complete=True

  • warn (bool) – Warn if the first codon might not be a start codon, warn for ambiguous stop codons, warn if the sequence does not end with a stop codon, default is False

  • astop (str) – Symbol for ambiguous stop codons

  • gap (str) – Gap character, default '-', set to None to raise an error for non-nucleotide characters

  • gap_after (int) – A single gap in the amino acid string is written after the first gap_after gaps in the nucleotide sequence and after every third gap thereafter, default is 2

  • tt (int) – the number of the translation table, default is 1

Returns:

Translated string