sugar.core.cane module¶
Several core helper classes and functions, such as translate(), match(), and find_orfs()
- class sugar.core.cane.BioMatch(match, rf=None, seqlen=None, seqid=None)[source]¶
Bases:
objectThe BioMatch object is returned by
match()and the different match methods.It is designed to behave like the original
re.Matchobject. See there for available methods. It also has theBioMatch.rfattribute, which holds the reading frame (between -3 and 2, inclusive) of the match.Example:
>>> match = read()[0].match('AT.') >>> match <sugar.BioMatch object; seqid=AB047639; rf=2; span=(11, 14); match=ATA> >>> print(match.group(), match.rf) ATA 2
- rf¶
reading frame (-3 to 2)
- seqid¶
sequence id
- class sugar.core.cane.BioMatchList(initlist=None)[source]¶
Bases:
UserListList of
BioMatchobjects- groupby(keys='rf')[source]¶
Group matches
- Parameters:
keys – Tuple of meta keys or functions to use for grouping. Can also be a single string or a callable. By default the method groups by rf.
- Returns:
Nested dict structure
- select(**kw)[source]¶
Select matches
- Parameters:
keys – Tuple of meta keys or functions to use for grouping. Can also be a single string or a callable. By default the method groups by seqid only.
- Returns:
Nested dict structure
- property d¶
Group matches by seqid, alias for
BioMatchList.groupby('seqid')
- class sugar.core.cane.ORFList(data=None)[source]¶
Bases:
FeatureListList of open reading frames (ORFs)
- sugar.core.cane.find_orfs(seq, rf='all', start='start', stop='stop', need_start='always', need_stop=True, nested='other', gap=None, len_ge=0, ftype='ORF')[source]¶
Find open reading frames (ORFs)
- Parameters:
seq –
BioSeqsequencerf – reading frame, possible values: int, string or tuple. See also
match. Default is'all'.start – regular expression defining the start codons, defaults to ATG/AUG
stop – regular expression defining the stop codons, defaults to stop codons in the default translation table
need_start – One of
('always', 'once', 'never'). Always: Each ORF starts with a start codon. Once: Only the first ORF on the forward and backward strand starts with a start codon. Never: ORFs can start at each codon.need_stop – Whether the last ORF in each RF must end with a stop codon.
nested – Allow nested ORFs fully contained within other ORFs, one of
('no', 'other', 'all'). No: No nested ORFs. Other: Nested ORFs allowed in other reading frames (default). All: Nested ORFs allowed in all reading frames.gap – Gap character inserted into the start and stop codon regexes, default is None.
len_ge – Return only ORFs with length greater equal, default: 0.
ftype – Feature type for found ORFs, default is
'ORF'
- Returns:
Returns a
ORFListof all found ORFs. You can attach these features to sequences usingBioSeq.add_fts()orBioBasket.add_fts(). Use theBioSeq.ftsandBioBasket.ftsproperties to overwrite features with the found ORFs.
Note
Python’s
re.finditer()is used internally to find start and stop codons. The limitations of this function apply; for example, matches cannot overlap. Care must be taken in special cases. For instance, if ORFs do not need to start with a start codon, do not use the regular expressionstart='...'; use theneed_start='never'option instead.
- sugar.core.cane.match(seq, sub, *, rf='fwd', start=0, gap='-', matchall=False)[source]¶
Return
BioMatchobject for first found match of regex sub, None if not found.- Parameters:
sub (str) – regex or
'start'or'stop'to find start/stop codon, please specify different codons likerf (int) – Can be set to an integer between -3 and 2 inclusive to respect the corresponding reading frame. Rfs 0 to 2 are on the forward strand, rfs -3 to -1 are on the backward strand, You can also specify a set or tuple of reading frames. Additionally you can use one of (‘fwd’, ‘bwd’, ‘all’) to select all reading frames on the specified strands. Defaults to
'fwd'– all three reading frames on the forward strand. You may set rf toNoneto ignore reading frames (i.e. for aa seqs)start (int) – Index of the nucleobase to start matching. Defaults to 0.
gap (str) – Consider gaps of given character, Defaults to ‘-’. The character is inserted between each two letters of the regex. Be careful, this approach does not work for arbitrary regexes.
matchall (bool) – False will return first match of type
BioMatch, True will return all matches in aBioMatchList. Defaults to False.
- Returns:
match (
BioMatchorBioMatchListof matches or None), the list will be sorted by match position, matches on the forward strand first, then matches on the backward strand.
- sugar.core.cane.translate(seq, *, complete=False, check_start=None, check_stop=False, final_stop=None, warn=False, astop='X', gap='-', gap_after=2, tt=1)[source]¶
Translate a string or
BioSeqobject into an amino acid string- Parameters:
complete (bool) – If set to
Trueignore stop codons, otherwise the translation is stopped at the first stop codoncheck_start (bool) – Check that the first codon is a start codon, default is True for
complete=Falseotherwise Falsecheck_stop (bool) – Check that the sequence ends with the first stop codon, default is False
final_stop (bool) – Append * for the final stop character, defaults to False for
complete=Falseand True forcomplete=Truewarn (bool) – Warn if the first codon might not be a start codon, warn for ambiguous stop codons, warn if the sequence does not end with a stop codon, default is False
astop (str) – Symbol for ambiguous stop codons
gap (str) – Gap character, default
'-', set toNoneto raise an error for non-nucleotide charactersgap_after (int) – A single gap in the amino acid string is written after the first
gap_aftergaps in the nucleotide sequence and after every third gap thereafter, default is 2tt (int) – the number of the translation table, default is 1
- Returns:
Translated string