Pydna is a python package providing code for simulation of the creation of
recombinant DNA molecules using
molecular biology
techniques. Development of pydna happens in this Github repository.
Provided:
PCR simulation
Assembly simulation based on shared identical sequences
Primer design for amplification of a given sequence
Automatic design of primer tails for Gibson assembly
or homologous recombination.
Restriction digestion and cut&paste cloning
Agarose gel simulation
Download sequences from Genbank
Parsing various sequence formats including the capacity to
handle broken Genbank format
The most important modules and how to import functions or classes from
them are listed below. Class names starts with a capital letter,
functions with a lowercase letter:
from pydna.module import function
from pydna.module import Class
Example: from pydna.gel import Gel
pydna
├── amplify
│ ├── Anneal
│ └── pcr
├── assembly
│ └── Assembly
├── design
│ ├── assembly_fragments
│ └── primer_design
├── download
│ └── download_text
├── dseqrecord
│ └── Dseqrecord
├── gel
│ └── Gel
├── genbank
│ ├── genbank
│ └── Genbank
├── parsers
│ ├── parse
│ └── parse_primers
└── readers
├── read
└── read_primers
Documentation is available as docstrings provided in the source code for
each module.
These docstrings can be inspected by reading the source code directly.
See further below on how to obtain the code for pydna.
In the python shell, use the built-in help function to view a
function’s docstring:
The doctrings are also used to provide an automaticly generated reference
manual available online at
read the docs.
Docstrings can be explored using IPython, an
advanced Python shell with
TAB-completion and introspection capabilities. To see which functions
are available in pydna,
type pydna.<TAB> (where <TAB> refers to the TAB key).
Use pydna.open_config_folder?<ENTER>`to view the docstring or
`pydna.open_config_folder??<ENTER> to view the source code.
In the Spyder IDE it is possible
to place the cursor immediately before the name of a module,class or
function and press ctrl+i to bring up docstrings in a separate window in Spyder
Code snippets are indicated by three greater-than signs:
Please join the
Google group
for pydna, this is the preferred location for help. If you find bugs
in pydna itself, open an issue at the
Github repository.
The email address is set to someone@example.com by default. If you change
this to you own address, the pydna.genbank.genbank() function can be
used to download sequences from Genbank directly without having to
explicitly add the email address.
Pydna can cache results from the following functions or methods:
Dseq holds information for a double stranded DNA fragment.
Dseq also holds information describing the topology of
the DNA fragment (linear or circular).
Parameters:
watson (str) – a string representing the watson (sense) DNA strand.
crick (str, optional) – a string representing the crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
watson and crick strands.
see below for a detailed explanation.
linear (bool, optional) – True indicates that sequence is linear, False that it is circular.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Seq object. It stores two
strings representing the watson (sense) and crick(antisense) strands.
two properties called linear and circular, and a numeric value ovhg
(overhang) describing the stagger for the watson and crick strand
in the 5’ end of the fragment.
The most common usage is probably to create a Dseq object as a
part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
There are three ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
The given string will be interpreted as the watson strand of a
blunt, linear double stranded sequence object. The crick strand
is created automatically from the watson strand.
If both watson and crick are given, but not ovhg an attempt
will be made to find the best annealing between the strands.
There are limitations to this. For long fragments it is quite
slow. The length of the annealing sequences have to be at least
half the length of the shortest of the strands.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the
crick strand overhang in the 5’ end of the molecule.
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a crick strand also
needs to be supplied, otherwise an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna_/dsdna.py", line 169, in __init__else:ValueError: ovhg defined without crick strand!
The shape of the fragment is set by circular = True, False
Note that both ends of the DNA fragment has to be compatible to set
circular = True.
This can only be done if the two ends are compatible,
otherwise a TypeError is raised.
Examples
>>> frompydna.dseqimportDseq>>> a=Dseq("catcgatc")>>> aDseq(-8)catcgatcgtagctag>>> a.looped()Dseq(o8)catcgatcgtagctag>>> a.T4("t")Dseq(-8)catcgat tagctag>>> a.T4("t").looped()Dseq(o7)catcgatgtagcta>>> a.T4("a")Dseq(-8)catcga agctag>>> a.T4("a").looped()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in loopediftype5==type3andstr(sticky5)==str(rc(sticky3)):TypeError: DNA cannot be circularized.5' and 3' sticky ends not compatible!>>>
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as exo-klenow [1])
and any combination of A, G, C or T. Default are all four
nucleotides together.
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Returns False if:
- Cut positions fall outside the sequence (could be moved to Biopython)
- Overhang is not double stranded
- Recognition site is not double stranded or is outside the sequence
- For enzymes that cut twice, it checks that at least one possibility is valid
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
This module provides the Dseqrecord class, for handling double stranded
DNA sequences. The Dseqrecord holds sequence information in the form of a pydna.dseq.Dseq
object. The Dseq and Dseqrecord classes are subclasses of Biopythons
Seq and SeqRecord classes, respectively.
The Dseq and Dseqrecord classes support the notion of circular and linear DNA topology.
Dseqrecord is a double stranded version of the Biopython SeqRecord [3] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [5]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
The Amplicon class holds information about a PCR reaction involving two
primers and one template. This class is used by the Anneal class and is not
meant to be instantiated directly.
Parameters:
forward_primer (SeqRecord(Biopython)) – SeqRecord object holding the forward (sense) primer
reverse_primer (SeqRecord(Biopython)) – SeqRecord object holding the reverse (antisense) primer
template (Dseqrecord) – Dseqrecord object holding the template (circular or linear)
This module provide the Anneal class and the pcr() function
for PCR simulation. The pcr function is simpler to use, but expects only one
PCR product. The Anneal class should be used if more flexibility is required.
Primers with 5’ tails as well as inverse PCR on circular templates are handled
correctly.
pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of sequences by homologous recombination.
Should also be useful for related techniques such as Gibson assembly and fusion
PCR. Given a list of sequences (Dseqrecords), all sequences are analyzed for
shared homology longer than the set limit.
A graph is constructed where each overlapping region form a node and
sequences separating the overlapping regions form edges.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
Finds the the flanking common substrings between stringx and stringy
longer than limit. This means that the results only contains substrings
that starts or ends at the the ends of stringx and stringy.
This function is case sensitive.
returns a list of tuples describing the substrings
The list is sorted longest -> shortest.
This module contain functions for primer design for various purposes.
:func:primer_design for designing primers for a sequence or a matching primer for an existing primer. Returns an Amplicon object (same as the amplify module returns).
:func:assembly_fragments Adds tails to primers for a linear assembly through homologous recombination or Gibson assembly.
:func:circular_assembly_fragments Adds tails to primers for a circular assembly through homologous recombination or Gibson assembly.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
The Editor class needs to be instantiated before use.
Parameters:
shell_command_for_editor (str) – String containing the path to the editor
tmpdir (str, optional) – String containing path to the temprary directory where sequence
files are stored before opening.
Examples
>>> importpydna>>> #ape = pydna.Editor("tclsh8.6 /home/bjorn/.ApE/apeextractor/ApE.vfs/lib/app-AppMain/AppMain.tcl")>>> #ape.open("aaa") # This command opens the sequence in the ApE editor
This module provides a class for downloading sequences from genbank
called Genbank and an function that does the same thing called genbank.
The function can be used if the environmental variable pydna_email has
been set to a valid email address. The easiest way to do this permanantly is to edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
This module provides the gbtext_clean() function which can clean up broken Genbank files enough to
pass the BioPython Genbank parser
Almost all of this code was lifted from BioJSON (https://github.com/levskaya/BioJSON) by Anselm Levskaya.
The original code was not accompanied by any software licence. This parser is based on pyparsing.
There are some modifications to deal with fringe cases.
The parser first produces JSON as an intermediate format which is then formatted back into a
string in Genbank format.
The parser is not complete, so some fields do not survive the roundtrip (see below).
This should not be a difficult fix. The returned result has two properties,
.jseq which is the intermediate JSON produced by the parser and .gbtext
which is the formatted genbank string.
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s='''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013... DEFINITION .... ACCESSION... VERSION... SOURCE .... ORGANISM .... COMMENT... COMMENT ApEinfo:methylated:1... ORIGIN... 1 aaa... //'''>>> frompydna.readersimportread>>> read(s)/home/bjorn/anaconda3/envs/bjorn36/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n' "correct?\n:%r" % line, BiopythonParserWarning)Traceback (most recent call last):
File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 48, in readresults=results.pop()IndexError: pop from empty listDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 50, in readraiseValueError("No sequences found in data:\n({})".format(data[:79]))ValueError: No sequences found in data:(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013DEFINITI)>>> frompydna.genbankfixerimportgbtext_clean>>> s2,j2=gbtext_clean(s)>>> print(s2)LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013DEFINITION .ACCESSIONVERSIONSOURCE .ORGANISM .COMMENTCOMMENT ApEinfo:methylated:1FEATURES Location/QualifiersORIGIN 1 aaa//>>> s3=read(s2)>>> s3Dseqrecord(o3)>>> print(s3.format())LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013DEFINITION .ACCESSION New_DNAVERSION New_DNAKEYWORDS .SOURCE ORGANISM . .COMMENT ApEinfo:methylated:1FEATURES Location/QualifiersORIGIN 1 aaa//
The primers can be of any format readable by the parse_primers
function. Lines beginning with # are ignored. Path defaults to
the path given by the pydna_primers environment variable.
The primer list does not accept new primers. Use the
assign_numbers_to_new_primers method and paste the new
primers at the top of the list.
The primer list remembers the numbers of accessed primers.
The indices of accessed primers are stored in the .accessed
property.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[7] objects are
returned.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base 64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be a part of and URL or a filename.
Examples
>>> frompydna.seqrecordimportSeqRecord>>> a=SeqRecord("gattaca")>>> a.seguid()# original seguid is +bKGnebMkia5kNg/gF7IORXMnIU'lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8'
Return the longest common substring between the sequence.
and another sequence (other). The other sequence can be a string,
Seq, SeqRecord, Dseq or DseqRecord.
The method returns a SeqFeature with type “read” as this method
is mostly used to map sequence reads to the sequence. This can be
changed by passing a type as keyword with some other string value.
Algorithm described in Pierre Duval, Jean. 1983. Factorizing Words
over an Ordered Alphabet. Journal of Algorithms & Computational Technology
4 (4) (December 1): 363–381. and Algorithms on strings and sequences based
on Lyndon words, David Eppstein 2011.
https://gist.github.com/dvberkel/1950267
Turn a three letter code protein sequence into one with one letter code.
The single input argument ‘seq’ should be a protein sequence using single
letter codes, as a python string.
This function returns the amino acid sequence as a string using the one
letter amino acid codes. Output follows the IUPAC standard (including
ambiguous characters B for “Asx”, J for “Xle” and X for “Xaa”, and also U
for “Sel” and O for “Pyl”) plus “Ter” for a terminator given as an
asterisk.
Any unknown
character (including possible gap characters), is changed into ‘Xaa’.
Examples
>>> fromBio.SeqUtilsimportseq3>>> seq3("MAIVMGRWKGAR*")'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'>>> frompydna.utilsimportseq31>>> seq31('MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer')'M A I V M G R W K G A R *'
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).