pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
Dseqrecord is a double stranded version of the Biopython SeqRecord [1] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [3]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
Dseq holds information for a double stranded DNA fragment.
Dseq also holds information describing the topology of
the DNA fragment (linear or circular).
Parameters:
watson (str) – a string representing the watson (sense) DNA strand.
crick (str, optional) – a string representing the crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
watson and crick strands.
see below for a detailed explanation.
linear (bool, optional) – True indicates that sequence is linear, False that it is circular.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Seq object. It stores two
strings representing the watson (sense) and crick(antisense) strands.
two properties called linear and circular, and a numeric value ovhg
(overhang) describing the stagger for the watson and crick strand
in the 5’ end of the fragment.
The most common usage is probably to create a Dseq object as a
part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
There are three ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
The given string will be interpreted as the watson strand of a
blunt, linear double stranded sequence object. The crick strand
is created automatically from the watson strand.
If both watson and crick are given, but not ovhg an attempt
will be made to find the best annealing between the strands.
There are limitations to this. For long fragments it is quite
slow. The length of the annealing sequences have to be at least
half the length of the shortest of the strands.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the
crick strand overhang in the 5’ end of the molecule.
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a crick strand also
needs to be supplied, otherwise an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna_/dsdna.py", line 169, in __init__else:ValueError: ovhg defined without crick strand!
The shape of the fragment is set by circular = True, False
Note that both ends of the DNA fragment has to be compatible to set
circular = True.
This can only be done if the two ends are compatible,
otherwise a TypeError is raised.
Examples
>>> frompydna.dseqimportDseq>>> a=Dseq("catcgatc")>>> aDseq(-8)catcgatcgtagctag>>> a.looped()Dseq(o8)catcgatcgtagctag>>> a.T4("t")Dseq(-8)catcgat tagctag>>> a.T4("t").looped()Dseq(o7)catcgatgtagcta>>> a.T4("a")Dseq(-8)catcga agctag>>> a.T4("a").looped()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in loopediftype5==type3andstr(sticky5)==str(rc(sticky3)):TypeError: DNA cannot be circularized.5' and 3' sticky ends not compatible!>>>
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as exo-klenow [4])
and any combination of A, G, C or T. Default are all four
nucleotides together.
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Returns False if:
- Cut positions fall outside the sequence (could be moved to Biopython)
- Overhang is not double stranded
- Recognition site is not double stranded or is outside the sequence
- For enzymes that cut twice, it checks that at least one possibility is valid
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[6] objects are
returned.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
The limit argument is the minimum length of the primer. The default value is 13.
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting
point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an
external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function.
The default value is None.
To use the default tm_func as estimate function to get the NEB Tm faster, you can do:
primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s='''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013... DEFINITION .... ACCESSION... VERSION... SOURCE .... ORGANISM .... COMMENT... COMMENT ApEinfo:methylated:1... ORIGIN... 1 aaa... //'''>>> frompydna.readersimportread>>> read(s)/home/bjorn/anaconda3/envs/bjorn36/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n' "correct?\n:%r" % line, BiopythonParserWarning)Traceback (most recent call last):
File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 48, in readresults=results.pop()IndexError: pop from empty listDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bjorn/python_packages/pydna/pydna/readers.py", line 50, in readraiseValueError("No sequences found in data:\n({})".format(data[:79]))ValueError: No sequences found in data:(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013DEFINITI)>>> frompydna.genbankfixerimportgbtext_clean>>> s2,j2=gbtext_clean(s)>>> print(s2)LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013DEFINITION .ACCESSIONVERSIONSOURCE .ORGANISM .COMMENTCOMMENT ApEinfo:methylated:1FEATURES Location/QualifiersORIGIN 1 aaa//>>> s3=read(s2)>>> s3Dseqrecord(o3)>>> print(s3.format())LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013DEFINITION .ACCESSION New_DNAVERSION New_DNAKEYWORDS .SOURCE ORGANISM . .COMMENT ApEinfo:methylated:1FEATURES Location/QualifiersORIGIN 1 aaa//
The primers can be of any format readable by the parse_primers
function. Lines beginning with # are ignored. Path defaults to
the path given by the pydna_primers environment variable.
The primer list does not accept new primers. Use the
assign_numbers_to_new_primers method and paste the new
primers at the top of the list.
The primer list remembers the numbers of accessed primers.
The indices of accessed primers are stored in the .accessed
property.