đ Complete Function Documentation¶
This page contains the complete documentation for each function in the pegg module.
pegg.prime¶
- pegg.prime.PAM_finder(seq, PAM)¶
Finds indeces of PAM sequences in sequence.
Parameters¶
- seq
type = str
Sequence to search for PAM sequences.
- PAM
type = str
PAM sequence for searching. Formatting: e.g. âNGGâ
N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]
- pegg.prime.RF_score(df)¶
Function for calculating the Random Forest score from pegRNA parameters.
Parameters¶
- df
type = pd.DataFrame
Dataframe containing the pegRNAs generated by run().
- pegg.prime.clinvar_VCF_translator(filepath, variation_ids)¶
Function that takes a clinvar.vcf.gz file containing information about variants, as well as a list of variation ID numbers, and returns a pandas dataframe containing the variants in a format that can be used by PEGG to design pegRNAs.
Parameters¶
- filepath
type = str
Filepath to the clinvar.vcf.gz file.
- variation_ids
type = list
List of variation IDs that the user wants to convert to .
- pegg.prime.df_formatter(df, chrom_dict, context_size=120)¶
Takes in variants (in cBioPortal format!) and outputs dataframe with REF and ALT oligos with designated context_size that can be used by PEGG for creating pegRNAs.
Parameters¶
- df
type = pd.DataFrame
The dataframe of input variants in cBioPortal format.
- chrom_dict
type = dict
Dictionary generated by genome_loader() that holds the chromosome sequences.
- context_size
type = int
The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3
- pegg.prime.eligible_PAM_finder(mut, PAM, max_RTT_length, proto_size=19)¶
Determines eligible PAM sequences for creating pegRNAs. Returns PAM sequence locations on (1) forward, and (2) reverse-complement strand
Parameters¶
- mut
type = mutation class
See class: mutation
- PAM
type = str
PAM sequence for searching. Formatting: e.g. âNGGâ
N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]
- max_RTT_length
type = int
Max RTT length for pegRNAs being designed.
- proto_size
type = int
Size of protospacer being used. Default = 19 (G+19).
- pegg.prime.genome_loader(filepath_gz)¶
Takes in filepath of human or mouse genome and returns dictionary of chromosome sequences that PEGG can parse. Tested only on human and mouse genomes.
Returns (1) chromosome dictionary, and (2) file names of the chromosomes that are stored (for manual checking of errors).
Parameters¶
- filepath_gz
type = str
The filepath to the .gz file holding the reference genome file.
- pegg.prime.goldengate_oligos(peg_df)¶
Currently a place-holder. Going to implement automated Golden Gate oligo generator (so scaffold doesnât need to be synthesized).
- pegg.prime.input_formatter(input_df, input_format, chrom_dict, context_size)¶
Master function for putting input mutations in the correct format for pegRNA/gRNA design.
- input_df
type = pd.DataFrame
Pandas dataframe that contains the input mutations.
- input_format
type = str
Options = âcBioPortalâ, âWT_ALTâ, âPrimeDesignâ. For âWT_ALTâ, make sure to put headers as âWTâ and âALTâ. For âPrimeDesignâ, put the header as âSEQâ.
- chrom_dict
type = dict
Dictionary generated by genome_loader() that holds the chromosome sequences.
- context_size
type = int
The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3
- pegg.prime.make_aligner()¶
Aligner function from Bio.Align.PairwiseAligner with custom parameters.
- pegg.prime.mut_formatter(wt, alt)¶
Formats mutations for pegRNA generation. Takes in WT and ALT sequences, returns necessary parameters for gRNA/pegRNA generation. Note: This function will break with INDELs and complex variants since it relies on a simple alignment.
Parameters¶
- wt
type = str
WT sequence
- alt
type = str
Mutant/alternate sequence.
- class pegg.prime.mutation(wt_w_context, alt_w_context, left_seq, right_seq, var_type, ref_seq, alt_seq, chrom=None, genome=None)¶
Bases:
object
Class for storing information about individual mutations. Used by functions throughout. Prevents the use of disorganized lists.
- pegg.prime.ontarget_score(df)¶
Calls to crisporEffScores (taken from CRISPOR github) to generate on-target scores using Rule Set 2.
- pegg.prime.other_filtration(pegRNA_df, RE_sites=None, polyT_threshold=4)¶
Determines whether pegRNAs contain polyT termination sites or Restriction Enzyme Sites. Returns pegRNA_df with properties added. Note: RE_sites requires only A, T, C, or G bases (Ns will not work currently; need to do custom filtration)⊠These properties are checking for:
u6 terminator (polyT sequence)
RE site presence
Parameters¶
- pegRNA_df
type = pd.DataFrame
A dataframe containing the pegRNAs for the selected input mutations. Generated by run() or pegRNA_generator()
- RE_sites
type = list or None
A list containing the RE recognition sites to filter (e.g. [âCGTCTCâ, âGAATTCâ] for Esp3I and EcoRI). Default = None (no filtration).
- polyT_threshold
type = int
The length of the polyT sequence to classify as a terminator. Default = 4.
- pegg.prime.pegRNA_generator(mut, PAM, orientation, proto_size, RTT_lengths, PBS_lengths)¶
Generates pegRNAs for a given mutation.
Parameters¶
- mut
type = mutation class
See class: mutation
- PAM
type = str
PAM sequence for searching. Formatting: e.g. âNGGâ
N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]
- orientation
type = str
â+â or â-â
- proto_size
type = int
Size of protospacer being used. Default = 19 (G+19).
- RTT_lengths
type = list
List containing RTT lengths to design pegRNAs for.
- PBS_length
type = list
List containing PBS lengths to desing pegRNAs for.
- pegg.prime.peggscore2(df)¶
Function for calculating the PEGG2 Score. A multiple linear regression score based on pegRNA parameters.
Parameters¶
- df
type = pd.DataFrame
Dataframe containing the pegRNAs generated by run().
- pegg.prime.prime_oligo_generator(peg_df, epeg=True, epeg_motif='tevopreQ1', five_prime_adapter='AGCGTACACGTCTCACACC', three_prime_adapter='GAATTCTAGATCCGGTCGTCAAC', gRNA_scaff='GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC')¶
A tool for automatically generating oligos from the output of run(). Returns input dataframe with new columns containing the pegRNA oligo.
Parameters¶
- peg_df
type = pd.DataFrame
A dataframe containing the pegRNAs for the selected input mutations. Generated by run() or pegRNA_generator()
- epeg
type = bool
True/False whether to include an epegRNA motif at the end of the 3â extension.
- epeg_motif
type = str
Which epeg motif to include at end of 3â extension. Default = âtevopreQ1â; Other option = âmpknotâ (both from Chen et al.). Can also input a custom motif sequence (just put in the sequence in 5â to 3â orientation)!
- five_prime_adapter
type = str
5â Prime Adapter. The automatically provided 5â adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.
- three_prime_adapter
type = str
5â Prime Adapter. The automatically provided 5â adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.
- gRNA_scaff
type = str
gRNA scaffold region. Automatically set to a functional gRNA scaffold. Can be swapped with whatever input string user wants.
- pegg.prime.primedesign_formatter(seq)¶
Takes as input sequence in prime design format e.g. AATTCCG(G/C)AATTCGCT. Returns necessary parameters for pegRNA generation.
Parameters¶
- seq
type = str
The PrimeDesign formatted sequence.
- pegg.prime.run(input_df, input_format, chrom_dict=None, PAM='NGG', rankby='PEGG2_Score', pegRNAs_per_mut='All', RTT_lengths=[5, 10, 15, 25, 30], PBS_lengths=[8, 10, 13, 15], min_RHA_size=1, RE_sites=None, polyT_threshold=4, proto_size=19, context_size=120, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement', sensor=True)¶
Master function for generating pegRNAs. Takes as input a dataframe containing mutations in one of the acceptable formats. Returns a dataframe with pegRNAs with desired design parameters.
Parameters¶
- input_df
type = pd.DataFrame
Pandas dataframe that contains the input mutations.
- input_format
type = str
Options = âcBioPortalâ, âWT_ALTâ, âPrimeDesignâ. For âWT_ALTâ, make sure to put headers as âWTâ and âALTâ. For âPrimeDesignâ, put the header as âSEQâ.
- chrom_dict
type = dict or None
Dictionary generated by genome_loader() that holds the chromosome sequences.
- PAM
type = str
PAM sequence for searching. Default = âNGGâ. Can include any nucleic acid code (e.g. PAM = âNRCHâ).
- rank_by
type = str
What pegRNA parameter to rank pegRNAs by. Options = âPEGG2_Scoreâ (weighted linear regression of different pegRNA parameters) or âRF_Scoreâ (random forest predictor of pegRNA efficiency).
- pegRNAs_per_mut
type = âAllâ or int
How many pegRNAs to produce per mutation. Default = âAllâ (all possible pegRNAs with parameters). Otherwise, choose an integer value (e.g. 5).
- RTT_lengths
type = list
List containing RTT lengths to design pegRNAs for.
- PBS_length
type = list
List containing PBS lengths to desing pegRNAs for.
- min_RHA_size
type = int
Minimum size of the RHA (Right homology arm). Default = 1. Generally pegRNAs with smaller RHA perform poorly.
- RE_sites
type = list or None
A list containing the RE recognition sites to filter (e.g. [âCGTCTCâ, âGAATTCâ] for Esp3I and EcoRI). Default = None (no filtration).
- polyT_threshold
type = int
The length of the polyT sequence to classify as a terminator. Default = 4.
- proto_size
type = int
The length of the protospacer (excluding the appended G at the 5â end). Default = 19 (G+19).
- context_size
type = int
The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3
- before_proto_context
type = int
Default = 5. Amount of nucleotide context to put before the protospacer in the sensor
- sensor_length
type = int
Total length of the sensor in nt. Default = 60.
- sensor_orientation
type = str
Options for sensor_orientation = âreverse-complementâ orâforwardâ.
- sensor
type = bool
True/False whether to include a sensor in the pegRNA design or not.
- pegg.prime.sensor_generator(df, proto_size, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement')¶
Generates sensor sequence for quantification of pegRNA editing outcomes. Automatically puts sensor in reverse complement orientation with respect to protospacer. This is highly reccomended to reduce recombination during cloning/library preparation.
Parameters¶
- df
type = pd.DataFrame
Dataframe containing pegRNAs.
- proto_size
type = int
Size of protospacer.
- before_proto_context
type = int
Default = 5. Amount of nucleotide context to put before the protospacer in the sensor
- sensor_length
type = int
Total length of the sensor in nt. Default = 60.
- sensor_orientation
type = str
Options for sensor_orientation = âreverse-complementâ orâforwardâ.
- pegg.prime.sensor_viz(df_w_sensor, i)¶
Function for visualizing the pegRNA aligned to the sensor. Input = dataframe with pegRNA-sensor pairs and the row index (i) to visualize.
Parameters¶
- df_w_sensor
type = pd.DataFrame
Dataframe containing the pegRNA-sensor pairs generated by run().
- i
type = int
Row index from the dataframe that you want to visualize.
- pegg.prime.split_word(word)¶
Simple function for splitting string into component characters
pegg.base¶
- pegg.base.base_oligo_generator(peg_df, five_prime_adapter='AGCGTACACGTCTCACACC', three_prime_adapter='GAATTCTAGATCCGGTCGTCAAC', gRNA_scaff='GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC')¶
A tool for automatically generating oligos from the output of run(). Returns input dataframe with new columns containing gRNA oligo with or without sensor.
Parameters¶
- peg_df
type = pd.DataFrame
A dataframe containing the gRNAs for the selected input mutations. Generated by run_base() or gRNA_generator()
- five_prime_adapter
type = str
5â Prime Adapter. The automatically provided 5â adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.
- three_prime_adapter
type = str
5â Prime Adapter. The automatically provided 5â adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.
- gRNA_scaff
type = str
gRNA scaffold region. Automatically set to a functional gRNA scaffold. Can be swapped with whatever input string user wants.
- pegg.base.eligible_PAM_finder_base(mut, PAM, proto_size=19)¶
Determines eligible PAM sequences for creating gRNAs. Returns PAM sequence locations on (1) forward, and (2) reverse-complement strand
Parameters¶
- mut
type = mutation class
See class: mutation
- PAM
type = str
PAM sequence for searching. Formatting: e.g. âNGGâ
N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]
- proto_size
type = int
Size of protospacer being used. Default = 19 (G+19).
- pegg.base.gRNA_generator(mut, PAM, orientation, proto_size, ideal_edit_window=[4, 8])¶
Generates pegRNAs for a given mutation.
Parameters¶
- mut
type = mutation class
See class: mutation
- PAM
type = str
PAM sequence for searching. Formatting: e.g. âNGGâ
N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]
- orientation
type = str
â+â or â-â
- proto_size
type = int
Size of protospacer being used. Default = 19 (G+19).
- ideal_edit_window
type = list
Ideal editing window for the editor being used. Default = [4,8]. Labels mutations that fall in this window for future filtration if desired.
- pegg.base.run_base(input_df, input_format, chrom_dict=None, PAM='NGG', filtration='ABE+CBE', ideal_edit_window=[4, 8], auto_SNP_filter=True, proto_size=19, context_size=120, RE_sites=None, polyT_threshold=4, before_proto_context=5, sensor_length=40, sensor_orientation='reverse-complement', sensor=True)¶
Master function for generating base editing gRNAs. Takes as input a dataframe containing mutations in one of the acceptable formats. Returns a dataframe with gRNAs with desired design parameters.
Parameters¶
- input_df
type = pd.DataFrame
Pandas dataframe that contains the input mutations.
- input_format
type = str
Options = âcBioPortalâ, âWT_ALTâ, âPrimeDesignâ. For âWT_ALTâ, make sure to put headers as âWTâ and âALTâ. For âPrimeDesignâ, put the header as âSEQâ.
- chrom_dict
type = dict or None
Dictionary generated by genome_loader() that holds the chromosome sequences.
- PAM
type = str
PAM sequence for searching. Default = âNGGâ. Can include any nucleic acid code (e.g. PAM = âNRCHâ).
- filtration
type = str or list
Filters the mutation input list to only include the desired SNPs. Options = âNo filterâ, âABEâ, âCBEâ, âABE+CBEâ, or a list containing the desired SNPs to model (e.g. [âC>Aâ, âT>Câ]).
- ideal_edit_window
type = list
Ideal editing window for the editor being used. Default = [4,8]. Labels mutations that fall in this window for future filtration if desired.
- auto_SNP_filter
type = bool
True/False for whether to filter mutant input to exclude mutations that are NOT SNPs (and thus not BE amenable).
- proto_size
type = int
The length of the protospacer (excluding the appended G at the 5â end). Default = 19 (G+19).
- context_size
type = int
The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3
- RE_sites
type = list or None
A list containing the RE recognition sites to filter (e.g. [âCGTCTCâ, âGAATTCâ] for Esp3I and EcoRI). Default = None (no filtration).
- polyT_threshold
type = int
The length of the polyT sequence to classify as a terminator. Default = 4.
- before_proto_context
type = int
Default = 5. Amount of nucleotide context to put before the protospacer in the sensor
- sensor_length
type = int
Total length of the sensor in nt. Default = 60.
- sensor_orientation
type = str
Options for sensor_orientation = âreverse-complementâ orâforwardâ.
- sensor
type = bool
True/False whether to include a sensor in the pegRNA design or not.
- pegg.base.sensor_generator_base(df, proto_size, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement')¶
Generates sensor sequence for quantification of gRNA editing outcomes. Automatically puts sensor in reverse complement orientation with respect to protospacer. This is highly reccomended to reduce recombination during cloning/library preparation.
Parameters¶
- df
type = pd.DataFrame
Dataframe containing gRNAs.
- proto_size
type = int
Size of protospacer.
- before_proto_context
type = int
Default = 5. Amount of nucleotide context to put before the protospacer in the sensor
- sensor_length
type = int
Total length of the sensor in nt. Default = 60.
- sensor_orientation
type = str
Options for sensor_orientation = âreverse-complementâ orâforwardâ.
- pegg.base.sensor_viz_base(df_w_sensor, i)¶
Function for visualizing the gRNA aligned to the sensor. Input = dataframe with pegRNA-sensor pairs and the row index (i) to visualize.
Parameters¶
- df_w_sensor
type = pd.DataFrame
Dataframe containing the gRNA-sensor pairs generated by run().
- i
type = int
Row index from the dataframe that you want to visualize.
- pegg.base.split_word(word)¶
Simple function for splitting string into component characters
pegg.library¶
- pegg.library.aavs1_muts(chrom_dict, num_muts, genome_version='GRCh37')¶
Function for generating list of âmutationsâ within AAVS1 locus as negative controls. Generates a list of SNPs that contain idential reference and alternate alleles. Intron 1-2 of PPP1R12C (AAVS1 locus): GRCh37 transcript = ENST00000263433.3 | GRCh38 transcript = ENST00000263433.8
Not currently used in the pipeline, but provided anyway.
Parameters¶
- chrom_dict
type = dict
Dictionary containing chromosomes in dictionary format for parsing by PEGG.
- num_muts
type = int
Number of safe-targetting mutations to select/generate.
- genome_version
type = str
Options = âGRCh37â, âGRCh38â
- pegg.library.library_maker(mutant_input, gene_name, chrom_dict, fraction_safetarget=0.05, organism='human', fraction_silent=0, chrom=None, strand=None, start_end_cds=None)¶
Compiles the different library design functions to aggregate all of the variants for a particular gene, include silent substitution controls at a designated %, and non-targeting controls. This can then be fed into the pegg2.run() or base_editing.run_base() functions.
NOTE: THIS ONLY WORKS FOR GRCh37 and GRCm38!!! Alternatively, generate these mutations/guides using these reference builds, and then generate your own guides targeting mutations in your desired reference build.
Parameters¶
- mutant_input
type = pd.DataFrame
DataFrame containing all of the input mutations. Doesnât require filtration for gene of interest.
- gene_name
type = str
Gene name to select from the mutant_input dataframe.
- chrom_dict
type = dict
Dictionary containing the reference genome. See genome_loader()
- fraction_safetarget
type = float
Value from 0 to 1 that corresponds to the fraction of safe-targetting mutations to include in the library. (i.e. what fraction of the mutant_input)
- organism
type = str
Options = âhumanâ or âmouseâ. Determines which of the organisms to generate safe targeting guides for.
- fraction_silent
type = float
Value from 0 to 1 that corresponds to the fraction of silent mutations to include in the library. (i.e. what fraction of the mutant_input)
- chrom
type = int or str
Chromosome that the gene of interest falls on.
- strand
type = str
â+â or â-â â corresponds with which strand the transcript falls on.
- start_end_cds
type = list
Nested list that contains the CDS locations of the gene transcript of interest, in the + strand orientation.
- pegg.library.mutation_aggregator(mutant_input, gene_name)¶
Selects the mutations that correspond to the desired gene name. Removes duplicates (checking at DNA level; not amino acid level). Returns dataframe containing these aggregated mutants occuring in gene_name.
Parameters¶
- mutant_input
type = pd.DataFrame
A dataframe containing the input mutations from which selections are made to generate pegRNAs. See documentation for precise qualities of this dataframe
- gene_name
type = str
Geneâs Hugo_Symbol (i.e. name).
- pegg.library.neutral_substitutions(gene_name, chrom, strand, start_end_cds, chrom_dict)¶
A function for generating all possible synonymous codon substitutions (i.e. silent mutations) of a given gene. See documentation for more information about format of start_end_cds & example usage.
Parameters¶
- gene_name
type = str
Geneâs Hugo_Symbol (i.e. name).
- chrom
type = str
Chromosome that gene occurs on. Format = e.g. âchr17â, âchrXâ, etc.
- strand
type = str
Strand that transcript is on. Options are â+â or â-â.
- start_end_cds
type = list
A 2-d list containing the start/end locations of each region of the coding sequence (CDS) for the geneâs selected transcript. See documentation for example and precise specifications of format.
- chrom_dict
type = dict
Dictionary containing the reference genome. See genome_loader()
- pegg.library.nontargeting_guides(num_guides, edit_type='prime')¶
Function for generating a list of non-targetting guides (in the human genome). Future versions will expand to mouse as wellâŠUnprocessed files for mouse located within this module. Guides taken from: https://www.nature.com/articles/nmeth.4423 (https://doi.org/10.1038/nmeth.4423)
Parameters¶
- num_guides
type = int
Number of guides to non-targetting generate. Max = 1000.
- edit_type
type = str
Options = âbaseâ, âprimeâ
- pegg.library.safe_muts(num_muts, chrom_dict, organism='human')¶
Function for generating a list of safe-targetting pegRNAs/gRNAs. NOTE: THIS ONLY WORKS FOR GRCh37 and GRCm38!!! Alternatively, generate these mutations/guides using these reference builds, and then generate your own guides targeting mutations in your desired reference build.
This is a random subset of 100,000 âsafe regionsâ taken from Morgens et al., 2017 (https://doi.org/10.1038/ncomms15178)
Parameters¶
- num_muts
type = int
Number of safe-targetting mutations to select/generate.
- chrom_dict
type = dict
Dictionary containing chromosomes in dictionary format for parsing by PEGG.
- organism
type = str
Choices = âhumanâ or âmouseâ. Generates safe targeting guides in human or mouse genome.