📚 Complete Function Documentation¶

This page contains the complete documentation for each function in the pegg module.

pegg.prime¶

pegg.prime.PAM_finder(seq, PAM)¶

Finds indeces of PAM sequences in sequence.

Parameters¶

seq

type = str

Sequence to search for PAM sequences.

PAM

type = str

PAM sequence for searching. Formatting: e.g. “NGG”

N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]

pegg.prime.RF_score(df)¶

Function for calculating the Random Forest score from pegRNA parameters.

Parameters¶

df

type = pd.DataFrame

Dataframe containing the pegRNAs generated by run().

pegg.prime.clinvar_VCF_translator(filepath, variation_ids)¶

Function that takes a clinvar.vcf.gz file containing information about variants, as well as a list of variation ID numbers, and returns a pandas dataframe containing the variants in a format that can be used by PEGG to design pegRNAs.

Parameters¶

filepath

type = str

Filepath to the clinvar.vcf.gz file.

variation_ids

type = list

List of variation IDs that the user wants to convert to .

pegg.prime.df_formatter(df, chrom_dict, context_size=120)¶

Takes in variants (in cBioPortal format!) and outputs dataframe with REF and ALT oligos with designated context_size that can be used by PEGG for creating pegRNAs.

Parameters¶

df

type = pd.DataFrame

The dataframe of input variants in cBioPortal format.

chrom_dict

type = dict

Dictionary generated by genome_loader() that holds the chromosome sequences.

context_size

type = int

The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3

pegg.prime.eligible_PAM_finder(mut, PAM, max_RTT_length, proto_size=19)¶

Determines eligible PAM sequences for creating pegRNAs. Returns PAM sequence locations on (1) forward, and (2) reverse-complement strand

Parameters¶

mut

type = mutation class

See class: mutation

PAM

type = str

PAM sequence for searching. Formatting: e.g. “NGG”

N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]

max_RTT_length

type = int

Max RTT length for pegRNAs being designed.

proto_size

type = int

Size of protospacer being used. Default = 19 (G+19).

pegg.prime.genome_loader(filepath_gz)¶

Takes in filepath of human or mouse genome and returns dictionary of chromosome sequences that PEGG can parse. Tested only on human and mouse genomes.

Returns (1) chromosome dictionary, and (2) file names of the chromosomes that are stored (for manual checking of errors).

Parameters¶

filepath_gz

type = str

The filepath to the .gz file holding the reference genome file.

pegg.prime.goldengate_oligos(peg_df)¶

Currently a place-holder. Going to implement automated Golden Gate oligo generator (so scaffold doesn’t need to be synthesized).

pegg.prime.input_formatter(input_df, input_format, chrom_dict, context_size)¶

Master function for putting input mutations in the correct format for pegRNA/gRNA design.

input_df

type = pd.DataFrame

Pandas dataframe that contains the input mutations.

input_format

type = str

Options = ‘cBioPortal’, ‘WT_ALT’, ‘PrimeDesign’. For ‘WT_ALT’, make sure to put headers as “WT” and “ALT”. For “PrimeDesign”, put the header as “SEQ”.

chrom_dict

type = dict

Dictionary generated by genome_loader() that holds the chromosome sequences.

context_size

type = int

The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3

pegg.prime.make_aligner()¶

Aligner function from Bio.Align.PairwiseAligner with custom parameters.

pegg.prime.mut_formatter(wt, alt)¶

Formats mutations for pegRNA generation. Takes in WT and ALT sequences, returns necessary parameters for gRNA/pegRNA generation. Note: This function will break with INDELs and complex variants since it relies on a simple alignment.

Parameters¶

wt

type = str

WT sequence

alt

type = str

Mutant/alternate sequence.

class pegg.prime.mutation(wt_w_context, alt_w_context, left_seq, right_seq, var_type, ref_seq, alt_seq, chrom=None, genome=None)¶

Bases: object

Class for storing information about individual mutations. Used by functions throughout. Prevents the use of disorganized lists.

pegg.prime.ontarget_score(df)¶

Calls to crisporEffScores (taken from CRISPOR github) to generate on-target scores using Rule Set 2.

pegg.prime.other_filtration(pegRNA_df, RE_sites=None, polyT_threshold=4)¶

Determines whether pegRNAs contain polyT termination sites or Restriction Enzyme Sites. Returns pegRNA_df with properties added. Note: RE_sites requires only A, T, C, or G bases (Ns will not work currently; need to do custom filtration)
 These properties are checking for:

  1. u6 terminator (polyT sequence)

  2. RE site presence

Parameters¶

pegRNA_df

type = pd.DataFrame

A dataframe containing the pegRNAs for the selected input mutations. Generated by run() or pegRNA_generator()

RE_sites

type = list or None

A list containing the RE recognition sites to filter (e.g. [‘CGTCTC’, ‘GAATTC’] for Esp3I and EcoRI). Default = None (no filtration).

polyT_threshold

type = int

The length of the polyT sequence to classify as a terminator. Default = 4.

pegg.prime.pegRNA_generator(mut, PAM, orientation, proto_size, RTT_lengths, PBS_lengths)¶

Generates pegRNAs for a given mutation.

Parameters¶

mut

type = mutation class

See class: mutation

PAM

type = str

PAM sequence for searching. Formatting: e.g. “NGG”

N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]

orientation

type = str

‘+’ or ‘-’

proto_size

type = int

Size of protospacer being used. Default = 19 (G+19).

RTT_lengths

type = list

List containing RTT lengths to design pegRNAs for.

PBS_length

type = list

List containing PBS lengths to desing pegRNAs for.

pegg.prime.peggscore2(df)¶

Function for calculating the PEGG2 Score. A multiple linear regression score based on pegRNA parameters.

Parameters¶

df

type = pd.DataFrame

Dataframe containing the pegRNAs generated by run().

pegg.prime.prime_oligo_generator(peg_df, epeg=True, epeg_motif='tevopreQ1', five_prime_adapter='AGCGTACACGTCTCACACC', three_prime_adapter='GAATTCTAGATCCGGTCGTCAAC', gRNA_scaff='GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC')¶

A tool for automatically generating oligos from the output of run(). Returns input dataframe with new columns containing the pegRNA oligo.

Parameters¶

peg_df

type = pd.DataFrame

A dataframe containing the pegRNAs for the selected input mutations. Generated by run() or pegRNA_generator()

epeg

type = bool

True/False whether to include an epegRNA motif at the end of the 3’ extension.

epeg_motif

type = str

Which epeg motif to include at end of 3’ extension. Default = “tevopreQ1”; Other option = “mpknot” (both from Chen et al.). Can also input a custom motif sequence (just put in the sequence in 5’ to 3’ orientation)!

five_prime_adapter

type = str

5’ Prime Adapter. The automatically provided 5’ adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.

three_prime_adapter

type = str

5’ Prime Adapter. The automatically provided 5’ adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.

gRNA_scaff

type = str

gRNA scaffold region. Automatically set to a functional gRNA scaffold. Can be swapped with whatever input string user wants.

pegg.prime.primedesign_formatter(seq)¶

Takes as input sequence in prime design format e.g. AATTCCG(G/C)AATTCGCT. Returns necessary parameters for pegRNA generation.

Parameters¶

seq

type = str

The PrimeDesign formatted sequence.

pegg.prime.run(input_df, input_format, chrom_dict=None, PAM='NGG', rankby='PEGG2_Score', pegRNAs_per_mut='All', RTT_lengths=[5, 10, 15, 25, 30], PBS_lengths=[8, 10, 13, 15], min_RHA_size=1, RE_sites=None, polyT_threshold=4, proto_size=19, context_size=120, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement', sensor=True)¶

Master function for generating pegRNAs. Takes as input a dataframe containing mutations in one of the acceptable formats. Returns a dataframe with pegRNAs with desired design parameters.

Parameters¶

input_df

type = pd.DataFrame

Pandas dataframe that contains the input mutations.

input_format

type = str

Options = ‘cBioPortal’, ‘WT_ALT’, ‘PrimeDesign’. For ‘WT_ALT’, make sure to put headers as “WT” and “ALT”. For “PrimeDesign”, put the header as “SEQ”.

chrom_dict

type = dict or None

Dictionary generated by genome_loader() that holds the chromosome sequences.

PAM

type = str

PAM sequence for searching. Default = “NGG”. Can include any nucleic acid code (e.g. PAM = “NRCH”).

rank_by

type = str

What pegRNA parameter to rank pegRNAs by. Options = “PEGG2_Score” (weighted linear regression of different pegRNA parameters) or “RF_Score” (random forest predictor of pegRNA efficiency).

pegRNAs_per_mut

type = ‘All’ or int

How many pegRNAs to produce per mutation. Default = ‘All’ (all possible pegRNAs with parameters). Otherwise, choose an integer value (e.g. 5).

RTT_lengths

type = list

List containing RTT lengths to design pegRNAs for.

PBS_length

type = list

List containing PBS lengths to desing pegRNAs for.

min_RHA_size

type = int

Minimum size of the RHA (Right homology arm). Default = 1. Generally pegRNAs with smaller RHA perform poorly.

RE_sites

type = list or None

A list containing the RE recognition sites to filter (e.g. [‘CGTCTC’, ‘GAATTC’] for Esp3I and EcoRI). Default = None (no filtration).

polyT_threshold

type = int

The length of the polyT sequence to classify as a terminator. Default = 4.

proto_size

type = int

The length of the protospacer (excluding the appended G at the 5’ end). Default = 19 (G+19).

context_size

type = int

The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3

before_proto_context

type = int

Default = 5. Amount of nucleotide context to put before the protospacer in the sensor

sensor_length

type = int

Total length of the sensor in nt. Default = 60.

sensor_orientation

type = str

Options for sensor_orientation = ‘reverse-complement’ or’forward’.

sensor

type = bool

True/False whether to include a sensor in the pegRNA design or not.

pegg.prime.sensor_generator(df, proto_size, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement')¶

Generates sensor sequence for quantification of pegRNA editing outcomes. Automatically puts sensor in reverse complement orientation with respect to protospacer. This is highly reccomended to reduce recombination during cloning/library preparation.

Parameters¶

df

type = pd.DataFrame

Dataframe containing pegRNAs.

proto_size

type = int

Size of protospacer.

before_proto_context

type = int

Default = 5. Amount of nucleotide context to put before the protospacer in the sensor

sensor_length

type = int

Total length of the sensor in nt. Default = 60.

sensor_orientation

type = str

Options for sensor_orientation = ‘reverse-complement’ or’forward’.

pegg.prime.sensor_viz(df_w_sensor, i)¶

Function for visualizing the pegRNA aligned to the sensor. Input = dataframe with pegRNA-sensor pairs and the row index (i) to visualize.

Parameters¶

df_w_sensor

type = pd.DataFrame

Dataframe containing the pegRNA-sensor pairs generated by run().

i

type = int

Row index from the dataframe that you want to visualize.

pegg.prime.split_word(word)¶

Simple function for splitting string into component characters

pegg.base¶

pegg.base.base_oligo_generator(peg_df, five_prime_adapter='AGCGTACACGTCTCACACC', three_prime_adapter='GAATTCTAGATCCGGTCGTCAAC', gRNA_scaff='GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC')¶

A tool for automatically generating oligos from the output of run(). Returns input dataframe with new columns containing gRNA oligo with or without sensor.

Parameters¶

peg_df

type = pd.DataFrame

A dataframe containing the gRNAs for the selected input mutations. Generated by run_base() or gRNA_generator()

five_prime_adapter

type = str

5’ Prime Adapter. The automatically provided 5’ adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.

three_prime_adapter

type = str

5’ Prime Adapter. The automatically provided 5’ adapter contains an Esp3I (BsmBI) site. Can be swapped with whatever input string user wants.

gRNA_scaff

type = str

gRNA scaffold region. Automatically set to a functional gRNA scaffold. Can be swapped with whatever input string user wants.

pegg.base.eligible_PAM_finder_base(mut, PAM, proto_size=19)¶

Determines eligible PAM sequences for creating gRNAs. Returns PAM sequence locations on (1) forward, and (2) reverse-complement strand

Parameters¶

mut

type = mutation class

See class: mutation

PAM

type = str

PAM sequence for searching. Formatting: e.g. “NGG”

N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]

proto_size

type = int

Size of protospacer being used. Default = 19 (G+19).

pegg.base.gRNA_generator(mut, PAM, orientation, proto_size, ideal_edit_window=[4, 8])¶

Generates pegRNAs for a given mutation.

Parameters¶

mut

type = mutation class

See class: mutation

PAM

type = str

PAM sequence for searching. Formatting: e.g. “NGG”

N = [A|T|C|G] R = [A|G] Y = [C|T] S = [G|C] W = [A|T] K = [G|T] M = [A|C] B = [C|G|T] D = [A|G|T] H = [A|C|T] V = [A|C|G]

orientation

type = str

‘+’ or ‘-’

proto_size

type = int

Size of protospacer being used. Default = 19 (G+19).

ideal_edit_window

type = list

Ideal editing window for the editor being used. Default = [4,8]. Labels mutations that fall in this window for future filtration if desired.

pegg.base.run_base(input_df, input_format, chrom_dict=None, PAM='NGG', filtration='ABE+CBE', ideal_edit_window=[4, 8], auto_SNP_filter=True, proto_size=19, context_size=120, RE_sites=None, polyT_threshold=4, before_proto_context=5, sensor_length=40, sensor_orientation='reverse-complement', sensor=True)¶

Master function for generating base editing gRNAs. Takes as input a dataframe containing mutations in one of the acceptable formats. Returns a dataframe with gRNAs with desired design parameters.

Parameters¶

input_df

type = pd.DataFrame

Pandas dataframe that contains the input mutations.

input_format

type = str

Options = ‘cBioPortal’, ‘WT_ALT’, ‘PrimeDesign’. For ‘WT_ALT’, make sure to put headers as “WT” and “ALT”. For “PrimeDesign”, put the header as “SEQ”.

chrom_dict

type = dict or None

Dictionary generated by genome_loader() that holds the chromosome sequences.

PAM

type = str

PAM sequence for searching. Default = “NGG”. Can include any nucleic acid code (e.g. PAM = “NRCH”).

filtration

type = str or list

Filters the mutation input list to only include the desired SNPs. Options = “No filter”, “ABE”, “CBE”, “ABE+CBE”, or a list containing the desired SNPs to model (e.g. [‘C>A’, ‘T>C’]).

ideal_edit_window

type = list

Ideal editing window for the editor being used. Default = [4,8]. Labels mutations that fall in this window for future filtration if desired.

auto_SNP_filter

type = bool

True/False for whether to filter mutant input to exclude mutations that are NOT SNPs (and thus not BE amenable).

proto_size

type = int

The length of the protospacer (excluding the appended G at the 5’ end). Default = 19 (G+19).

context_size

type = int

The amount of context/flanking sequence on either side of the mutation to generate. For larger variants, set this larger. Default = 120. e.g. AAA(A/G)AAA = context_size of 3

RE_sites

type = list or None

A list containing the RE recognition sites to filter (e.g. [‘CGTCTC’, ‘GAATTC’] for Esp3I and EcoRI). Default = None (no filtration).

polyT_threshold

type = int

The length of the polyT sequence to classify as a terminator. Default = 4.

before_proto_context

type = int

Default = 5. Amount of nucleotide context to put before the protospacer in the sensor

sensor_length

type = int

Total length of the sensor in nt. Default = 60.

sensor_orientation

type = str

Options for sensor_orientation = ‘reverse-complement’ or’forward’.

sensor

type = bool

True/False whether to include a sensor in the pegRNA design or not.

pegg.base.sensor_generator_base(df, proto_size, before_proto_context=5, sensor_length=60, sensor_orientation='reverse-complement')¶

Generates sensor sequence for quantification of gRNA editing outcomes. Automatically puts sensor in reverse complement orientation with respect to protospacer. This is highly reccomended to reduce recombination during cloning/library preparation.

Parameters¶

df

type = pd.DataFrame

Dataframe containing gRNAs.

proto_size

type = int

Size of protospacer.

before_proto_context

type = int

Default = 5. Amount of nucleotide context to put before the protospacer in the sensor

sensor_length

type = int

Total length of the sensor in nt. Default = 60.

sensor_orientation

type = str

Options for sensor_orientation = ‘reverse-complement’ or’forward’.

pegg.base.sensor_viz_base(df_w_sensor, i)¶

Function for visualizing the gRNA aligned to the sensor. Input = dataframe with pegRNA-sensor pairs and the row index (i) to visualize.

Parameters¶

df_w_sensor

type = pd.DataFrame

Dataframe containing the gRNA-sensor pairs generated by run().

i

type = int

Row index from the dataframe that you want to visualize.

pegg.base.split_word(word)¶

Simple function for splitting string into component characters

pegg.library¶

pegg.library.aavs1_muts(chrom_dict, num_muts, genome_version='GRCh37')¶

Function for generating list of “mutations” within AAVS1 locus as negative controls. Generates a list of SNPs that contain idential reference and alternate alleles. Intron 1-2 of PPP1R12C (AAVS1 locus): GRCh37 transcript = ENST00000263433.3 | GRCh38 transcript = ENST00000263433.8

Not currently used in the pipeline, but provided anyway.

Parameters¶

chrom_dict

type = dict

Dictionary containing chromosomes in dictionary format for parsing by PEGG.

num_muts

type = int

Number of safe-targetting mutations to select/generate.

genome_version

type = str

Options = ‘GRCh37’, ‘GRCh38’

pegg.library.library_maker(mutant_input, gene_name, chrom_dict, fraction_safetarget=0.05, organism='human', fraction_silent=0, chrom=None, strand=None, start_end_cds=None)¶

Compiles the different library design functions to aggregate all of the variants for a particular gene, include silent substitution controls at a designated %, and non-targeting controls. This can then be fed into the pegg2.run() or base_editing.run_base() functions.

NOTE: THIS ONLY WORKS FOR GRCh37 and GRCm38!!! Alternatively, generate these mutations/guides using these reference builds, and then generate your own guides targeting mutations in your desired reference build.

Parameters¶

mutant_input

type = pd.DataFrame

DataFrame containing all of the input mutations. Doesn’t require filtration for gene of interest.

gene_name

type = str

Gene name to select from the mutant_input dataframe.

chrom_dict

type = dict

Dictionary containing the reference genome. See genome_loader()

fraction_safetarget

type = float

Value from 0 to 1 that corresponds to the fraction of safe-targetting mutations to include in the library. (i.e. what fraction of the mutant_input)

organism

type = str

Options = ‘human’ or ‘mouse’. Determines which of the organisms to generate safe targeting guides for.

fraction_silent

type = float

Value from 0 to 1 that corresponds to the fraction of silent mutations to include in the library. (i.e. what fraction of the mutant_input)

chrom

type = int or str

Chromosome that the gene of interest falls on.

strand

type = str

‘+’ or ‘-’ – corresponds with which strand the transcript falls on.

start_end_cds

type = list

Nested list that contains the CDS locations of the gene transcript of interest, in the + strand orientation.

pegg.library.mutation_aggregator(mutant_input, gene_name)¶

Selects the mutations that correspond to the desired gene name. Removes duplicates (checking at DNA level; not amino acid level). Returns dataframe containing these aggregated mutants occuring in gene_name.

Parameters¶

mutant_input

type = pd.DataFrame

A dataframe containing the input mutations from which selections are made to generate pegRNAs. See documentation for precise qualities of this dataframe

gene_name

type = str

Gene’s Hugo_Symbol (i.e. name).

pegg.library.neutral_substitutions(gene_name, chrom, strand, start_end_cds, chrom_dict)¶

A function for generating all possible synonymous codon substitutions (i.e. silent mutations) of a given gene. See documentation for more information about format of start_end_cds & example usage.

Parameters¶

gene_name

type = str

Gene’s Hugo_Symbol (i.e. name).

chrom

type = str

Chromosome that gene occurs on. Format = e.g. ‘chr17’, ‘chrX’, etc.

strand

type = str

Strand that transcript is on. Options are ‘+’ or ‘-‘.

start_end_cds

type = list

A 2-d list containing the start/end locations of each region of the coding sequence (CDS) for the gene’s selected transcript. See documentation for example and precise specifications of format.

chrom_dict

type = dict

Dictionary containing the reference genome. See genome_loader()

pegg.library.nontargeting_guides(num_guides, edit_type='prime')¶

Function for generating a list of non-targetting guides (in the human genome). Future versions will expand to mouse as well
Unprocessed files for mouse located within this module. Guides taken from: https://www.nature.com/articles/nmeth.4423 (https://doi.org/10.1038/nmeth.4423)

Parameters¶

num_guides

type = int

Number of guides to non-targetting generate. Max = 1000.

edit_type

type = str

Options = ‘base’, ‘prime’

pegg.library.safe_muts(num_muts, chrom_dict, organism='human')¶

Function for generating a list of safe-targetting pegRNAs/gRNAs. NOTE: THIS ONLY WORKS FOR GRCh37 and GRCm38!!! Alternatively, generate these mutations/guides using these reference builds, and then generate your own guides targeting mutations in your desired reference build.

This is a random subset of 100,000 “safe regions” taken from Morgens et al., 2017 (https://doi.org/10.1038/ncomms15178)

Parameters¶

num_muts

type = int

Number of safe-targetting mutations to select/generate.

chrom_dict

type = dict

Dictionary containing chromosomes in dictionary format for parsing by PEGG.

organism

type = str

Choices = “human” or “mouse”. Generates safe targeting guides in human or mouse genome.