Skip to content

ShrutiVerma17/NeuralNetworksForGenomics

Repository files navigation

NeuralNetworksForGenomics

wrapper.sh: Script file that takes in three arguments, the directory where the original fasta files are located, the sequence length desired, and whether you want to run the neural network on the balanced file or the imbalanced (true implies balanced, false implies imbalanced)

one_hot_encode_dna_update_categorical.py: File that does the one hot encoding for the input DNA sequences and that comes up with the labels for each DNA sequence. Important: it expects that all introns come in with every nucleotide lowercase and assigns that sequence a label of 0. Similarily, it expects cds exons to come in with every nucleotide capitalized and assigns it a label of 1, 3 prime UTR exons to come in with only the first base capitalized and assigns it to a label of 2, and 5 prime UTR exons to come in with only the second base capitalized and assigns it to a label of 3.

categorical_neural_network.py: Convolutional neural network that I wrote to do categorical prediction. To run this, you first have to go through the input creation pipeline script described below. At the end of that script, you'll have a large file composed of many sequences (cds exons, introns, 3 prime UTR exons, and 5 prime UTR exons). All you need to do to run this neural network is replace the name for your input file.

input_creation_pipeline.py: Assumes that, for each chromosome that you want to include in your input file, you have created a directory dedicated to that chromosome with four files: "cds_exons.fa", "utr_3_prime.fa", "utr_5_prime.fa", and "introns.fa." Note that you can retrieve each of these files from the UCSC Genome Browser. To indicate how many/which chromosomes you are planning to include in the input files (and thus which ones you have folders dedicated to), change the line "for i in range (1, 5)" to be "for i in range (your_starting_chrom_number, your_ending_chrom_number + 1)". Also, before you run, indicate the sequence length you want by changing the sequence_length variable (which by default is 300 base pairs). Another important thing to note is that the input file that this script creates by default will not be balanced; rather, you can expect that there will be far, far more introns in the resulting file than any of the other categories. If you'd like to create a balanced file, though, there is a rather easy fix. As you can note, this script keeps track of how many sequences of each category are going into the input file. It prints these values at the end. Run the script once, creating an imbalanced file and make note of the numbers printed at the end. There should be roughly the same number of all exon types. Decide, based on that, how many introns you want to include in your more balanced file. Then, to the if statement on line 34, add the clause "or (file == "introns.fa" and introns > $num)" where $num is the number of introns you want included in your file. This clause also exists on line 25 if you'd like to copy and paste. Then, run the script again, and you should create your roughly balanced file.

About

Developed deep and interpretable neural networks to investigate how biological actors such as mRNA can distinguish between coding/non-coding sections of the genome using only sequence information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors