From 245c79d900c80272d5bcde0093495c28d1be352d Mon Sep 17 00:00:00 2001 From: Razan <96405088+razan18@users.noreply.github.com> Date: Fri, 18 Feb 2022 15:15:20 -0600 Subject: [PATCH] Update UNIX_Assignment_Template.md --- UNIX_Assignment_Template.md | 183 ++++++++++++++++++++++++++++++------ 1 file changed, 152 insertions(+), 31 deletions(-) diff --git a/UNIX_Assignment_Template.md b/UNIX_Assignment_Template.md index 0e25d1d..4941ed1 100644 --- a/UNIX_Assignment_Template.md +++ b/UNIX_Assignment_Template.md @@ -2,57 +2,178 @@ ##Data Inspection -###Attributes of `fang_et_al_genotypes` +###Attributes of `fang_et_al_genotypes` and `snp_position` +To look at the first 10 lines and inspect the headers ``` -here is my snippet of code used for data inspection +head fang_et_al_genotypes.txt ``` - -By inspecting this file I learned that: - -1. point 1 -2. point 2 -3. point 3 - -or - -* point 1 -* point 2 -* point 3 - -###Attributes of `snp_position.txt` - +To look at the number of columns +``` +awk -F "\t" '{print NF; exit}' fang_et_al_genotypes.txt +``` +To look at the number of words, lines, characters (bytes) +``` +wc fang_et_al_genotypes.txt +``` +To look at the file size ``` -here is my snippet of code used for data inspection +du -h fang_et_al_genotypes.txt +``` +To look at file type +``` +file fang_et_al_genotypes.txt ``` By inspecting this file I learned that: +Note: all previous code was repeated for the snp position file, just changing the file being selected in the command for inspection. -1. point 1 -2. point 2 -3. point 3 - -or +1. The fang et al had was very unorganized and I can't tell what the headers are from what the data is. +2. The fang et al file had 986 columns, snp position file had 15 columns. +3. The fang et al file had 2783 lines, 2744038 words, 11051939 characters. The snp position file had 984 lines, 13198 words, 82763 characters. +4. The fang et al file is 6.5M in size, the snp position file is 41K in size. +5. Both files are ASCII text. -* point 1 -* point 2 -* point 3 ##Data Processing +To extract column 3 which is 'position' and sort it based on the number of occurences of each term in that column +``` +cut -f3 fang_et_al_genotypes.txt | sort | uniq -c +``` +To sort snp position file based on the required columns +``` +cut -f 1,3,4 snp_position.txt > 134col_snp_pos.txt +sort -k1,1 134col_snp_pos.txt > sorted_cut_snp_pos.txt +``` +To grab the header into a new file +``` +head -n 1 fang_et_al_genotypes.txt > header.txt +``` ###Maize Data +To match anything with 'ZMM' from the fang file and direct the standard out to a new file called maize_geno. Match 'ZMM' as all maize start with this. ``` -here is my snippet of code used for data processing +grep 'ZMM' fang_et_al_genotypes.txt > maize_genotypes.txt ``` -Here is my brief description of what this code does +To combine the header with maize genotypes. The code after checks how the data looks like. +``` +cat header.txt maize_genotypes.txt > header_maize_genotypes.txt +head -n 10 maize_genotypes.txt | cut -c -100 | column -t +``` +To extract the groups and maize genotypes to a new file +``` +cut -f 3-986 header_maize_genotypes.txt > snps_only_header_maize_genotypes.txt +``` +To transpose data after extracting maize data +``` +awk -f transpose.awk snps_only_header_maize_genotypes.txt > transposed_maize_genotypes.txt +``` +Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted. +``` +sort -k1,1 transposed_maize_genotypes.txt > sorted_maize_genotypes.txt +head -n 10 sorted_maize_genotypes.txt | cut -c -50 | column -t +``` +To join sorted snp and maize +``` +join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_maize_genotypes.txt > join_maize_genotypes.txt +``` +Sort increasing SNP position for maize +``` +grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_maize_genotypes.txt +``` +Sort decreasing SNP position for maize +``` +grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_maize_genotypes.txt +``` +Make a file for multiple snps in maize genotype +``` +awk '$3 ~ /^multiple$/' join_maize_genotypes.txt > maize_multiple.txt +``` +Make a file for unknown snps in maize genotype +``` +awk '$3 ~ /^unknown$/' join_maize_genotypes.txt > maize_unknown.txt +``` +Make directory to put all maize data files +``` +mkdir maize_data +``` +Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ? +``` +for i in {1..10}; do awk '$2== '$i'' increase_maize_genotypes.txt | sort -k3,3n > maize_data/maize_data_chr"$i"_increase.txt; done +``` +Replace missing value in decreasing maize genotype with "-" +``` +sed 's/?/-/g' decrease_maize_genotypes.txt > decrease_maize_genotype_dash.txt +``` +Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-" +``` +for i in {1..10}; do awk '$2=='$i'' decrease_maize_genotype_dash.txt > maize_data/maize_chr"$i"_decrease.txt; done +``` ###Teosinte Data - +To match anything with 'ZMP' from the fang file and direct the standard out to a new file called teosinte_geno. Match 'ZMP' as all teosinte start with this. +``` +grep 'ZMP' fang_et_al_genotypes.txt > teosinte_genotypes.txt ``` -here is my snippet of code used for data processing +To combine the header with teosinte genotypes. The code after checks how the data looks like. ``` +cat header.txt teosinte_genotypes.txt > header_teosinte_genotypes.txt -Here is my brief description of what this code does \ No newline at end of file +head -n 10 teosinte_genotypes.txt | cut -c -100 | column -t +``` +To check the number of columns is 986 +``` +awk -F "\t" '{print NF; exit}' header_teosinte_genotypes.txt +``` +To extract the groups and maize genotypes to a new file +``` +cut -f 3-986 teosinte_genotypes.txt > snps_only_header_teosinte_genotypes.txt +``` +To transpose data after extracting maize data +``` +awk -f transpose.awk snps_only_header_teosinte_genotypes.txt > transposed_teosinte_genotypes.txt +``` +Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted. +``` +sort -k1,1 transposed_teosinte_genotypes.txt > sorted_teosinte_genotypes.txt +head -n 10 sorted_teosinte_genotypes.txt | cut -c -50 | column -t +``` +To join sorted snp and teosinte +``` +join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_teosinte_genotypes.txt > join_teosinte_genotypes.txt +``` +Sort increasing SNP position for teosinte +``` +grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_teosinte_genotypes.txt +``` +Sort decreasing SNP position for teosinte +``` +grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_teosinte_genotypes.txt +``` +Make a file for multiple snps in teosinte genotype +``` +awk '$3 ~ /^multiple$/' join_teosinte_genotypes.txt > teosinte_multiple.txt +``` +Make a file for unknown snps in teosinte genotype +``` +awk '$3 ~ /^unknown$/' join_teosinte_genotypes.txt > teosinte_unknown.txt +``` +Make directory to put all teosinte data files +``` +mkdir teosinte_data +``` +Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ? +``` +for i in {1..10}; do awk '$2== '$i'' increase_teosinte_genotypes.txt | sort -k3,3n > teosinte_data/teosinte_data_chr"$i"_increase.txt; done +``` +Replace missing value in decreasing teosinte genotype with "-" +``` +sed 's/?/-/g' decrease_teosinte_genotypes.txt > decrease_teosinte_genotypes_dash.txt +``` +Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-" +``` +for i in {1..10}; do awk '$2=='$i'' decrease_teosinte_genotypes_dash.txt > teosinte_data/teosinte_chr"$i"_decrease.txt; done +```