razan18 · razan18 · Feb 18, 2022
diff --git a/UNIX_Assignment_Template.md b/UNIX_Assignment_Template.md
@@ -2,57 +2,178 @@
 
 ##Data Inspection
 
-###Attributes of `fang_et_al_genotypes`
+###Attributes of `fang_et_al_genotypes` and `snp_position`
 
+To look at the first 10 lines and inspect the headers
 ```
-here is my snippet of code used for data inspection
+head fang_et_al_genotypes.txt
 ```
-
-By inspecting this file I learned that:
-
-1. point 1
-2. point 2
-3. point 3
-
-or
-
-* point 1
-* point 2
-* point 3
-
-###Attributes of `snp_position.txt`
-
+To look at the number of columns
+```
+awk -F "\t" '{print NF; exit}' fang_et_al_genotypes.txt
+``` 
+To look at the number of words, lines, characters (bytes)
+```
+wc fang_et_al_genotypes.txt
+```
+To look at the file size
 ```
-here is my snippet of code used for data inspection
+du -h fang_et_al_genotypes.txt
+```
+To look at file type
+```
+file fang_et_al_genotypes.txt
 ```
 
 By inspecting this file I learned that:
+Note: all previous code was repeated for the snp position file, just changing the file being selected in the command for inspection.
 
-1. point 1
-2. point 2
-3. point 3
-
-or
+1. The fang et al had was very unorganized and I can't tell what the headers are from what the data is.
+2. The fang et al file had 986 columns, snp position file had 15 columns.
+3. The fang et al file had 2783 lines, 2744038 words, 11051939 characters. The snp position file had 984 lines, 13198 words, 82763 characters.
+4. The fang et al file is 6.5M in size, the snp position file is 41K in size.
+5. Both files are ASCII text.
 
-* point 1
-* point 2
-* point 3
 
 ##Data Processing
 
+To extract column 3 which is 'position' and sort it based on the number of occurences of each term in that column
+```
+cut -f3 fang_et_al_genotypes.txt | sort | uniq -c
+```
+To sort snp position file based on the required columns
+```
+cut -f 1,3,4 snp_position.txt > 134col_snp_pos.txt
+sort -k1,1 134col_snp_pos.txt > sorted_cut_snp_pos.txt 
+```
+To grab the header into a new file
+```
+head -n 1 fang_et_al_genotypes.txt > header.txt
+```
 ###Maize Data
 
+To match anything with 'ZMM' from the fang file and direct the standard out to a new file called maize_geno. Match 'ZMM' as all maize start with this.
 ```
-here is my snippet of code used for data processing
+grep 'ZMM' fang_et_al_genotypes.txt > maize_genotypes.txt
 ```
 
-Here is my brief description of what this code does
+To combine the header with maize genotypes. The code after checks how the data looks like.
+```
+cat header.txt maize_genotypes.txt > header_maize_genotypes.txt
 
+head -n 10 maize_genotypes.txt | cut -c -100 | column -t
+```
+To extract the groups and maize genotypes to a new file
+```
+cut -f 3-986 header_maize_genotypes.txt > snps_only_header_maize_genotypes.txt
+```
+To transpose data after extracting maize data
+```
+awk -f transpose.awk snps_only_header_maize_genotypes.txt > transposed_maize_genotypes.txt
+```
+Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted.
+```
+sort -k1,1 transposed_maize_genotypes.txt > sorted_maize_genotypes.txt
+head -n 10 sorted_maize_genotypes.txt | cut -c -50 | column -t
+```
 
+To join sorted snp and maize 
+```
+join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_maize_genotypes.txt > join_maize_genotypes.txt
+```
+Sort increasing SNP position for maize
+```
+grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_maize_genotypes.txt
+```
+Sort decreasing SNP position for maize
+```
+grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_maize_genotypes.txt
+```
+Make a file for multiple snps in maize genotype
+```
+awk '$3 ~ /^multiple$/' join_maize_genotypes.txt > maize_multiple.txt
+```
+Make a file for unknown snps in maize genotype
+```
+awk '$3 ~ /^unknown$/' join_maize_genotypes.txt > maize_unknown.txt
+```
+Make directory to put all maize data files
+```
+mkdir maize_data
+```
+Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ?
+```
+for i in {1..10}; do awk '$2== '$i'' increase_maize_genotypes.txt | sort -k3,3n > maize_data/maize_data_chr"$i"_increase.txt; done
+```
+Replace missing value in decreasing maize genotype with "-"
+```
+sed 's/?/-/g' decrease_maize_genotypes.txt > decrease_maize_genotype_dash.txt
+```
+Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-"
+```
+for i in {1..10}; do awk '$2=='$i'' decrease_maize_genotype_dash.txt > maize_data/maize_chr"$i"_decrease.txt; done
+```
 ###Teosinte Data
-
+To match anything with 'ZMP' from the fang file and direct the standard out to a new file called teosinte_geno. Match 'ZMP' as all teosinte start with this.
+```
+grep 'ZMP' fang_et_al_genotypes.txt > teosinte_genotypes.txt
 ```
-here is my snippet of code used for data processing
+To combine the header with teosinte genotypes. The code after checks how the data looks like.
 ```
+cat header.txt teosinte_genotypes.txt > header_teosinte_genotypes.txt
 
-Here is my brief description of what this code does
+head -n 10 teosinte_genotypes.txt | cut -c -100 | column -t
+```
+To check the number of columns is 986
+```
+awk -F "\t" '{print NF; exit}' header_teosinte_genotypes.txt
+```
+To extract the groups and maize genotypes to a new file
+```
+cut -f 3-986 teosinte_genotypes.txt > snps_only_header_teosinte_genotypes.txt
+```
+To transpose data after extracting maize data
+```
+awk -f transpose.awk snps_only_header_teosinte_genotypes.txt > transposed_teosinte_genotypes.txt
+```
+Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted.
+```
+sort -k1,1 transposed_teosinte_genotypes.txt > sorted_teosinte_genotypes.txt
+head -n 10 sorted_teosinte_genotypes.txt | cut -c -50 | column -t
+```
+To join sorted snp and teosinte
+```
+join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_teosinte_genotypes.txt > join_teosinte_genotypes.txt
+```
+Sort increasing SNP position for teosinte
+```
+grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_teosinte_genotypes.txt
+```
+Sort decreasing SNP position for teosinte
+```
+grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_teosinte_genotypes.txt
+```
+Make a file for multiple snps in teosinte genotype
+```
+awk '$3 ~ /^multiple$/' join_teosinte_genotypes.txt > teosinte_multiple.txt
+```
+Make a file for unknown snps in teosinte genotype
+```
+awk '$3 ~ /^unknown$/' join_teosinte_genotypes.txt > teosinte_unknown.txt
+```
+Make directory to put all teosinte data files
+```
+mkdir teosinte_data
+```
+Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ?
+```
+for i in {1..10}; do awk '$2== '$i'' increase_teosinte_genotypes.txt | sort -k3,3n > teosinte_data/teosinte_data_chr"$i"_increase.txt; done
+```
+Replace missing value in decreasing teosinte genotype with "-"
+```
+sed 's/?/-/g' decrease_teosinte_genotypes.txt > decrease_teosinte_genotypes_dash.txt
+```
+Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-"
+```
+for i in {1..10}; do awk '$2=='$i'' decrease_teosinte_genotypes_dash.txt > teosinte_data/teosinte_chr"$i"_decrease.txt; done
+```