Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 152 additions & 31 deletions UNIX_Assignment_Template.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,57 +2,178 @@

##Data Inspection

###Attributes of `fang_et_al_genotypes`
###Attributes of `fang_et_al_genotypes` and `snp_position`

To look at the first 10 lines and inspect the headers
```
here is my snippet of code used for data inspection
head fang_et_al_genotypes.txt
```

By inspecting this file I learned that:

1. point 1
2. point 2
3. point 3

or

* point 1
* point 2
* point 3

###Attributes of `snp_position.txt`

To look at the number of columns
```
awk -F "\t" '{print NF; exit}' fang_et_al_genotypes.txt
```
To look at the number of words, lines, characters (bytes)
```
wc fang_et_al_genotypes.txt
```
To look at the file size
```
here is my snippet of code used for data inspection
du -h fang_et_al_genotypes.txt
```
To look at file type
```
file fang_et_al_genotypes.txt
```

By inspecting this file I learned that:
Note: all previous code was repeated for the snp position file, just changing the file being selected in the command for inspection.

1. point 1
2. point 2
3. point 3

or
1. The fang et al had was very unorganized and I can't tell what the headers are from what the data is.
2. The fang et al file had 986 columns, snp position file had 15 columns.
3. The fang et al file had 2783 lines, 2744038 words, 11051939 characters. The snp position file had 984 lines, 13198 words, 82763 characters.
4. The fang et al file is 6.5M in size, the snp position file is 41K in size.
5. Both files are ASCII text.

* point 1
* point 2
* point 3

##Data Processing

To extract column 3 which is 'position' and sort it based on the number of occurences of each term in that column
```
cut -f3 fang_et_al_genotypes.txt | sort | uniq -c
```
To sort snp position file based on the required columns
```
cut -f 1,3,4 snp_position.txt > 134col_snp_pos.txt
sort -k1,1 134col_snp_pos.txt > sorted_cut_snp_pos.txt
```
To grab the header into a new file
```
head -n 1 fang_et_al_genotypes.txt > header.txt
```
###Maize Data

To match anything with 'ZMM' from the fang file and direct the standard out to a new file called maize_geno. Match 'ZMM' as all maize start with this.
```
here is my snippet of code used for data processing
grep 'ZMM' fang_et_al_genotypes.txt > maize_genotypes.txt
```

Here is my brief description of what this code does
To combine the header with maize genotypes. The code after checks how the data looks like.
```
cat header.txt maize_genotypes.txt > header_maize_genotypes.txt

head -n 10 maize_genotypes.txt | cut -c -100 | column -t
```
To extract the groups and maize genotypes to a new file
```
cut -f 3-986 header_maize_genotypes.txt > snps_only_header_maize_genotypes.txt
```
To transpose data after extracting maize data
```
awk -f transpose.awk snps_only_header_maize_genotypes.txt > transposed_maize_genotypes.txt
```
Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted.
```
sort -k1,1 transposed_maize_genotypes.txt > sorted_maize_genotypes.txt
head -n 10 sorted_maize_genotypes.txt | cut -c -50 | column -t
```

To join sorted snp and maize
```
join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_maize_genotypes.txt > join_maize_genotypes.txt
```
Sort increasing SNP position for maize
```
grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_maize_genotypes.txt
```
Sort decreasing SNP position for maize
```
grep -v "multiple" join_maize_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_maize_genotypes.txt
```
Make a file for multiple snps in maize genotype
```
awk '$3 ~ /^multiple$/' join_maize_genotypes.txt > maize_multiple.txt
```
Make a file for unknown snps in maize genotype
```
awk '$3 ~ /^unknown$/' join_maize_genotypes.txt > maize_unknown.txt
```
Make directory to put all maize data files
```
mkdir maize_data
```
Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ?
```
for i in {1..10}; do awk '$2== '$i'' increase_maize_genotypes.txt | sort -k3,3n > maize_data/maize_data_chr"$i"_increase.txt; done
```
Replace missing value in decreasing maize genotype with "-"
```
sed 's/?/-/g' decrease_maize_genotypes.txt > decrease_maize_genotype_dash.txt
```
Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-"
```
for i in {1..10}; do awk '$2=='$i'' decrease_maize_genotype_dash.txt > maize_data/maize_chr"$i"_decrease.txt; done
```
###Teosinte Data

To match anything with 'ZMP' from the fang file and direct the standard out to a new file called teosinte_geno. Match 'ZMP' as all teosinte start with this.
```
grep 'ZMP' fang_et_al_genotypes.txt > teosinte_genotypes.txt
```
here is my snippet of code used for data processing
To combine the header with teosinte genotypes. The code after checks how the data looks like.
```
cat header.txt teosinte_genotypes.txt > header_teosinte_genotypes.txt

Here is my brief description of what this code does
head -n 10 teosinte_genotypes.txt | cut -c -100 | column -t
```
To check the number of columns is 986
```
awk -F "\t" '{print NF; exit}' header_teosinte_genotypes.txt
```
To extract the groups and maize genotypes to a new file
```
cut -f 3-986 teosinte_genotypes.txt > snps_only_header_teosinte_genotypes.txt
```
To transpose data after extracting maize data
```
awk -f transpose.awk snps_only_header_teosinte_genotypes.txt > transposed_teosinte_genotypes.txt
```
Before joining, we have to sort the SNP_ID. This will sort based on column 1 and print the standard out into the new files. I also checked if they're sorted.
```
sort -k1,1 transposed_teosinte_genotypes.txt > sorted_teosinte_genotypes.txt
head -n 10 sorted_teosinte_genotypes.txt | cut -c -50 | column -t
```
To join sorted snp and teosinte
```
join -1 1 -2 1 -t $'\t' sorted_cut_snp_pos.txt sorted_teosinte_genotypes.txt > join_teosinte_genotypes.txt
```
Sort increasing SNP position for teosinte
```
grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3n > increase_teosinte_genotypes.txt
```
Sort decreasing SNP position for teosinte
```
grep -v "multiple" join_teosinte_genotypes.txt | grep -v "unkown" | sort -k3,3nr > decrease_teosinte_genotypes.txt
```
Make a file for multiple snps in teosinte genotype
```
awk '$3 ~ /^multiple$/' join_teosinte_genotypes.txt > teosinte_multiple.txt
```
Make a file for unknown snps in teosinte genotype
```
awk '$3 ~ /^unknown$/' join_teosinte_genotypes.txt > teosinte_unknown.txt
```
Make directory to put all teosinte data files
```
mkdir teosinte_data
```
Loop to make individual chromosome files (chr1-10) based on increasing snp position with unknown ?
```
for i in {1..10}; do awk '$2== '$i'' increase_teosinte_genotypes.txt | sort -k3,3n > teosinte_data/teosinte_data_chr"$i"_increase.txt; done
```
Replace missing value in decreasing teosinte genotype with "-"
```
sed 's/?/-/g' decrease_teosinte_genotypes.txt > decrease_teosinte_genotypes_dash.txt
```
Loop to make individual chromosome files (chr1-10) based on decreasing snp position with unknown "-"
```
for i in {1..10}; do awk '$2=='$i'' decrease_teosinte_genotypes_dash.txt > teosinte_data/teosinte_chr"$i"_decrease.txt; done
```