diff --git a/DataForTut12.R b/DataForTut12.R new file mode 100644 index 0000000..8e56fef --- /dev/null +++ b/DataForTut12.R @@ -0,0 +1,34 @@ +### Making the data for Q2 ### + +setwd('/Users/andrewguinness/Google Drive/ND/Fall 17/Intro to Biocomputing/Intro_Biocom_ND_319_Tutorial12') +library(stringr) + +SS <- round(runif(10, 111111111, 999999999)) +SS <- as.character(SS) +SS <- paste(substr(SS, 1, 3), substr(SS, 4, 5), substr(SS, 6, nchar(SS)), sep = '-') + + +hours <- sample(00:23, 15) +hours <- as.character(hours) +for(i in 1:15){ + if(nchar(hours[i]) == 1){ + hours[i] = paste("0", hours[i], sep = '') + } +} +minutes <- sample(00:59, 15) +minutes <- as.character(minutes) +for(i in 1:15){ + if(nchar(minutes[i]) == 1){ + minutes[i] = paste("0", minutes[i], sep = '') + } +} + +times <- paste(hours, minutes, sep = ':') + + + +sp <- c("H. sapiens", "C. pipiens", "H. cecropia", "E. coli", "R. pomonella") + +list <- as.list(c(times, SS, sp)) + +write.table(list, "pattern.txt", row.names = F, col.names = F, sep = '\n') diff --git a/Tutorial12.html b/Tutorial12.html new file mode 100644 index 0000000..aed84eb --- /dev/null +++ b/Tutorial12.html @@ -0,0 +1,12213 @@ + + +
+import pandas as pd
+import numpy as np
+import os
+from scipy.optimize import minimize
+from scipy.stats import norm
+import scipy.stats
+
+os.chdir('/Users/andrewguinness/Google Drive/ND/Fall 17/Intro to Biocomputing/Intro_Biocom_ND_319_Tutorial12')
+Let's begin by importing all of the packages we will need in a manner conducive to their use, and set our working directory to the directory containing our data.
+ +chickwts = pd.read_csv("chickwts.txt")
+
+chickwts = chickwts[(chickwts.feed == 'soybean') | (chickwts.feed == 'sunflower')]
+
+chickwts = chickwts.replace('soybean', 0)
+chickwts = chickwts.replace('sunflower',1)
+Now we want to read in our data and save it as an argument.
+Thereafter, we want to subset our data so that we can answer the question we're asking statistically. So, let's grab the chick weights ONLY for those fed soybean and sunflower diets. (Read the | bar as "or").
+Now, let's treat soybean like our intercept of the linear model, b0 by setting it to zero.
+What this does is it takes the linear model: y = b0 + b1 * x and changes it to y = b0 (because x = 0).
+This will become clearer later.
+ +def nllike(p,obs):
+ B0=p[0]
+ B1=p[1]
+ sigma=p[2]
+ expected=B0+B1*obs.feed
+ nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()
+ return nll
+
+initialguess = np.array([1,1,1])
+
+fit = minimize(nllike, initialguess, method="Nelder-Mead", options={'disp': True}, args=chickwts)
+
+print(fit.x)
+
+scipy.stats.ttest_ind(chickwts[(chickwts.feed == 0)], chickwts[(chickwts.feed == 1)])
+Here we define the normal log likelihood function, which takes three parameter (p) values, which is our initialguess, and our observations (obs).
+So, how do we interpet this?
+You'll see from the fit (which optimizes the values to predict their LOWEST log likelihood (i.e. highest likelihood)) that b0 gives you some value and b1 gives you a positive additive value. That is to say, on average the chicks fed sunflower weigh b1 more than soybean chicks.
+We can confirm these results with a T-test, indicating the two populations are discrete.
+ +Here, I've synthesized some data on which we'll be doing three separate pattern matching steps, so many different types of data are available in the same file. Let's go ahead and read this in and take a look.
+ +import re
+
+textfile = open('pattern.txt', 'r')
+filetext = textfile.read()
+textfile.close()
+
+print filetext
+Alright, so now we want to pull out three separate things:
+1) Just the SSNs
+2) Times after noon (12:00)
+3) The species names
+Let's do these each one at a time.
+ +## Social Security Numbers
+SSNmatch = re.compile('\d{3}-\d{2}-\d{4}')
+
+SSNs = re.findall(SSNmatch, filetext)
+print SSNs
+So, this gives us a list of all 10 social security numbers in the data set by matching a pattern of three digits (\d), a hyphen, two digits, a hyphen, and four digits.
+Truthfully, we don't need to match them this way. Since the only data structure in this file that contains hyphens (or for that matter even contains digits longer than two) are the social security numbers, we can be a little bit lazier in this specific case.
+ +## Social Security Numbers the lazy way
+
+SSNmatch = re.compile('\d{3}.+')
+
+SSNs = re.findall(SSNmatch, filetext)
+print SSNs
+In this case, we match anything that has three digits in a row and all of the other characters that come after it (. means anything, + takes it ad infinitum)
+ +Let's skip challenge two for right now. We'll get back to it. We can be super lazy with the abbreviated binomial names also. They're the ONLY data structure here that has any word characters.
+ +## Binomial names
+
+speciesmatch = re.compile('[A-Za-z].+')
+
+spp = re.findall(speciesmatch, filetext)
+print spp
+The final option asks us to pull out times in a 24-hr format after noon. This is slightly more challenging because we can't just match all digits. So, we want to be able to pull out times like 14:56 and 21:43, but ignore times like 01:53.
+This sounds convoluted at first, but it's really quite simple.
+We want to match times that start with one digit of either 1 or 2, but not zero and are followed by a digit 2 or greater if they start with 1. Then there's a colon and some other stuff. The colon is a key here because it lets us ignore the social security numbers.
+ +## Times after 12:00
+timesmatch = re.compile('1{1}[2-9]{1}:.+|2{1}[0-9]{1}:.+')
+
+afternoon = re.findall(timesmatch, filetext)
+print afternoon
+So, let's translate this to English.
+We're looking for something that has a 1 once and then {1} digit 2 or greater, a colon, and anything after that.
+OR (|)
+We're looking for something that starts with a 2 and has any digit and a colon (etc.) thereafter.
+ +