diff --git a/Ex12Q1.html b/Ex12Q1.html new file mode 100755 index 0000000..1af2370 --- /dev/null +++ b/Ex12Q1.html @@ -0,0 +1,12137 @@ + + + +Ex12Q1 + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+
+
+
+

Testing a hypothesis with maximum likekihood

Let's say we had some data, and we wanted to generate some statistical information about it and the potential patterns it might follow.

+

In this case, we have a file called "chickwts.txt", containing a bunch of rows and columns of data, with information separated by commas. The first column of this dataset indicates the measured weight of a whole bunch of chicks, and the second column indicates what kind of feed they were on (horsebean, linseed, soybean, etc.) How could we start analyzing this data and looking for patterns?

+

Import necessary packages

As usual, when working in Python, we want to make sure we have the right packages installed to make working with our data easier. Since we're working with a bunch of numbers and tables, it seems likely we'll need both numpy and pandas. As it turns out, we'll also want to look at some graphical representations, so we'll want plotnine, and we'll also end up doing some statistical calculations, so we'll want some elements of scipy.

+ +
+
+
+
+
+
In [ ]:
+
+
+
import numpy
+import pandas
+from plotnine import *
+from scipy.optimize import minimize
+from scipy.stats import norm
+
+ +
+
+
+ +
+
+
+
+
+
+

Import dataset

Conveniently, the dataset we're working with is delineated by commas and has a header line labeling the columns, making it pretty easy to import into Python. We'll import it as a pandas dataframe and call it something convenient, like "weights""

+ +
+
+
+
+
+
In [ ]:
+
+
+
weights=pandas.read_csv("chickwts.txt", header=0, sep=",")
+
+ +
+
+
+ +
+
+
+
+
+
+

Create a plot to summarize the data

Ideally, we'd like to plot each individual feed type separately, so we could see if there was a pattern or connection between how much the chicks weigh and what type of feed they're getting. To do this, we can separate the larger dataframe into a series of smaller dataframes, each representing one type of feed:

+ +
+
+
+
+
+
In [ ]:
+
+
+
horsebean=weights.loc[weights.feed.isin(['horsebean']),:]
+linseed=weights.loc[weights.feed.isin(['linseed']),:]
+soybean=weights.loc[weights.feed.isin(['soybean']),:]
+sunflower=weights.loc[weights.feed.isin(['sunflower']),:]
+meatmeal=weights.loc[weights.feed.isin(['meatmeal']),:]
+casein=weights.loc[weights.feed.isin(['casein']),:]
+
+ +
+
+
+ +
+
+
+
+
+
+

Then, we'll create an initial plotnine object with one of the individual datasets - say, the horsebean dataframe. We'll graph the data on a scatter plot, with individual data points as separate points on the graph, to better see variation between individuals within the same group. We'll also label the x and y axes, and tell Python we'd like to just keep it simple:

+ +
+
+
+
+
+
In [ ]:
+
+
+
g=ggplot(horsebean,aes(x='feed',y='weight'))+geom_point()+theme_classic()
+
+ +
+
+
+ +
+
+
+
+
+
+

Once this initial plotnine object exists, we can add more data to it, simply by adding plots for each new dataset onto the same object:

+ +
+
+
+
+
+
In [ ]:
+
+
+
g=g+geom_point(linseed, aes(x='feed',y='weight'))+theme_classic()
+g=g+geom_point(soybean, aes(x='feed',y='weight'))+theme_classic()
+g=g+geom_point(sunflower, aes(x='feed',y='weight'))+theme_classic()
+g=g+geom_point(meatmeal, aes(x='feed',y='weight'))+theme_classic()
+g=g+geom_point(casein, aes(x='feed',y='weight'))+theme_classic()
+
+ +
+
+
+ +
+
+
+
+
+
+

Our final step will be to actually give Python the command to display the finished plot:

+ +
+
+
+
+
+
In [ ]:
+
+
+
print(g)
+
+ +
+
+
+ +
+
+
+
+
+
+

Make some hypotheses

Looking at this data, we can start to make some hypotheses about the effect of feed type on chick weight. For instance, we might want to know if there's an overall difference in chick weight when the chicks are given soybean feed versus sunflower feed.

+

Put another way, we could define two hypotheses for feeding chicks soybean versus sunflower feed:

+

Null = There is no difference in chick weight whether they are fed with soybean or sunflower feed; the feed type has no effect on weight.

+

Alternative = There is a difference in chick weight when they are fed with soybean versus sunflower feed; the feed type has some effect on weight.

+

We've learned a pretty good way to test a null versus alternative hypothesis - by estimating the maximum likelihood of each model.

+

Rearrange some data

In order to directly compare these two datasets together, we have to combine them. We'll first place the two individual feed-specific dataframes we created above into a list, then tell pandas we'd like to concatenate the two dataframes (essentially, just add one to the end of the other):

+ +
+
+
+
+
+
In [ ]:
+
+
+
frames=[soybean,sunflower]
+
+df=pd.concat(frames)
+
+ +
+
+
+ +
+
+
+
+
+
+

Right now, one of the columns in our dataframe contains a string - the name of the feed type the chicks have been recieving. However, this won't work in later steps - we'll need to plug this dataframe into an equation to determine if the feed type has an effect on weight. In order to do this, we'll need to substitute the names of the feed type with numbers.

+

In this case, because there's only two different types of feed, this is a binary system - in other words, the chicks are either recieving one type of food, or they're recieving the other. This is actually pretty easy to translate into numbers - we'll simply assign one of the feed types to a value of "1" and the other to a value of "0" (it doesn't actually matter which is which, just that they're different).

+ +
+
+
+
+
+
In [ ]:
+
+
+
df.replace(['soybean','sunflower'],[0,1],inplace=True)
+
+ +
+
+
+ +
+
+
+
+
+
+

Define custom likelihood functions

We can now define some custom likelihood functions that describe our two hypotheses. Our null hypothesis function should only have one parameter - that is, it describes the y-variable (chick weight) as some function of parameters, with no input from whether the chicks are being fed soybean or sunflower feed:

+ +
+
+
+
+
+
In [ ]:
+
+
+
def null(p,obs):
+    B0=p[0]
+    sigma=p[1]
+    expected=B0
+    nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()
+    return nll
+
+ +
+
+
+ +
+
+
+
+
+
+

In contrast, our alternative hypothesis function should have an additional parameter - one that varies depending on whether the chicks were fed soybean or sunflower feed (whether the column reads 0 or 1):

+ +
+
+
+
+
+
In [ ]:
+
+
+
def alt(p,obs):
+    B0=p[0]
+    B1=p[1]
+    sigma=p[2]
+    expected=B0+B1*obs.feed
+    nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()
+    return nll
+
+ +
+
+
+ +
+
+
+
+
+
+

Estimate parameters by minimizing negative log likelihood

Now that we've defined some custom functions and asked them to return the negative log likelihood, we can attempt to find the best parameters for each function that match the data we already have. We do this by passing the functions some initial values (it doesn't really matter what these are, they're just a starting point for the functions to start their estimates) and asking it to iterate upon these numbers, changing them a little bit each time, until they find the values that yield the minimum value of log likelihood.

+ +
+
+
+
+
+
In [ ]:
+
+
+
initialVals1=numpy.array([1,1,1])
+
+fitNull=minimize(null,initialVals1,method="Nelder-Mead",options={'disp':True},args=df)
+fitAlt=minimize(alt,initialVals1,method="Nelder-Mead",options={'disp':True},args=df)
+
+ +
+
+
+ +
+
+
+
+
+
+

If we want to know what the ultimate parameters the functions calculated, we can ask Python to print the attribute 'x' for each function, which is a list of the calculated parameter values:

+ +
+
+
+
+
+
In [ ]:
+
+
+
print(fitNull.x)
+print(fitAlt.x)
+
+ +
+
+
+ +
+
+
+
+
+
+

Determine if the effect is statistically significant

Now that we have two values for likelihood, we can actually determine if the difference between two conditions is statistically significant by performing a Chi-square test to determine if two times the difference in likelihoods (D) is large relative to a chi-squared distribution with one degree of freedom.

+

We actually need to import one additional function from scipy that lets us do this - aptly named chi2:

+ +
+
+
+
+
+
In [ ]:
+
+
+
from scipy.stats import chi2
+
+D=(2*(fitNull.fun-fitAlt.fun))
+
+chi2feed=(1-chi2.cdf(x=D,df=1))
+
+print(chi2feed)
+
+ +
+
+
+ +
+
+
+
+
+
+

Turns out, there is a statistically significant difference in chick weight that is dependent on the type of feed they're getting - soybean or sunflower.

+ +
+
+
+
+
+ + + + + + diff --git a/Ex12Q1.ipynb b/Ex12Q1.ipynb new file mode 100755 index 0000000..7680dc0 --- /dev/null +++ b/Ex12Q1.ipynb @@ -0,0 +1,338 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Testing a hypothesis with maximum likekihood\n", + "\n", + "Let's say we had some data, and we wanted to generate some statistical information about it and the potential patterns it might follow. \n", + "\n", + "In this case, we have a file called \"chickwts.txt\", containing a bunch of rows and columns of data, with information separated by commas. The first column of this dataset indicates the measured **weight** of a whole bunch of chicks, and the second column indicates what kind of **feed** they were on (horsebean, linseed, soybean, etc.) How could we start analyzing this data and looking for patterns?\n", + "\n", + "## Import necessary packages\n", + "\n", + "As usual, when working in Python, we want to make sure we have the right packages installed to make working with our data easier. Since we're working with a bunch of numbers and tables, it seems likely we'll need both numpy and pandas. As it turns out, we'll also want to look at some graphical representations, so we'll want plotnine, and we'll also end up doing some statistical calculations, so we'll want some elements of scipy.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import numpy\n", + "import pandas\n", + "from plotnine import *\n", + "from scipy.optimize import minimize\n", + "from scipy.stats import norm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import dataset\n", + "\n", + "Conveniently, the dataset we're working with is delineated by commas and has a header line labeling the columns, making it pretty easy to import into Python. We'll import it as a pandas dataframe and call it something convenient, like \"weights\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "weights=pandas.read_csv(\"chickwts.txt\", header=0, sep=\",\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a plot to summarize the data\n", + "\n", + "Ideally, we'd like to plot each individual **feed** type separately, so we could see if there was a pattern or connection between how much the chicks **weigh** and what type of **feed** they're getting. To do this, we can separate the larger dataframe into a series of smaller dataframes, each representing one type of **feed**:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "horsebean=weights.loc[weights.feed.isin(['horsebean']),:]\n", + "linseed=weights.loc[weights.feed.isin(['linseed']),:]\n", + "soybean=weights.loc[weights.feed.isin(['soybean']),:]\n", + "sunflower=weights.loc[weights.feed.isin(['sunflower']),:]\n", + "meatmeal=weights.loc[weights.feed.isin(['meatmeal']),:]\n", + "casein=weights.loc[weights.feed.isin(['casein']),:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, we'll create an initial plotnine object with one of the individual datasets - say, the horsebean dataframe. We'll graph the data on a scatter plot, with individual data points as separate points on the graph, to better see variation between individuals within the same group. We'll also label the x and y axes, and tell Python we'd like to just keep it simple:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "g=ggplot(horsebean,aes(x='feed',y='weight'))+geom_point()+theme_classic()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once this initial plotnine object exists, we can add more data to it, simply by adding plots for each new dataset onto the same object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "g=g+geom_point(linseed, aes(x='feed',y='weight'))+theme_classic()\n", + "g=g+geom_point(soybean, aes(x='feed',y='weight'))+theme_classic()\n", + "g=g+geom_point(sunflower, aes(x='feed',y='weight'))+theme_classic()\n", + "g=g+geom_point(meatmeal, aes(x='feed',y='weight'))+theme_classic()\n", + "g=g+geom_point(casein, aes(x='feed',y='weight'))+theme_classic()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our final step will be to actually give Python the command to display the finished plot:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print(g)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Make some hypotheses\n", + "\n", + "Looking at this data, we can start to make some hypotheses about the effect of **feed** type on chick **weight**. For instance, we might want to know if there's an overall difference in chick weight when the chicks are given soybean feed versus sunflower feed.\n", + "\n", + "Put another way, we could define two hypotheses for feeding chicks soybean versus sunflower feed:\n", + "\n", + "*Null = There is no difference in chick weight whether they are fed with soybean or sunflower feed; the **feed** type has no effect on **weight**.*\n", + "\n", + "*Alternative = There is a difference in chick weight when they are fed with soybean versus sunflower feed; the **feed** type has some effect on **weight**.*\n", + "\n", + "We've learned a pretty good way to test a null versus alternative hypothesis - by estimating the maximum likelihood of each model.\n", + "\n", + "## Rearrange some data\n", + "\n", + "In order to directly compare these two datasets together, we have to combine them. We'll first place the two individual **feed**-specific dataframes we created above into a list, then tell pandas we'd like to concatenate the two dataframes (essentially, just add one to the end of the other):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "frames=[soybean,sunflower]\n", + "\n", + "df=pd.concat(frames)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Right now, one of the columns in our dataframe contains a string - the name of the **feed** type the chicks have been recieving. However, this won't work in later steps - we'll need to plug this dataframe into an equation to determine if the **feed** type has an effect on **weight**. In order to do this, we'll need to substitute the names of the **feed** type with numbers.\n", + "\n", + "In this case, because there's only two different types of **feed**, this is a binary system - in other words, the chicks are either recieving one type of food, or they're recieving the other. This is actually pretty easy to translate into numbers - we'll simply assign one of the feed types to a value of \"1\" and the other to a value of \"0\" (it doesn't actually matter which is which, just that they're different)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df.replace(['soybean','sunflower'],[0,1],inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define custom likelihood functions\n", + "\n", + "We can now define some custom likelihood functions that describe our two hypotheses. Our null hypothesis function should only have one parameter - that is, it describes the y-variable (chick **weight**) as some function of parameters, with no input from whether the chicks are being fed soybean or sunflower **feed**:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def null(p,obs):\n", + " B0=p[0]\n", + " sigma=p[1]\n", + " expected=B0\n", + " nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()\n", + " return nll" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In contrast, our alternative hypothesis function should have an additional parameter - one that varies depending on whether the chicks were fed soybean or sunflower **feed** (whether the column reads 0 or 1):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def alt(p,obs):\n", + " B0=p[0]\n", + " B1=p[1]\n", + " sigma=p[2]\n", + " expected=B0+B1*obs.feed\n", + " nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()\n", + " return nll" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Estimate parameters by minimizing negative log likelihood\n", + "\n", + "Now that we've defined some custom functions and asked them to return the negative log likelihood, we can attempt to find the best parameters for each function that match the data we already have. We do this by passing the functions some initial values (it doesn't really matter what these are, they're just a starting point for the functions to start their estimates) and asking it to iterate upon these numbers, changing them a little bit each time, until they find the values that yield the minimum value of log likelihood." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "initialVals1=numpy.array([1,1,1])\n", + "\n", + "fitNull=minimize(null,initialVals1,method=\"Nelder-Mead\",options={'disp':True},args=df)\n", + "fitAlt=minimize(alt,initialVals1,method=\"Nelder-Mead\",options={'disp':True},args=df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want to know what the ultimate parameters the functions calculated, we can ask Python to print the attribute 'x' for each function, which is a list of the calculated parameter values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print(fitNull.x)\n", + "print(fitAlt.x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Determine if the effect is statistically significant\n", + "\n", + "Now that we have two values for likelihood, we can actually determine if the difference between two conditions is statistically significant by performing a Chi-square test to determine if two times the difference in likelihoods (D) is large relative to a chi-squared distribution with one degree of freedom.\n", + "\n", + "We actually need to import one additional function from scipy that lets us do this - aptly named chi2:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from scipy.stats import chi2\n", + "\n", + "D=(2*(fitNull.fun-fitAlt.fun))\n", + "\n", + "chi2feed=(1-chi2.cdf(x=D,df=1))\n", + "\n", + "print(chi2feed)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Turns out, there is a statistically significant difference in chick **weight** that is dependent on the type of **feed** they're getting - soybean or sunflower." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Question1.py b/Question1.py new file mode 100755 index 0000000..fed7fe6 --- /dev/null +++ b/Question1.py @@ -0,0 +1,58 @@ +import numpy +import pandas +from plotnine import * +from scipy.optimize import minimize +from scipy.stats import norm + +weights=pandas.read_csv("chickwts.txt", header=0, sep=",") + +horsebean=weights.loc[weights.feed.isin(['horsebean']),:] +linseed=weights.loc[weights.feed.isin(['linseed']),:] +soybean=weights.loc[weights.feed.isin(['soybean']),:] +sunflower=weights.loc[weights.feed.isin(['sunflower']),:] +meatmeal=weights.loc[weights.feed.isin(['meatmeal']),:] +casein=weights.loc[weights.feed.isin(['casein']),:] + +g=ggplot(horsebean,aes(x='feed',y='weight'))+geom_point()+theme_classic() +g=g+geom_point(linseed, aes(x='feed',y='weight'))+theme_classic() +g=g+geom_point(soybean, aes(x='feed',y='weight'))+theme_classic() +g=g+geom_point(sunflower, aes(x='feed',y='weight'))+theme_classic() +g=g+geom_point(meatmeal, aes(x='feed',y='weight'))+theme_classic() +g=g+geom_point(casein, aes(x='feed',y='weight'))+theme_classic() + +frames=[soybean,sunflower] + +df=pd.concat(frames) +df.replace(['soybean','sunflower'],[0,1],inplace=True) + +######################## +#null hypothesis likelihood equation +def null(p,obs): + B0=p[0] + sigma=p[1] + expected=B0 + nll=-1*norm(expected,sigma).logpdf(obs.weight).sum() + return nll + +#Alternative hypothesis likelihood equation +def alt(p,obs): + B0=p[0] + B1=p[1] + sigma=p[2] + expected=B0+B1*obs.feed + nll=-1*norm(expected,sigma).logpdf(obs.weight).sum() + return nll + +#estimaitng parameters by minimizing the nll +initialVals1=numpy.array([1,1,1]) + +fitNull=minimize(null,initialVals1, method="Nelder-Mead",options={'disp': True}, args=df) +fitAlt=minimize(alt,initialVals1, method="Nelder-Mead",options={'disp': True}, args=df) + +print(fitNull.x) +print(fitAlt.x) + +from scipy.stats import chi2 +D=(2*(fitNull.fun-fitAlt.fun)) +chi2feed=(1-chi2.cdf(x=D,df=1)) +print(chi2feed) diff --git a/Question2.py b/Question2.py new file mode 100755 index 0000000..225d9a3 --- /dev/null +++ b/Question2.py @@ -0,0 +1,20 @@ +import re + +testset1=['13:40','11:20','22:00','24:49'] + +regextime=r"(12)|(13)|(14)|(15)|(16)|(17)|(18)|(19)|(20)|(21)|(22)|(23)|(24)\:[0-9]{2}" +regextime2=re.compile(regextime) +filter(regextime2.match,testset1) + + +testset2='H. sapien','D. rerio','YYstutorial' + +regexGenus=r"[A-Z]?\. [a-z]+" +regexGenus2=re.compile(regexGenus) +filter(regexGenus2.match,testset2) + +testset3=['1wo-rd-me22','food','123-45-6789','555-55-5555'] + +regexSSN=r"[0-9]{3}\-[0-9]{2}\-[0-9]{4}" +regexSSN2=re.compile(regexSSN) +filter(regexSSN2.match,testset3) \ No newline at end of file diff --git a/Question2completed.html b/Question2completed.html new file mode 100755 index 0000000..3b4fa09 --- /dev/null +++ b/Question2completed.html @@ -0,0 +1,12132 @@ + + + +Untitled + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+
+
+
+

Question 2

+
+
+
+
+
+
+
+
+

How to use Regular Expression (REGEX) in Python

+
+
+
+
+
+
+
+
+

Example 1 - Military time after noon

+
+
+
+
+
+
+
+
+

Step 1 is to import the REGEX package in python

+
+
+
+
+
+
In [1]:
+
+
+
import re
+
+ +
+
+
+ +
+
+
+
+
+
+

Step 2 is to create a data list to use so we can test our REGEX - we will call this list 'testset1' for simplicity - we will include 4 times, with 3 that should work and 1 that shouldn't

+
+
+
+
+
+
In [2]:
+
+
+
testset1=['13:40','11:20','22:00','24:49']
+
+ +
+
+
+ +
+
+
+
+
+
+

Step 3 is to assign a variable to your regular expression

+
+
+
+
+
+
In [3]:
+
+
+
regextime=r"(12)|(13)|(14)|(15)|(16)|(17)|(18)|(19)|(20)|(21)|(22)|(23)|(24)\:[0-9]{2}"
+
+ +
+
+
+ +
+
+
+
+
+
+
This regular expression works because the first number before the " : " is anywhere between 12-24, then we put a ":" followed by any 2 numbers [0-9] for the minutes.
+
+
+
+
+
+
+
+
+

Step 4 we next compile the REGEX and then use the filter command with the compiled REGEX and .match on our test list 'testset1' and our resulting matches are printed for us.

+
+
+
+
+
+
In [4]:
+
+
+
regextime2=re.compile(regextime)
+filter(regextime2.match,testset1)
+
+ +
+
+
+ +
+
+ + +
+
Out[4]:
+ + + +
+
['13:40', '22:00', '24:49']
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Example 2 - Genus species 'G. species'

+
+
+
+
+
+
+
+
+

Step 1. Again we will create a list of examples we know will work and also some that will not work to test our REGEX

+
+
+
+
+
+
In [5]:
+
+
+
testset2='H. sapien','D. rerio','YYstutorial'
+
+ +
+
+
+ +
+
+
+
+
+
+

Step 2. We will assign a new variable 'regexGenus' to this particular REGEX code and compile it.

+
+
+
+
+
+
In [6]:
+
+
+
regexGenus=r"[A-Z]?\. [a-z]+"
+regexGenus2=re.compile(regexGenus)
+
+ +
+
+
+ +
+
+
+
+
+
+
this REGEX works because we have a capital letter "[A-Z]? followed by a period " . " and a space " " then any number of lower cased letters [a-z]+
+
+
+
+
+
+
+
+
+

Step 3. Again we will check to see any matches using the filter command that will print the matches

+
+
+
+
+
+
In [7]:
+
+
+
filter(regexGenus2.match,testset2)
+
+ +
+
+
+ +
+
+ + +
+
Out[7]:
+ + + +
+
('H. sapien', 'D. rerio')
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Example 3 - matching a social security number SSN

+
+
+
+
+
+
+
+
+

Step 1 - creaat data list

+
+
+
+
+
+
In [8]:
+
+
+
testset3=['1wo-rd-me22','food','123-45-6789','555-55-5555']
+
+ +
+
+
+ +
+
+
+
+
+
+

Step 2 - Assign a variable to REGEX code ' regexSSN' and compile it

+
+
+
+
+
+
In [9]:
+
+
+
regexSSN=r"[0-9]{3}\-[0-9]{2}\-[0-9]{4}"
+regexSSN2=re.compile(regexSSN)
+
+ +
+
+
+ +
+
+
+
+
+
+
this regular expression works because we have three {3} numbers [0-9] followed by a hyphen that we escape with a \ then we do the same thing for the next two sections of the REGEX depending on how many number would fit a SSN (ie {2} then {4}).
+
+
+
+
+
+
+
+
+

Step 3 - Again we will filter and match the results which then the matches get printed

+
+
+
+
+
+
In [10]:
+
+
+
filter(regexSSN2.match,testset3)
+
+ +
+
+
+ +
+
+ + +
+
Out[10]:
+ + + +
+
['123-45-6789', '555-55-5555']
+
+ +
+ +
+
+ +
+
+
+ + + + + +