diff --git a/Exercises/scripts/DataProcessingPython.ipynb b/Exercises/scripts/DataProcessingPython.ipynb
index e37a55c..e20e39b 100644
--- a/Exercises/scripts/DataProcessingPython.ipynb
+++ b/Exercises/scripts/DataProcessingPython.ipynb
@@ -7,7 +7,7 @@
"source": [
"# Data Processing in Python\n",
"\n",
- "QLS-MiCM Workshop - November 18, 2025\n",
+ "QLS-MiCM Workshop - March 11, 2026\n",
"\n",
"Benjamin Z. Rudski, PhD Candidate, Quantitative Life Sciences, McGill University\n",
"\n",
@@ -31,12 +31,12 @@
"metadata": {},
"outputs": [],
"source": [
- "using_colab = False\n",
+ "using_colab = True\n",
"\n",
"if using_colab:\n",
" !wget https://github.com/QLS-MiCM/DataProcessingInPython/archive/refs/heads/main.zip\n",
" !unzip main.zip\n",
- " base_dir = \"Data-Processing-in-Python-main/Exercises/\"\n",
+ " base_dir = \"DataProcessingInPython-main/Exercises/\"\n",
"else:\n",
" base_dir = \"../\""
]
@@ -517,7 +517,7 @@
"\n",
"An example is [CuPy](https://cupy.dev/), which allows performing NumPy and SciPy operations on the GPU. This package **does not** work on all systems. It requires an NVIDIA GPU and CUDA, which is not available on macOS. In these cases, it's very important to read the [installation instructions](https://docs.cupy.dev/en/stable/install.html).\n",
"\n",
- "If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first. You can easily search on https://anaconda.org to see if the package is available. Installing packages with `conda` makes it easier to manage multiple *environments* (which we'll discuss soon)."
+ "If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first. You can easily search on https://anaconda.org to see if the package is available. Installing packages with `conda` makes it easier to manage multiple *environments*."
]
},
{
@@ -530,7 +530,8 @@
"\n",
"We've seen what packages are and how to install them, but now how do we use them?\n",
"\n",
- "To use a package, we have to import it, just like we import a module. Since we use a lot of functions from a package, we often it a shorter name when we import it. Here's the syntax for doing this:\n",
+ "To use a package, we have to import it, just like we import a module. Since we use a lot of functions from a package, we often give it a shorter name when we import it. Here's the syntax for doing this:\n",
+ "\n",
"```python\n",
"import package_name as short_name\n",
"```\n",
diff --git a/Exercises/scripts/DataProcessingPythonCompact.ipynb b/Exercises/scripts/DataProcessingPythonCompact.ipynb
new file mode 100644
index 0000000..a46ffd7
--- /dev/null
+++ b/Exercises/scripts/DataProcessingPythonCompact.ipynb
@@ -0,0 +1,3525 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "e49c93e4-f02a-4c79-a5c1-2960e451c28a",
+ "metadata": {},
+ "source": [
+ "# Data Processing in Python\n",
+ "\n",
+ "QLS-MiCM Workshop - March 11, 2026\n",
+ "\n",
+ "Benjamin Z. Rudski, PhD Candidate, Quantitative Life Sciences, McGill University\n",
+ "\n",
+ "Dear `Reader | Workshop Attendee`,\n",
+ "\n",
+ "Welcome! In this interactive Jupyter notebook, we will explore some basics of processing data using the Python programming language. This workshop will focus mainly on how to perform powerful data processing tasks using packages developed and distributed by other people. We'll see how to process large data arrays, generate useful visualisations and store information in tables. \n",
+ "\n",
+ "This workshop assumes that you have a basic knowledge of Python. If you don't, feel free to check out some beginner resources. In a shameless self-promotion plug, you may find the QLS-MiCM [Intro to Python workshop](https://github.com/QLS-MiCM/IntroToPython) workshop helpful.\n",
+ "\n",
+ "To maximise engagement, this workshop involves some live-coding. This **compact student version** notebook contains various blanks throughout, as well as empty cells for the end-of-module exercises. The [**solution version**](../solutions/DataProcessingPython.ipynb) be found in the `solutions` folder. \n",
+ "\n",
+ "> 💡 **Tip:** Even if you don't want to fill in the blanks yourself, I **strongly** recommend trying the exercises before looking at the solutions. There's often more than one way to accomplish a task, so it's better that you figure out the intuition for yourself. Your answer may actually be better than the one I've provided!\n",
+ "\n",
+ "> ❗ ️ **Attention:** This notebook is a compact version, where many of my explanations have been removed for readability. If you would like the complete notebook, with all explanations but still with blanks for you to fill out, check out the [**student version**](./DataProcessingPython.ipynb).\n",
+ "\n",
+ "Before getting started, let's configure something quickly to ensure everything works when running on Colab (adapted from https://stackoverflow.com/a/69419390/):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "13862d53-fc7c-4214-b2fb-6544bcedb558",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "using_colab = True\n",
+ "\n",
+ "if using_colab:\n",
+ " !wget https://github.com/QLS-MiCM/DataProcessingInPython/archive/refs/heads/main.zip\n",
+ " !unzip main.zip\n",
+ " base_dir = \"DataProcessingInPython-main/Exercises/\"\n",
+ "else:\n",
+ " base_dir = \"../\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "408fe8a2-56d5-4583-981f-98ef8e1e31b3",
+ "metadata": {},
+ "source": [
+ "# Table of Contents\n",
+ "\n",
+ "1. **Module 1 -- Modules and Packages (40 minutes)**\n",
+ " 1. Using Modules\n",
+ " 2. A Brief Intro to Package Management\n",
+ " 3. **Exercises**\n",
+ "2. **Module 2 -- Introduction to NumPy Arrays (50 minutes)**\n",
+ " 1. Introducing NumPy\n",
+ " 2. Array Operations\n",
+ " 3. **Exercises**\n",
+ "3. **Module 3 -- Visualising Data with Matplotlib (50 minutes)**\n",
+ " 1. Creating Plots with Matplotlib\n",
+ " 2. Exploring the Matplotlib Documentation\n",
+ " 3. **Exercises**\n",
+ "4. **Module 4 -- Intro to Tabular Data with Pandas (30 minutes)**\n",
+ " 2. Fundamentals of pandas\n",
+ " 3. Exploring the pandas Documentation\n",
+ " 4. **Exercises**\n",
+ "5. **Module 5 -- A Brief Guide to Exploring the Unknown (10 minutes)**\n",
+ " 1. What to learn next? How?\n",
+ " 2. How to get help and how not to get help\n",
+ " 3. Other cool programming\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bf430b75-21ee-474b-bfe2-41ead87088bc",
+ "metadata": {},
+ "source": [
+ "# Learning Objectives\n",
+ "\n",
+ "By the end of this workshop, you'll have the skills necessary to:\n",
+ "\n",
+ "1. Import code from existing modules and packages.\n",
+ "2. Use NumPy to easily process multidimensional data.\n",
+ "3. Use Matplotlib to generate different types of plots to visualise data.\n",
+ "4. Approach a new package and explore its documentation and examples.\n",
+ "\n",
+ "Ready? Let's dive into the material!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ce950b06-2491-4a27-b65b-5a4fcd764f1c",
+ "metadata": {},
+ "source": [
+ "# Module 1 - Modules and Packages\n",
+ "\n",
+ "1. Using Modules\n",
+ " 1. What is a Module?\n",
+ " 2. Importing a Module\n",
+ " 3. Importing Specific Functions\n",
+ "2. A Brief Intro to Package Management\n",
+ " 1. What is a Package?\n",
+ " 2. Installing Packages using conda\n",
+ " 3. Installing Packages using pip\n",
+ " 4. Using Packages\n",
+ "3. Exercises"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b4fbd2a1-0f89-4fd8-ada1-13d36cc05f00",
+ "metadata": {},
+ "source": [
+ "## Using Modules"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a6cbd49d-cca0-4d90-903b-6a1fd1eded98",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### What is a module?\n",
+ "\n",
+ "Python code is organised in *modules*. A **module** is a file that ends with `.py`. If you share this file with someone else, they can reuse your code.\n",
+ "\n",
+ "So, what does this module look like? Usually, it contains a bunch of different code:\n",
+ "* **Functions**: bits of repeatable behaviour to simplify tasks.\n",
+ "* **Classes**: code that defines new types of objects.\n",
+ "* **Constants**: variables that have important pre-determined values, like $\\pi$.\n",
+ "\n",
+ "All of these are also typically accompanied by **documentation**, which is the series of *docstrings* from the module file.\n",
+ "\n",
+ "Python comes with **a lot** of [built-in modules](https://docs.python.org/3/py-modindex.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "091b66d7-2935-476d-9be8-7c1dd3441dda",
+ "metadata": {},
+ "source": [
+ "### Importing a module\n",
+ "\n",
+ "To use code from a module, we have to **import** it.\n",
+ "\n",
+ "To import a module so that we can use it in our code, here's the syntax:\n",
+ "```python\n",
+ " import module_name\n",
+ "```\n",
+ "\n",
+ "If you're importing code you've written yourself, then the `module_name` is just the name of your file, without the `.py` extension. Module names follow the same rules as variable names.\n",
+ "\n",
+ "> ⚠ ️ **Warning:** Always remember to import a module before you try to use it! Otherwise, Python won't be able to find the module and it will get mad at you."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0c58ac84-f386-4f8a-a0ff-c4b800789593",
+ "metadata": {},
+ "source": [
+ "Let's do an example. Let's import the `math` module, which provides basic functions for performing more complicated mathematical operations:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f244372d-9663-4c37-8acc-ba3a794dbb7c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import the math module\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "339af350-5122-43cd-bfb8-456a5725bcee",
+ "metadata": {},
+ "source": [
+ "Great! We've imported the module! That's our first step done. The next step is to **read how to use the module**.\n",
+ "\n",
+ "Let's look at the [online documentation](https://docs.python.org/3/library/math.html#module-math) for the `math` module.\n",
+ "\n",
+ "Let's try to use some of the trig functions! Let's try to compute the sine and cosine of 180°. We expect to find the following:\n",
+ "\n",
+ "$$\n",
+ "\\begin{align*}\n",
+ " \\sin 180^\\circ &= 0\\\\\n",
+ " \\cos 180^\\circ &= -1\n",
+ "\\end{align*}\n",
+ "$$"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0bdb929b-7cbb-4a9e-bd49-6b64645a9614",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here... Compute sines and cosines using `math`\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1c4d386e-b691-4552-8512-35ee96ffe93c",
+ "metadata": {},
+ "source": [
+ "Aha! The angle has to be in radians! So, we need to convert the angle to radians first! We can do this manually by doing $\\text{radians} = \\pi/180 \\times \\text{degrees}$ or... we can use **another function** from `math`!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fef1e41d-7209-4638-9b1c-758281ece77b",
+ "metadata": {},
+ "source": [
+ "> 📝 **Note:** You may be thinking... Hang on! The value of `sin(180°)` didn't come out to zero. Well, it's something very small due to problems representing decimal numbers on a computer. So, for our intents and purposes, we can say $1\\times 10^{-16} \\approx 0$."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4524103a-293b-48d7-933d-0f538ee891ed",
+ "metadata": {},
+ "source": [
+ "### Importing specific functions\n",
+ "\n",
+ "To import specific functions, we can write:\n",
+ "```python\n",
+ "from module_name import function1, function2, constant\n",
+ "```\n",
+ "\n",
+ "When we call the imported function, we **don't** need to write the module name. We only need to write the function name. We can also import **constants** in this way."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "263a3cee-c037-4598-82b9-8755180a933b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import the specific functions for our sine and cosine example\n",
+ "import math\n",
+ "\n",
+ "my_angle_in_degrees = 180\n",
+ "my_angle_in_radians = math.radians(my_angle_in_degrees)\n",
+ "sin180 = math.sin(my_angle_in_radians)\n",
+ "cos180 = math.cos(my_angle_in_radians)\n",
+ "\n",
+ "print(f\"sin(180)={sin180} and cos(180)={cos180}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b1939b41-b312-4b57-a4fd-94bd3bb6b9ec",
+ "metadata": {},
+ "source": [
+ "Notice that we were able to call the `sin`, `cos` and `radians` functions directly, as if it we had defined them."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "315add98-d43d-4400-a002-6933b6c45c07",
+ "metadata": {},
+ "source": [
+ "## A Brief Intro to Package Management\n",
+ "\n",
+ "When doing scientific computing, we often need to use a lot of code that **doesn't** come included in Python. To access this code, we need to use packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d5c1a271-0543-4423-9574-fbe8c242ba9f",
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true
+ },
+ "source": [
+ "### What is a package?\n",
+ "\n",
+ "A **package** is a collection of modules that usually interact and have been grouped together to be easily **distributed** to other people. Packages usually have a very specific focus. \n",
+ "\n",
+ "Here are some very common packages that you will almost definitely encounter in your career:\n",
+ "\n",
+ "* **NumPy**\n",
+ "* **SciPy**\n",
+ "* **Pandas**\n",
+ "* **Matplotlib**\n",
+ "* **scikit-image**\n",
+ "* **scikit-learn**\n",
+ "* **TensorFlow** and **PyTorch**\n",
+ "* **BioPython**\n",
+ "\n",
+ "There are two main tools that you'll use:\n",
+ "* `conda` -- available if you've installed Anaconda or miniconda.\n",
+ "* `pip` -- always available, regardless of how you installed Python.\n",
+ "\n",
+ "Let's see how to use each of them!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf5d9710-77ce-4365-a049-490d339aa451",
+ "metadata": {},
+ "source": [
+ "### Installing packages using `conda`\n",
+ "\n",
+ "The packages available to install in `conda` come from various channels available via the **Anaconda** repository: https://anaconda.org/.\n",
+ "\n",
+ "#### Installing packages on the command line\n",
+ "\n",
+ "To install using the **command line**, open up the **Terminal** on macOS or Linux, or the **Anaconda Prompt** on Windows. It's very important to **not** use a Python shell for this. Again, we do this in a terminal, **NOT IN A PYTHON SHELL**.\n",
+ "\n",
+ "If everything is set up properly, you should see `(base)` before the prompt. This indicates that you are in the base `conda` environment.\n",
+ "\n",
+ "In general, to install a package with `conda`, at the **command prompt** you would write:\n",
+ "```bash\n",
+ "conda install package_name\n",
+ "```\n",
+ "\n",
+ "Press enter, wait for it to prompt you, type `y` and hit enter again to install! If you don't want to be prompted, then you can just add `-y` to the command so that it automatically answers \"yes\" to the prompt for installation.\n",
+ "\n",
+ "If the package you want isn't available in the main `anaconda` channel, you can specify the [`conda-forge` channel](https://conda-forge.org/) using the `-c` option (see [here](https://docs.conda.io/projects/conda/en/latest/commands/install.html) for more details).\n",
+ "\n",
+ "```bash\n",
+ "conda install -c conda-forge package_name\n",
+ "```\n",
+ "\n",
+ "You can also add additional channels, such as [**bioconda**](https://bioconda.github.io/).\n",
+ "\n",
+ "For example, let's try to install `numpy`, `scipy` and `matplotlib` from the `conda-forge` channel.\n",
+ "\n",
+ "> 💡 **Tip:** We can install multiple packages at the same time by including all their names."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "236c6a20-fc2d-46fa-8471-103a3420a7d6",
+ "metadata": {},
+ "source": [
+ "So, we'd write:\n",
+ "\n",
+ "```bash\n",
+ "conda install -c conda-forge numpy scipy matplotlib -y\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fed7e6a2-9645-4069-9ab2-b0d41a0e654f",
+ "metadata": {},
+ "source": [
+ "We can do other package management operations in `conda`. These are described in the `conda` [documentation](https://docs.conda.io/projects/conda/en/stable/commands/index.html). The main ones are:\n",
+ "* `conda remove` - uninstall a package.\n",
+ "* `conda update` - update a package.\n",
+ "\n",
+ "There are also various options for each command."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "629a3b88-5062-481f-87ed-1387894a2ad7",
+ "metadata": {},
+ "source": [
+ "### Installing packages using `pip`\n",
+ "\n",
+ "Every installation of Python comes with [`pip`](https://pip.pypa.io/en/stable/), the official tool for installing packages. `pip` lets you download packages from the official Python Packaging Index (PyPI), found at https://pypi.org/.\n",
+ "\n",
+ "To install packages using `pip`, again you must open the command line. At the prompt, you write:\n",
+ "```shell\n",
+ "pip install package_name\n",
+ "```\n",
+ "\n",
+ "Similar to `conda`, `pip` has a variety of other operations it can perform, which are all described in its [online documentation](https://pip.pypa.io/en/stable/cli/). The most important one for now is `pip uninstall package_name` which removes an installed package."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "282e6830-ae04-4627-95ba-67f2f7ca6f17",
+ "metadata": {},
+ "source": [
+ "### (Optional) Jupyter notebook trick - Magic commands\n",
+ "\n",
+ "If you're using Jupyter notebooks, there are built-in [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that let you install packages within Python code cells. (These are also discussed [here](https://discourse.jupyter.org/t/why-users-can-install-modules-from-pip-but-not-from-conda/10722/4?u=fomightez) and [here](https://discourse.jupyter.org/t/python-in-terminal-finds-module-jupyter-notebook-does-not/2262/9) and mentioned briefly in a comment [here](https://stackoverflow.com/questions/38694081/executing-terminal-commands-in-jupyter-notebook).)\n",
+ "\n",
+ "To install a package using `pip` or `conda`, you write the same line you would at the terminal, but put the `%` sign at the beginning.\n",
+ "\n",
+ "For example, to install NumPy in the environment associated with the Jupyter notebook using `conda`, write:\n",
+ "\n",
+ "```python\n",
+ "%conda install numpy\n",
+ "```\n",
+ "\n",
+ "To install Matplotlib using `pip`, write:\n",
+ "\n",
+ "```python\n",
+ "%pip install matplotlib\n",
+ "```\n",
+ "\n",
+ "There are other magic commands that can be used only in **Jupyter notebooks** and in the **IPython shell**. You can read more about them [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "73284ad0-4d29-4bb6-8fda-aac2f1018c64",
+ "metadata": {},
+ "source": [
+ "### (Optional) Other installation tips\n",
+ "\n",
+ "Most packages give you information in the **documentation** about how to install them. In practice, you rarely have to search Anaconda or PyPI. Usually, you just need to search for the package, and it will explain how to set it up.\n",
+ "\n",
+ "For example, NumPy provides the following [page](https://numpy.org/install/).\n",
+ "\n",
+ "Matplotlib provides [this page](https://matplotlib.org/stable/users/getting_started/index.html#installation-quick-start). \n",
+ "\n",
+ "PyQt provides [this page](https://www.riverbankcomputing.com/software/pyqt/download).\n",
+ "\n",
+ "**Usually**, the installation instructions are simple, telling you to `pip install` the package or to install it from `conda-forge`. There are a few cases, though, where things are more complicated.\n",
+ "\n",
+ "An example is [CuPy](https://cupy.dev/), which allows performing NumPy and SciPy operations on the GPU. This package **does not** work on all systems. It requires an NVIDIA GPU and CUDA, which is not available on macOS. In these cases, it's very important to read the [installation instructions](https://docs.cupy.dev/en/stable/install.html).\n",
+ "\n",
+ "If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first. You can easily search on https://anaconda.org to see if the package is available. Installing packages with `conda` makes it easier to manage multiple *environments*."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "f0a834e0-d079-45d5-a5b9-29945f36c752",
+ "metadata": {},
+ "source": [
+ "### Importing packages\n",
+ "\n",
+ "To use a package, we have to import it, just like we import a module. We often give it a shorter name when we import it. Here's the syntax for doing this:\n",
+ "```python\n",
+ "import package_name as short_name\n",
+ "```\n",
+ "\n",
+ "You'll see this commonly for the NumPy package:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "33fa03bc-e88e-4685-b217-504023d6d3b0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import numpy\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "87b5fa6f-8621-4d1b-84f1-03f053cca9c5",
+ "metadata": {},
+ "source": [
+ "Now that we have done this, we don't write `numpy` before all functions. Instead, we write `np`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "14f8a244-2159-4e3f-bbae-76b640a486a1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_arr = np.arange(8)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "275b6f1c-995e-4cfe-a575-6ab455f1f1cb",
+ "metadata": {},
+ "source": [
+ "Packages can be very big!Developers often create additional modules and subpackages. One very common example is the `pyplot` module in the Matplotlib package.\n",
+ "\n",
+ "To import subpackages, we use the **dot notation**. We can also rename subpackages on import. For example, let's import the `pyplot` subpackage from Matplotlib and give it the alias `plt`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "10657d31-334e-422e-bde2-3514a62c9b83",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import matplotlib.pyplot\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e1aa2e80-0044-4c9a-9d10-128dbf5ccf96",
+ "metadata": {},
+ "source": [
+ "Here, we've given the subpackage a much shorter name.\n",
+ "\n",
+ "> 📝 **Note:** This renaming works for all imports, not just from packages and subpackages. You can even rename functions that you import (although this can become confusing)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb1d14a0-bb58-427a-b625-86dfc91d4e99",
+ "metadata": {},
+ "source": [
+ "## Module Summary\n",
+ "\n",
+ "Congratulations! We're at the end of this module on modules and packages! Here are the main points we saw:\n",
+ "* Python code is organised into **modules** that we can easily **import** into our own code to use.\n",
+ "* We can import **an entire module** or we can import **specific functions and constants** to accomplish certain tasks.\n",
+ "* Python comes with **many pre-installed modules** for performing common tasks, like mathematical operations and generating random numbers.\n",
+ "* Not all modules we need come installed with Python. We can install **packages** using `conda` or `pip` to get even more functionality.\n",
+ "* We can easily **import** packages into our code to use their added functionality.\n",
+ "\n",
+ "Now you can both write your own code and use code from existing modules and packages!\n",
+ "\n",
+ "To practice, let's do some exercises!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "25996faa-6280-45ec-8beb-b4c8c6df3d2d",
+ "metadata": {},
+ "source": [
+ "## Exercises"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "345917c9-6c90-4f34-980e-49ab158c5a36",
+ "metadata": {},
+ "source": [
+ "### Using `random` to generate random DNA sequences\n",
+ "\n",
+ "DNA sequences consist of many nucleotides, which can be `A`, `T`, `C` and `G`. Not all sequences of DNA are genes that can code for protein, though! For our intents and purposes, we can think of a gene as starting with the nucleotides `ATG`, and then contain various triplets of nucleotides, until a stop signal is reached, consisting of either `TAA`, `TAG` or `TGA`. Using the [`random`](https://docs.python.org/3/library/random.html) module that comes pre-installed with Python, write a function that can generate a random coding DNA sequence of a specific number of coding triples.\n",
+ "\n",
+ "Your function should have the following signature:\n",
+ "```python\n",
+ "def generate_random_dna(number_of_codons)\n",
+ "```\n",
+ "where `number_of_codons` is an `int` and the function returns a `str`.\n",
+ "\n",
+ "**Hint:** To convert a list of strings into a single string, you can use the following code:\n",
+ "\n",
+ "```python\n",
+ "my_string = \"\".join(my_list)\n",
+ "```\n",
+ "where `\"\"` is the empty string and `my_list` is your list of strings.\n",
+ "\n",
+ "**BONUS:** Write a docstring for your function so that it is well-documented.\n",
+ "\n",
+ "**Want an extra challenge?** If you want to take things a step farther, add some error to the length! Instead of using a fixed length, using the `random` module, select a actual sequence length that is the specified length plus or minus 3 codons.\n",
+ "\n",
+ "**Note:** Codons taken from [\"DNA and RNA codon tables\"](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) page on Wikipedia. This explanation is an oversimplification; the exact details are not relevant to this exercise."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "54da1f0d-2068-4c2d-bd53-59ea9598bc46",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here...\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "64be8430-45c3-49d6-bd27-4b3b7b8a6ec5",
+ "metadata": {},
+ "source": [
+ "### Using `textwrap` to nicely display DNA and Protein sequences\n",
+ "\n",
+ "DNA and amino acid sequences can be very long.\n",
+ "\n",
+ "Sequences that are are very long don't look nice on the screen. To make our sequences easier to read, we want to wrap the sequences and break them into several smaller lines of 80 nucleotides.\n",
+ "\n",
+ "We could do this manually... but, as it turns out, Python includes a [`textwrap` module](https://docs.python.org/3/library/textwrap.html#module-textwrap) that can help! Read the module documentation and write code to break a long sequence into smaller chunks. I've given you a DNA sequence to test this code on.\n",
+ "\n",
+ "After breaking the DNA sequence up, print each line so that we get a nice wrapped sequence."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "343f8795-446a-4f66-9992-be9f646f9513",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_long_dna = \"AGGACAGTTGTACGATGCATCGTGCTACGATCGATGCTAGCGACGTACGTAGCATGCTAGCTAGCTGACGAGCGCGCGCGATCAGCATGCGCCGGACGTCAGTCAGTGTCAGTCATGCAGTACTGCAGTGTACGTCAGTACGTACTGCAGTCGTCATGTCGATGCATGCCATGTGACGTATGACTGCATGACGTACTG\"\n",
+ "# Your code here to run the wrapping example\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "139608f7-5879-458b-8399-8e4e2ce0ad98",
+ "metadata": {},
+ "source": [
+ "### BONUS: Random DNA with biased weights\n",
+ "\n",
+ "Ok, now let's add a twist: write a function that uses `random` to generate a long random sequence of nucleotides with different weights for each nucleotide.\n",
+ "\n",
+ "Your function should have the following signature:\n",
+ "\n",
+ "```python\n",
+ "generate_random_dna(length, weight_a, weight_t, weight_c, weight_g)\n",
+ "```\n",
+ "and return a string. If you have time, add a docstring to describe your function.\n",
+ "\n",
+ "Generate some random DNA sequences and then use your code from above to wrap them to 80 characters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0a89f5ae-4932-4095-b3be-95d5233368fe",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to generate random DNA sequences\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2d4657a3-f670-4977-9386-ef797448b07e",
+ "metadata": {},
+ "source": [
+ "# Module 2 - Introduction to NumPy Arrays\n",
+ "\n",
+ "Now that we know how to import external packages, let's start learning about one! One of the most basic packages that comes up again and again in scientific work is NumPy. In this module, we'll explore the core of what NumPy can do. Here's the outline of what we'll see:\n",
+ "\n",
+ "1. Introducing NumPy\n",
+ " 1. What is an array?\n",
+ " 2. NumPy fundamentals\n",
+ " 3. Creating and reshaping arrays\n",
+ " 4. Indexing arrays\n",
+ " 5. Simple assignments\n",
+ "2. Array Operations\n",
+ " 1. Element-wise arithmetic\n",
+ " 2. Statistics on arrays\n",
+ " 3. Combining arrays\n",
+ " 4. Exploring the documentation\n",
+ "3. Exercises\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1bedbf39-98c6-425d-a4dc-fb68369b208a",
+ "metadata": {},
+ "source": [
+ "## Introducing NumPy\n",
+ "\n",
+ "### What is an array?\n",
+ "An **array** is essentially a grid of numbers in any number of dimensions.\n",
+ "\n",
+ "**NumPy** is a free and open source package that implements array operations on high-dimensional data. The centre of everything is the `ndarray`, the $n$-dimensional NumPy array.\n",
+ "\n",
+ "The official NumPy documentation can be found on the [NumPy website](https://numpy.org/doc/stable/).\n",
+ "\n",
+ "### NumPy fundamentals\n",
+ "\n",
+ "Very important: **Arrays are *not* lists**. Here are the important differences:\n",
+ "\n",
+ "* Arrays have a **fixed** size.\n",
+ "* Arrays can have more than one dimension.\n",
+ "* All elements in an array have the same type.\n",
+ "* Arrays can be provided to NumPy functions that perform mathematical operations without explicit iteration.\n",
+ "\n",
+ "#### Installing and Importing NumPy\n",
+ "\n",
+ "If you don't already have NumPy installed, you can install it using either `conda` or `pip`.\n",
+ "\n",
+ "```bash\n",
+ "# Using conda\n",
+ "conda install -c conda-forge numpy\n",
+ "\n",
+ "# Using pip\n",
+ "pip install numpy\n",
+ "```\n",
+ "\n",
+ "When importing NumPy, we typically assign the alias `np`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0bc2fd9d-7baf-472a-ac68-61c3ab14a46f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import NumPy\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26dba51c-1de1-458d-94b8-8fe61b33d3ac",
+ "metadata": {},
+ "source": [
+ "### Creating and reshaping arrays\n",
+ "\n",
+ "Now that we have NumPy installed, we can start exploring arrays.\n",
+ "\n",
+ "#### Creating NumPy arrays\n",
+ "\n",
+ "We can create an array based on nested lists using the `np.array` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "74fee626-185e-4714-bbfd-7cbf1572779a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f7dc63a7-d2ea-4894-b3a4-ac85e5eecfb2",
+ "metadata": {},
+ "source": [
+ "Note that the lists need to line up properly.\n",
+ "\n",
+ "In this example, we used a list of lists, but you can create arrays of **any dimension** by further nesting lists."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba0db42a-e73b-49f3-bffd-43020da8a477",
+ "metadata": {},
+ "source": [
+ "It's important to remember that arrays are **objects**, and so they have **attributes** and **methods**. One of the most important attributes is the `shape`, which tells you what... well, the shape of the array is. Let's look at our example array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "28d99eed-77ae-4185-920a-764a3adc7e0c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the shape of the array\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1bf31d3b-2378-4c75-b759-b5b280561522",
+ "metadata": {},
+ "source": [
+ "The `shape` attribute is a **tuple**, giving you the number of elements along each axis of the array. In our 2D array, we can think of the first number as the number of **rows** and the second as the number of **columns**.\n",
+ "\n",
+ "**Note:** This follows *matrix* order convention, **not** Cartesian order convention.\n",
+ "\n",
+ "The length of the `shape` tuple gives us the **number of dimensions** of the array. In our example, we had a 2D array, and there were thus 2 numbers. We can also get the number of dimensions using the `ndim` attribute."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f7dfa47-5cbc-4bae-95c3-a8fafa40405c",
+ "metadata": {},
+ "source": [
+ "##### Other Ways to Create Arrays\n",
+ "\n",
+ "We can create arrays of a certain size filled with zero or oene using the functions `np.zeros` and `np.ones`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1ea63e38-cc32-4d59-998a-77f3932c9747",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to produce some arrays of ones and zeros.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5a88c081-0927-46bd-9bbe-f554c15c1f05",
+ "metadata": {},
+ "source": [
+ "There's another easy way to create a 1D array. Using `np.arange`, we can create simple arrays of numbers separated by a consistent step size. There are different ways of using the function:\n",
+ "```python\n",
+ "\n",
+ "np.arange(a) # Array contains all integers from zero up to and excluding a.\n",
+ "np.arange(a, b) # Array contains all numbers from a (inclusive) to b (exclusive).\n",
+ "np.arange(a, b, c) # Array contains all numbers from a (inclusive) to b (exclusive) incrementing by c.\n",
+ "\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6debec17-0033-4d88-a552-deb2a427ba98",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38e90ffd-11b3-42b3-8f9e-c3f8ef4d46fd",
+ "metadata": {},
+ "source": [
+ "#### Reshaping arrays\n",
+ "\n",
+ "We can reshape arrays very easily using the [`reshape`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) method. We just need to pass the desired new shape to this function.\n",
+ "\n",
+ "**Note:** the dimensions must be compatible with the number of elements in the array. If you know all the dimensions except one, you can set a dimension size as `-1` and Python will infer that dimension."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0ab9ff1b-a96a-4215-a675-0229f2bf4252",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f151bbc2-f085-49b5-87d9-d66bc6dd722c",
+ "metadata": {},
+ "source": [
+ "And we can even reshape to higher dimensions!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f1b54194-f8f6-4684-8e06-ba1d8c765a57",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to perform array reshaping\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00751e0c-f4dc-4ed5-bf52-ae8cedbb654a",
+ "metadata": {},
+ "source": [
+ "#### Writing and Loading arrays from files\n",
+ "\n",
+ "We can save arrays to a binary NumPy file (`*.npy`) and to load arrays back from these files:\n",
+ "* [`np.save`](https://numpy.org/doc/stable/reference/generated/numpy.save.html) -- save an array to a NumPy binary file (`*.npy`).\n",
+ "* [`np.load`](https://numpy.org/doc/stable/reference/generated/numpy.load.html) -- load an array from a NumPy binary file (`*.npy`).\n",
+ "\n",
+ "For example, we can save our array from the last example to a file, and then load it back into a new variable:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0f67042c-a564-4b3c-909a-e822ebccdec4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "array_output_path = \"my_arr.npy\"\n",
+ "# Your code here to save the array to a file\n",
+ "\n",
+ "\n",
+ "# Your code here to load the array from a file\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f1a6e49-e6cf-414e-b250-df4c5f089634",
+ "metadata": {},
+ "source": [
+ "So, now we know how to create arrays and view **all** the elements in an array... but what about just viewing *some* of the elements???"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "22d550fe-d951-4ac3-856b-1cbb62048252",
+ "metadata": {},
+ "source": [
+ "### Array indexing\n",
+ "\n",
+ "When working with arrays in NumPy, we can **index** and **slice** arrays in *almost* the same way as Python lists.\n",
+ "\n",
+ "To index NumPy arrays, we use square brackets `[]`. For **each axis** in the array, we can put either a single index or a range consisting of `start:end:increment`. To index along multiple axes, we can use multiple of these indices or slices **separated by commas**.\n",
+ "\n",
+ "Like with lists, each of the elements is optional. To index from the beginning of an axis up until `i`, simply use `:i`. If you want to index from `i` to the end, then write `i:`. If you want to go along the entire length of the array, skipping by `i`, write `::i`. Just like with lists. And negative indexing still works too!\n",
+ "\n",
+ "If you have multiple axes and you want to take all items along an axis, simply put `:`.\n",
+ "\n",
+ "*Advanced tip:* If you have many axes and you want to skip a bunch of axes in the middle, you can use `...` in the index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7414af1f-8fe4-4a32-ada5-a7e08dc955c5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here for indexing in a 1D array\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "557730c6-f8ea-4c22-9439-d950bd8a2baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here for indexing in a 2D array\n",
+ "\n",
+ "my_2d_array = np.array([[1, 4, 2], [5, 3, 7]])\n",
+ "\n",
+ "# Let's select the first row\n",
+ "\n",
+ "\n",
+ "# Let's select the last column\n",
+ "\n",
+ "\n",
+ "# Let's select the item in the bottom left corner\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f6af67b7-2cae-4ff5-92e3-b77170ecc5d9",
+ "metadata": {},
+ "source": [
+ "Notice that you only need to write the indices up until the last axis where you're selecting. If you omit indices after a specific axis, all values along each subsequent axis are selected."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4771a7b2-ba63-44a8-90a2-60cedc443603",
+ "metadata": {},
+ "source": [
+ "### Simple assignments\n",
+ "\n",
+ "To insert a value into a specific position in an array, we again use indexing.\n",
+ "\n",
+ "To assign a single value, we use syntax similar to list assignment:\n",
+ "\n",
+ "```python\n",
+ "my_array[i, j] = my_value\n",
+ "```\n",
+ "\n",
+ "We can also assign values to a specific region of the array (or sub-array) by indexing a region and assigning an array to those indices. Let's see some examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d74e75b0-61aa-45e2-be3c-8bebea93d6eb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to perform assignment in a 3D array of ones\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6bfc7ca6-5160-4433-b0e7-2c3ff5e67093",
+ "metadata": {},
+ "source": [
+ "Careful! The sizes need to match up! Except, we can actually assign all entries in a sub-array to a single value very easily! Here's how:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bf16426f-c453-407d-be14-62b720d8764d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to assign the same value to a sub-array\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "745ceecc-fb32-40fb-9036-88f860bbb303",
+ "metadata": {},
+ "source": [
+ "## Array Operations\n",
+ "\n",
+ "Now that we know how to *create* arrays and modify individual entries and sub-arrays through assignment, we can think about operations that we can perform with them.\n",
+ "\n",
+ "### Element-wise arthimetic\n",
+ "\n",
+ "Some of the simplest operations we can perform between arrays are *mathematical operations*. These basic arithmetic operators perform **element-wise** computations:\n",
+ "\n",
+ "* `+` - addition of two arrays, element-wise.\n",
+ "* `-` - subtraction of two arrays, element-wise.\n",
+ "* `*` - **element-wise** multiplication of two arrays.\n",
+ "* `/` - division of two arrays, element-wise.\n",
+ "* `**` - **element-wise** exponentiation of arrays.\n",
+ "* `%` - modulo of two arrays, element-wise.\n",
+ "* `//` - integer division (floor division) of two arrays, element-wise.\n",
+ "\n",
+ "**Note:** Again, note that these operations are all **element-wise**. This is unlike certain other programming languages, where the element-wise operators have different symbols.\n",
+ "\n",
+ "To perform these operations, the arrays must have *compatible shapes*. Here are some of the rules of working with different array shapes:\n",
+ "\n",
+ "* Exact same shape - the operation works directly\n",
+ "* One of the operands is a `float` or an `int` - the operation is performed with the scalar and each element in the array\n",
+ "* Shapes can be broadcast together - this process is described in the [documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6a467f5d-5df7-48c4-8afe-35eb071f0c52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "arr1 = np.array([[3, 5, 4], [8, 5, 2], [8, 3, 9]])\n",
+ "\n",
+ "arr2 = np.array([[6, 3, 2], [6, 6, 1], [4, 9, 2]])\n",
+ "\n",
+ "print(f\"Array 1:\\n {arr1}\\n\")\n",
+ "print(f\"Array 2:\\n {arr2}\\n\")\n",
+ "\n",
+ "# Your code here for basic array arithmetic.\n",
+ "# Let's add these arrays together\n",
+ "print(\"Array 1 + Array 2 = \")\n",
+ "...\n",
+ "print()\n",
+ "\n",
+ "# Now, let's divide arr1 by arr2\n",
+ "print(\"Array 1 / Array 2 = \")\n",
+ "...\n",
+ "print()\n",
+ "\n",
+ "# We can also work with individual integers\n",
+ "...\n",
+ "print(arr1 ** 2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6ee3b78-4c3b-4f64-9eb4-503cfee9a079",
+ "metadata": {},
+ "source": [
+ "**Warning:** Be careful when dividing by arrays that may contain zero! This will produce potentially annoying results, which contain invalid values, such as `NaN` (not a number) or `inf` (infinity):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "14fb0808-6280-47af-b931-1c3fc9b89446",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "arr3 = np.array([[6, 5, 0], [2, 9, 4], [4, 5, 1]])\n",
+ "arr4 = np.array([[4, 2, 0], [0, 5, 2], [8, 0, 3]])\n",
+ "\n",
+ "# Your code here to divide by zero\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "57f4df41-82a5-4eeb-a2a0-1195e74362ca",
+ "metadata": {},
+ "source": [
+ "Notice that NumPy still gives you a result, but it lets you know that it's not happy with you."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81cf6b75-f7fd-4af0-97f8-8ceb4eb4cf42",
+ "metadata": {},
+ "source": [
+ "### Statistics on arrays\n",
+ "\n",
+ "Here are some basic functions that all follow a similar pattern:\n",
+ "* [`ndarray.mean`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.std.html) - Compute the average.\n",
+ "* [`ndarray.var`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.var.html) - Compute the variance.\n",
+ "* [`ndarray.std`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.mean.html) - Compute the standard deviation.\n",
+ "\n",
+ "We can change the behaviour of these functions using the `axis` keyword argument. If `axis` is specified, the statistics are computed along a specific axis of the array, produce an array of results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fe882e1a-11d3-4e1a-ae9e-d6aebe9602f7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to compute statistics on arrays\n",
+ "my_arr = np.array([\n",
+ " [4, 2, 5],\n",
+ " [7, 3, 2],\n",
+ " [8, 9, 4],\n",
+ " [2, 0, 3]\n",
+ "])\n",
+ "\n",
+ "# Let's compute the mean\n",
+ "\n",
+ "\n",
+ "# Let's now comptue the mean along each row\n",
+ "\n",
+ "\n",
+ "# Let's compute the standard deviation along each column\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2eac5aa1-0c2e-4bfe-9dc8-3a79ced37e87",
+ "metadata": {},
+ "source": [
+ "**Note:** It can get a bit confusing to know which axis to put. The axis should be the one across which the operation will actually be performed (which one are we crossing). So, for the case where we're computing the mean, we need to think about which numbers we want averaged with each other. If we're averaging along a row, then that means we want to average numbers across columns. The columns are defined by axis `1`.\n",
+ "\n",
+ "This `axis` parameter can also be used for other methods, such as [`np.sum`](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) and [`np.prod`](https://numpy.org/doc/stable/reference/generated/numpy.prod.html), which respectively add and multiply all elements in an array."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ac190a21-a314-4049-93f9-a9ea494e3aa9",
+ "metadata": {},
+ "source": [
+ "There are other array, matrix and vector operations that we can perform, such as dot products. We won't see those here, but I encourage you to read up on them in the [documentation](https://numpy.org/doc/stable/reference/routines.linalg.html).\n",
+ "\n",
+ "There are also other element-wise functions included in NumPy, such as the `cos` and `sin` trigonometric functions, such as `cos` and `sin`. When we import NumPy, we can simply call these functions using the dot notation. They do not exist as as array methods."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "6dc21db2-5106-4230-ba6d-c5bba1e6a302",
+ "metadata": {},
+ "source": [
+ "### Combining arrays\n",
+ "\n",
+ "We can combine multiple arrays together using the following NumPy functions:\n",
+ "\n",
+ "* [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) - Combine arrays along an **existing** axis.\n",
+ "* [`np.stack`](https://numpy.org/doc/stable/reference/generated/numpy.stack.html) - Combine arrays along a **new** axis.\n",
+ "* [`np.hstack`](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) - Combine arrays horizontally (along rows).\n",
+ "* [`np.vstack`](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html) - Combine arrays vertically (along columns).\n",
+ "\n",
+ "Let's see some examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6ca80d18-9adf-44db-ae76-b73f66d6efe0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to combine arrays\n",
+ "my_first_array = np.arange(12).reshape(3, 4)\n",
+ "my_second_array = np.arange(24, 0, -2).reshape(3, 4)\n",
+ "\n",
+ "# Let's concatenate the two arrays\n",
+ "cat_array = ...\n",
+ "\n",
+ "print(f\"Concatenated array has shape {cat_array.shape}.\")\n",
+ "\n",
+ "# Let's stack the two arrays\n",
+ "stacked_array = ...\n",
+ "print(f\"Stacked array has shape {stacked_array.shape}.\")\n",
+ "\n",
+ "# Let's horizontally stack the two arrays\n",
+ "hstacked_array = ...\n",
+ "print(f\"Horizontally stacked array has shape {hstacked_array.shape}.\")\n",
+ "\n",
+ "# And now let's vertically stack the two arrays\n",
+ "vstacked_array = ...\n",
+ "print(f\"Vertically stacked array has shape {vstacked_array.shape}.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11ded154-b743-4e47-b61f-990f53168fc7",
+ "metadata": {},
+ "source": [
+ "There are **many** other useful functions in the NumPy package, but there are *way too many* to cover in this workshop... So, how can you get more info about NumPy?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f3072fee-5751-4278-a858-2d6483b747c1",
+ "metadata": {},
+ "source": [
+ "### Exploring the documentation\n",
+ "\n",
+ "The answer: the documentation!\n",
+ "\n",
+ "NumPy is a **huge** library. So, it's impossible to know and remember all the functions, their arguments, and all the details about them.\n",
+ "\n",
+ "Instead, it's better to become familiar with the **documentation**. It can be found on the NumPy website, at https://numpy.org/doc/stable/index.html. I've included links to the documentation along the way. Let's look at some of the pages for the functions we've discussed.\n",
+ "\n",
+ "The docs give detailed information about every class, function, constant and method available. There are instructions on how to use (and not use) everything. They provide hints and warnings, as well as a [\"User Guide\"](https://numpy.org/doc/stable/user/index.html).\n",
+ "\n",
+ "When in doubt, **check the documentation**. This should be your **first stop**. Unlike checking on Stack Overflow, the docs give you **official information** and **important guidance**. It's also important to remember that sometimes, just because code *works*, that doesn't mean the code is *right*. The documentation will help clear up any confusion."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d4ce9d14-2d0d-4bb7-ac82-3f34a1e9330f",
+ "metadata": {},
+ "source": [
+ "## Module Summary\n",
+ "\n",
+ "We've made it to the end of our module on NumPy! Here are the highlights of what we've seen:\n",
+ "\n",
+ "* NumPy is a **package** that offers tools to easily represent **arrays**.\n",
+ "* It is easy to perform **arithmetic operations** between arrays.\n",
+ "* Arrays have *attributes* like `shape` and *methods* like `mean` that allow us to learn more about our data and compute results.\n",
+ "* We can easily **combine** arrays and **reshape** them as needed.\n",
+ "* The **documentation** provides resources to learn more about all the features is NumPy.\n",
+ "\n",
+ "Congratulations! You're now ready to enter into the world of array programming! Let's work on an exercise!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "51f824de-e98e-49c9-8b27-754f5f52f2a1",
+ "metadata": {},
+ "source": [
+ "## Exercises\n",
+ "\n",
+ "Now, let's explore what you've learned using a few exercises!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c3c9619b-5efc-43ae-97c3-c8ffc5e9c2c3",
+ "metadata": {},
+ "source": [
+ "### Single Nucleotide Polymorphism analysis\n",
+ "\n",
+ "Genes are not identical between all individuals. Differences at a single location in a gene are referred to as *single nucleotide polymorphisms* (SNPs). Diploid organisms have two copies of most genes. At each location, an individual can have 0, 1 or 2 copies of a SNP. \n",
+ "\n",
+ "For this exercise, we've genotyped an unknown number of different artificial yeast cells to determine the number of copies of 5000 SNPs we have. Unfortunately, our equipment could only produce 1D arrays. The data are located in `../data/snp_individuals.npy`. Perform the following operations to allow us to analyse the SNP data.\n",
+ "\n",
+ "1. Load the genotype data and reshape it so that the columns correspond to each of the 5000 SNPs. Figure out how many individuals have been analysed.\n",
+ "2. Compute the following:\n",
+ " * The mean frequency of each SNP across the population (both with and without the built-in method).\n",
+ " * The mean frequency of all SNPs in each individual.\n",
+ " * The maximum and minimum frequency of each SNP.\n",
+ " * The maximum and minimum SNP frequency within each individual.\n",
+ "3. Then, extract the frequencies of every 10th SNP, and do the same.\n",
+ "\n",
+ "**Note:** These data are completely artificial (the genotyping was a nice problem scenario)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b5fdc20c-6fd4-49fb-97cb-0e7f6e530897",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = base_dir + \"data/snp_individuals.npy\"\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e4bf7c47-c47b-4387-b2cd-a08ddb63b1cf",
+ "metadata": {},
+ "source": [
+ "### Amino acid properties\n",
+ "\n",
+ "Before getting to the exercise, here's some code which will be helpful to have for the exercises in this, and other, modules. It allows reading DNA and amino acid sequences from a FASTA file.\n",
+ "\n",
+ "FASTA files contain two types of lines (my terms for them):\n",
+ "* header lines - begin with `>` and contain *metadata*.\n",
+ "* sequence lines - all other lines, which contain the actual sequence information.\n",
+ "\n",
+ "A single sequence can span multiple lines between headers.\n",
+ "\n",
+ "For more information on FASTA files, see the [BLAST documentation](https://blast.ncbi.nlm.nih.gov/doc/blast-topics/#fasta) and this [Wikipedia page](https://en.wikipedia.org/wiki/FASTA_format)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5366e11a-e54f-4cc4-8bea-3a7b730ba9f0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def load_sequences(filename):\n",
+ " \"\"\"Load sequences from a FASTA file.\"\"\"\n",
+ " sequences = []\n",
+ " \n",
+ " with open(filename, \"r\") as f:\n",
+ " lines = f.readlines()\n",
+ "\n",
+ " # Define the current sequence\n",
+ " seq = \"\"\n",
+ "\n",
+ " for line in lines:\n",
+ " if line.startswith(\">\"):\n",
+ " if seq != \"\":\n",
+ " sequences.append(seq)\n",
+ " seq = \"\"\n",
+ " else:\n",
+ " seq += line.strip()\n",
+ "\n",
+ " if seq != \"\":\n",
+ " sequences.append(seq)\n",
+ "\n",
+ " return sequences"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d8cf983b-1f9b-4e5e-b2da-154bc859d749",
+ "metadata": {},
+ "source": [
+ "I've provided you with peptide sequences containing 100 amino acids from 50 simulated individuals.\n",
+ "\n",
+ "1. At each location, determine the frequency of amino acids with a non-polar side chain.\n",
+ "2. For each individual, determine the proportion of amino acids with acidic side chains."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "152d4024-160b-4701-83e8-dae7ade29e10",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "input_file = base_dir + \"data/HUMAN_HEALTHY.fasta\"\n",
+ "sequences = load_sequences(input_file)\n",
+ "\n",
+ "AMINO_ACID_PROPERTIES = {\n",
+ " \"NON_POLAR\": [\"F\", \"L\", \"I\", \"M\", \"V\", \"P\", \"A\", \"W\", \"G\"],\n",
+ " \"POLAR\": [\"S\", \"T\", \"Y\", \"Q\", \"N\", \"C\"],\n",
+ " \"ACIDIC\": [\"D\", \"E\"],\n",
+ " \"BASIC\": [\"H\", \"K\", \"R\"]\n",
+ "}\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2aa2726f-46b7-482b-a646-540fdb8c9196",
+ "metadata": {},
+ "source": [
+ "**Have extra time? Want extra practice?** I've included simulated examples for other species and other disease states in the same folder. Feel free to do similar experiments on those."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c703c655-61ea-44b3-afe5-68e62ce248d6",
+ "metadata": {},
+ "source": [
+ "**Curious about how I generated these sequences?** Check out [`generate_proteins.py`](../data_scripts/generate_proteins.py) file to see how I generated these artificial sequences."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4da504a9-45fe-45a2-90a9-3ea1dddb9042",
+ "metadata": {},
+ "source": [
+ "# Module 3 - Visualising Data with Matplotlib\n",
+ "\n",
+ "We've now seen how to process data stored in arrays... but an array is just an array. We want to get more out of our data. Let's enter the world of **data visualisation**. In this module, we'll see the following topics:\n",
+ "\n",
+ "1. Creating Plots with Matplotlib\n",
+ " 1. Introduction to pyplot\n",
+ " 2. Creating simple plots, bar plots, histograms, ...\n",
+ " 3. Customising plots: Labelling axes, titles, and more!\n",
+ " 4. Advanced plots: scatter plots, image plots\n",
+ " 5. Exporting plots as images\n",
+ " 6. Generating subplots\n",
+ "2. Exploring the Matplotlib Documentation\n",
+ "3. Exercises"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9260841-ae50-4ec9-9b5e-10e9a5edfa7e",
+ "metadata": {},
+ "source": [
+ "## Creating Plots with Matplotlib\n",
+ "\n",
+ "[**Matplotlib**](https://matplotlib.org/) is a commonly-used package for producing plots."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0ce3b5b9-b8a0-4f85-9cc4-47b1837c005d",
+ "metadata": {},
+ "source": [
+ "### Introduction to `pyplot`\n",
+ "\n",
+ "To simplify the plotting experience, Matplotlib offers the `pyplot` sub-package. This interface lets us quickly generate many figures without having to worry about much of the nitty-gritty behind the scenes.\n",
+ "\n",
+ "To get started with `pyplot`, we have to import it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "17925c38-aeea-44df-aa2a-9d1d083545e0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import pyplot\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c36a0d75-dcdf-46a0-a27b-87f2ba579157",
+ "metadata": {},
+ "source": [
+ "### Creating simple plots\n",
+ "\n",
+ "In this section, we'll cover a number of plots:\n",
+ "* simple plots\n",
+ "* bar plots\n",
+ "* histograms\n",
+ "* violin plots\n",
+ "* polar plots\n",
+ "\n",
+ "All of these will follow a similar pattern:\n",
+ "```python\n",
+ "\n",
+ "plt.plotting_function(data, ...)\n",
+ "plt.show()\n",
+ "\n",
+ "```\n",
+ "\n",
+ "The specific `plotting_function` will change, but typically you need to call [`plt.show()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) after defining the plot so that you actually see the result. In a Jupyter notebook, this is typically not required, but I'm including it anyways just to make sure!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5178fa1c-db25-42e2-8b7b-5b07dfaac406",
+ "metadata": {},
+ "source": [
+ "#### Simple plots\n",
+ "\n",
+ "What is the simplest thing we can plot? How about a function? To make a plot of `x` values and `y` values, we just use the [`plot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) function:\n",
+ "\n",
+ "```python\n",
+ "plt.plot(xs, ys)\n",
+ "```\n",
+ "\n",
+ "Let's do an example! Let's take the angles from 0 to 360 degrees, skipping by increments of 20, and compute the cosine:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2c64076c-502c-456f-8dfc-97568e3dd850",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to plot the function\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78fd07d3-cfa4-4251-addd-1cb51330ddeb",
+ "metadata": {},
+ "source": [
+ "We can customise the line formatting using the **format parameter**.\n",
+ "\n",
+ "The format parameter consists of a **single string** combining:\n",
+ "* the colour (as a single letter abbreviation)\n",
+ "* the point markers (a variety of shapes are available)\n",
+ "* the line style (solid, dots, dashes)\n",
+ "\n",
+ "The formatting of this string is described in detail [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html).\n",
+ "\n",
+ "Let's make our cosine function appear in magenta (`m`) using stars at each point (`*`) with a dotted line in between (`:`):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5c6b326e-6527-438d-9d65-64bc3aa301aa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to plot with a different format\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f66b5605-87f5-4fcd-b1a7-01186ef7c913",
+ "metadata": {},
+ "source": [
+ "**Note:** This isn't the only way to specify the formatting, but it's very easy to use."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e799f11-658f-44a3-8176-03f7286eb9bf",
+ "metadata": {},
+ "source": [
+ "To plot many lines together, we can call `plot` as many times as we need **before calling `plt.show()`**. After we call `plt.show()`, the plot gets cleared."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b29d7b68-5c1b-4694-998b-d281b28df96d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xs = np.arange(0, 361, 20)\n",
+ "cos_ys = np.cos(np.radians(xs))\n",
+ "sin_ys = np.sin(np.radians(xs))\n",
+ "\n",
+ "# Your code here to plot the functions for sin and cos\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee70a17e-e68e-4ecc-8127-3fed9acce9ca",
+ "metadata": {},
+ "source": [
+ "We'll see later how to add a legend to this type of plot."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6b54b5b-2f61-4c29-b568-9d0b1baca40c",
+ "metadata": {},
+ "source": [
+ "#### Bar plots\n",
+ "\n",
+ "We can also generate bar plots, where each plot has a height proportional to a specified value. Let's say we are looking at the number of people attending a course online versus in-person.\n",
+ "\n",
+ "We can use the [`plt.bar`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) to produce bar plots:\n",
+ "\n",
+ "```python\n",
+ "plt.bar(horiz_categories, heights)\n",
+ "```\n",
+ "\n",
+ "For an official example, see: https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_colors.html\n",
+ "\n",
+ "Let's do a quick example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cdf13af9-a324-4f0b-9251-8130f6bbc9a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "participation_options = [\"online\", \"in-person\", \"hybrid\"]\n",
+ "participants = [120, 30, 60]\n",
+ "\n",
+ "# Your code here for an attendance example\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b3936823-43b8-46c0-8f87-4b577e9110cd",
+ "metadata": {},
+ "source": [
+ "Using other arguments for the `bar` function, it's possible to customise the bars to change their widths, colours, alignment, and more! We can even include error bars using the `xerr, yerr` parameters! Make sure to check out the complete documentation to see more parameters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a31d3773-ac34-48c0-9092-025c67d0456d",
+ "metadata": {},
+ "source": [
+ "#### Histograms\n",
+ "\n",
+ "Matplotlib includes a [`hist` function](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) specifically for computing histograms. All we need to do is supply one or more **flat** arrays (only one dimension) and it will do the rest. We can optionally specify the number of bins or their boundaries (using the `bins` parameter), and even indicate that the vertical axis should have a log scale (using the `log` parameter).\n",
+ "\n",
+ "Here's the syntax:\n",
+ "```python\n",
+ "plt.hist(my_data)\n",
+ "```\n",
+ "\n",
+ "Let's see a quick example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "266a4293-c195-42c8-be6b-1f6ab4f723f8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "random_numbers = np.random.default_rng().normal(size=10000)\n",
+ "\n",
+ "# Your code here for a simple histogram\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e22ce7f-6e5f-4c63-8a94-ab3d7f25798b",
+ "metadata": {},
+ "source": [
+ "#### Violin plots\n",
+ "\n",
+ "Another way of representing this type of frequency data is using a *violin plot*. Matplotlib can generate violin plots in a similar way, using the [`violinplot` function](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html). The syntax is:\n",
+ "\n",
+ "```python\n",
+ "plt.violinplot(data)\n",
+ "```\n",
+ "\n",
+ "The function can easily be used for multiple datasets.\n",
+ "\n",
+ "Let's look at an example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b2b4af86-d67c-4207-b4fd-8df21ee4a5df",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "random_numbers = np.random.default_rng().normal(size=10000)\n",
+ "plt.violinplot(random_numbers)\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a90ce8de-61a1-4855-bc80-aca3734585db",
+ "metadata": {},
+ "source": [
+ "#### Polar plots\n",
+ "\n",
+ "Matplotlib can even generate plots on polar axes using the [`plt.polar` function](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.polar.html). The syntax, per the documentation, is:\n",
+ "\n",
+ "```python\n",
+ "plt.polar(theta, r)\n",
+ "```\n",
+ "\n",
+ "So, let's say we want to create a circle with a radius $r = \\sin\\theta$:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "58b6facc-36fe-4d80-bf94-3af40b1d7eff",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "thetas = np.arange(360)\n",
+ "rs = np.sin(np.radians(thetas))\n",
+ "\n",
+ "plt.polar(np.radians(thetas), rs)\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fd95288-ef7c-4515-9138-9869a65848ff",
+ "metadata": {},
+ "source": [
+ "These examples show the basics for each type of plot. For more parameters and customisations, check out the respective documentation pages, which I have linked to. Also, make sure to consult the documentation for other types of plots that I haven't show, such as pie charts."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78003886-64cf-46a6-8d62-321071121c66",
+ "metadata": {},
+ "source": [
+ "### Customising plots\n",
+ "\n",
+ "The plots that we've made so far show the data... and not much else. We can add more information to our plots:\n",
+ "* Axis labels\n",
+ "* Titles\n",
+ "* Legends\n",
+ "* ... and more!\n",
+ "\n",
+ "We'll discuss the first three, but there are plenty more described in the Matplotlib documentation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3850a5e3-a704-4abd-9687-0c8998ebbb8f",
+ "metadata": {},
+ "source": [
+ "#### Adding axis labels and titles\n",
+ "\n",
+ "To add axis labels, we just simply call the `pyplot` functions [`xlabel`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html) and [`ylabel`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html), like so:\n",
+ "\n",
+ "```python\n",
+ "# Set the x-axis label\n",
+ "plt.xlabel(\"My x axis\")\n",
+ "\n",
+ "# Set the y-axis label\n",
+ "plt.ylabel(\"My y axis\")\n",
+ "```\n",
+ "\n",
+ "For a title, we just call the function [`title`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html):\n",
+ "\n",
+ "```python\n",
+ "# Set the title\n",
+ "plt.title(\"My Title\")\n",
+ "```\n",
+ "\n",
+ "Let's see an example using our bar plot from before:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "29f73010-ce7d-43a6-92ba-7a02096094f0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "participation_options = [\"Online\", \"In-person\", \"Hybrid\"]\n",
+ "participants = [120, 30, 60]\n",
+ "\n",
+ "plt.bar(participation_options, participants)\n",
+ "\n",
+ "# Your code here to add axis labels and a title\n",
+ "\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "a3a9f415-fbe5-4533-bbc9-81b0cb3cd3a4",
+ "metadata": {},
+ "source": [
+ "Now, this looks like a more respectable plot! As with everything else, there are many options for customising the appearance of labels and titles, which are described in the documentation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6cb6eab-44f8-480c-9497-5ebbe6daa926",
+ "metadata": {},
+ "source": [
+ "#### Plot legends\n",
+ "\n",
+ "And now we're back to our example from before with multiple plots! We want to let everyone know what each curve represents. To do this, we can add a legend using the [`plt.legend`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html) function.\n",
+ "\n",
+ "When adding a legend, we need to supply a set of *labels* that will be associated with each plot. We can do this using the `label` keyword argument in the `plot` function.\n",
+ "\n",
+ "We can also specify other parameters for the legend, such as the *location* (`loc`), the number of columns (`ncols`), the font size (`fontsize`) and more! See the documentation for all the details.\n",
+ "\n",
+ "Let's now use this function to make a legend for our sine and cosine graphs:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "be92d5a9-7d6a-4a5b-9934-af4670ab195d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xs = np.arange(0, 360, 20)\n",
+ "cos_ys = np.cos(np.radians(xs))\n",
+ "sin_ys = np.sin(np.radians(xs))\n",
+ "\n",
+ "plt.xlabel\n",
+ "\n",
+ "# Generate the plot - Your code here to add the labels\n",
+ "plt.plot(xs, cos_ys)\n",
+ "plt.plot(xs, sin_ys)\n",
+ "\n",
+ "# Your code here to create the legend\n",
+ "\n",
+ "\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "545ed220-170f-43df-83d7-baf8421bfe13",
+ "metadata": {},
+ "source": [
+ "In this example, I've also shown a little trick. You can slip $\\LaTeX$ math symbols into any text that is shown in a plot. There's just a catch. The backslash `\\` is a special character. So, to be able to include `\\`-commands, like `\\cos` or `\\sin`, you must do one of two things:\n",
+ "* *Escape* the slash, by typing it twice, as `\\\\`, to write `\\\\cos` or `\\\\sin`.\n",
+ "* Indicate that you are passing in raw text, by putting `r` **before** the string, as in `r\"$\\sin$\"`.\n",
+ "\n",
+ "As usual, when typing in $\\LaTeX$, don't forget to put the dollar signs `$` around the math text!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c84df7e2-6c11-42b4-9ceb-e899d997d86a",
+ "metadata": {},
+ "source": [
+ "#### Other customisations\n",
+ "\n",
+ "There are many other ways that we can customise our plots. We can changing the axis ticks using [`plt.xticks`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html) and [`plt.yticks`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.yticks.html) or alter their appearance using [`plt.tick_params`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.tick_params.html#matplotlib.pyplot.tick_params). We can also hide the bounding box using [`plt.box`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.box.html) and alter the grid appearance using [`plt.grid`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.grid.html).\n",
+ "\n",
+ "I've linked to the documentation in each case."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f1ad975-7ff1-4bef-9803-556ada0b72d2",
+ "metadata": {},
+ "source": [
+ "### Advanced plots\n",
+ "\n",
+ "Now, let's look at some more advanced plots. In the plots we saw before, our data had a simple structure; for each item, we wanted to only show one piece of information:\n",
+ "\n",
+ "* `plot` - For each `x`, show a `y`.\n",
+ "* `bar` - For each category, show a height.\n",
+ "* `hist` and `violinplot` - For each `x`, show a frequency.\n",
+ "* `polar` - For each `theta` show an `r`.\n",
+ "\n",
+ "For these plots, we could change the colour of bars and markers, but we were only plotting two dimensions of data. Now, let's explore more complicated plots, which require additional axes. But first, let's talk a bit about colours."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b32d6392-a364-4344-98f4-31b5810f77cc",
+ "metadata": {},
+ "source": [
+ "#### A brief intro to colour maps\n",
+ "\n",
+ "In scientific visualisation, we often use colours to provide additional insight into data. Colour maps allow us to assign specific colours to low and high values, with a spectrum in between. Matplotlib offers **many** different colour maps, which are described [here](https://matplotlib.org/stable/users/explain/colors/colormaps.html). This page also provides insight into different types of colour maps, as well as the strengths and weaknesses of each colour map.\n",
+ "\n",
+ "I **strongly recommend** that you take a look at this page and decide which colour map to use. Different contexts require different uses of colour.\n",
+ "\n",
+ "Each colour map has a name, and in this section we'll see some places where we can pass these names into functions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8cb5e147-b6d2-4eb6-b66e-0b00ddf96692",
+ "metadata": {},
+ "source": [
+ "#### Scatter plots\n",
+ "\n",
+ "At first glance, you might assume that a scatter plot is just like the simple plots that we generated with `plot`... Well, there's actually a bit more here. Using [`plt.scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html), we place points in a 2D `xy`-plane, but we can also have the colour and marker size change in response to additional variables.\n",
+ "\n",
+ "To generate a scatter plot, we use the following syntax:\n",
+ "```python\n",
+ "plt.scatter(xs, ys, sizes, colours, ...)\n",
+ "\n",
+ "```\n",
+ "\n",
+ "To control the colour map used, we can set the `cmap` argument equal to the desired colour map name. **Be careful!** The colour map names are case-sensitive. There are several other options that can be configured as well.\n",
+ "\n",
+ "Let's see an example where we change the size and the colour of the points:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "001365e8-f8f3-458c-b737-696fd46e0e37",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "number_of_points = 500\n",
+ "\n",
+ "# Generate some 2D points\n",
+ "my_points = np.random.default_rng().normal(size=(number_of_points, 2))\n",
+ "\n",
+ "# Generate the size data\n",
+ "my_sizes = np.random.default_rng().exponential(scale=10, size=number_of_points)\n",
+ "\n",
+ "# Generate the colour data\n",
+ "my_colour_data = np.random.default_rng().chisquare(5, size=number_of_points)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "72917d75-bcd8-48cb-9229-adda7a32bd49",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to generate the scatter plot\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cc867cdc-34d0-4f7e-b60b-bdd6c3b59a27",
+ "metadata": {},
+ "source": [
+ "We can add a colour bar using the [`plt.colorbar()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.colorbar.html) function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "80f827ba-52fd-473b-8707-a056e4d6b426",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.scatter(xs, ys, my_sizes, my_colour_data)\n",
+ "\n",
+ "# Your code here to add the colour bar\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4524b22f-1b57-47e8-a669-a6c3d318e9ac",
+ "metadata": {},
+ "source": [
+ "Size legends are also possible: https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6d46252-bbd2-4699-a9f4-fdb5dafe11bb",
+ "metadata": {},
+ "source": [
+ "#### Image plots\n",
+ "\n",
+ "We can also use Matplotlib to plot images. We can actually plot **any array** as an image.\n",
+ "\n",
+ "The function to use is [`plt.imshow`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html). The syntax is:\n",
+ "\n",
+ "```python\n",
+ "plt.imshow(image_arr, ...)\n",
+ "```\n",
+ "\n",
+ "Other parameters include the colour map to use, as well as image interpolation, aspect ratio and normalisation settings. These are explained in the documentation.\n",
+ "\n",
+ "Like before, we can show a colour bar using the `plt.colorbar()` function. This is especially useful if your image intensities contain important measurements and/or computed results.\n",
+ "\n",
+ "Let's see an example involving [\"Ant SEM.jpg\"](https://en.wikipedia.org/wiki/File:Ant_SEM.jpg), produced by the US Government (public domain)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6641c064-1c75-4711-aab2-9e13c5c3336b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import PIL\n",
+ "\n",
+ "my_test_image = PIL.Image.open(base_dir + \"assets/mod3/Ant_SEM.jpg\")\n",
+ "\n",
+ "# Your code here for the sample image\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "3bb29f8c-ddfc-461f-8f31-822eabf36c5e",
+ "metadata": {},
+ "source": [
+ "We can get the original image by changing the colour map to `grey` by setting the `cmap` parameter.\n",
+ "\n",
+ "**Important note:** Notice the axes! In image coordinates, the `y`-axis decreases from the top corner!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b32fa92e-3306-4d5a-a337-d8c692ebb356",
+ "metadata": {},
+ "source": [
+ "### Exporting plots as images\n",
+ "\n",
+ "To save figures, we just need to call the `pyplot` function [`plt.savefig`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html). The only required parameter is the **filename**.\n",
+ "\n",
+ "```python\n",
+ "plt.savefig(filename, ...)\n",
+ "```\n",
+ "\n",
+ "We can export plots as `PNG`, `TIFF` or `JPEG`, among other raster formats. We can even export a plot as an `SVG` or a `PDF` to get a vector image. The default extension is `PNG` if none is specified.\n",
+ "\n",
+ "There are other options that can be set. If we're saving in a format that supports transparency, we can indicate to save the plot with a transparent background (`transparent`). We can also change the resolution using the `dpi` parameter.\n",
+ "\n",
+ "As an example, let's save our sine and cosine plot from before:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "946fc206-e9bf-4c5f-a3a3-0cec7c54b95d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to plot the functions for sin and cos\n",
+ "xs = np.arange(0, 361, 20)\n",
+ "cos_ys = np.cos(np.radians(xs))\n",
+ "sin_ys = np.sin(np.radians(xs))\n",
+ "\n",
+ "# Generate the plot\n",
+ "plt.plot(xs, cos_ys, label=r\"$\\cos\\theta$\")\n",
+ "plt.plot(xs, sin_ys, label=r\"$\\sin\\theta$\")\n",
+ "\n",
+ "plt.legend()\n",
+ "\n",
+ "# Your code here to save the figure\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0694a49-54cc-40e1-b1d2-5182fb84be07",
+ "metadata": {},
+ "source": [
+ "> **Very Important:** You **cannot** call `plt.savefig` after calling `plt.show()`. `plt.show` clears the current plot after showing it. If you call `plt.show` first, you will just end up saving a blank canvas.\n",
+ "\n",
+ "Again, **always** call `plt.savefig` **FIRST** if you're going to both save and show the plot."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "db2e61f2-57e7-444f-899d-616a7966096b",
+ "metadata": {},
+ "source": [
+ "### Generating subplots\n",
+ "\n",
+ "There are a few different ways that we can produce subplots.\n",
+ "\n",
+ "One approach is to use the [`plt.subplot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html) function.\n",
+ "\n",
+ "Here's the syntax:\n",
+ "```python\n",
+ "plt.subplot(nrows, ncols, index)\n",
+ "```\n",
+ "\n",
+ "First, we specify the number of rows `nrows` and number of columns `ncols` that will define our plot grid.\n",
+ "\n",
+ "The third argument, `index`, is a bit more complicated. This argument is either a single integer, if the plot takes up one panel, or a tuple containing a starting and ending index, if the plot should be bigger than one panel. Oddly enough, for the grid panels, the index starts at 1. Here's an illustration for `plt.subplot(3, 3, ...)`:\n",
+ "\n",
+ "\n",
+ "\n",
+ "The subplot is the smallest **rectangle** that fits between the start and end index.\n",
+ "\n",
+ "So, if we want a plot to extend vertically from the top left corner to the centre position, we would write:\n",
+ "\n",
+ "```python\n",
+ "plt.subplot(3, 3, (1, 5))\n",
+ "```\n",
+ "\n",
+ "This sub-plot then becomes active, and we can plot on it just like we would normally.\n",
+ "\n",
+ "**Note:** After specifying the `index`, we can pass other keyword arguments. For example, we can make polar plots by specifying `projection=\"polar\"` or 3D with `projection=\"3d\"`.\n",
+ "\n",
+ "Here's an example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f30fa559-c505-4862-8d76-015795df3030",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.subplot(3, 3, 3)\n",
+ "plt.plot(np.arange(10))\n",
+ "plt.subplot(3, 3, (7, 9))\n",
+ "plt.plot(np.arange(20, 0, -1))\n",
+ "plt.subplot(3, 3, (1, 5), projection=\"3d\")\n",
+ "plt.subplot(3, 3, 6, projection=\"polar\")\n",
+ "thetas = np.radians(np.arange(0, 365, 10))\n",
+ "plt.polar(thetas, np.cos(thetas))\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d741c0a6-2ae3-4d75-a6f1-a8518634458c",
+ "metadata": {},
+ "source": [
+ "As we can supply specific arguments to each sub-plot, we can have a mix of different types of plots."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "4b137418-0e75-4c11-b9e6-ceca3dad08ca",
+ "metadata": {},
+ "source": [
+ "#### Alternative methods of generating subplots (Optional)\n",
+ "\n",
+ "##### Using `plt.subplots`\n",
+ "\n",
+ "A more straightforward way is to use the [`plt.subplots`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) function. As arguments, we specify the number of rows and columns we want, and we get back a list of `Axes` objects on which we can plot.\n",
+ "\n",
+ "```python\n",
+ "my_figure, my_axes = plt.subplots(nrows, ncols, ...)\n",
+ "```\n",
+ "\n",
+ "This method is intuitive and we can easily share axes across plots. But, all plots must have the **same size**. So, if we have a 3 by 3 grid, we have 9 plots. That's it. No merging.\n",
+ "\n",
+ "##### Using `plt.subplot_mosaic`\n",
+ "\n",
+ "If we want to create a more elaborate setup, we can use [`plt.subplot_mosaic`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot_mosaic.html), which takes in our desired layout using labels and produces the corresponding grid."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "839be448-d0f4-413a-a80d-c7c308a50f12",
+ "metadata": {},
+ "source": [
+ "## Exploring the Matplotlib Documentation\n",
+ "\n",
+ "We've covered a lot on Matplotlib here, but we're really just scratching the surface. If you want to learn more about Matplotlib, reading the online documentation is a **must**.\n",
+ "\n",
+ "The documentation can be found online at https://matplotlib.org/stable/.\n",
+ "\n",
+ "This guide contains many different sections, including:\n",
+ "* **User guide**: material explaining how to use Matplotlib, written in text form. These pages explain core concepts and how to use them.\n",
+ "* **Reference**: the guide to every Python class and function used under the hood. These pages tell you the ins and outs of how to plot, and what will make Matplotlib happy and what will annoy it. If you're ever using a new function, make sure to check out its page.\n",
+ "* **Examples**: Matplotlib is a very visual package. The website provides **tons** of examples. Just about every type of plot you can imagine has an example on this page. If you're ever in doubt about how to do something, chances are that there's an example that can help. Whether you're plotting heat maps or 3D vectors, or something in between, this page has you covered!\n",
+ "* **Cheatsheets**: These are a bit hidden, but if you want a quick summary of everything that Matplotlib can do, check out the [cheatsheets](https://matplotlib.org/cheatsheets/). Print them out, post them on your wall!\n",
+ "\n",
+ "Here are a couple of the cheatsheets for convenience:\n",
+ "\n",
+ "\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1dd9adc6-ce3e-4242-82f1-3d77e9fcd3cf",
+ "metadata": {},
+ "source": [
+ "## Module Summary\n",
+ "\n",
+ "With that, we reach the end of our module on Matplotlib. Here are the highlights:\n",
+ "\n",
+ "* **Matplotlib** is a package for plotting. It offers the **`pyplot` interface** to simplify the plotting process.\n",
+ "* Various **types of plots** can be constructed, ranging from **simple Cartesian plots** to **more advanced** image plots and scatter plots.\n",
+ "* These plots can be **customised** with **titles** and **axis labels** and some can also include **colour bars**.\n",
+ "* Plots can either be **shown** immediately or **saved** to use later.\n",
+ "* **Subplots** can be used to generate more complicated, elaborate figures.\n",
+ "\n",
+ "And now, let's do some exercises!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9a462433-ea45-4c17-b4c3-3ca209b4e078",
+ "metadata": {},
+ "source": [
+ "## Exercises\n",
+ "\n",
+ "We've seen how to generate different types of plots. Let's now use these tools to visualise some data from our previous exercises."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1a09f6da-981a-4138-82e5-5019226043bc",
+ "metadata": {},
+ "source": [
+ "### Single Nucleotide Polymorphism analysis\n",
+ "\n",
+ "Let's go back to our SNP example. For the provided SNPs, construct the following plots:\n",
+ "\n",
+ "1. A plot showing the total number of copies of each SNP\n",
+ "2. A plot showing the number of SNPs for each individual\n",
+ "3. A histogram showing the number of copies of each SNP\n",
+ "\n",
+ "Think about how to best represent these data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3d905587-a507-462e-8a9a-0a7619dd5162",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "genotype_data = np.load(base_dir + \"data/snp_individuals.npy\")\n",
+ "\n",
+ "# Reshape the data array to have 5000 columns and unknown number of rows\n",
+ "genotype_data = genotype_data.reshape(-1, 5000)\n",
+ "\n",
+ "# Your code here for generating the plots\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cdc9719d-0b76-417c-8cce-30d55c6947e9",
+ "metadata": {},
+ "source": [
+ "### Amino acid properties\n",
+ "\n",
+ "Let's stick with our amino acid analysis. Before, we looked at non-polar side chains, and at acidic side chains. Well, let's produce some plots to visually show the difference in amino acid frequency.\n",
+ "\n",
+ "1. Construct plots to show the frequency of the amino acids of different properties in a single individual (take the first sequence). This plot may be a bar graph. If you want to check out the Matplotlib documentation, you can also try to make a pie chart to show the different properties.\n",
+ "\n",
+ "2. These plots give us a global picture, but how about a more local one? Generate a plot showing the frequency of amino acids with different properties at each position in the protein of interest. If you want to use a stacked bar graph, check out [this example](https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py).\n",
+ "\n",
+ "3. We have data from different species and disease states. Construct an array whose columns correspond to the frequency of acidic side chains at each position and the rows correspond to the different species. Use `imshow` to visualise these differences.\n",
+ "\n",
+ "I've provided you with some basic starter code to make it easier to count amino acids."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a61c7a9d-b473-420f-8c97-cc2f0403a7c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "AMINO_ACID_PROPERTIES = {\n",
+ " \"NON_POLAR\": [\"F\", \"L\", \"I\", \"M\", \"V\", \"P\", \"A\", \"W\", \"G\"],\n",
+ " \"POLAR\": [\"S\", \"T\", \"Y\", \"Q\", \"N\", \"C\"],\n",
+ " \"ACIDIC\": [\"D\", \"E\"],\n",
+ " \"BASIC\": [\"H\", \"K\", \"R\"]\n",
+ "}\n",
+ "\n",
+ "# Create a new dictionary where the amino acids are the keys and the\n",
+ "# amino acid properties are the values.\n",
+ "amino_acid_properties = {}\n",
+ "\n",
+ "for prop in AMINO_ACID_PROPERTIES:\n",
+ " for aa in AMINO_ACID_PROPERTIES[prop]:\n",
+ " amino_acid_properties[aa] = prop\n",
+ "\n",
+ "def count_amino_acid_properties(seq):\n",
+ " \"\"\"Count the amino acid properties.\"\"\"\n",
+ " \n",
+ " sequence_amino_acid_counts = {}\n",
+ " \n",
+ " # Get the list of amino acid properties in the sequence\n",
+ " for aa in seq:\n",
+ " prop = amino_acid_properties[aa]\n",
+ " \n",
+ " if prop not in sequence_amino_acid_counts:\n",
+ " sequence_amino_acid_counts[prop] = 1\n",
+ " else:\n",
+ " sequence_amino_acid_counts[prop] += 1\n",
+ " \n",
+ " return sequence_amino_acid_counts\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a842ab72-12cc-4f8c-8b26-f9766b293f7a",
+ "metadata": {},
+ "source": [
+ "And now, put your code here:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5468ecca-e0f3-4b45-a081-263e09e28c6d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "input_file = base_dir + \"data/HUMAN_HEALTHY.fasta\"\n",
+ "sequences = load_sequences(input_file)\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bed78cca-50b4-4c81-8295-e533ee401728",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Part 2 - Get the amino acid property at each location in each individual\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "da515dba-4c60-427f-a17f-9196e819b47d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Part 3 - Perform similar analysis for many species\n",
+ "\n",
+ "# Let's load our sequences\n",
+ "filenames = [\n",
+ " base_dir + \"data/HUMAN_HEALTHY.fasta\",\n",
+ " base_dir + \"data/HUMAN_DISEASE.fasta\",\n",
+ " base_dir + \"data/CHIMP_HEALTHY.fasta\",\n",
+ " base_dir + \"data/CHIMP_DISEASE.fasta\",\n",
+ " base_dir + \"data/PIG_HEALTHY.fasta\",\n",
+ " base_dir + \"data/PIG_DISEASE.fasta\",\n",
+ "]\n",
+ "\n",
+ "seq_superlist = [load_sequences(fn) for fn in filenames]\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0d7b4aa-cecf-4bbc-a43c-b29d2cbc5147",
+ "metadata": {},
+ "source": [
+ "# Module 4 - Intro to Tabular Data with Pandas\n",
+ "\n",
+ "We've now seen how to store multi-dimensional arrays using NumPy and how to generate rich visualisations using Matplotlib. Now, let's take things in another direction and see how to represent and analyse tables of data using [**pandas**](https://pandas.pydata.org/). We're going to scratch the surface for this big, important package and cover only the absolute basics. After that, we'll explore the resources available from this project so that you can easily continue learning it on your own. Here's the outline for this module:\n",
+ "\n",
+ "1. Fundamentals of pandas\n",
+ " 1. Motivation: When arrays aren’t enough…\n",
+ " 2. Introducing the Series and DataFrame\n",
+ " 3. Intro to Grouping\n",
+ " 4. Reading and writing tables\n",
+ "2. Exploring the pandas Documentation\n",
+ "3. Exercise\n",
+ "\n",
+ "While pandas is very popular, it's not the *only* package out there. Hopefully, this module will provide you with helpful information that is relevant to those packages, as well.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b86d50a-647f-41bf-a6d2-099dec660a5a",
+ "metadata": {},
+ "source": [
+ "## Fundamentals of pandas\n",
+ "\n",
+ "[pandas](https://pandas.pydata.org/) is a package for handling labelled data in tables."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "847c72dd-0883-4ab3-8c09-01c5aa5d6485",
+ "metadata": {},
+ "source": [
+ "### Motivation: When arrays aren't enough...\n",
+ "\n",
+ "Sometimes, we need something that works more like a table, where we can give our columns and rows names, and easily add new columns based on existing ones.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Using pandas, we can work with tables in a similar way to how we could in a spreadsheet.\n",
+ "\n",
+ "If you don't already have it installed, you can install pandas using either from `conda-forge` using `conda` or from PyPI using `pip`:\n",
+ "\n",
+ "```bash\n",
+ "# conda\n",
+ "conda install -c conda-forge pandas\n",
+ "\n",
+ "# pip\n",
+ "pip install pandas\n",
+ "```\n",
+ "\n",
+ "For more info on installing pandas, see [this page](https://pandas.pydata.org/docs/getting_started/install.html) in the online documentation.\n",
+ "\n",
+ "When we **import** pandas, we typically give it the alias `pd`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35424673-0503-4bce-8acb-78854654b835",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to import pandas\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05b4fe4b-6e60-4afa-8ebe-a80806b30008",
+ "metadata": {},
+ "source": [
+ "### Introducing the Series and DataFrame\n",
+ "\n",
+ "In NumPy, the objects that we worked with were N-dimensional arrays, known as `ndarray`. In pandas, we work with two main types of objects:\n",
+ "\n",
+ "* [`Series`](https://pandas.pydata.org/docs/reference/series.html) - represents a single column of data, typically of the same type.\n",
+ "* [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) - represents a table of data, where columns are `Series`.\n",
+ "\n",
+ "In this module, we'll cover the basics of these two data types."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b51152c-6675-42d9-abbe-7b23fe8caa40",
+ "metadata": {},
+ "source": [
+ "#### Series\n",
+ "\n",
+ "A series as a sort of single-dimension array. We can create a `Series` object quite easily:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a0015204-4cfd-4ccb-a40c-35772f72d656",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_data = [6, 7, 1, 10, 2, 3]\n",
+ "\n",
+ "# Your code here to create a Series\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05d9de89-c175-4498-9863-77428cd94379",
+ "metadata": {},
+ "source": [
+ "Now we have a `Series`! Based on the output, we can see that our series contains integers (`int64`).\n",
+ "\n",
+ "This first column does not actually contain data. It is known as the **index**. You can think of the index as giving names to the rows. When we created the `Series`, we didn't explicitly give an index, so pandas just numbered the rows sequentially.\n",
+ "\n",
+ "If we wanted to give the rows explicit names, we could do that using the `index` keyword argument in the `Series` constructor."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ba37794e-81fc-45b3-b454-4038f1fe408e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_data = [6, 7, 1, 10, 2, 3]\n",
+ "my_indices = [\"microCT\", \"MRI\", \"CT\", \"FIB-SEM\", \"cryoTEM\", \"PALM\"]\n",
+ "\n",
+ "# Your code here to create a Series with named rows\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a18c9070-3610-426c-9e87-1472c8785de1",
+ "metadata": {},
+ "source": [
+ "**Side note:** You may be thinking that this looks a lot like a dictionary... Well, we can actually create a `Series` using a dictionary, shown here:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "11fc8a69-fef6-4c3d-b138-067f09efc043",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_data = {\n",
+ " \"microCT\": 6,\n",
+ " \"MRI\": 7,\n",
+ " \"CT\": 1,\n",
+ " \"FIB-SEM\": 10,\n",
+ " \"cryoTEM\": 2,\n",
+ " \"PALM\": 3\n",
+ " }\n",
+ "\n",
+ "# Your code here to create a Series\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "afa2e91a-4375-4a2a-8120-54509d818293",
+ "metadata": {},
+ "source": [
+ "**Warning:** There are some differences between dictionaries and `Series` that we won't get into here."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f17db560-dcc9-4841-aae6-c3095ca62b8f",
+ "metadata": {},
+ "source": [
+ "##### Accessing Elements in a Series\n",
+ "\n",
+ "We can access elements in our `Series` using the bracket operator. We can index by **both** the row number, and the row index (if it's different from the row number).\n",
+ "\n",
+ "When using the square brackets after the variable name, you **must** use the row index. In our example with the named rows, here's how we can get the number of `FIB-SEM` datasets:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e347971e-5ead-462a-85c9-6998571e4f6a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the number of FIB-SEM datasets\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9c6d53f7-64b6-4cb7-b218-f2e48cd0484c",
+ "metadata": {},
+ "source": [
+ "Alternatively, we can use the special `.loc` property of the `Series`, followed by the row name in **square brackets**.\n",
+ "\n",
+ "So, we can rewrite this example as:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7e8141dd-00aa-4fed-b060-43ce6ad5b930",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the number of FIB-SEM datasets using loc\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c3f5f7fe-39b8-4fb4-b068-73fb2c580638",
+ "metadata": {},
+ "source": [
+ "If you want to access individual elements by row number, we must use the special `.iloc` property of the `Series`, followed by **square brackets**.\n",
+ "\n",
+ "For example, let's say we want to access the last row of our `Series`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0d9be18c-4c0b-4234-b922-a269db0ee9bb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to access the last row\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "22b69b07-be6f-4e40-a07a-dab4c8bb941a",
+ "metadata": {},
+ "source": [
+ "Like with lists and arrays, we can select multiple rows using **slicing**. To perform slicing, we can again use the square brackets directly, or `.loc` or `.iloc`.\n",
+ "\n",
+ "**Important note:** When doing slicing with row names, the upper bound is **included** (per the pandas documentation).\n",
+ "\n",
+ "Let's get the range of rows from `MRI` to `FIB-SEM`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0fb375f9-8579-4826-bc5c-0f206b6e323b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to select the rows\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b20c4451-bfa2-4db5-a6ab-0f0ea50adb9d",
+ "metadata": {},
+ "source": [
+ "We can also perform indexing using booleans, described in the pandas [**User Guide**](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing). We won't see that right now, but it is **very useful** for selecting specific rows."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9bbb9442-dfbb-4fc1-aad7-fe164bcf09cc",
+ "metadata": {},
+ "source": [
+ "##### Modifying Series\n",
+ "\n",
+ "Like many of the other collection types we've seen, `Series` are **mutable**; we can change the data stored inside the `Series`... using the **bracket operator** `[]` similar to how we would for a dictionary or a list.\n",
+ "\n",
+ "```python\n",
+ "my_series[row_index] = new_value\n",
+ "```\n",
+ "\n",
+ "This can be used both to update existing values and to add new values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ba3599fb-5572-488f-8016-c15f544659fb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to change the number of FIB-SEM datasets to 11\n",
+ "\n",
+ "\n",
+ "# Your code here to add 4 datasets in the new row AFM\n",
+ "\n",
+ "\n",
+ "my_series"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e95f30e1-a37f-457b-b4ad-c826999b881f",
+ "metadata": {},
+ "source": [
+ "##### Operations on Series\n",
+ "\n",
+ "We can perform NumPy-style operations on our `Series`, and we can do arithmetic on `Series`.\n",
+ "\n",
+ "Let's start by doing something simple. Let's use the [`sum()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html) method to get the total number of datasets we have:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0a5ef737-3736-4123-92ed-ec876d6fd1c7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to take the sum of our datasets\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e784a218-365a-4c23-b6e0-a3eb15dfb833",
+ "metadata": {},
+ "source": [
+ "We can also perform basic statistics using methods such as [`Series.std()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html), [`Series.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html), [`Series.median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) and [`Series.quantile()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html).\n",
+ "\n",
+ "Now, let's compute the proportion of datasets that are in each category:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "76101568-835d-42fa-9ff7-80d1405c636f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to compute the proportions\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f3daea7b-5f97-40c7-a4a0-8c6a67c58cc8",
+ "metadata": {},
+ "source": [
+ "So, we've now produced a new `Series`!\n",
+ "\n",
+ "There are many more operations that can be performed on `Series`, some of which we will see later."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "40cd8ada-b604-47e3-a1dd-56d3ee203b93",
+ "metadata": {},
+ "source": [
+ "#### DataFrame\n",
+ "\n",
+ "The `DataFrame` combines multiple `Series` together as columns in a table. These columns have names and each row still has an index.\n",
+ "\n",
+ "There are many different ways to get a `DataFrame`. We can create a `DataFrame` using a dictionary with strings as keys and lists as values:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff7398c2-fdf4-48f7-b2f9-a695de5a8e72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_data = {\n",
+ " \"height\": [145, 198, 157, 175, 157],\n",
+ " \"weight\": [50, 65, 53, 54, 67],\n",
+ " \"age\": [45, 50, 50, 45, 40],\n",
+ " \"sex\": [\"F\", \"F\", \"M\", \"F\", \"M\"]\n",
+ "}\n",
+ "\n",
+ "# Your code here to create a DataFrame\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38106a22-eac0-46f7-8980-c3ff197e7e04",
+ "metadata": {},
+ "source": [
+ "We can access individual `Series` in the table using the bracket operator with the column name:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7469d0ce-3111-4f44-8fc6-6ea536010c80",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to access the height column\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "94aedf28-80e3-4157-a087-cd8e95225395",
+ "metadata": {},
+ "source": [
+ "This indeed gives us our familiar `Series`.\n",
+ "\n",
+ "We can generate new columns based on old columns and easily append them to our `DataFrame`.\n",
+ "\n",
+ "We just get the new values and assign them, again using the bracket operator.\n",
+ "\n",
+ "For example, the weights that we've provided are in kilograms. Let's say we want to add a new column called `weight (lb)` that has the weight in pounds.\n",
+ "\n",
+ "Here's how we can do this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f0b160bd-62ff-4e5d-8028-11c0390d9e85",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to add a new column that contains weight in lbs\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8968a7bb-edaa-47e5-be55-c522c5c45ec3",
+ "metadata": {},
+ "source": [
+ "We can see that we now have a new column! And, we can even perform operations on multiple columns! Let's compute the body mass index (BMI) for each individual.\n",
+ "\n",
+ "Recall the formula for BMI: $\\text{BMI} = \\text{weight} / \\text{height}^2$ where weight is in **kg** and height is in **metres**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9c2e81a5-7a63-4174-aa5b-8cf190c9259a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to construct a BMI column\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7853f314-4b36-4e59-8e36-1a8b8a2e7b5b",
+ "metadata": {},
+ "source": [
+ "##### Accessing DataFrame elements\n",
+ "\n",
+ "We can access **individual elements** using `.loc` and `.iloc`, followed by square brackets and either row and column **names** (for `loc`) or row and column **numbers** (for `iloc`).\n",
+ "\n",
+ "**Note:** It is **super important** to remember that the indexing follows the same order as NumPy indexing. First **row** then **column**.\n",
+ "\n",
+ "For example, let's get the BMI of the first individual in our table:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5d9a9a46-e48b-4fe0-80c7-c07f5b5b4003",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26664446-a533-4275-909e-3f6be12e30f6",
+ "metadata": {},
+ "source": [
+ "We can also access information for multiple rows and multiple columns using slicing. Let's get age and BMI for the even-numbered rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "75489e5c-c8fc-4eed-88fc-d6959771890e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "41b9026c-7500-4d64-bd4f-5e45e4a60506",
+ "metadata": {},
+ "source": [
+ "This last example shows that if we index using a list, we can get specific rows or columns.\n",
+ "\n",
+ "##### Data filtering\n",
+ "\n",
+ "Now, let's say we want to get a new table where we only have the data for the female patients. We can perform **boolean indexing**:\n",
+ "\n",
+ "```python\n",
+ "filtered_df = my_df[some_boolean]\n",
+ "```\n",
+ "\n",
+ "The `some_boolean` can be generated using comparisons, as we'll see in our example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdc0ac4f-d9dc-48a1-bcb8-4130cde49d05",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to extract the rows corresponding to female patients\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6686c106-b6cd-45ce-a1b9-e26d151614dd",
+ "metadata": {},
+ "source": [
+ "Now we have a subset of our rows in our new DataFrame. Notice that the index does not change; we still have the original row numbers.\n",
+ "\n",
+ "Boolean indexing is covered in much more depth in the pandas [documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "428ef0d2",
+ "metadata": {},
+ "source": [
+ "#### Missing data\n",
+ "\n",
+ "Sometimes, when conducting experiments, some data are missing. These may be recorded in a dataset as `NaN` or as `NA`. There are a couple of different things that we can do with missing data:\n",
+ "\n",
+ "* remove the rows that contain missing values\n",
+ "* fill in the missing values with some other value\n",
+ "\n",
+ "With `pandas`, we can do both these tasks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ef44635e",
+ "metadata": {},
+ "source": [
+ "##### Removing missing data\n",
+ "\n",
+ "Let's start with removing `NaN` rows. We can take advantage of the functions:\n",
+ "* [`pandas.isna`](https://pandas.pydata.org/docs/reference/api/pandas.isna.html) -- identifies if a value is `NaN`.\n",
+ "* [`pandas.notna`](https://pandas.pydata.org/docs/reference/api/pandas.notna.html) -- identifies if a value is **not** `NaN`.\n",
+ "* [`DataFrame.dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) -- `DataFrame` method to remove rows or columns containing missing values.\n",
+ "\n",
+ "These also exist as **methods** on `DataFrame` objects. We can also take advantage of the [`all`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html) method, which can help us when indexing to remove rows with missing data.\n",
+ "\n",
+ "Let's now do an example!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "130b0bfd",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a DataFrame with missing values\n",
+ "my_arr = np.arange(30).reshape(6, 5).astype(float)\n",
+ "my_arr[0, 3] = np.nan\n",
+ "my_arr[3, 4] = np.nan\n",
+ "\n",
+ "my_df = pd.DataFrame(my_arr, columns=[\"Mon\", \"Tues\", \"Wed\", \"Thurs\", \"Fri\"])\n",
+ "\n",
+ "my_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20e68241",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here for dropping missing values\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e4488636",
+ "metadata": {},
+ "source": [
+ "##### Filling missing data\n",
+ "\n",
+ "`pandas` offers a number of functions to fill in missing data **with data that is representative of your sample**. The following `DataFrame` methods are offered:\n",
+ "\n",
+ "* [`DataFrame.fillna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) -- fill with a predetermined value, such as a global constant, or values specific to that row or column.\n",
+ "* [`DataFrame.ffill`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html) -- forward fill; push values forward into missing entries.\n",
+ "* [`DataFrame.bfill`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html) -- backward fill; push values backward into missing entries.\n",
+ "* [`DataFrame.interpolate`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html) -- perform a more complicated interpolation with the neighbours to fill missing values.\n",
+ "\n",
+ "Let's go back to our array from before and let's see what happens if we do a forward fill along rows in the `DataFrame` object `my_df`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6f68099e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to use ffill\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2d111bdf",
+ "metadata": {},
+ "source": [
+ "Notice that the holes have been filled! But, this doesn't always work! Try switching from `axis=\"columns\"` to `axis=\"index\"` to see what happens...\n",
+ "\n",
+ "Most of the tools for dealing with missing data also exist for `Series`.\n",
+ "\n",
+ "For more details on how to work with missing data, make sure to check out the `pandas` [User's Guide](https://pandas.pydata.org/docs/user_guide/missing_data.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1edf0588-b7b7-4289-ad32-3e958970a9ae",
+ "metadata": {},
+ "source": [
+ "#### Functions on DataFrames and Series\n",
+ "\n",
+ "In addition to storing and modifying the information in `Series` and `DataFrame`s, it is also helpful to apply functions to the different entries. This process is described in-depth in the [documentation](https://pandas.pydata.org/docs/user_guide/basics.html#function-application).\n",
+ "\n",
+ "One of the key methods is [`DataFrame.apply()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html). Using this method, a function can be applied to each row or column in a `DataFrame`. This function also exists for `Series`. When used on a `DataFrame`, this method calls the provided function on each row or column, represented as a `Series`.\n",
+ "\n",
+ "Let's do a DNA example. We'll start with some DNA sequences. Let's create a `DataFrame` that has two columns: one for the sequence and one for the sequence length."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e81011f6-44c2-4518-b3aa-09c806d5df4a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_sequences = [\"ATTGACTACA\", \"ATCGGGCAGACTTTT\", \"GGCACATGTACATATG\", \"TGTCGTCACGTACGTCA\"]\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca3dd303-0e9e-4c9d-968e-bd3932bfcbd0",
+ "metadata": {},
+ "source": [
+ "> Wait, hang on! What did I just do with `len`? Well, in Python, **everything is an object**. So, we can just pass functions around as arguments to other functions! We won't get into too many details about this... but there's an entire programming paradigm known as **functional programming** that loves to do things like this with functions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f951fa63-d31a-426f-a7ab-d11e84312bbe",
+ "metadata": {},
+ "source": [
+ "#### Is that it?\n",
+ "\n",
+ "No. There's much, much, much more that you can do with pandas. You can also merge tables together, group rows in tables, construct pivot tables, and more! We won't see all that today, but I'll very soon tell you how you can learn about it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "788fcd09-e65f-4cae-86ac-d3fe4d0c5e98",
+ "metadata": {},
+ "source": [
+ "### Intro to Grouping\n",
+ "\n",
+ "In addition to these operations, we can also group the data based on certain series and then perform operations on the groups. We do this using the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby) method.\n",
+ "\n",
+ "We pass the column name(s) we wish to use to group the data.\n",
+ "\n",
+ "For example, let's go back to our sample patient data and group the data by sex."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b9ffdc58-16a7-4838-a0ec-d12788f32159",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "my_data = {\n",
+ " \"height\": [145, 198, 157, 175, 157],\n",
+ " \"weight\": [50, 65, 53, 54, 67],\n",
+ " \"age\": [45, 50, 50, 45, 40],\n",
+ " \"sex\": [\"F\", \"F\", \"M\", \"F\", \"M\"]\n",
+ "}\n",
+ "\n",
+ "my_df = pd.DataFrame(my_data)\n",
+ "\n",
+ "# Your code here to group by sex\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4d864281-a841-45cd-a338-c6ae0bedcac5",
+ "metadata": {},
+ "source": [
+ "We can't just display the grouped data like we could with a `DataFrame`, but we can use methods to learn more about the grouped dataset.\n",
+ "\n",
+ "For example, we can use the [`get_group()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.get_group.html#pandas.core.groupby.DataFrameGroupBy.get_group) method to get only the rows associated with a specific group. We can also use the [`groups`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.groups.html#pandas.core.groupby.DataFrameGroupBy.groups) property to see the names of the different groups."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "38ce5159-8420-40ea-8ea9-014483821874",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the names of the groups present\n",
+ "\n",
+ "\n",
+ "# Your code here to get the male data\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c2d3b30-3971-4fd8-91df-cb301d4f7e4c",
+ "metadata": {},
+ "source": [
+ "The real power in grouping is the ability to perform operations on the groups. We can apply methods to each group separately. For example, we can call the `mean()` method to get the averages of each value for each group."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5b857458-ef48-49a6-8467-f00005a9f90e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the averages within each group\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c1e7753-5065-4331-b2bc-9a8762b5e6b4",
+ "metadata": {},
+ "source": [
+ "We can also get the number of non-missing members of each column in each group using the `count()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3ccb8696-3c2c-477b-965c-8d1a66d77e43",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to count the number of rows in each group\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ab427361-f064-487f-8ac8-54fcd1f59bec",
+ "metadata": {},
+ "source": [
+ "We can also apply more complicated functions using the [`apply()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply) method, similar to what we saw with `Series` and `DataFrame`s.\n",
+ "\n",
+ "Let's see another simple way to get the number of members of each group by applying the `len` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aba4f427-416d-4341-9ee4-3374627198ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here to get the number of members of each group\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb9fe797-c2b2-4d4c-9bd7-815ae47a4530",
+ "metadata": {},
+ "source": [
+ "We now have a `Series` where each index is the name of a group, and the values are the results of calling the function.\n",
+ "\n",
+ "> Note\n",
+ ">\n",
+ "> When calling `apply()`, we **must** pass the keyword argument `include_groups=False`, per the documentation.\n",
+ "\n",
+ "There's plenty more that can be done with grouping. You can check out the [Group by: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#) guide in the pandas documentation to learn more."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d39ae322-ea1d-48a6-b102-bcca2405251b",
+ "metadata": {},
+ "source": [
+ "### Loading and saving tables\n",
+ "\n",
+ "`pandas` offers tools for reading and writing data in a variety of formats.\n",
+ "\n",
+ "We'll focus on one key format: the comma-separated values (`.csv`) file.\n",
+ "\n",
+ "#### CSV files\n",
+ "\n",
+ "A CSV file is a *text-based* format for storing information in tables. Since these files are based on text, they are very easy to read. The most important feature of this format is that columns are separated by a special character.\n",
+ "\n",
+ "Despite the name of this format, the separator could be any number of characters:\n",
+ "\n",
+ "* comma `,`\n",
+ "* tab `\\t`\n",
+ "* semicolon `;`\n",
+ "* space ` `\n",
+ "\n",
+ "It is very important to know how your file is structured in order to read it.\n",
+ "\n",
+ "Thankfully, pandas offers very useful functions for reading and writing CSV files.\n",
+ "\n",
+ "#### Reading CSV files\n",
+ "\n",
+ "To load a CSV file into a `DataFrame`, we use the function [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv).\n",
+ "\n",
+ "The important parameters that we need to give are:\n",
+ "\n",
+ "* the filename (seems pretty obvious).\n",
+ "* `sep` - the column separator.\n",
+ "* `header` - whether there are any headers.\n",
+ "* `index_col` - whether there is an index column with row names.\n",
+ "\n",
+ "Let's do an example! I've provided a sample CSV file in [`Exercises/data/sample_file.csv`](../data/sample_file.csv). First, let's look at it, and then, let's open it using pandas.\n",
+ "\n",
+ "We notice that there is no index column, we have a header row, and the columns are separated by indents (`\\t`). Let's now read this file using pandas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a32a5603-b99d-4938-8a14-e22dbe1ee13e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = base_dir + \"data/sample_file.csv\"\n",
+ "# Your code here to read the CSV file\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9fc8971-0bf2-4e53-b8eb-bee8806794eb",
+ "metadata": {},
+ "source": [
+ "#### Writing CSV files\n",
+ "\n",
+ "We can write the data to a CSV file using the [`DataFrame.to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv) method.\n",
+ "\n",
+ "The important parameters we'll be setting are:\n",
+ "\n",
+ "* `path_or_buf` - this first argument determines where we'll output the data. We'll typically be passing a filename here.\n",
+ " * Fun fact: if we don't set a value, the CSV data is returned as a string!\n",
+ "* `sep` - the column separator.\n",
+ "* `header` - indicate whether to put column headers (or what to set as the headers).\n",
+ "* `index` - boolean value indicating whether to include the row names in the first column.\n",
+ "\n",
+ "Let's do an example now. We have our basic patient characteristics. Let's compute BMI as a new column, and export the new file to `patients_bmi.csv`, with commas separating the columns, row names excluded and column names included."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9a13ccca-7d30-4274-9a8f-546f20392c78",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_filename = \"patients_bmi.csv\"\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "f107cd63-6e54-41f2-8dc8-84ab18c616b7",
+ "metadata": {},
+ "source": [
+ "#### Notes on Excel spreadsheets\n",
+ "\n",
+ "It's also possible to process Excel spreadsheets using pandas. It functions very similarly to the CSV functions and methods.\n",
+ "\n",
+ "You must have `openpyxl` installed. It may not be installed by default. To install this package using `conda`, you can run\n",
+ "```shell\n",
+ "conda install -c conda-forge openpyxl\n",
+ "```\n",
+ "\n",
+ "To ensure that you have all the necessary dependencies for working with Excel files, the official [documentation](https://pandas.pydata.org/docs/getting_started/install.html#installing-from-pypi) instructs to do the following when installing with `pip`:\n",
+ "```shell\n",
+ "pip install \"pandas[excel]\"\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78568109-f85b-4244-aac2-405817f40032",
+ "metadata": {},
+ "source": [
+ "## Reading the pandas documentation\n",
+ "\n",
+ "This is definitely not the limit of what you can do with pandas! Many other topics are covered in the [**documentation**](https://pandas.pydata.org/docs/index.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e8f520a9-340d-4f7f-a978-040f0a8391f9",
+ "metadata": {},
+ "source": [
+ "### Getting Started\n",
+ "\n",
+ "The [**Getting Started**](https://pandas.pydata.org/docs/getting_started/index.html) section describes how to get started... unsurprising.\n",
+ "\n",
+ "But, what does that mean? It gives you information about how to install pandas, provides a basic tutorial to get you up and running and even provides external links to other tutorials.\n",
+ "\n",
+ "It also gives guides for if you have experience using a different data analysis tool!\n",
+ "\n",
+ "If you want to start with the basics, definitely check out this section."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "062126dc-dbf4-4186-a000-fa7d09b4364b",
+ "metadata": {},
+ "source": [
+ "### User Guide\n",
+ "\n",
+ "Let's say you want a bit more depth. Check out the [**User Guide**](https://pandas.pydata.org/docs/user_guide/index.html). This section includes instructions and examples for using different aspects of pandas. The pages in the **User Guide** are very helpful, as they not only provide the code to perform a task, but also narrative explanations. If you want to learn how to do something, there's is likely a page here that explains the concept.\n",
+ "\n",
+ "I strongly recommend looking at the **10 minutes to pandas** page. It covers a lot of what I've talked about, and more! It's a great way to get an overview of what pandas can do."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fa4b1341-76d8-478a-a774-2cc825aa4e01",
+ "metadata": {},
+ "source": [
+ "### API reference\n",
+ "\n",
+ "If you want to find a specific function, class or method, then the [**API reference**](https://pandas.pydata.org/docs/reference/index.html) is the place to look. This part of the documentation details every function and every parameter and every part of the code. If you want to learn more about the technical aspects of a single function, look here."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8cd5ec7e-97c4-45e0-9b41-cadf4780e96f",
+ "metadata": {},
+ "source": [
+ "### Important note on sections\n",
+ "\n",
+ "You may be wondering why all these sections exist.\n",
+ "\n",
+ "Programming is like assembling a puzzle. Each line is essentially a piece. The **API reference** helps you understand what each piece does, but the **User Guide** lets you know which piece to use when and how to fit the pieces together.\n",
+ "\n",
+ "The bottom line is that you should make use of the documentation. It's there to help you and it's written **by the project**. It's there for you to use, so make sure to get the most out of it!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "408e2269-4f6f-4190-b0b3-f23d2279e11e",
+ "metadata": {},
+ "source": [
+ "### But wait! There's also a book!\n",
+ "\n",
+ "If you want additional narrative guidance, Wes McKinney, the original creator of pandas wrote an entire book *Python for Data Analysis* on how to use it. This book is avaiable online **for free, legally** at https://wesmckinney.com/book/. It offers great hands-on examples and is very helpful."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "32498a32-60d8-4fbc-acc3-b5f4c0deb12b",
+ "metadata": {},
+ "source": [
+ "## Module Summary\n",
+ "\n",
+ "In this module, we've seen the basics of pandas. Here are the main points that we covered:\n",
+ "\n",
+ "* In pandas, data are stored in **`DataFrame`s**, which are like tables.\n",
+ "* Each column in these tables is a **`Series`**.\n",
+ "* We can easily **read** data from **CSV files** and **write** `DataFrame` objects to CSV files.\n",
+ "* To learn more about pandas, we can consult its very thorough documentation.\n",
+ "\n",
+ "With that, let's do an exercise on pandas."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f63eb34f-c48d-4575-8fbf-9b6ad0a7377b",
+ "metadata": {},
+ "source": [
+ "## Exercises\n",
+ "\n",
+ "And now, let's return to our exercises!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "858bdd72-373b-4a0e-ab22-85e2efec53b3",
+ "metadata": {},
+ "source": [
+ "### Single Nucleotide Polymorphism analysis\n",
+ "\n",
+ "For our last SNP example, let's look at some SNP properties. We've augmented the SNP data using some properties found in `snp_properties.csv`. Load this file and look at the properties contained in it.\n",
+ "\n",
+ "Then, filter the list of SNPs to extract the recessive mutations. Perform the following tasks on the data:\n",
+ "\n",
+ "1. Construct a histogram showing the distribution of recessive mutations on the different chromosomes.\n",
+ "2. Construct a pie chart showing the distribution of types of mutations (silent, synonymous, missense, nonsense, frameshift).\n",
+ "3. (**Bonus**) Extract the indices of silent recessive mutations and construct the histogram of copy number for those specific SNPs within the population."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a6b904f5-67bb-4add-9123-cad5cce068ae",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = base_dir + \"data/snp_properties.csv\"\n",
+ "\n",
+ "# Your code here to analyse SNP properties\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a5a502d-6306-461e-a785-ecb59397be7b",
+ "metadata": {},
+ "source": [
+ "### Amino acid sequence analysis\n",
+ "\n",
+ "Let's go back into the world of proteins. In a previous exercise, we wrote code to determine the number of amino acids with different properties (polar, non-polar, acidic, basic) in various amino acid sequences. But before, we stored the results in an array.\n",
+ "\n",
+ "Now, let's organise and process the results using a `DataFrame`. Create a `DataFrame` where each amino acid sequence has a row. The `DataFrame` should have columns for:\n",
+ "\n",
+ "* Sequence (optional)\n",
+ "* Total number of amino acids\n",
+ "* Number of polar amino acids\n",
+ "* Percent polar (in decimal)\n",
+ "\n",
+ "After creating this table, export it to a CSV file called `amino_acid_properties.csv`. Use tabs to separate the columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "425ba459-5825-45e0-86e7-672ae919b684",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = base_dir + \"data/HUMAN_HEALTHY.fasta\"\n",
+ "\n",
+ "sequences = load_sequences(filename)\n",
+ "\n",
+ "# Your code here\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d25622d-0515-4199-8ec5-3b8f818aa281",
+ "metadata": {},
+ "source": [
+ "# Module 5 - A Brief Guide to Exploring the Unknown\n",
+ "\n",
+ "Congratulations! You've now reached the end of this **Data Processing in Python** workshop. In this workshop, we've seen the following big ideas:\n",
+ "\n",
+ "* How to import **modules** and install **packages** to include code written by others.\n",
+ "* How to use **NumPy** to perform operations on arrays.\n",
+ "* How to use **Matplotlib** to visualise data.\n",
+ "* How to use **pandas** to perform basic data processing.\n",
+ "\n",
+ "By now, you've seen a fair amount of Python, and you're well on your way to writing successful code."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a15f423-b855-4d80-86dc-693326d0b10e",
+ "metadata": {},
+ "source": [
+ "## What to Learn Next... and How?\n",
+ "\n",
+ "We've seen a lot. But, the learning is never done! There are many more topics that you can explore. With the tools that you've now learned, it should be fairly easy for you to learn them. Here are some topics that are definitely worth looking into:\n",
+ "\n",
+ "### Using SciPy for scientific computing\n",
+ "\n",
+ "We've seen NumPy, Matplotlib and pandas, so [SciPy](https://scipy.org/) is a possible next logical step. SciPy provides some more advanced scientific operations, like signal processing, spatial operations and mathematical optimisation. The docs are structured quite similarly to NumPy, and NumPy arrays are at the basis of just about all of SciPy, so it shouldn't be too hard to jump right in.\n",
+ "\n",
+ "### Using Polars\n",
+ "\n",
+ "[Polars](https://pola.rs/) is the new pandas, it seems. It claims to be faster and also offers `DataFrame`s.\n",
+ "\n",
+ "### Image processing with Scikit-Image\n",
+ "\n",
+ "Much life science work relies on processing images. [Scikit-image](https://scikit-image.org/) provides many functions for processing images and getting insight.\n",
+ "\n",
+ "### Basic machine learning with Scikit-Learn\n",
+ "\n",
+ "Want to get started with advanced statistics and machine learning in Python? Check out [Scikit-learn](https://scikit-learn.org/stable/). This package provides tools for clustering, dimensionality reduction and much more!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "410bf02a-7e77-4b8e-b8bc-f9f019a941c1",
+ "metadata": {},
+ "source": [
+ "## How to get help... and how not to get help\n",
+ "\n",
+ "But, of course, in the software development process, you'll inevitably run into bugs (if you don't, make sure you're doing everything right). There will be times when your code won't work. It happens to everyone. So, how can you get help when you need it? Here are some important resources that may (or may not) be of use (adapted from several of my previous workshops):\n",
+ "\n",
+ "### Your code editor\n",
+ "\n",
+ "Think about it... when you're writing code, you're using a piece of software that is designed **specifically for one purpose**: to help you write code. Yes! That's right! Your IDE can suggest code completions, tell when there are errors and even help you reformat your files and restructure your code.\n",
+ "\n",
+ "So, please, please, please, **DO NOT** write your code in a simple text editor that has no additional features. There are **many** IDEs out there that have Python support, including:\n",
+ "\n",
+ "* [PyCharm](https://www.jetbrains.com/pycharm/)\n",
+ "* [Microsoft Visual Studio Code](https://code.visualstudio.com/)\n",
+ "* [Spyder](https://www.spyder-ide.org/)\n",
+ "* [Zed](https://zed.dev/)\n",
+ "\n",
+ "And these are all either completely free or have a free version with most of the functionality. And ***PLEASE*** don't use word processing software to write code. Use software that is made for coding!\n",
+ "\n",
+ "### Documentation\n",
+ "\n",
+ "Big projects have big, well-maintained documentation. Take a look at their guides for getting started. For example, [Pandas](https://pandas.pydata.org/) has a [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html) tutorial. Use these resources! If you want to learn how to use a function, **look it up** and read the paragraph about it. The docs will tell you how to use the arguments, as well as any quirks to expect. In some cases, the authors have even included references to the papers behind the function. This is especially true in image processing and other fields that rely heavily on algorithms. So, the documentation will tell you not only how to use the code, but also **where it comes from**. And make sure to check out the Official Python docs at https://docs.python.org/3/.\n",
+ "\n",
+ "### Books\n",
+ "\n",
+ "Books, books, books! There are tons! And tons of books out there! For example, there are a couple of general books that are free online:\n",
+ "* *Think Python 2e* by Allen B. Downey (FREE book): https://greenteapress.com/wp/think-python-2e/\n",
+ "* *Data Structures and Information Retrieval in Python* also by Allen B. Downey (FREE book): https://greenteapress.com/wp/data-structures-and-information-retrieval-in-python/\n",
+ "* *Introduction to Python Programming* by Udayan Das et al., published by OpenStax: https://openstax.org/details/books/introduction-python-programming\n",
+ "* *The Hitchhiker's Guide to Python* by Kenneth Reitz and Tanya Schlusser: https://docs.python-guide.org/\n",
+ "\n",
+ "There are also books online about more specialised topics, such as:\n",
+ "\n",
+ "* Package development: *Python Packages* by Tomas Beuzen and Tiffany Timbers -- https://py-pkgs.org/\n",
+ "* Data science:\n",
+ " * *Python for Data Analysis, 3E* by Wes McKinney -- https://wesmckinney.com/book/\n",
+ " * *Python Data Science Handbook* by Jake VanderPlas -- https://jakevdp.github.io/PythonDataScienceHandbook/\n",
+ "\n",
+ "Another book that covers software development for research more generally, including more emphasis on the tools used is:\n",
+ "\n",
+ "* *Research Software Engineering with Python* by Damien Irving, et al.: https://third-bit.com/py-rse/index.html\n",
+ "\n",
+ "Through the databases at the McGill Library, we also have access to lots of books **for free**. Check out the library's online catalogue to see more.\n",
+ "\n",
+ "### Tutorials\n",
+ "\n",
+ "Tutorials are also great! And very much abundant! From more formal ones on sites like [freeCodeCamp](https://www.freecodecamp.org/) and [W3Schools](https://www.w3schools.com/python/default.asp) to less formal ones on [DEV](https://dev.to/), you can get lots of insight from these. There are also lots posted on Medium that you can check out. In addition to text-based tutorials, there are also videos on YouTube. And don't forget the official tutorials in the documentation! Tutorials are a very valuable resource that can help you see how to put pieces of code together in real-world examples.\n",
+ "\n",
+ "**Want to try some bioinformatics examples?** Check out the Rosalind platform at .\n",
+ "\n",
+ "### Stack Overflow (and pitfalls)\n",
+ "\n",
+ "If you have a Python question, chances are that someone, somewhere has asked it on [Stack Overflow](https://stackoverflow.com/). Stack Overflow is a **great** resource for finding answers to real questions about programming. **But** make sure that you're using it properly. Try the other resources **before** going to Stack Overflow. The answer may turn out to be on the documentation page for the function you're using. If there's a link to the docs in a Stack Overflow answer, **use it**. Check out the answers in more detail. Make sure that you understand the code that you're about to add to your project and **don't just copy-paste** it. Coding is a thinking game. Make sure that you have thought about all the code that you're putting in and that you understand why it's there. And use your judgement and intuition when borrowing that code. If it looks sketchy, it could very well be sketchy and there may be a better way.\n",
+ "\n",
+ "In other words: Documentation **first**, documentation **last**, documentation **always**.\n",
+ "\n",
+ "### ChatGPT (and pitfalls)\n",
+ "\n",
+ "Everything I said above about Stack Overflow. And more. Answers on Stack Overflow are written by humans who have written the code, tested it, and run the results. **Be careful** when using ChatGPT for code (if you're allowed to use it at all). Make extra sure that it makes sense, and test it. Don't just trust it because AI wrote it for you. After all, then you might wind up putting [glue on your pizza](https://www.theverge.com/2024/5/23/24162896/google-ai-overview-hallucinations-glue-in-pizza).\n",
+ "\n",
+ "You need to make extra sure that it actually makes sense and runs properly, because you don't have that same guarantee that a human has used this exact code in their own experience. Use your coding judgement and intuition.\n",
+ "\n",
+ "### Concluding help remarks...\n",
+ "\n",
+ "Again, ALWAYS remember to **read the documentation**. Often, if you're stuck, the answer is **right there**. If it's not, then it's probably on Stack Overflow. It's often a good idea to check the documentation **first** to see if there's an official explanation or an official example. And don't just copy a Stack Overflow answer or sample code. Think about what the code is doing. Does it make sense? Is there a better way? Try to look line by line to understand what is going on (play around in the IPython interpreter or in a Jupyter notebook!).\n",
+ "\n",
+ "## Other cool programming topics\n",
+ "\n",
+ "Aside from the packages that I discussed above, there are other cool topics that you should definitely take a look at! These will help you write code that runs better, is easier to update and is easier to share.\n",
+ "\n",
+ "### Writing packages\n",
+ "\n",
+ "We've seen how to install and use packages. But, you can also **write your own packages**. There are many great resources online about writing packages. The one that I most recommend is [this free online book](https://py-pkgs.org/): *Python Packages* by Tomas Beuszen and Tiffany Timbers. It's an easy read and helps you learn not only how to organise your code, but how to publish it, too. The authors also walk through how to render your own nice-looking documentation and host it online.\n",
+ "\n",
+ "### Object-oriented and functional programming\n",
+ "\n",
+ "We've come across a bunch of different data structures and types that can be useful for storing and organising data. Well, with Python you can easily define your own new types using **classes**. A **class** is a template that is used to define new objects. In addition to working with classes and objects, Python also offers tools for doing more with functions, in the realm of **functional programming**.\n",
+ "\n",
+ "### Developing graphical user interfaces\n",
+ "\n",
+ "Jupyter notebooks and command line scripts are powerful, but they aren't accessible for people who don't know how to code. Solution: build a graphical user interface! Using PyQt, the process is quite straightforward. Check out [this online tutorial series](https://www.pythonguis.com/) by Martin Fitzpatrick to learn about developing GUIs in Python. It has been a great help to me in my own research.\n",
+ "\n",
+ "### Hosting projects on GitHub\n",
+ "\n",
+ "What fun is a project if other people can't use it? By hosting your project on GitHub, you let others easily contribute to your project and build on it. Learning Git and GitHub are essential! And so are a few other skills along the way, like writing documents in Markdown. QLS-MiCM often has Git and GitHub workshops, so check out their workshop schedule!\n",
+ "\n",
+ "## Conclusion\n",
+ "We've reached the end of this workshop. You now have most of the skills you need to use existing software packages to perform research tasks and learn how to use additional packages not covered.\n",
+ "\n",
+ "If you have any questions, please reach out!\n",
+ "\n",
+ "```python\n",
+ "print(\"Goodbye!\")\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "83996f5d-750b-457e-bc71-54082969d18a",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.13.3"
+ },
+ "widgets": {
+ "application/vnd.jupyter.widget-state+json": {
+ "state": {},
+ "version_major": 2,
+ "version_minor": 0
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/Exercises/solutions/DataProcessingPython.ipynb b/Exercises/solutions/DataProcessingPython.ipynb
index 85d2b90..5d8a0cb 100644
--- a/Exercises/solutions/DataProcessingPython.ipynb
+++ b/Exercises/solutions/DataProcessingPython.ipynb
@@ -7,7 +7,7 @@
"source": [
"# Data Processing in Python\n",
"\n",
- "QLS-MiCM Workshop - November 18, 2025\n",
+ "QLS-MiCM Workshop - March 11, 2026\n",
"\n",
"Benjamin Z. Rudski, PhD Candidate, Quantitative Life Sciences, McGill University\n",
"\n",
@@ -26,7 +26,7 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": null,
"id": "13862d53-fc7c-4214-b2fb-6544bcedb558",
"metadata": {
"execution": {
@@ -39,12 +39,12 @@
},
"outputs": [],
"source": [
- "using_colab = False\n",
+ "using_colab = True\n",
"\n",
"if using_colab:\n",
" !wget https://github.com/QLS-MiCM/DataProcessingInPython/archive/refs/heads/main.zip\n",
" !unzip main.zip\n",
- " base_dir = \"Data-Processing-in-Python-main/Exercises/\"\n",
+ " base_dir = \"DataProcessingInPython-main/Exercises/\"\n",
"else:\n",
" base_dir = \"../\""
]
@@ -629,7 +629,7 @@
"\n",
"An example is [CuPy](https://cupy.dev/), which allows performing NumPy and SciPy operations on the GPU. This package **does not** work on all systems. It requires an NVIDIA GPU and CUDA, which is not available on macOS. In these cases, it's very important to read the [installation instructions](https://docs.cupy.dev/en/stable/install.html).\n",
"\n",
- "If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first. You can easily search on https://anaconda.org to see if the package is available. Installing packages with `conda` makes it easier to manage multiple *environments* (which we'll discuss soon)."
+ "If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first. You can easily search on https://anaconda.org to see if the package is available. Installing packages with `conda` makes it easier to manage multiple *environments*."
]
},
{
@@ -642,7 +642,7 @@
"\n",
"We've seen what packages are and how to install them, but now how do we use them?\n",
"\n",
- "To use a package, we have to import it, just like we import a module. Since we use a lot of functions from a package, we often it a shorter name when we import it. Here's the syntax for doing this:\n",
+ "To use a package, we have to import it, just like we import a module. Since we use a lot of functions from a package, we often give it a shorter name when we import it. Here's the syntax for doing this:\n",
"```python\n",
"import package_name as short_name\n",
"```\n",
diff --git a/Outline/MiCM - Data Processing in Python Workshop Outline.docx b/Outline/MiCM - Data Processing in Python Workshop Outline.docx
new file mode 100644
index 0000000..f2c06db
Binary files /dev/null and b/Outline/MiCM - Data Processing in Python Workshop Outline.docx differ
diff --git a/Outline/MiCM - Data Processing in Python Workshop Outline.pdf b/Outline/MiCM - Data Processing in Python Workshop Outline.pdf
new file mode 100644
index 0000000..11e820b
Binary files /dev/null and b/Outline/MiCM - Data Processing in Python Workshop Outline.pdf differ
diff --git a/README.md b/README.md
index a5bf807..4864cd6 100755
--- a/README.md
+++ b/README.md
@@ -1,7 +1,8 @@
-# Data Processing in Python (Part 2)
+# Data Processing in Python
**Click one of these:**
[](https://colab.research.google.com/github/QLS-MiCM/DataProcessingInPython/blob/main/Exercises/scripts/DataProcessingPython.ipynb)
+[](https://colab.research.google.com/github/QLS-MiCM/DataProcessingInPython/blob/main/Exercises/scripts/DataProcessingPythonCompact.ipynb)
[](https://colab.research.google.com/github/QLS-MiCM/DataProcessingInPython/blob/main/Exercises/solutions/DataProcessingPython.ipynb)
## Overview
@@ -18,20 +19,15 @@ By the end of this workshop, you should be able to:
4. Use pandas to represent data stored in tables.
5. Approach a new package and explore its documentation and examples.
-## Prerequisites
+## Requirements
* Basic knowledge of Python is required.
-* Attendees must be comfortable using variables for simple data types,
- as well as collections. Attendees should also be comfortable with
- loops and control flow and be familiar with the basics of using
- functions in Python.
+* Attendees must be comfortable using variables for simple data types, as well as collections. Attendees should also be comfortable with loops and control flow and be familiar with the basics of using functions in Python.
* To be able to participate in the exercises, participants must either:
- * Have a local installation of Python and Jupyter notebooks.
- Microsoft Visual Studio Code with the Python extension installed
- can also be used to run the Notebook.
- * Have a Google Account (to run in-browser as a Colab notebook)
+ * **(Preferred)** Have a Google Account to run in-browser as a Colab notebook
+ * Have a local installation of Python and software to edit Jupyter notebooks (e.g., Jupyter Lab, Microsoft Visual Studio Code, PyCharm)
-## Setup Information
+## Software
This workshop is intended to be interactive. Before the workshop, please download the materials from this repository. You can download the material as a ZIP file using the green button higher up on this page, or you can simply clone this repository by typing the following in a terminal:
@@ -39,70 +35,27 @@ This workshop is intended to be interactive. Before the workshop, please downloa
git clone https://github.com/QLS-MiCM/DataProcessingInPython.git
```
-### Requirements
-
-To take full advantage of this interactive workshop, you must have access to a Python environment and Jupyter Lab.
-
-You must also install the following packages:
+In your Python environment, you must have the following packages installed:
* NumPy
* Matplotlib
* pandas
-#### Local
-
-The required steps depend on how you installed Python:
-
-* **(Recommended)** If you installed **minconda**, you can easily install all these packages by running the following on the command line:
-
-```shell
-conda install -c conda-forge jupyterlab numpy matplotlib pandas -y
-```
-
-* If you installed Python from the official website, you can easily install Jupyter using `pip` by running the following on the command line:
-
-```shell
-pip install jupyterlab numpy matplotlib pandas
-```
-
-* If you installed **Anaconda**, you already have everything you need installed.
-
-For more details on installing Jupyter Lab, see .
-
-Once you have Jupyter installed, open the `Data-Processing-in-Python` folder on your computer and launch Jupyter Lab by typing:
-
-```shell
-jupyter lab
-```
-
-Then you can open the Jupyter notebook files in the `Exercises/scripts` and `Exercises/solutions` folders.
-
-#### Cloud
+## Links to Colab
If you don't want to install anything locally, you can open the workshop materials using Google Colab:
* Student version (with blank fields):
+* Compact student version (with blank fields and shorter explanations):
* Solution version (filled out):
-> ⚠ **Warning:** To configure for Google Colab, make sure to set `using_colab = True` in the first code cell and run that cell to download all the data files.
-
-## Outline
-
-*For a more detailed outline, see [Outline/Outline.md](Outline/Outline.md).*
-
-1. **Module 1 -- Modules and Packages**
-2. **Module 2 -- Introduction to NumPy Arrays**
-3. **Module 3 -- Visualising Data with Matplotlib**
-4. **Module 4 -- Intro to Tabular Data with Pandas**
-5. **Module 5 -- A Brief Guide to Exploring the Unknown**
+> ⚠ **Warning:** Make sure that `using_colab = True` in the first code cell and run that cell to download all the data files required for this workshop.
## References
-In developing this workshop, I largely relied on the documentation of the various projects discussed, including NumPy, Matplotlib, pandas, conda and pip, as well as the official Python documentation. I've provided links to these projects in the interactive Jupyter notebook. I've also referenced a few useful other tutorials throughout the notebook.
-
-This workshop would also not have been possible without the professors and others who helped me on my Python journey.
+This workshop material relies heavily on the documentation of the various projects discussed, including NumPy, Matplotlib, pandas, conda and pip, as well as the official Python documentation. Links to relevant documentation pages are provided throughout the Jupyter notebook. There are also references to a few other useful tutorials.
-This workshop is based on my previous iterations of this workshop (as **Intermediate Python**) and my **Intro to Python** workshop, which can be found at the following repositories:
+This workshop is based on previous iterations of this workshop (as **Intermediate Python**) and the **Intro to Python** workshop, which can be found at the following repositories:
* Intro to Python:
* [Winter 2025](https://github.com/bzrudski/Intro-to-Python)
diff --git a/Slides/QLS-MiCM_DataProcessingInPython.pdf b/Slides/QLS-MiCM_DataProcessingInPython.pdf
index 91c9aba..2881dd3 100644
Binary files a/Slides/QLS-MiCM_DataProcessingInPython.pdf and b/Slides/QLS-MiCM_DataProcessingInPython.pdf differ
diff --git a/Slides/QLS-MiCM_DataProcessingInPython.pptx b/Slides/QLS-MiCM_DataProcessingInPython.pptx
index cdb2445..b2555dc 100644
Binary files a/Slides/QLS-MiCM_DataProcessingInPython.pptx and b/Slides/QLS-MiCM_DataProcessingInPython.pptx differ