This repository is the code associated with the WAF manuscript titled: "A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning" written by Chase, R. J., Harrison, D. R., Burke, A., Lackmann, G. and McGovern, A. Find the paper here and provide any comments via email to the corresponding author. The SSEC-hosted branch of this repository has been edited for use in a 2026 AI/ML short course, and any problems encountered here are not the responsibility of the original creators with AI2ES. If you have any issues with the code (bugs or other questions) please leave an issue associated with this repo.
This first paper and repo (of two) covers the traditional supervised machine learning methods (e.g., the sklearn models; if you don't know what that phrase even means thats OK! Check out Section 2 in the paper). We decided to start off with the orginal machine learning methods, before jumping into the more advanced techniques. Part two of this paper series digs into neural networks and deep learning. That paper is under review now, but can be read here. The code for the part 2 paper can be found here.
Meteorological journal articles mentioning or using machine learning is growing rapidly (see figure above or Figure 1 in the paper; Data are derived from Clarivate Web of Science). Since there is such rapid growth and formal instruction of machine learning topics catered for meteorologsts are scarce, this manuscript and code repository were created. The goal is to familiarize meteorologists with the tools of machine learning and accelerate the use of machine learning in meteorological workflows. In order to accomplish these goals, it is imperative that code and a sandbox for readers to play around with exisit.
Beyond just discussing the machine learning topics in an abstract way, we decided to show an end-to-end example of the machine learning pipeline using the The Storm EVent ImagRy (SEVIR) dataset
SEVIR consists of over 10,000 matched storm events measured by satellite (i.e., GOES-16) and radar (i.e., NEXRAD) images. The specific variables are: red channel visible reflectance, mid-tropospheric water vapor channel brightness temperatures, clean infrared channel brightness temperatures, retrieved vertically integrated liquid and GOES Lightning Mapper (GLM) measured lightning flashes. The SEVIR dataset github repo can be found here and a helpful notebook tutorial can be found here. We thank the authors (Mark S. Veillette, Siddharth Samsi and Christopher J. Mattioli) of SEVIR for their efforts and creating a high-quality, open source meteorological dataset primed for machine learning. This dataset will be the centerpiece for both this paper and the next paper in the series.
There are two main ways to interact with the code here.
This is the recommended and the quickest way to get started and only requires a (free) google account. Google Colab is a cloud instance of python that is run from your favorite web browser (although works best in Chrome). If you wish to use these notebooks, navigate to the directory named colab_notebooks.
Once in that directory, select the notebook you would like to run. There will be a button that looks like this once it loads:
Click that button and it will take you to Google Colab where you can run the notebook. Please note it does not save things by default, so if you would like to save your own copy, you will need to go to File -> save a copy in drive
Google colab is awesome for those who do not know how to install python, or just dont have the RAM/HDD locally to do things. You can think of it this way. This notebook is just providing the instructions (i.e., code) to do what you want it to. Meanwhile the data and physical computer are on some Google machine somewhere, which will execute the code in the notebook. By default this google owned machine will have 12 GB of RAM and about 100 GB of HDD (i.e. storage).


