This is the repository for the Master of Applied Data Science course "Data Analysis & Visualisation", previously known as "Data Mining & Exploration". All instructions assume a UNIX machine (Linux or Os X). You should have received an invite link for a linux VM; if not, contact your teacher. On the VM, everything is installed (like uv).
The manual for setting up the VM can be found in the references folder, in addition to a git_crash_course to help you work with git. Make sure to read both documents before you start asking questions about either.
uv is the modern dependency manager for python, and it is quickly being adopted by the industry.
Besides being the fastest manager out there, is also has a very robust development team behind it.
- Make sure you have
uvinstalled. You can check this by typingwhich uvin your bash terminal. If that doesnt return a location butuv not foundyou need to install it: On Unix systems, you can usecurl -LsSf https://astral.sh/uv/install.sh | sh, for Windows read the uv documentation - check if the
MADS-DAVfolder is already available cloned to the machine. If not, clone it yourself with the commandgit clone https://github.com/raoulg/MADS-DAV.git - Navigate to the MADS-DAV folder where the
pyproject.tomlis located withcd MADS-DAVand runuv sync --all-extras. This should create a virtual environment (a.venvfolder) and install all dependencies there. - Read through the uv docs "working on projects" intro into
uv
If for some reason you are unable to install uv (eg because you have a company laptop with restrictions on what to install) you can probably still install your .venv with base python and pip. Skip these steps if you already installed with uv
- Open a bash terminal in the folder where you cloned the repo
- create a
.venvwithpython -m venv .venv - activate the
.venvand runpip install -e .. If you are new to.venvs andpyproject.tomlfiles, or dont know how to activate a.venv, you can read all the details in the codestyle repo
Copy the config.example.toml file to a config.toml file. Update the contents after running the preprocessor.
Download a chat from Whatsapp and put it in the data/raw folder. Rename the file to _chat.txt (or change the config.toml file).
If you want to use my code in your own repo, do not copy paste everything. Instead, install it as a package; I published it on pypi so you can simply do uv add wa-analyzer (or, pip install wa-analyzer)
This preprocesser uses the datetime module to convert strings with a date and / or time into datetime objects. The preprocessor needs to know the formatting of the timestamps in your _chat.txt file. Therefore, you might need to update the datetime_format variable in the config.toml file accordingly. You can find the formatting in the documentation of the datetime module.
Now you can run the following command (on UNIX systems like Linux or OS X) to activate the virtual environment you have created in the previous step.
source .venv/bin/activateYou can check which python is being used by running:
which pythonThis should now return a path that includes the .venv folder in your project.
After this, you can run the preprocessor with the following command:
analyzer --device iosChange ios to android if you have an android device.
This will run the src/wa_analyzer.py:main method, which will process the chat and save the results in the data/processed folder.
You should see some logs, like this:
2024-02-11 16:07:19.191 | INFO | __main__:main:71 - Using iOS regexes
2024-02-11 16:07:19.201 | INFO | __main__:process:61 - Found 1779 records
2024-02-11 16:07:19.201 | INFO | __main__:process:62 - Appended 152 records
2024-02-11 16:07:19.202 | INFO | __main__:save:30 - Writing to data/processed/whatsapp-20240211-160719.csv
2024-02-11 16:07:19.206 | SUCCESS | __main__:save:32 - Done!
Inside the log folder you will find a logfile, which has some additional information that might be useful for debugging.
After this, put the name of the .csv file that is save to inputpath in the config.toml file.
You can then run the 01-cleaning.ipynb notebook. This will save a cleaned .parq file. Put the name of that file after the current key in the config.toml file.
This config.toml file should make it easier to run the code with multiple input files; you can simply change the current value and run all notebooks for the file specified there.
During the course, you will continue to improve your coding skills. Use the codestyle repo as a reference!