Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Binary file not shown.
9 changes: 9 additions & 0 deletions Groupwork (Group 5)/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Collecting data from r/Feminism: A Reddit webscraper for research in the Digital Humanities

This repository contains a group project we created for research in the field of Digital Humanities. We wanted to explore the different ways in which scholars of the field could actually collect various forms of data for their research purposes. In order to do so, we came up with a brief project which helps us to better understand the process of data collection. Moreover, this project gave us the opportunity to create a guide for further research.

We decided to explore the social media platform Reddit. On here we focused on a topic analysis of r/Feminism. In order to gain our data we needed to make our own scraper. The guide in this repository contains the step-by-step process of how we programmed the scraper. For this we used PRAW: Python Reddit Api Wrapper. This is a great and user-friendly way to scrape data from Reddit. We are aware that there are multiple ways of scraping Reddit. For example, using the Pushshift Api. These are also great tools. However, we found that using PRAW is more consistent in the data gathering and the servers it connects to.

The workbook in this repository allows researchers to make their own Reddit scraper. It takes you through the whole process again but this time with the subreddit of your own choosing. This file contains various example assignments you can do. Still, it also allows you to use it for your own desire. The code in the guide is reusable and with a view adjustments suitable for many projects.

The report in this repository contains the foundations of our own research project which guided us for the data collection. In addition, it contains the documentation of the file in which we created our own scraper in the form of a tutorial. In order to explore our own r/Feminism dataset we created a data visualization on flourish. This visualization is described in our Tutorial section but is also visible with a link in this repository. The visualization serves as a way in which we wanted to give a brief example of the ways in which the r/Feminism dataset could be used. Moreover, in the tutorial we followed up on this by giving a view recommendations for further, more in-depth, data analysis of our dataset. The workbook which serves as active learning exercise is also explained more in-depth in the report. Finally, this repository contains a Data Management Plan (DMP). In here we elaborated on the ways in which our dataset is considered FAIR. Moreover, we justify our methods, tools and research more in-depth in the DMP.
Binary file not shown.
358 changes: 358 additions & 0 deletions Groupwork (Group 5)/Workbook- Making your own scraper.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c766bdfa",
"metadata": {},
"source": [
"# Workbook: Making your own Reddit scraper with PRAW"
]
},
{
"cell_type": "markdown",
"id": "c87b65f9",
"metadata": {},
"source": [
"In this file we will go step-by-step through the whole process of making a Reddit web scraper. If you want to create a dataset of any subreddit you like, you can just simply fill in the empty code spaces. In this way, any time you need a webscraper for Reddit you can just come back to this file and fill everything in. Moreover, we will guide you into cleaning and expecting the dataset but you can always skip this part if you think it is not needed (although we highly recommend do to so in order to get a better understanding of your dataset). \n",
"\n",
"If anything is unclear you can look at the other file in our github reposity where we created the dataset of r/Feminism. Moreover, you can always look at the tutorial in our report. "
]
},
{
"cell_type": "markdown",
"id": "ea57770c",
"metadata": {},
"source": [
"## 1. Installing PRAW"
]
},
{
"cell_type": "markdown",
"id": "4d6b33de",
"metadata": {},
"source": [
"First make sure you downloaded PRAW to your computer: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4d48ff0",
"metadata": {},
"outputs": [],
"source": [
"!pip install praw"
]
},
{
"cell_type": "markdown",
"id": "6214d481",
"metadata": {},
"source": [
"## 2. Importing the PRAW and pandas libraries"
]
},
{
"cell_type": "markdown",
"id": "c30629d7",
"metadata": {},
"source": [
"Now we have to import praw and pandas to build the scraper and analyze the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b78a3bc2",
"metadata": {},
"outputs": [],
"source": [
"import praw\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "c3ed37ca",
"metadata": {},
"source": [
"## 3. Creating a Reddit App and connecting to the subreddit"
]
},
{
"cell_type": "markdown",
"id": "26642b8f",
"metadata": {},
"source": [
"Use https://www.reddit.com/prefs/apps to create a Reddit app. Choose 'Create App.' Here you can fill in a name (user agent), description and redirect uri. As described in the PRAW documentation (https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application) you should choose http://localhost:8080 as your uri.\n",
"\n",
"For the name you should avoid using words like 'scraping' or 'bot.' It could be that Reddit will not allow your authorization if you use these words. Lastly, select script for personal use and press 'create app.'\n",
"\n",
"The client_id is a code which can be found underneath 'personal use script.' The client_secret can be found next to 'secret.' The user_agent is the name you chose yourself.\n",
"\n",
"For our scraper we chose the 'reddit_read_only.' This means the scraper will only gather the data.\n",
"\n",
"For a more indepth explanation on creating the Reddit app we refer to the tutorial section in our report or take a look here: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0862f704",
"metadata": {},
"outputs": [],
"source": [
"reddit_read_only = praw.Reddit(client_id=\"\", #your client id \n",
" client_secret=\"\", #your client secret \n",
" user_agent=\"\") # your user agent\n",
"subreddit = reddit_read_only.subreddit(\"\") #The name of the subreddit.If you want to scrape all subreddits use 'all'\n",
" \n",
"#With these lines of code you can check if PRAW is connected to the subreddit of your choice.\n",
"\n",
"# Display the name of the Subreddit\n",
"print(\"Display Name:\", subreddit.display_name)\n",
" \n",
"# Display the title of the Subreddit\n",
"print(\"Title:\", subreddit.title)\n",
" \n",
"# Display the description of the Subreddit\n",
"print(\"Description:\", subreddit.description)"
]
},
{
"cell_type": "markdown",
"id": "d9cde1d2",
"metadata": {},
"source": [
"## 4. Scraping data and creating a dataset"
]
},
{
"cell_type": "markdown",
"id": "a1337aea",
"metadata": {},
"source": [
"Now it is time to actually gain the data and put it in a pandas dataset. For this you have to follow the three steps as explained in our guide: \n",
"\n",
"1. Make an empty list\n",
"2. Make a loop to append the desired values to your list. Think about the information you need: Do you want usernames, titles, upvotes, name of the subreddit etc (Praw collects them automatically)\n",
"3. Make a pandas dataframe and specify the column names.\n",
"\n",
"Think of the type of posts you need and the amount (limit): top posts or hot posts.\n",
"\n",
"Example assignment: You want to collect 50 top posts from all subreddits. For this you also want to know the usernames, title of the thread, amount of upvotes, amount of comments, date of creation, the text in the post and the name of the subreddit. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0bc61da5",
"metadata": {},
"outputs": [],
"source": [
"posts = []\n",
"\n",
"#your code here:\n",
"for post in ...:\n",
" posts.append([...])\n",
"df = pd.DataFrame(posts,columns=[...])\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"id": "ca734636",
"metadata": {},
"source": [
"## 5. Inspecting and cleaning the dataset"
]
},
{
"cell_type": "markdown",
"id": "9bc32b89",
"metadata": {},
"source": [
"It is important to know what is in the dataset you created. Therefore you can run a few simple pandas commands:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "509be0c3",
"metadata": {},
"outputs": [],
"source": [
"#Checking the rows and columns: \n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3312867",
"metadata": {},
"outputs": [],
"source": [
"#Checking the values: \n",
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5537dd5d",
"metadata": {},
"outputs": [],
"source": [
"#Checking the observations: \n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "f809c6ad",
"metadata": {},
"source": [
"You probably noticed that you cannot see the actual dates of when the posts are created. Lets change this. \n",
"\n",
"Example assignment: Change the created column to dates and drop the created column. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee6887d0",
"metadata": {},
"outputs": [],
"source": [
"import datetime as dt\n",
"df['...'] = pd.to_datetime(df['...'] utc=True, unit='s')\n",
"df = df.drop(columns=['...'])"
]
},
{
"cell_type": "markdown",
"id": "c6b14a03",
"metadata": {},
"source": [
"Now lets take a look again at the observations of your dataset. Does it have any null values? It is likely the column which contains the text of the thread has some null values as Reddit users could post threads without text. \n",
"\n",
"Example assignment: Create an overview of the rows with missing values for this column and think how this affects your dataset and further research. Does it matter? How can you interpret this?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce44b29e",
"metadata": {},
"outputs": [],
"source": [
"df[df.isnull().any(axis=1)]"
]
},
{
"cell_type": "markdown",
"id": "0394ce10",
"metadata": {},
"source": [
"Example assignment: Now lets say you only want a dataset with posts which actually have text in the post. Create a new dataframe to filter the other posts out."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d0f82b5",
"metadata": {},
"outputs": [],
"source": [
"body_df = df[df['...'].notna()]\n",
"body_df"
]
},
{
"cell_type": "markdown",
"id": "7d8f5773",
"metadata": {},
"source": [
"Now its time to take a closer look at the values of your dataset. \n",
"\n",
"Example assignment: Interpret the values. What is the average amount of upvotes? What is the maximum and minimum? The same goes for the comments. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97745b6b",
"metadata": {},
"outputs": [],
"source": [
"df.describe(include='all')"
]
},
{
"cell_type": "markdown",
"id": "3bb5f263",
"metadata": {},
"source": [
"Example assignment: As you saw in our guide, it is very likely for the title column to not only exist out of unique values. Check this for yourself. If this is the case with your dataset aswell, look at the rows with duplicates. How can you interpret the duplicates? Do you need to remove them from your dataset?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e29b3ae1",
"metadata": {},
"outputs": [],
"source": [
"df[df.duplicated(subset=['title'])]"
]
},
{
"cell_type": "markdown",
"id": "b3bbebbf",
"metadata": {},
"source": [
"## 6. Saving your dataset to a CSV file"
]
},
{
"cell_type": "markdown",
"id": "65a59547",
"metadata": {},
"source": [
"Now its time to save your dataset to a CSV file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af2fb073",
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"...\", index=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 3 additions & 0 deletions Groupwork (Group 5)/feminism-reddit dataset.csv
Git LFS file not shown
2 changes: 2 additions & 0 deletions Groupwork (Group 5)/flourish plot
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://public.flourish.studio/visualisation/12415427/