diff --git a/Groupwork (Group 5)/Collecting data Group Project - building the reddit scraper.ipynb b/Groupwork (Group 5)/Collecting data Group Project - building the reddit scraper.ipynb new file mode 100644 index 0000000..aa56ab0 --- /dev/null +++ b/Groupwork (Group 5)/Collecting data Group Project - building the reddit scraper.ipynb @@ -0,0 +1,1425 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4e6409e2", + "metadata": {}, + "source": [ + "# The creation and exploration of the r/Feminism dataset" + ] + }, + { + "cell_type": "markdown", + "id": "884380e1", + "metadata": {}, + "source": [ + "This file serves as a guide for the creation of a Reddit webscraper with PRAW. We used our own research question, which you can find in our report, for guidance on how to approach your own research using a scraper. After creating your own dataframe with Pandas, we show you how to inspect and clean it for further use. In our report we illustrated some examples of further use with some brief visualizations created with flourish. However a possible output ofcourse depends on your own research. \n", + "\n", + "In the workbook file on our github you can find a file in which you can create your own Reddit scraper of your choice. You can use this file for reference and as an example to the assignments you find in there. " + ] + }, + { + "cell_type": "markdown", + "id": "bdb227e8", + "metadata": {}, + "source": [ + "1). Use https://www.reddit.com/prefs/apps to create a Reddit app. Choose 'Create App.' Here you can fill in a name (user agent), description and redirect uri. As described in the PRAW documentation (https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application) \n", + "you should choose http://localhost:8080 as your uri. \n", + "\n", + "For the name you should avoid using words like 'scraping' or 'bot.' It could be that Reddit will not allow your authorization if you use these words. Lastly, select script for personal use and press 'create app.' \n", + "\n", + "The client_id is a code which can be found underneath 'personal use script.' The client_secret can be found next to 'secret.' The user_agent is the name you chose yourself. \n", + "\n", + "For our scraper we chose the 'reddit_read_only.' This means the scraper will only gather the data. \n", + "\n", + "For a more indepth explanation on creating the Reddit app we refer to the tutorial section in our report or take a look here: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "ff2a759e", + "metadata": {}, + "outputs": [], + "source": [ + "#If you have not already done this, install these libraries first with !pip install name_of_library\n", + "import praw\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "cb19e399", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Display Name: Feminism\n", + "Title: Feminism - “the personal is political” \n", + "Description: >#\n", + "\n", + ">* [Library](http://www.reddit.com/r/Feminism/search?q=flair%3A%22full+text%22&sort=new&restrict_sr=on)\n", + "\n", + ">#\n", + "\n", + ">* [Tags](http://redd.it/209vts)\n", + "\n", + ">#\n", + "\n", + ">* [FAQ and resources](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.p80ha3e7jbzv)\n", + "* [Concepts](http://redd.it/1fkhq4)\n", + "* [Studies] (https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.abrf8mm38svw)\n", + "* [Feminist works](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.jsay6nakas1s)\n", + "* [Organizations](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.jsay6nakas1s)\n", + "\n", + ">#\n", + "\n", + ">* [Currents](http://redd.it/166i8a)\n", + "\n", + ">#\n", + "\n", + ">* [Definition](http://redd.it/1fkhkf)\n", + "\n", + "\n", + "\n", + "\n", + "### [](#h3-blue)\n", + ">**Feminism** is the pursuit of equality in regards to women's rights. It has manifested across centuries and continents through [various movements, currents and ideologies](http://www.reddit.com/r/Feminism/comments/166i8a/a_short_introduction_to_feminist_movements/).\n", + "\n", + "Welcome to the feminism community! This is a space for discussing and promoting awareness of issues related to equality for women.\n", + "\n", + "####Recommended introductory reading:\n", + "\n", + "- a selection of **[feminist works](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#heading=h.vyclhseeefrl)**\n", + "\n", + "- on the **[history of feminism](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#heading=h.x60lc44f3gn2)**\n", + "\n", + "- feminist **[blogs and websites](http://redd.it/19l1wv)**\n", + "\n", + "- **[recurrent questions](http://www.reddit.com/r/AskFeminists/search?q=flair%3ARecurrent_questions&restrict_sr=on)**\n", + "\n", + "- tagged browsing: posted **[studies](http://www.reddit.com/r/Feminism/search?q=flair%3Astudy&restrict_sr=on&sort=relevance&t=all)**, **[classic works](http://www.reddit.com/r/Feminism/search?q=flair%3Aclassic&restrict_sr=on&sort=relevance&t=all)**\n", + "\n", + "\n", + "####Issues related to [women's rights](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit?pli=1#bookmark=id.v1jecz3p3hpc):\n", + "\n", + "- [bodily integrity and autonomy](http://redd.it/142nzm)\n", + "\n", + "- [fair wages and equal career opportunities](http://redd.it/142o2s)\n", + "\n", + "- [the right to vote and the representation of women in politics](http://redd.it/145z0n)\n", + "\n", + "- [the right to own property](http://redd.it/146es0)\n", + "\n", + "- [the right to education](http://redd.it/1475xh)\n", + "\n", + "Our FAQ also has sections on issues related to [LGBT](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit?pli=1#bookmark=id.gs6n32up92ey) rights and [men's](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.xf29a9z1r0tt) rights.\n", + "\n", + "####Other Recommended Subreddits \n", + "\n", + " | | \n", + ":--| :--|\n", + "/r/twoXchromosomes | /r/AskFeminists|\n", + "/r/CriticalTheory| /r/domesticviolence |\n", + "/r/MeToo | /r/relationship_advice |\n", + "/r/rapecounseling | /r/ainbow |\n", + "/r/BodyAcceptance | /r/SexPositive |\n", + " | | \n", + "\n", + "\n", + "For a larger selection of civic issues subreddits, click [here](https://docs.google.com/document/d/1TpHPEo3pG-QlB7dWCF-fcJFgayFmlQia-RjpKQGFP4A/edit#bookmark=id.h456x5acmpv9)\n", + "\n", + "####Posting Rules\n", + "\n", + "\\- all posts and discussions must be relevant to women's issues\n", + "\n", + "\\- all posts must come from an educated perspective\n", + "\n", + "\\- promoting regressive agendas is not permitted\n", + "\n", + "\\- be respectful and courteous\n", + "\n", + "\\- respect the \"assume good faith\" principle\n", + " \n", + "[Click here for more info](https://www.reddit.com/r/Feminism/about/rules)\n", + "\n", + "**Rules regarding debating**:\n", + "\n", + "Criticism of feminist concepts/organizations/persons is **welcomed** if it meets the following criteria:\n", + "\n", + "\\- it is topical/directly relevant to the topic at hand;\n", + "\n", + "\\- it is verifiably sourced (i.e. it doesn’t rely on mere dismissiveness/speculation, non-feminist preferences or anecdotal evidence. In particular, pure anti-feminist propaganda is not allowed, since personal non-/anti-feminist preferences are deemed as not informative or relevant); furthermore, presentation of relevant data must not be biased against the feminist position (i.e. there should be a best effort to include the evidence/arguments supportive of the feminist position);\n", + "\n", + "\\- it is properly qualified: i.e. it correctly identifies the problem at the appropriate level, instead of unwarrantably generalizing it, especially if it does so for the whole collection of movements that constitute feminism;\n", + "\n", + "\\- all ideological considerations must contribute to understanding the feminist perspective, and be consistent with an attitude of encouragement towards further learning.\n" + ] + } + ], + "source": [ + "reddit_read_only = praw.Reddit(client_id=\"\", #your client id \n", + " client_secret=\"\", #your client secret \n", + " user_agent=\"\") # your user agent\n", + "subreddit = reddit_read_only.subreddit(\"Feminism\") #The name of the subreddit, in our case: (r/)Feminism.\n", + " \n", + "#With these lines of code you can check if PRAW is connected to the subreddit of your choice.\n", + "\n", + "# Display the name of the Subreddit\n", + "print(\"Display Name:\", subreddit.display_name)\n", + " \n", + "# Display the title of the Subreddit\n", + "print(\"Title:\", subreddit.title)\n", + " \n", + "# Display the description of the Subreddit\n", + "print(\"Description:\", subreddit.description)" + ] + }, + { + "cell_type": "markdown", + "id": "64c20a41", + "metadata": {}, + "source": [ + "Once you are connected to the subreddit it is time to make a dataframe using pandas. In order to do this, you have to add the values to an empty list. In our example we collected the hotposts. Here you can set the limit for yourself. This is especially important if you want to scrape a larger subreddit. In our case looking for the hotposts automatically scrapes all the posts from the subreddit. \n", + "\n", + "It is also possible to scrape the top posts (a selection of the most popular posts from the subreddit). To do this you can run the line: \n", + "\n", + "for post in subreddit.top(\"month\"): \n", + "\n", + "You can specify if you want the top posts from the current week, month or year.\n", + "\n", + "You can decide which values you want to collect. Eventually you can create a dataframe where you specify your desired column names. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "fb6152d2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " title score id \\\n", + "0 This is a comprehensive list of resources for ... 2547 phrcrn \n", + "1 On New Year's Eve, Iranian women express their... 189 10134y8 \n", + "2 She led two historic victories for abortion ri... 204 101106p \n", + "3 Amal(12) | Pregnant child bride: A story of ma... 84 1014ivy \n", + "4 My brother watches Hamza and it’s scaring me 76 1010yt6 \n", + ".. ... ... ... \n", + "791 A 31-year-old woman who had her tubes removed ... 1030 xuiv4z \n", + "792 Husband subscribed to Jordan P podcast. Help! 190 xuvhd1 \n", + "793 Journalist/Writer Julia Ioffe posts the ultima... 0 xvox9v \n", + "794 Canada significantly undercounts maternal deat... 220 xuiix4 \n", + "795 Men under 30 are less accepting of women’s rig... 129 xum6lz \n", + "\n", + " subreddit url \\\n", + "0 Feminism https://www.reddit.com/r/Feminism/comments/phr... \n", + "1 Feminism https://v.redd.it/r6d7rq8b4k9a1 \n", + "2 Feminism https://www.theguardian.com/world/2023/jan/01/... \n", + "3 Feminism https://v.redd.it/w4k473wkyg9a1 \n", + "4 Feminism https://www.reddit.com/r/Feminism/comments/101... \n", + ".. ... ... \n", + "791 Feminism https://www.businessinsider.com/new-york-woman... \n", + "792 Feminism https://www.reddit.com/r/Feminism/comments/xuv... \n", + "793 Feminism https://twitter.com/juliaioffe/status/15772880... \n", + "794 Feminism https://www.cbc.ca/news/canada/canada-maternal... \n", + "795 Feminism https://www.msn.com/en-gb/news/world/men-under... \n", + "\n", + " num_comments body \\\n", + "0 236 **Update** I guess I've been mass reported for... \n", + "1 3 \n", + "2 4 \n", + "3 6 \n", + "4 46 My little brother (14M) listens to a lot of “r... \n", + ".. ... ... \n", + "791 74 \n", + "792 102 He told me this morning he subscribed. What ca... \n", + "793 2 \n", + "794 2 \n", + "795 17 \n", + "\n", + " created \n", + "0 1.630761e+09 \n", + "1 1.672633e+09 \n", + "2 1.672627e+09 \n", + "3 1.672638e+09 \n", + "4 1.672627e+09 \n", + ".. ... \n", + "791 1.664802e+09 \n", + "792 1.664831e+09 \n", + "793 1.664913e+09 \n", + "794 1.664801e+09 \n", + "795 1.664810e+09 \n", + "\n", + "[796 rows x 8 columns]\n" + ] + } + ], + "source": [ + "posts = []\n", + "\n", + "for post in subreddit.hot(limit=2000):\n", + " posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\n", + "feminism_df = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])\n", + "print(feminism_df)" + ] + }, + { + "cell_type": "markdown", + "id": "6fcfe574", + "metadata": {}, + "source": [ + "Here we can check what is now in our dataframe. In our case the r/feminism dataset has 796 rows and 8 columns. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e91f9263", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(796, 8)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df.shape" + ] + }, + { + "cell_type": "markdown", + "id": "c635b952", + "metadata": {}, + "source": [ + "Here we add in the dates of when the threads were posted:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "081ec7c4", + "metadata": {}, + "outputs": [], + "source": [ + "import datetime as dt\n", + "feminism_df['date'] = pd.to_datetime(feminism_df['created'], utc=True, unit='s')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b5c6a6e2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2021-09-04 13:15:02+00:00\n", + "1 2023-01-02 04:23:21+00:00\n", + "2 2023-01-02 02:37:39+00:00\n", + "3 2023-01-02 05:35:48+00:00\n", + "4 2023-01-02 02:35:44+00:00\n", + " ... \n", + "791 2022-10-03 13:02:01+00:00\n", + "792 2022-10-03 21:06:27+00:00\n", + "793 2022-10-04 19:56:50+00:00\n", + "794 2022-10-03 12:47:30+00:00\n", + "795 2022-10-03 15:14:54+00:00\n", + "Name: date, Length: 796, dtype: datetime64[ns, UTC]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df['date']" + ] + }, + { + "cell_type": "markdown", + "id": "755f4feb", + "metadata": {}, + "source": [ + "After collection, you can save the dataset as a csv file. It is important to do this before your analysis, especially if you work with others, as the Reddit App you created contains private information you definitly should not make available to the public. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "2b68c222", + "metadata": {}, + "outputs": [], + "source": [ + "feminism_df.to_csv(\"feminism reddit dataset.csv\", index=True)" + ] + }, + { + "cell_type": "markdown", + "id": "ee229384", + "metadata": {}, + "source": [ + "In order to explore our r/Feminism dataset a bit we call it again as the above code does not work without client id, client secret and user agent:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c6609a59", + "metadata": {}, + "outputs": [], + "source": [ + "feminism_df = pd.read_csv('feminism reddit dataset.csv', delimiter = ',', encoding= 'utf-8')" + ] + }, + { + "cell_type": "markdown", + "id": "b2cd33ec", + "metadata": {}, + "source": [ + "If you want to explore your dataset, you can start by looking at the types of values:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "2341223f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Unnamed: 0 int64\n", + "title object\n", + "score int64\n", + "id object\n", + "subreddit object\n", + "url object\n", + "num_comments int64\n", + "body object\n", + "created float64\n", + "date object\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "18c448b6", + "metadata": {}, + "source": [ + "Now lets see how many observations the dataset has:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "0ed0fde2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 796 entries, 0 to 795\n", + "Data columns (total 10 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Unnamed: 0 796 non-null int64 \n", + " 1 title 796 non-null object \n", + " 2 score 796 non-null int64 \n", + " 3 id 796 non-null object \n", + " 4 subreddit 796 non-null object \n", + " 5 url 796 non-null object \n", + " 6 num_comments 796 non-null int64 \n", + " 7 body 282 non-null object \n", + " 8 created 796 non-null float64\n", + " 9 date 796 non-null object \n", + "dtypes: float64(1), int64(3), object(6)\n", + "memory usage: 62.3+ KB\n" + ] + } + ], + "source": [ + "feminism_df.info()\n" + ] + }, + { + "cell_type": "markdown", + "id": "afc93f5e", + "metadata": {}, + "source": [ + "It seems only the column 'body' has some null values. This is because reddit users could make posts with only a title. Moreover, a post could include only a picture or video. In this case the link to these images or video's will be stored as a value in the 'url' column." + ] + }, + { + "cell_type": "markdown", + "id": "06047e28", + "metadata": {}, + "source": [ + "We noticed the column 'Unnamed 0' is the same as the index so we can drop it. Also the 'created' column is redundant as it shows a code from when it is created. As we also added the dates in the dataframe this column is no longer needed. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "b92d98ee", + "metadata": {}, + "outputs": [], + "source": [ + "feminism_df = feminism_df.drop(columns=['created'])\n", + "feminism_df = feminism_df.drop(columns=['Unnamed: 0'])" + ] + }, + { + "cell_type": "markdown", + "id": "7c1a2621", + "metadata": {}, + "source": [ + "Here we get a quick overview of some rows with missing values for the 'body' column:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "15d3cf8a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0titlescoreidsubredditurlnum_commentsbodycreateddate
11On New Year's Eve, Iranian women express their...18910134y8Feminismhttps://v.redd.it/r6d7rq8b4k9a13NaN1.672633e+092023-01-02 04:23:21+00:00
22She led two historic victories for abortion ri...204101106pFeminismhttps://www.theguardian.com/world/2023/jan/01/...4NaN1.672627e+092023-01-02 02:37:39+00:00
33Amal(12) | Pregnant child bride: A story of ma...841014ivyFeminismhttps://v.redd.it/w4k473wkyg9a16NaN1.672638e+092023-01-02 05:35:48+00:00
55Women are more critical of female toplessness ...146100vlhdFeminismhttps://www.psypost.org/2022/10/women-are-more...39NaN1.672613e+092023-01-01 22:35:35+00:00
66This has bothered me for a long time2609100c37fFeminismhttps://i.redd.it/2xwf53zf3d9a1.png88NaN1.672548e+092023-01-01 04:46:51+00:00
.................................
790790Infosys to face age, gender bias suit by forme...3xvsmlnFeminismhttps://www.theregister.com/2022/10/04/infosys...0NaN1.664922e+092022-10-04 22:23:45+00:00
791791A 31-year-old woman who had her tubes removed ...1030xuiv4zFeminismhttps://www.businessinsider.com/new-york-woman...74NaN1.664802e+092022-10-03 13:02:01+00:00
793793Journalist/Writer Julia Ioffe posts the ultima...0xvox9vFeminismhttps://twitter.com/juliaioffe/status/15772880...2NaN1.664913e+092022-10-04 19:56:50+00:00
794794Canada significantly undercounts maternal deat...220xuiix4Feminismhttps://www.cbc.ca/news/canada/canada-maternal...2NaN1.664801e+092022-10-03 12:47:30+00:00
795795Men under 30 are less accepting of women’s rig...129xum6lzFeminismhttps://www.msn.com/en-gb/news/world/men-under...17NaN1.664810e+092022-10-03 15:14:54+00:00
\n", + "

514 rows × 10 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 title score \\\n", + "1 1 On New Year's Eve, Iranian women express their... 189 \n", + "2 2 She led two historic victories for abortion ri... 204 \n", + "3 3 Amal(12) | Pregnant child bride: A story of ma... 84 \n", + "5 5 Women are more critical of female toplessness ... 146 \n", + "6 6 This has bothered me for a long time 2609 \n", + ".. ... ... ... \n", + "790 790 Infosys to face age, gender bias suit by forme... 3 \n", + "791 791 A 31-year-old woman who had her tubes removed ... 1030 \n", + "793 793 Journalist/Writer Julia Ioffe posts the ultima... 0 \n", + "794 794 Canada significantly undercounts maternal deat... 220 \n", + "795 795 Men under 30 are less accepting of women’s rig... 129 \n", + "\n", + " id subreddit url \\\n", + "1 10134y8 Feminism https://v.redd.it/r6d7rq8b4k9a1 \n", + "2 101106p Feminism https://www.theguardian.com/world/2023/jan/01/... \n", + "3 1014ivy Feminism https://v.redd.it/w4k473wkyg9a1 \n", + "5 100vlhd Feminism https://www.psypost.org/2022/10/women-are-more... \n", + "6 100c37f Feminism https://i.redd.it/2xwf53zf3d9a1.png \n", + ".. ... ... ... \n", + "790 xvsmln Feminism https://www.theregister.com/2022/10/04/infosys... \n", + "791 xuiv4z Feminism https://www.businessinsider.com/new-york-woman... \n", + "793 xvox9v Feminism https://twitter.com/juliaioffe/status/15772880... \n", + "794 xuiix4 Feminism https://www.cbc.ca/news/canada/canada-maternal... \n", + "795 xum6lz Feminism https://www.msn.com/en-gb/news/world/men-under... \n", + "\n", + " num_comments body created date \n", + "1 3 NaN 1.672633e+09 2023-01-02 04:23:21+00:00 \n", + "2 4 NaN 1.672627e+09 2023-01-02 02:37:39+00:00 \n", + "3 6 NaN 1.672638e+09 2023-01-02 05:35:48+00:00 \n", + "5 39 NaN 1.672613e+09 2023-01-01 22:35:35+00:00 \n", + "6 88 NaN 1.672548e+09 2023-01-01 04:46:51+00:00 \n", + ".. ... ... ... ... \n", + "790 0 NaN 1.664922e+09 2022-10-04 22:23:45+00:00 \n", + "791 74 NaN 1.664802e+09 2022-10-03 13:02:01+00:00 \n", + "793 2 NaN 1.664913e+09 2022-10-04 19:56:50+00:00 \n", + "794 2 NaN 1.664801e+09 2022-10-03 12:47:30+00:00 \n", + "795 17 NaN 1.664810e+09 2022-10-03 15:14:54+00:00 \n", + "\n", + "[514 rows x 10 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df[feminism_df.isnull().any(axis=1)]" + ] + }, + { + "cell_type": "markdown", + "id": "1749098f", + "metadata": {}, + "source": [ + "To filter the post with actual text in the body (so without images, links and video's):" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "b86d50da", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0titlescoreidsubredditurlnum_commentsbodycreateddate
00This is a comprehensive list of resources for ...2547phrcrnFeminismhttps://www.reddit.com/r/Feminism/comments/phr...236**Update** I guess I've been mass reported for...1.630761e+092021-09-04 13:15:02+00:00
44My brother watches Hamza and it’s scaring me761010yt6Feminismhttps://www.reddit.com/r/Feminism/comments/101...46My little brother (14M) listens to a lot of “r...1.672627e+092023-01-02 02:35:44+00:00
77Being a teen girl sucks236100ofn7Feminismhttps://www.reddit.com/r/Feminism/comments/100...12As I am in my last year of highschool and am g...1.672594e+092023-01-01 17:28:45+00:00
88Is the whole thing a lie?31100yxd9Feminismhttps://www.reddit.com/r/Feminism/comments/100...10I've gone through hell this year. It has made ...1.672621e+092023-01-02 01:00:42+00:00
1212Girls fighting for the future of the environme...11100yiuhFeminismhttps://youtu.be/_HTdyorjL0E0This documentary is a year old but these girls...1.672620e+092023-01-02 00:42:40+00:00
.................................
771771Yes or no to bras4xx10f8Feminismhttps://www.reddit.com/r/Feminism/comments/xx1...9Hello, hi.\\n\\nSo I am that kind of person that...1.665049e+092022-10-06 09:33:52+00:00
772772Why do schools need to know female athletes cy...26xwpqupFeminismhttps://www.reddit.com/r/Feminism/comments/xwp...19Why would schools and coaches need to know an ...1.665012e+092022-10-05 23:25:06+00:00
775775Mobile PP clinics launched today4xwr4adFeminismhttps://www.reddit.com/r/Feminism/comments/xwr...0Mobile PP clinic launched today, bringing much...1.665016e+092022-10-06 00:27:15+00:00
784784Please God tell me this is Not true. If it is ...56xvxxsbFeminismhttps://www.reddit.com/r/Feminism/comments/xvx...35—Florida’s state government finds itself in th...1.664936e+092022-10-05 02:21:09+00:00
792792Husband subscribed to Jordan P podcast. Help!190xuvhd1Feminismhttps://www.reddit.com/r/Feminism/comments/xuv...102He told me this morning he subscribed. What ca...1.664831e+092022-10-03 21:06:27+00:00
\n", + "

282 rows × 10 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 title score \\\n", + "0 0 This is a comprehensive list of resources for ... 2547 \n", + "4 4 My brother watches Hamza and it’s scaring me 76 \n", + "7 7 Being a teen girl sucks 236 \n", + "8 8 Is the whole thing a lie? 31 \n", + "12 12 Girls fighting for the future of the environme... 11 \n", + ".. ... ... ... \n", + "771 771 Yes or no to bras 4 \n", + "772 772 Why do schools need to know female athletes cy... 26 \n", + "775 775 Mobile PP clinics launched today 4 \n", + "784 784 Please God tell me this is Not true. If it is ... 56 \n", + "792 792 Husband subscribed to Jordan P podcast. Help! 190 \n", + "\n", + " id subreddit url \\\n", + "0 phrcrn Feminism https://www.reddit.com/r/Feminism/comments/phr... \n", + "4 1010yt6 Feminism https://www.reddit.com/r/Feminism/comments/101... \n", + "7 100ofn7 Feminism https://www.reddit.com/r/Feminism/comments/100... \n", + "8 100yxd9 Feminism https://www.reddit.com/r/Feminism/comments/100... \n", + "12 100yiuh Feminism https://youtu.be/_HTdyorjL0E \n", + ".. ... ... ... \n", + "771 xx10f8 Feminism https://www.reddit.com/r/Feminism/comments/xx1... \n", + "772 xwpqup Feminism https://www.reddit.com/r/Feminism/comments/xwp... \n", + "775 xwr4ad Feminism https://www.reddit.com/r/Feminism/comments/xwr... \n", + "784 xvxxsb Feminism https://www.reddit.com/r/Feminism/comments/xvx... \n", + "792 xuvhd1 Feminism https://www.reddit.com/r/Feminism/comments/xuv... \n", + "\n", + " num_comments body \\\n", + "0 236 **Update** I guess I've been mass reported for... \n", + "4 46 My little brother (14M) listens to a lot of “r... \n", + "7 12 As I am in my last year of highschool and am g... \n", + "8 10 I've gone through hell this year. It has made ... \n", + "12 0 This documentary is a year old but these girls... \n", + ".. ... ... \n", + "771 9 Hello, hi.\\n\\nSo I am that kind of person that... \n", + "772 19 Why would schools and coaches need to know an ... \n", + "775 0 Mobile PP clinic launched today, bringing much... \n", + "784 35 —Florida’s state government finds itself in th... \n", + "792 102 He told me this morning he subscribed. What ca... \n", + "\n", + " created date \n", + "0 1.630761e+09 2021-09-04 13:15:02+00:00 \n", + "4 1.672627e+09 2023-01-02 02:35:44+00:00 \n", + "7 1.672594e+09 2023-01-01 17:28:45+00:00 \n", + "8 1.672621e+09 2023-01-02 01:00:42+00:00 \n", + "12 1.672620e+09 2023-01-02 00:42:40+00:00 \n", + ".. ... ... \n", + "771 1.665049e+09 2022-10-06 09:33:52+00:00 \n", + "772 1.665012e+09 2022-10-05 23:25:06+00:00 \n", + "775 1.665016e+09 2022-10-06 00:27:15+00:00 \n", + "784 1.664936e+09 2022-10-05 02:21:09+00:00 \n", + "792 1.664831e+09 2022-10-03 21:06:27+00:00 \n", + "\n", + "[282 rows x 10 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "body_df = feminism_df[feminism_df['body'].notna()]\n", + "body_df" + ] + }, + { + "cell_type": "markdown", + "id": "daac6082", + "metadata": {}, + "source": [ + "Lastly, lets take a look at the values of the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "99a0467f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0titlescoreidsubredditurlnum_commentsbodycreateddate
count796.000000796796.000000796796796796.0000002827.960000e+02796
uniqueNaN792NaN7961792NaN282NaN795
topNaNYes we can.NaNphrcrnFeminismhttps://theconversation.com/women-in-antarctic...NaN**Update** I guess I've been mass reported for...NaN2022-12-09 03:43:49+00:00
freqNaN3NaN17962NaN1NaN2
mean397.500000NaN258.898241NaNNaNNaN23.077889NaN1.668580e+09NaN
std229.929699NaN442.125368NaNNaNNaN40.938321NaN2.703460e+06NaN
min0.000000NaN0.000000NaNNaNNaN0.000000NaN1.630761e+09NaN
25%198.750000NaN15.000000NaNNaNNaN2.000000NaN1.666444e+09NaN
50%397.500000NaN84.000000NaNNaNNaN6.000000NaN1.668573e+09NaN
75%596.250000NaN328.000000NaNNaNNaN23.000000NaN1.670594e+09NaN
max795.000000NaN3388.000000NaNNaNNaN278.000000NaN1.672650e+09NaN
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 title score id subreddit \\\n", + "count 796.000000 796 796.000000 796 796 \n", + "unique NaN 792 NaN 796 1 \n", + "top NaN Yes we can. NaN phrcrn Feminism \n", + "freq NaN 3 NaN 1 796 \n", + "mean 397.500000 NaN 258.898241 NaN NaN \n", + "std 229.929699 NaN 442.125368 NaN NaN \n", + "min 0.000000 NaN 0.000000 NaN NaN \n", + "25% 198.750000 NaN 15.000000 NaN NaN \n", + "50% 397.500000 NaN 84.000000 NaN NaN \n", + "75% 596.250000 NaN 328.000000 NaN NaN \n", + "max 795.000000 NaN 3388.000000 NaN NaN \n", + "\n", + " url num_comments \\\n", + "count 796 796.000000 \n", + "unique 792 NaN \n", + "top https://theconversation.com/women-in-antarctic... NaN \n", + "freq 2 NaN \n", + "mean NaN 23.077889 \n", + "std NaN 40.938321 \n", + "min NaN 0.000000 \n", + "25% NaN 2.000000 \n", + "50% NaN 6.000000 \n", + "75% NaN 23.000000 \n", + "max NaN 278.000000 \n", + "\n", + " body created \\\n", + "count 282 7.960000e+02 \n", + "unique 282 NaN \n", + "top **Update** I guess I've been mass reported for... NaN \n", + "freq 1 NaN \n", + "mean NaN 1.668580e+09 \n", + "std NaN 2.703460e+06 \n", + "min NaN 1.630761e+09 \n", + "25% NaN 1.666444e+09 \n", + "50% NaN 1.668573e+09 \n", + "75% NaN 1.670594e+09 \n", + "max NaN 1.672650e+09 \n", + "\n", + " date \n", + "count 796 \n", + "unique 795 \n", + "top 2022-12-09 03:43:49+00:00 \n", + "freq 2 \n", + "mean NaN \n", + "std NaN \n", + "min NaN \n", + "25% NaN \n", + "50% NaN \n", + "75% NaN \n", + "max NaN " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df.describe(include='all')\n" + ] + }, + { + "cell_type": "markdown", + "id": "25c09a63", + "metadata": {}, + "source": [ + "Here we can see the the avarage score(upvotes) which is around 259 per post. We can see the avarage comments per post which is 23. The maximum number of upvotes is 3388. The maximum number of comments is 278. " + ] + }, + { + "cell_type": "markdown", + "id": "1e77fa97", + "metadata": {}, + "source": [ + "We also see that the 'title' column has 4 values which are not unique. This means there are a few posts in the dataset that appear more than once. Lets check which posts this are: " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "7a53ef76", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titlescoreidsubredditurlnum_commentsbodydate
118Yes we can.813zqn682Feminismhttps://i.redd.it/021o2vz4r17a1.jpg8NaN2022-12-20 12:29:06+00:00
121Women Heavily Underrepresented in Political De...3zrfcidFeminismhttps://web-mind.io/artificial-intelligence/wo...0NaN2022-12-21 09:15:24+00:00
210Yes we can.895zfibcaFeminismhttps://i.redd.it/2csazo9cck4a1.jpg19NaN2022-12-07 23:47:35+00:00
774Women in Antarctica face assault and harassmen...9xwlrzsFeminismhttps://theconversation.com/women-in-antarctic...0NaN2022-10-05 20:43:51+00:00
\n", + "
" + ], + "text/plain": [ + " title score id \\\n", + "118 Yes we can. 813 zqn682 \n", + "121 Women Heavily Underrepresented in Political De... 3 zrfcid \n", + "210 Yes we can. 895 zfibca \n", + "774 Women in Antarctica face assault and harassmen... 9 xwlrzs \n", + "\n", + " subreddit url \\\n", + "118 Feminism https://i.redd.it/021o2vz4r17a1.jpg \n", + "121 Feminism https://web-mind.io/artificial-intelligence/wo... \n", + "210 Feminism https://i.redd.it/2csazo9cck4a1.jpg \n", + "774 Feminism https://theconversation.com/women-in-antarctic... \n", + "\n", + " num_comments body date \n", + "118 8 NaN 2022-12-20 12:29:06+00:00 \n", + "121 0 NaN 2022-12-21 09:15:24+00:00 \n", + "210 19 NaN 2022-12-07 23:47:35+00:00 \n", + "774 0 NaN 2022-10-05 20:43:51+00:00 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feminism_df[feminism_df.duplicated(subset=['title'])]" + ] + }, + { + "cell_type": "markdown", + "id": "433d41d6", + "metadata": {}, + "source": [ + "It could be that the 'yes we can' is a repost. However we see that they did not post a body but only an image. So it could also be that this is a different image. The other posts are not the same. They do have some similar words but the body shows other url's as well. For our dataset we do not have to remove these posts. We do know now for sure that our dataset does not have actual reposts. You have to decide for yourself and your research if rows have to be removed. In this case it would be the easiest to drop a row by index number: df.drop([0, 1]) or you can take a look here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Groupwork (Group 5)/Data Management Plan (DMP).pdf b/Groupwork (Group 5)/Data Management Plan (DMP).pdf new file mode 100644 index 0000000..9e3efea Binary files /dev/null and b/Groupwork (Group 5)/Data Management Plan (DMP).pdf differ diff --git a/Groupwork (Group 5)/README.md b/Groupwork (Group 5)/README.md new file mode 100644 index 0000000..eb7969d --- /dev/null +++ b/Groupwork (Group 5)/README.md @@ -0,0 +1,9 @@ +# Collecting data from r/Feminism: A Reddit webscraper for research in the Digital Humanities + +This repository contains a group project we created for research in the field of Digital Humanities. We wanted to explore the different ways in which scholars of the field could actually collect various forms of data for their research purposes. In order to do so, we came up with a brief project which helps us to better understand the process of data collection. Moreover, this project gave us the opportunity to create a guide for further research. + +We decided to explore the social media platform Reddit. On here we focused on a topic analysis of r/Feminism. In order to gain our data we needed to make our own scraper. The guide in this repository contains the step-by-step process of how we programmed the scraper. For this we used PRAW: Python Reddit Api Wrapper. This is a great and user-friendly way to scrape data from Reddit. We are aware that there are multiple ways of scraping Reddit. For example, using the Pushshift Api. These are also great tools. However, we found that using PRAW is more consistent in the data gathering and the servers it connects to. + +The workbook in this repository allows researchers to make their own Reddit scraper. It takes you through the whole process again but this time with the subreddit of your own choosing. This file contains various example assignments you can do. Still, it also allows you to use it for your own desire. The code in the guide is reusable and with a view adjustments suitable for many projects. + +The report in this repository contains the foundations of our own research project which guided us for the data collection. In addition, it contains the documentation of the file in which we created our own scraper in the form of a tutorial. In order to explore our own r/Feminism dataset we created a data visualization on flourish. This visualization is described in our Tutorial section but is also visible with a link in this repository. The visualization serves as a way in which we wanted to give a brief example of the ways in which the r/Feminism dataset could be used. Moreover, in the tutorial we followed up on this by giving a view recommendations for further, more in-depth, data analysis of our dataset. The workbook which serves as active learning exercise is also explained more in-depth in the report. Finally, this repository contains a Data Management Plan (DMP). In here we elaborated on the ways in which our dataset is considered FAIR. Moreover, we justify our methods, tools and research more in-depth in the DMP. diff --git a/Groupwork (Group 5)/Report_Collecting data from rFeminism.pdf b/Groupwork (Group 5)/Report_Collecting data from rFeminism.pdf new file mode 100644 index 0000000..db576d8 Binary files /dev/null and b/Groupwork (Group 5)/Report_Collecting data from rFeminism.pdf differ diff --git a/Groupwork (Group 5)/Workbook- Making your own scraper.ipynb b/Groupwork (Group 5)/Workbook- Making your own scraper.ipynb new file mode 100644 index 0000000..c4a8ba4 --- /dev/null +++ b/Groupwork (Group 5)/Workbook- Making your own scraper.ipynb @@ -0,0 +1,358 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c766bdfa", + "metadata": {}, + "source": [ + "# Workbook: Making your own Reddit scraper with PRAW" + ] + }, + { + "cell_type": "markdown", + "id": "c87b65f9", + "metadata": {}, + "source": [ + "In this file we will go step-by-step through the whole process of making a Reddit web scraper. If you want to create a dataset of any subreddit you like, you can just simply fill in the empty code spaces. In this way, any time you need a webscraper for Reddit you can just come back to this file and fill everything in. Moreover, we will guide you into cleaning and expecting the dataset but you can always skip this part if you think it is not needed (although we highly recommend do to so in order to get a better understanding of your dataset). \n", + "\n", + "If anything is unclear you can look at the other file in our github reposity where we created the dataset of r/Feminism. Moreover, you can always look at the tutorial in our report. " + ] + }, + { + "cell_type": "markdown", + "id": "ea57770c", + "metadata": {}, + "source": [ + "## 1. Installing PRAW" + ] + }, + { + "cell_type": "markdown", + "id": "4d6b33de", + "metadata": {}, + "source": [ + "First make sure you downloaded PRAW to your computer: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4d48ff0", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install praw" + ] + }, + { + "cell_type": "markdown", + "id": "6214d481", + "metadata": {}, + "source": [ + "## 2. Importing the PRAW and pandas libraries" + ] + }, + { + "cell_type": "markdown", + "id": "c30629d7", + "metadata": {}, + "source": [ + "Now we have to import praw and pandas to build the scraper and analyze the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b78a3bc2", + "metadata": {}, + "outputs": [], + "source": [ + "import praw\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "c3ed37ca", + "metadata": {}, + "source": [ + "## 3. Creating a Reddit App and connecting to the subreddit" + ] + }, + { + "cell_type": "markdown", + "id": "26642b8f", + "metadata": {}, + "source": [ + "Use https://www.reddit.com/prefs/apps to create a Reddit app. Choose 'Create App.' Here you can fill in a name (user agent), description and redirect uri. As described in the PRAW documentation (https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application) you should choose http://localhost:8080 as your uri.\n", + "\n", + "For the name you should avoid using words like 'scraping' or 'bot.' It could be that Reddit will not allow your authorization if you use these words. Lastly, select script for personal use and press 'create app.'\n", + "\n", + "The client_id is a code which can be found underneath 'personal use script.' The client_secret can be found next to 'secret.' The user_agent is the name you chose yourself.\n", + "\n", + "For our scraper we chose the 'reddit_read_only.' This means the scraper will only gather the data.\n", + "\n", + "For a more indepth explanation on creating the Reddit app we refer to the tutorial section in our report or take a look here: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0862f704", + "metadata": {}, + "outputs": [], + "source": [ + "reddit_read_only = praw.Reddit(client_id=\"\", #your client id \n", + " client_secret=\"\", #your client secret \n", + " user_agent=\"\") # your user agent\n", + "subreddit = reddit_read_only.subreddit(\"\") #The name of the subreddit.If you want to scrape all subreddits use 'all'\n", + " \n", + "#With these lines of code you can check if PRAW is connected to the subreddit of your choice.\n", + "\n", + "# Display the name of the Subreddit\n", + "print(\"Display Name:\", subreddit.display_name)\n", + " \n", + "# Display the title of the Subreddit\n", + "print(\"Title:\", subreddit.title)\n", + " \n", + "# Display the description of the Subreddit\n", + "print(\"Description:\", subreddit.description)" + ] + }, + { + "cell_type": "markdown", + "id": "d9cde1d2", + "metadata": {}, + "source": [ + "## 4. Scraping data and creating a dataset" + ] + }, + { + "cell_type": "markdown", + "id": "a1337aea", + "metadata": {}, + "source": [ + "Now it is time to actually gain the data and put it in a pandas dataset. For this you have to follow the three steps as explained in our guide: \n", + "\n", + "1. Make an empty list\n", + "2. Make a loop to append the desired values to your list. Think about the information you need: Do you want usernames, titles, upvotes, name of the subreddit etc (Praw collects them automatically)\n", + "3. Make a pandas dataframe and specify the column names.\n", + "\n", + "Think of the type of posts you need and the amount (limit): top posts or hot posts.\n", + "\n", + "Example assignment: You want to collect 50 top posts from all subreddits. For this you also want to know the usernames, title of the thread, amount of upvotes, amount of comments, date of creation, the text in the post and the name of the subreddit. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bc61da5", + "metadata": {}, + "outputs": [], + "source": [ + "posts = []\n", + "\n", + "#your code here:\n", + "for post in ...:\n", + " posts.append([...])\n", + "df = pd.DataFrame(posts,columns=[...])\n", + "print(df)" + ] + }, + { + "cell_type": "markdown", + "id": "ca734636", + "metadata": {}, + "source": [ + "## 5. Inspecting and cleaning the dataset" + ] + }, + { + "cell_type": "markdown", + "id": "9bc32b89", + "metadata": {}, + "source": [ + "It is important to know what is in the dataset you created. Therefore you can run a few simple pandas commands:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "509be0c3", + "metadata": {}, + "outputs": [], + "source": [ + "#Checking the rows and columns: \n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3312867", + "metadata": {}, + "outputs": [], + "source": [ + "#Checking the values: \n", + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5537dd5d", + "metadata": {}, + "outputs": [], + "source": [ + "#Checking the observations: \n", + "df.info()" + ] + }, + { + "cell_type": "markdown", + "id": "f809c6ad", + "metadata": {}, + "source": [ + "You probably noticed that you cannot see the actual dates of when the posts are created. Lets change this. \n", + "\n", + "Example assignment: Change the created column to dates and drop the created column. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee6887d0", + "metadata": {}, + "outputs": [], + "source": [ + "import datetime as dt\n", + "df['...'] = pd.to_datetime(df['...'] utc=True, unit='s')\n", + "df = df.drop(columns=['...'])" + ] + }, + { + "cell_type": "markdown", + "id": "c6b14a03", + "metadata": {}, + "source": [ + "Now lets take a look again at the observations of your dataset. Does it have any null values? It is likely the column which contains the text of the thread has some null values as Reddit users could post threads without text. \n", + "\n", + "Example assignment: Create an overview of the rows with missing values for this column and think how this affects your dataset and further research. Does it matter? How can you interpret this?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce44b29e", + "metadata": {}, + "outputs": [], + "source": [ + "df[df.isnull().any(axis=1)]" + ] + }, + { + "cell_type": "markdown", + "id": "0394ce10", + "metadata": {}, + "source": [ + "Example assignment: Now lets say you only want a dataset with posts which actually have text in the post. Create a new dataframe to filter the other posts out." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d0f82b5", + "metadata": {}, + "outputs": [], + "source": [ + "body_df = df[df['...'].notna()]\n", + "body_df" + ] + }, + { + "cell_type": "markdown", + "id": "7d8f5773", + "metadata": {}, + "source": [ + "Now its time to take a closer look at the values of your dataset. \n", + "\n", + "Example assignment: Interpret the values. What is the average amount of upvotes? What is the maximum and minimum? The same goes for the comments. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97745b6b", + "metadata": {}, + "outputs": [], + "source": [ + "df.describe(include='all')" + ] + }, + { + "cell_type": "markdown", + "id": "3bb5f263", + "metadata": {}, + "source": [ + "Example assignment: As you saw in our guide, it is very likely for the title column to not only exist out of unique values. Check this for yourself. If this is the case with your dataset aswell, look at the rows with duplicates. How can you interpret the duplicates? Do you need to remove them from your dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e29b3ae1", + "metadata": {}, + "outputs": [], + "source": [ + "df[df.duplicated(subset=['title'])]" + ] + }, + { + "cell_type": "markdown", + "id": "b3bbebbf", + "metadata": {}, + "source": [ + "## 6. Saving your dataset to a CSV file" + ] + }, + { + "cell_type": "markdown", + "id": "65a59547", + "metadata": {}, + "source": [ + "Now its time to save your dataset to a CSV file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af2fb073", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_csv(\"...\", index=True)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Groupwork (Group 5)/feminism-reddit dataset.csv b/Groupwork (Group 5)/feminism-reddit dataset.csv new file mode 100644 index 0000000..61a811d --- /dev/null +++ b/Groupwork (Group 5)/feminism-reddit dataset.csv @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fa7b42f00ec9040f6dc49c9857d003d63d8969bf9b5c9ef45553be858bb3ab01 +size 506811 diff --git a/Groupwork (Group 5)/flourish plot b/Groupwork (Group 5)/flourish plot new file mode 100644 index 0000000..629352a --- /dev/null +++ b/Groupwork (Group 5)/flourish plot @@ -0,0 +1,2 @@ +https://public.flourish.studio/visualisation/12415427/ +