Article Classifier

Overview

Article Classifier is an application that fetches HTML from an article URL entered into a form, determines the article's category, and displays it on the screen. It deploys a classification model trained from more than 40,000 articles to the web application.

Application Summary

Enter an article URL from https://gunosy.com/ into the form in the center of the application screen. Press the "Analyze" button to classify the article's category and display it on the screen.

Classifier Accuracy Overview

Two types of classifiers can be used to determine the article category. One is a classifier using Naive Bayes, and the other is a classifier using Logistic Regression. For both classifiers, training was performed with a training-to-test data ratio of 8:2. The precision, recall, f1-score, and support (number of test data) for each classifier are shown below. Furthermore, to show that overfitting has not occurred for a specific dataset, the values from cross-validation are also listed below.

Naive Bayes Classifier

category	precision	recall	f1-score	support
IT/Science	0.79	0.94	0.86	541
Funny	0.75	0.15	0.25	101
Entertainment	0.97	0.94	0.96	4039
Gourmet	0.87	0.95	0.91	611
Column	0.81	0.87	0.83	1155
Sports	0.97	0.96	0.97	827
Domestic	0.87	0.82	0.84	671
International	0.84	0.86	0.85	336
avg / total	0.91	0.91	0.91	8281

The values obtained from 5-fold cross-validation are:

scores: [ 0.91063881 0.90882744 0.90725758 0.91038647 0.90736715]

and the average value is:

average value: 0.908895489763

achieving an average accuracy of approximately 90.9% in cross-validation.

Logistic Regression Classifier

category	precision	recall	f1-score	support
IT/Science	0.90	0.95	0.92	541
Funny	0.80	0.69	0.74	101
Entertainment	0.98	0.98	0.98	4039
Gourmet	0.93	0.96	0.94	611
Column	0.91	0.89	0.90	1155
Sports	0.98	0.98	0.98	827
Domestic	0.90	0.87	0.88	671
International	0.88	0.91	0.89	336
avg / total	0.95	0.95	0.95	8281

The values obtained from 5-fold cross-validation are:

scores: [ 0.94046613 0.9364811 0.94119068 0.93538647 0.93913043]

and the average value is:

average value: 0.938530962853

achieving an average accuracy of approximately 93.9% in cross-validation.

Environment Setup

The execution environment is as follows: Mac OS X: Sierra 10.12.2 Python: 3.6.1

In the terminal:

$ brew update
$ brew install python3
$ pip install virtualenv
$ virtualenv --python=/usr/local/bin/python3 --no-site-packages env
$ source env/bin/activate

to start the virtual environment.

$ brew install mecab
$ brew install mecab-ipadic
$ git clone --depth 1 git@github.com:neologd/mecab-ipadic-neologd.git
$ ./mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n

to install mecab-ipadic-neologd, which is used as a dictionary for MeCab.

Furthermore:

$ git clone git@github.com:ajingu/gunosy.git
$ cd gunosy
$ pip install -r requirements.txt

to install the required Python packages into the virtual environment.

Finally, you need to write environment variables to .bash_profile. In this project, to hide the database password and other credentials, environment variables are added to .bash_profile and read within the program using the os module.

$ cd ~
$ vim .bash_profile

to open .bash_profile and add the following text:

export GUNOSY_HOST="****"
export GUNOSY_USERNAME="****"
export GUNOSY_PASSWORD="****"
export GUNOSY_DATABASE_NAME="****"
export GUNOSY_TABLE_NAME="****"

to add the environment variables.

The correspondence for each variable is as follows. You need to set appropriate values considering the database you will use.

Environment Variable	Value
GUNOSY_HOST	Host Name
GUNOSY_USERNAME	Username
GUNOSY_PASSWORD	Password
GUNOSY_DATABASE_NAME	Database Name
GUNOSY_TABLE_NAME	Table Name

How to Run

Step 1: Creating the web app with the Naive Bayes classifier

※ Note: This repository includes pre-trained data by default, so you can run it from the start by entering $ python manage.py runserver in the gunosychallenge directory.

When collecting data, if you want to delete previously collected data, in the gunosychallenge repository:

You can delete all rows in the relevant table and initialize the database by running the command $ python manage.py initialize.
To collect data using Scrapy, in the gunosychallenge repository:

Run the command $ python manage.py scrapy crawl gunosy. Data collection takes about 70 minutes and retrieves around 40,000 articles.
To train the Naive Bayes classifier, in the gunosychallenge repository:

Run the command $ python manage.py make_clf nb. Training takes about 8 minutes.
To start the web application, in the gunosychallenge repository:

Run the command $ python manage.py runserver.

This starts a local server. Accessing http://127.0.0.1:8000/ will launch the web application. As explained in the "Overview" section, enter an article URL from https://gunosy.com/ into the central form and press the "Analyze" button. The application will predict the article's category from "Entertainment," "Sports," "Funny," "Domestic," "International," "Column," "IT/Science," and "Gourmet," and display it on the screen.

Step 2: Improving Classifier Accuracy

To train the classifier using Logistic Regression, in the gunosychallenge repository:

Run the command $ python manage.py make_clf logistic. Training takes about 10 minutes.
Launching the web app, entering an article URL, and predicting the category are done in the exact same way as in Step 1.

Appendix: Testing

It is possible to test the application.

Scrapy Test

In the gunosynews repository: Enter $ python gunosynews/tests/scrapy_test.py to test the crawler.

Web App Test

In the gunosychallenge repository: Enter $ python manage.py test to test the web application.

Innovations

1. Regarding Data Collection

Acquiring as many articles as possible
- Relevant file: gunosy.py
- To use as many articles as possible for training and testing, articles were collected from https://gunosy.com/tags.
- There are currently 2500 tags (from 1 to 2500), and each tag is assigned to one of eight categories: "Entertainment," "Sports," "Funny," "Domestic," "International," "Column," "IT/Science," and "Gourmet."
- However, some tags are not assigned to a category, or exist as a tag but have no associated articles. An implementation was made to ignore such irregular tags.
Configuring MySQL in pipelines.py
- Relevant file: pipelines.py
- By setting up MySQL linked with the Django app in pipelines.py, the flow from data collection to database upload was made smooth.

2. Regarding Classifier Training

2-1. General Classifier Features

2-1-1. Use of mecab-ipadic-neologd

Relevant file: preprocess.py
For morphological analysis of Japanese, the morphological analyzer MeCab was used with the mecab-ipadic-neologd dictionary.
This allows for more appropriate feature extraction from news articles, which contain many proper nouns such as names of people and places. Feature words were limited to nouns and adjectives to extract words more relevant to the context of the text.
Also, since the dictionary path can change depending on the build environment, an implementation was added to search for the dictionary path first when preprocessing the data.

2-1-2. Stop Word Configuration

Relevant file: preprocess.py
Stop words were configured by implementing a program that reads from slothlib, a collection of Japanese stop words.

2-1-3. Classifier Serialization

Relevant files: NaiveBayes.py, Logistic.py
During classifier training, the trained classifier is serialized using the dill library.
Therefore, when an article URL is entered in the web app, the analysis of a new article can be done simply by loading the already created classifier, significantly reducing the time required for category prediction.

2-2. Regarding the Naive Bayes Classifier

Relevant file: NaiveBayes.py

2-2-1. Implementation of Laplace Smoothing

To avoid the "zero-frequency problem" in Naive Bayes (where if a word in a test document was not present in a certain category during training, the probability of that category becomes 0), Laplace smoothing was implemented.

2-3. Regarding the Logistic Regression Classifier

Relevant file: Logistic.py

2-3-1. Use of Logistic Regression

The Naive Bayes method assumes that the occurrences of each word are mutually independent and multiplies their conditional probabilities. This makes it susceptible to the "zero-frequency problem" mentioned earlier, where results can be heavily influenced by words not present during training.
Therefore, a Logistic Regression model was used, which is widely used for category classification. This model performs calculations focusing only on words present during training (feature words not present during training have an input of 0 to the logistic function and thus have no practical effect).

2-3-2. TfidfVectorizer Configuration

TF-IDF was calculated for each word to apply appropriate weighting to each word.

2-3-3. Setting class_weight in the LogisticRegression model

Due to a significant difference in the number of samples for each category, the recall for the "Funny" category was a very low 0.15 with the Naive Bayes classifier. This is because the number of samples for the "Funny" category is much smaller than for other categories, leading to a high number of false negatives, as articles that are actually in the "Funny" category are often predicted to be in other categories.
To mitigate this bias in category prediction due to sample size, the class_weight parameter of the LogisticRegression model was set to "balanced". This makes the weight of feature words in each category inversely proportional to the number of samples, which improved the recall for the "Funny" category by more than 50 percentage points.
Undersampling the data to equalize the number of data points for each category was also considered, but this would have drastically reduced the total number of data points to about 4000, leading to a sharp drop in accuracy. Therefore, the method of setting class_weight was chosen this time.

2-3-4. Use of GridSearch

GridSearch was used to optimize the regularization parameter C of the LogisticRegression model.

3. Regarding the Web App

3-1. Exception Handling

Relevant file: views.py
If a URL other than an article URL from https://gunosy.com/ is entered, the HTML structure cannot be parsed, and a Django error screen is displayed instead of the web app screen.
Therefore, exception handling was written in advance to display an error message on the web app screen when an inappropriate URL is entered.

4. Other

4-1. Hiding Database Credentials

Relevant files: settings.py (in gunosychallenge directory), database.py, pipelines.py
By setting environment variables in ~/.bash_profile, the exposure of the MySQL password was avoided.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
gunosychallenge		gunosychallenge
gunosynews		gunosynews
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
README_JP.md		README_JP.md
requirements.txt		requirements.txt

ajingu/ML_Article_Classifier

Folders and files

Latest commit

History

Repository files navigation