Skip to content

Conductor15/disease-article-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disease Article Crawler

A Python web crawler that collects disease-related articles from the website:

https://suckhoedoisong.vn

The crawler extracts:

  • Disease categories
  • Disease article lists
  • Article details including:
    • Title
    • Published date
    • Sapo (summary)
    • Full article content

The collected data is stored in JSON format.


Project Structure

    CRAWLING/
    │
    ├── data/ # Output JSON files
    │ ├── diseases.json
    │ └── all_posts.json
    │
    ├── src/
    │ ├── crawler/
    │ │ ├── list_crawler.py # Crawl disease list
    │ │ └── detail_disease_crawler.py # Crawl article details
    │ │
    │ ├── utils/
    │ │ ├── convert_to_datetype.py
    │ │ └── file_utils.py
    │ │
    │ └── driver.py # Selenium driver setup
    │
    ├── main.py # Main entry point
    ├── requirements.txt
    └── README.md

Installation

Clone the repository:

git clone <repo-url>
cd CRAWLING

Create virtual environment:

python -m venv .venv

Activate environment:

Windows:

.venv\Scripts\activate

Linux / Mac:

source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Crawl the disease list

First, collect the list of diseases and their corresponding article pages.

Edit main.py to run:

if __name__ == "__main__":
    crawl_list()

Then run:

python main.py

This step will:

  • Visit the disease lookup page

  • Extract all disease names and their links

  • Save the results to data/diseases.json

Crawl disease articles by char

After collecting the disease list, you can crawl detailed articles for diseases starting with a specific character.

Update main.py, modify the entry point to run the crawler for a specific character:

if __name__ == "__main__":
    crawl_data_by_char("A")  # Example: "A", "B", "C", ...

Run again:

python main.py

This step will:

  • Load disease data from data/diseases.json

  • Select diseases whose group character matches the specified letter (e.g., "A")

  • Visit each disease page

  • Extract article details: Title, Published date, Sapo (summary), Article content

  • Save results to data/A_posts.json

  1. Handling missing pages

Some article pages may fail due to:

  • Website redirects

  • Bot protection

  • Temporary loading issues

The crawler records failed disease links and prints them at the end of execution.

Example:

['https://suckhoedoisong.vn/...']

These links can be retried later.

Crawl all disease articles

After the disease list is collected, crawl the detailed articles for each disease.

Update main.py:

if __name__ == "__main__":
    crawl_all_detail()

Run again:

python main.py

This step will:

  • Load disease data from data/diseases.json

  • Visit each disease page

  • Extract article details: Title, Published date, Sapo (summary), Article content

  • Save results to data/all_posts.json

  1. Handling missing pages

Some article pages may fail due to:

  • Website redirects

  • Bot protection

  • Temporary loading issues

The crawler records failed disease links and prints them at the end of execution.

Example:

['https://suckhoedoisong.vn/...']

These links can be retried later.

Output Example

{
  "title": "Chế độ ăn uống phù hợp với người mắc bệnh ấu trùng sán lợn",
  "published": "2024-08-17",
  "sapo": "...",
  "content": "...",
  "disease": "Ấu trùng sán lợn"
}

About

Python crawler that collects disease-related articles from the Suckhoedoisong website using Selenium.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages