Disease Article Crawler

A Python web crawler that collects disease-related articles from the website:

The crawler extracts:

Disease categories
Disease article lists
Article details including:
- Title
- Published date
- Sapo (summary)
- Full article content

The collected data is stored in JSON format.

Project Structure

    CRAWLING/
    │
    ├── data/ # Output JSON files
    │ ├── diseases.json
    │ └── all_posts.json
    │
    ├── src/
    │ ├── crawler/
    │ │ ├── list_crawler.py # Crawl disease list
    │ │ └── detail_disease_crawler.py # Crawl article details
    │ │
    │ ├── utils/
    │ │ ├── convert_to_datetype.py
    │ │ └── file_utils.py
    │ │
    │ └── driver.py # Selenium driver setup
    │
    ├── main.py # Main entry point
    ├── requirements.txt
    └── README.md

Installation

Clone the repository:

git clone <repo-url>
cd CRAWLING

Create virtual environment:

python -m venv .venv

Activate environment:

Windows:

.venv\Scripts\activate

Linux / Mac:

source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Crawl the disease list

First, collect the list of diseases and their corresponding article pages.

Edit main.py to run:

if __name__ == "__main__":
    crawl_list()

Then run:

python main.py

This step will:

Visit the disease lookup page
Extract all disease names and their links
Save the results to data/diseases.json

Crawl disease articles by char

After collecting the disease list, you can crawl detailed articles for diseases starting with a specific character.

Update main.py, modify the entry point to run the crawler for a specific character:

if __name__ == "__main__":
    crawl_data_by_char("A")  # Example: "A", "B", "C", ...

Run again:

python main.py

This step will:

Load disease data from data/diseases.json
Select diseases whose group character matches the specified letter (e.g., "A")
Visit each disease page
Extract article details: Title, Published date, Sapo (summary), Article content
Save results to data/A_posts.json

Handling missing pages

Some article pages may fail due to:

Website redirects
Bot protection
Temporary loading issues

The crawler records failed disease links and prints them at the end of execution.

Example:

['https://suckhoedoisong.vn/...']

These links can be retried later.

Crawl all disease articles

After the disease list is collected, crawl the detailed articles for each disease.

Update main.py:

if __name__ == "__main__":
    crawl_all_detail()

Run again:

python main.py

This step will:

Load disease data from data/diseases.json
Visit each disease page
Extract article details: Title, Published date, Sapo (summary), Article content
Save results to data/all_posts.json

Handling missing pages

Some article pages may fail due to:

Website redirects
Bot protection
Temporary loading issues

The crawler records failed disease links and prints them at the end of execution.

Example:

['https://suckhoedoisong.vn/...']

These links can be retried later.

Output Example

{
  "title": "Chế độ ăn uống phù hợp với người mắc bệnh ấu trùng sán lợn",
  "published": "2024-08-17",
  "sapo": "...",
  "content": "...",
  "disease": "Ấu trùng sán lợn"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disease Article Crawler

Project Structure

Installation

Usage

Crawl the disease list

Crawl disease articles by char

Crawl all disease articles

Output Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Disease Article Crawler

Project Structure

Installation

Usage

Crawl the disease list

Crawl disease articles by char

Crawl all disease articles

Output Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages