A Python web crawler that collects disease-related articles from the website:
The crawler extracts:
- Disease categories
- Disease article lists
- Article details including:
- Title
- Published date
- Sapo (summary)
- Full article content
The collected data is stored in JSON format.
CRAWLING/
│
├── data/ # Output JSON files
│ ├── diseases.json
│ └── all_posts.json
│
├── src/
│ ├── crawler/
│ │ ├── list_crawler.py # Crawl disease list
│ │ └── detail_disease_crawler.py # Crawl article details
│ │
│ ├── utils/
│ │ ├── convert_to_datetype.py
│ │ └── file_utils.py
│ │
│ └── driver.py # Selenium driver setup
│
├── main.py # Main entry point
├── requirements.txt
└── README.md
Clone the repository:
git clone <repo-url>
cd CRAWLINGCreate virtual environment:
python -m venv .venv
Activate environment:
Windows:
.venv\Scripts\activate
Linux / Mac:
source .venv/bin/activate
Install dependencies:
pip install -r requirements.txt
First, collect the list of diseases and their corresponding article pages.
Edit main.py to run:
if __name__ == "__main__":
crawl_list()
Then run:
python main.py
This step will:
-
Visit the disease lookup page
-
Extract all disease names and their links
-
Save the results to
data/diseases.json
After collecting the disease list, you can crawl detailed articles for diseases starting with a specific character.
Update main.py, modify the entry point to run the crawler for a specific character:
if __name__ == "__main__":
crawl_data_by_char("A") # Example: "A", "B", "C", ...
Run again:
python main.py
This step will:
-
Load disease data from
data/diseases.json -
Select diseases whose group character matches the specified letter (e.g., "A")
-
Visit each disease page
-
Extract article details: Title, Published date, Sapo (summary), Article content
-
Save results to
data/A_posts.json
- Handling missing pages
Some article pages may fail due to:
-
Website redirects
-
Bot protection
-
Temporary loading issues
The crawler records failed disease links and prints them at the end of execution.
Example:
['https://suckhoedoisong.vn/...']
These links can be retried later.
After the disease list is collected, crawl the detailed articles for each disease.
Update main.py:
if __name__ == "__main__":
crawl_all_detail()
Run again:
python main.py
This step will:
-
Load disease data from
data/diseases.json -
Visit each disease page
-
Extract article details: Title, Published date, Sapo (summary), Article content
-
Save results to
data/all_posts.json
- Handling missing pages
Some article pages may fail due to:
-
Website redirects
-
Bot protection
-
Temporary loading issues
The crawler records failed disease links and prints them at the end of execution.
Example:
['https://suckhoedoisong.vn/...']
These links can be retried later.
{
"title": "Chế độ ăn uống phù hợp với người mắc bệnh ấu trùng sán lợn",
"published": "2024-08-17",
"sapo": "...",
"content": "...",
"disease": "Ấu trùng sán lợn"
}