This Python-based web crawler explores web pages up to a specified depth, extracting information and recording results in an output file. It utilizes asynchronous methods for efficient crawling.
-
WebCrawler Class: The core component responsible for managing the crawling process. It tracks visited URLs, fetches web pages asynchronously, extracts links, and calculates same-domain ratios.
-
Utils Module: Contains utility functions used in the crawler. The
time_calculatordecorator measures the time taken by functions, andget_random_floatgenerates random floats within a specified range. -
Config Module: Holds configuration parameters for the crawler, such as the default URL protocol, output file name, retry count, backoff time, etc.
-
Log Module: Implements a logging handler (
LogHandler) and provides a logger object (LOGGER) for consistent and structured logging throughout the application. -
Main Script (crawler.py): The entry point of the application. It initializes the crawler, sets up the necessary configurations, and starts the crawling process.
- Asynchronous Crawling: Utilizes
aiohttpandasynciofor asynchronous web page crawling. - Depth Control: Allows crawling up to a specified depth.
- Same Domain Ratio: Calculates the ratio of same-domain links for each page.
- Retry Mechanism: Implements retries with an exponential backoff for robustness against network issues.
- Logging: Uses Python's
loggingmodule for log handling. - Configurability: Easily configurable with
config.pyfor parameters like protocol, output file, retries, and more.
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Run the crawler:
python crawler.py <URL> <depth>
Example:
python crawler.py https://example.com 3
<URL>: Starting URL for crawling.<depth>: Depth to which the crawler should explore.
-
Results will be written to
output.tsvby default.
Modify config.py to adjust parameters like URL protocol, output file name, retry count, and backoff time.
Logs are written to app.log by default. Adjust logging settings in log.py as needed.
Used black and pylint for formatting and linting.