Content Scraper API

A high-performance FastAPI microservice that extracts content from web articles using newspaper3k, providing structured data through a RESTful API.

Features

Article Extraction: Extracts comprehensive article data including:
- Title and main content
- Authors and publication date
- Images and videos
- Meta information (keywords, description, language)
- Additional metadata
Clean Architecture: Modular design with clear separation of concerns
FastAPI Framework: High performance, automatic OpenAPI documentation
Error Handling: Robust error handling for various failure scenarios
Type Safety: Full type hints and Pydantic models for request/response validation

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone the repository:

git clone https://github.com/yourusername/content-scraper-api.git
cd content-scraper-api

Install dependencies:

pip install -r requirements.txt

Usage

Starting the server

python main.py

Or using uvicorn directly:

uvicorn main:app --reload

The API will be available at http://localhost:8000

API Endpoints

POST /fetch-article

Fetches and parses an article from the provided URL.

Request:

{
  "url": "https://example.com/news/article"
}

Response:

{
  "url": "https://example.com/news/article",
  "title": "Example Article Title",
  "content": "Article content text...",
  "top_image": "https://example.com/images/top.jpg",
  "authors": ["Author Name"],
  "images": [
    "https://example.com/images/1.jpg",
    "https://example.com/images/2.jpg"
  ],
  "movies": ["https://example.com/videos/1.mp4"]
}

API Documentation

FastAPI automatically generates interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Architecture

The application follows a clean architecture approach with the following components:

API Layer: Handles HTTP requests and responses
Service Layer: Contains the business logic for article extraction
Core: Configuration and shared utilities

Error Handling

The API handles various error scenarios:

Invalid URLs
Unreachable sites
Parsing failures
Server errors

Extending the API

The modular architecture makes it easy to extend the API:

Add new endpoints in api/routes.py
Add new services in the services package
Modify the data models in api/models.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
api		api
core		core
services		services
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test.sh		test.sh
trending_searches_IN.json		trending_searches_IN.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Scraper API

Features

Prerequisites

Installation

Usage

Starting the server

API Endpoints

POST /fetch-article

API Documentation

Architecture

Error Handling

Extending the API

License

About

Uh oh!

Releases

Packages

Languages

mr3od/content-scraper-api

Folders and files

Latest commit

History

Repository files navigation

Content Scraper API

Features

Prerequisites

Installation

Usage

Starting the server

API Endpoints

POST /fetch-article

API Documentation

Architecture

Error Handling

Extending the API

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages