Skip to content

FastAPI RESTful api for extracting clean content and metadata from web articles using newspaper3k.

Notifications You must be signed in to change notification settings

mr3od/content-scraper-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content Scraper API

A high-performance FastAPI microservice that extracts content from web articles using newspaper3k, providing structured data through a RESTful API.

Features

  • Article Extraction: Extracts comprehensive article data including:
    • Title and main content
    • Authors and publication date
    • Images and videos
    • Meta information (keywords, description, language)
    • Additional metadata
  • Clean Architecture: Modular design with clear separation of concerns
  • FastAPI Framework: High performance, automatic OpenAPI documentation
  • Error Handling: Robust error handling for various failure scenarios
  • Type Safety: Full type hints and Pydantic models for request/response validation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/content-scraper-api.git
cd content-scraper-api
  1. Install dependencies:
pip install -r requirements.txt

Usage

Starting the server

python main.py

Or using uvicorn directly:

uvicorn main:app --reload

The API will be available at http://localhost:8000

API Endpoints

POST /fetch-article

Fetches and parses an article from the provided URL.

Request:

{
  "url": "https://example.com/news/article"
}

Response:

{
  "url": "https://example.com/news/article",
  "title": "Example Article Title",
  "content": "Article content text...",
  "top_image": "https://example.com/images/top.jpg",
  "authors": ["Author Name"],
  "images": [
    "https://example.com/images/1.jpg",
    "https://example.com/images/2.jpg"
  ],
  "movies": ["https://example.com/videos/1.mp4"]
}

API Documentation

FastAPI automatically generates interactive API documentation:

Architecture

The application follows a clean architecture approach with the following components:

  • API Layer: Handles HTTP requests and responses
  • Service Layer: Contains the business logic for article extraction
  • Core: Configuration and shared utilities

Error Handling

The API handles various error scenarios:

  • Invalid URLs
  • Unreachable sites
  • Parsing failures
  • Server errors

Extending the API

The modular architecture makes it easy to extend the API:

  • Add new endpoints in api/routes.py
  • Add new services in the services package
  • Modify the data models in api/models.py

License

MIT

About

FastAPI RESTful api for extracting clean content and metadata from web articles using newspaper3k.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published