Skip to content

A Python tool for extracting structured contract data from FPDS (Federal Procurement Data System) PDF reports into CSV format.

Notifications You must be signed in to change notification settings

corintxt/fpds-extract

Repository files navigation

FPDS Contract Extractor

A Python tool for extracting structured contract data from FPDS (Federal Procurement Data System) PDF reports into CSV format.

Overview

This script processes PDF files containing federal contract data from FPDS and extracts individual contract records with all 26 fields into a structured CSV file. Each row in the output represents one complete contract with standardized column headers.

Features

  • Comprehensive extraction: Extracts all 26 contract fields per record
  • Clean data output: Removes headers, newlines, and formatting artifacts
  • Robust processing: Handles errors gracefully and continues processing
  • Progress tracking: Shows real-time progress during extraction
  • Flexible input: Process entire PDFs or specify page limits for testing
  • Command-line interface: Easy to use from terminal or scripts

Requirements

  • Python 3.7+
  • Required packages:
    • pdfplumber - PDF parsing and table extraction
    • pandas - Data manipulation and CSV output

Installation

  1. Clone or download this repository
  2. Install required packages:
    pip install pdfplumber pandas

Usage

Basic Usage

python contract_extract.py input.pdf output.csv

Advanced Usage

# Process only first 50 pages (for testing)
python contract_extract.py contracts.pdf sample.csv --max-pages 50

# Get help
python contract_extract.py --help

Examples

# Extract all contracts from FPDS PDF
python contract_extract.py ICE_AllContracts_250121_50k.pdf all_contracts.csv

# Test with first 10 pages
python contract_extract.py ICE_AllContracts_250121_50k.pdf test.csv --max-pages 10

Output Format

The script outputs a CSV file with 26 columns representing all contract fields:

Column Description Example
Contract ID Unique contract identifier 70CDCR22P00000024
Reference IDV Reference indefinite delivery vehicle 70CDCR22D00000002
Modification Number Contract modification number P00011
Transaction Number Transaction sequence number 0
Award/IDV Type Type of award/contract PO Purchase Order
Action Obligation ($) Contract value $258,000.00
Date Signed Contract signature date Sep 16, 2025
Solicitation Date Date of solicitation Jul 6, 2022
Contracting Agency ID Agency identifier 7012
Contracting Agency Full agency name U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT
Contracting Office Name Contracting office DETENTION COMPLIANCE AND REMOVALS
PSC Type Product/Service Code type S
PSC Product/Service Code X1FB
PSC Description Service description LEASE/RENTAL OF RECREATIONAL BUILDINGS
NAICS Industry classification code 713990
NAICS Description Industry description ALL OTHER AMUSEMENT AND RECREATION INDUSTRIES
Entity City Contractor city CONROE
Entity State Contractor state TX
Entity ZIP Code Contractor ZIP code 773024850
Additional Reporting Code Special reporting codes E, S
Additional Reporting Description Reporting description EMPLOYMENT ELIGIBILITY VERIFICATION
Unique Entity ID Contractor unique ID XR3HKXN6M1B3
Ultimate Parent Unique Entity ID Parent company ID WGN2KJJD27Q3
Ultimate Parent Legal Business Name Parent company name AKIMA INFRASTRUCTURE PROTECTION LLC
Legal Business Name Contractor legal name GO & ZALEZ INC.
CAGE Code Commercial and Government Entity code 6S0S5

Performance

  • Processing speed: ~2-3 pages per second
  • Memory usage: Moderate (processes one page at a time)
  • Error handling: Continues processing if individual pages fail
  • Large files: Tested with 600+ page PDFs

Sample Output

From a 10-page test extraction:

  • 23 contracts extracted
  • 26 columns with complete data
  • Contract values: $211K - $14.6M
  • Major contractors: CoreCivic, GEO Group, G4S, Akima Infrastructure

Troubleshooting

Common Issues

  1. Missing dependencies

    pip install pdfplumber pandas
  2. PDF file not found

    • Check file path and spelling
    • Use absolute paths if needed
  3. Memory issues with large PDFs

    • Use --max-pages to process in chunks
    • Process smaller sections and combine results
  4. No contracts extracted

    • Verify PDF contains FPDS contract data
    • Check if PDF format matches expected structure

Error Messages

  • ❌ Error: Input PDF file not found - Check file path
  • ❌ Error: Input file must be a PDF - Ensure file has .pdf extension
  • Missing library - Install required packages
  • KeyError during extraction - PDF format may not match expected structure

Technical Details

PDF Structure Expected

  • 4-column table format with field:value pairs
  • Standard FPDS header: "www.fpds.gov List of contracts matching your search criteria"
  • 26 fields per contract from "Contract ID:" to "CAGE Code:"

Processing Steps

  1. Open PDF and identify total pages
  2. For each page:
    • Extract tables using pdfplumber
    • Remove FPDS headers
    • Parse 4-column format into field:value pairs
    • Clean data (remove newlines, extra whitespace)
    • Group fields into complete contracts
  3. Combine all contracts into single DataFrame
  4. Export to CSV with standardized column names

Contributing

To contribute improvements:

  1. Test with different FPDS PDF formats
  2. Add error handling for edge cases
  3. Optimize performance for very large files
  4. Add additional output formats (JSON, Excel)

License

This tool is provided as-is for processing federal contract data from public FPDS reports.

About

A Python tool for extracting structured contract data from FPDS (Federal Procurement Data System) PDF reports into CSV format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published