A Python tool for extracting structured contract data from FPDS (Federal Procurement Data System) PDF reports into CSV format.
This script processes PDF files containing federal contract data from FPDS and extracts individual contract records with all 26 fields into a structured CSV file. Each row in the output represents one complete contract with standardized column headers.
- Comprehensive extraction: Extracts all 26 contract fields per record
- Clean data output: Removes headers, newlines, and formatting artifacts
- Robust processing: Handles errors gracefully and continues processing
- Progress tracking: Shows real-time progress during extraction
- Flexible input: Process entire PDFs or specify page limits for testing
- Command-line interface: Easy to use from terminal or scripts
- Python 3.7+
- Required packages:
pdfplumber- PDF parsing and table extractionpandas- Data manipulation and CSV output
- Clone or download this repository
- Install required packages:
pip install pdfplumber pandas
python contract_extract.py input.pdf output.csv# Process only first 50 pages (for testing)
python contract_extract.py contracts.pdf sample.csv --max-pages 50
# Get help
python contract_extract.py --help# Extract all contracts from FPDS PDF
python contract_extract.py ICE_AllContracts_250121_50k.pdf all_contracts.csv
# Test with first 10 pages
python contract_extract.py ICE_AllContracts_250121_50k.pdf test.csv --max-pages 10The script outputs a CSV file with 26 columns representing all contract fields:
| Column | Description | Example |
|---|---|---|
| Contract ID | Unique contract identifier | 70CDCR22P00000024 |
| Reference IDV | Reference indefinite delivery vehicle | 70CDCR22D00000002 |
| Modification Number | Contract modification number | P00011 |
| Transaction Number | Transaction sequence number | 0 |
| Award/IDV Type | Type of award/contract | PO Purchase Order |
| Action Obligation ($) | Contract value | $258,000.00 |
| Date Signed | Contract signature date | Sep 16, 2025 |
| Solicitation Date | Date of solicitation | Jul 6, 2022 |
| Contracting Agency ID | Agency identifier | 7012 |
| Contracting Agency | Full agency name | U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT |
| Contracting Office Name | Contracting office | DETENTION COMPLIANCE AND REMOVALS |
| PSC Type | Product/Service Code type | S |
| PSC | Product/Service Code | X1FB |
| PSC Description | Service description | LEASE/RENTAL OF RECREATIONAL BUILDINGS |
| NAICS | Industry classification code | 713990 |
| NAICS Description | Industry description | ALL OTHER AMUSEMENT AND RECREATION INDUSTRIES |
| Entity City | Contractor city | CONROE |
| Entity State | Contractor state | TX |
| Entity ZIP Code | Contractor ZIP code | 773024850 |
| Additional Reporting Code | Special reporting codes | E, S |
| Additional Reporting Description | Reporting description | EMPLOYMENT ELIGIBILITY VERIFICATION |
| Unique Entity ID | Contractor unique ID | XR3HKXN6M1B3 |
| Ultimate Parent Unique Entity ID | Parent company ID | WGN2KJJD27Q3 |
| Ultimate Parent Legal Business Name | Parent company name | AKIMA INFRASTRUCTURE PROTECTION LLC |
| Legal Business Name | Contractor legal name | GO & ZALEZ INC. |
| CAGE Code | Commercial and Government Entity code | 6S0S5 |
- Processing speed: ~2-3 pages per second
- Memory usage: Moderate (processes one page at a time)
- Error handling: Continues processing if individual pages fail
- Large files: Tested with 600+ page PDFs
From a 10-page test extraction:
- 23 contracts extracted
- 26 columns with complete data
- Contract values: $211K - $14.6M
- Major contractors: CoreCivic, GEO Group, G4S, Akima Infrastructure
-
Missing dependencies
pip install pdfplumber pandas
-
PDF file not found
- Check file path and spelling
- Use absolute paths if needed
-
Memory issues with large PDFs
- Use
--max-pagesto process in chunks - Process smaller sections and combine results
- Use
-
No contracts extracted
- Verify PDF contains FPDS contract data
- Check if PDF format matches expected structure
❌ Error: Input PDF file not found- Check file path❌ Error: Input file must be a PDF- Ensure file has .pdf extensionMissing library- Install required packagesKeyErrorduring extraction - PDF format may not match expected structure
- 4-column table format with field:value pairs
- Standard FPDS header: "www.fpds.gov List of contracts matching your search criteria"
- 26 fields per contract from "Contract ID:" to "CAGE Code:"
- Open PDF and identify total pages
- For each page:
- Extract tables using pdfplumber
- Remove FPDS headers
- Parse 4-column format into field:value pairs
- Clean data (remove newlines, extra whitespace)
- Group fields into complete contracts
- Combine all contracts into single DataFrame
- Export to CSV with standardized column names
To contribute improvements:
- Test with different FPDS PDF formats
- Add error handling for edge cases
- Optimize performance for very large files
- Add additional output formats (JSON, Excel)
This tool is provided as-is for processing federal contract data from public FPDS reports.