This repository shows how to extract invoices and receipts from email using a single API call to iGPT. The agent connects to iGPT using an API key, scans email threads and PDF attachments, and returns structured, deduplicated invoice data as validated JSON.
No MIME parsing, OCR pipeline, or attachment handling code.
Just a prompt, a schema, and the iGPT API.
from igptai import IGPT
client = IGPT(api_key="...", user="...")
result = client.recall.ask(input=SYSTEM_PROMPT, output_format=OUTPUT_FORMAT)
# β validated JSON: vendor, amount, dates, line items, payment status, source provenanceReturns structured, schema-validated JSON in ~3 seconds, even across large inboxes.
The project is designed for easy customization and extension, allowing you to:
- Target different document types (contracts, SOWs, purchase orders)
- Adjust extraction rules and output schemas
- Plug into any orchestration framework (LangChain, LangGraph, CrewAI)
Email threads contain 10-15x duplicate content from nested quoting. Attachments disagree with body text. Invoices get forwarded, CC'd, and duplicated across inboxes. Renewals look like invoices but aren't always. PDFs require document extraction and normalization. Most teams end up building custom Gmail API logic, PDF parsing pipelines, and deduplication layers from scratch. This repo skips all of that.
Use this pattern to: sync SaaS invoices into QuickBooks or Xero, track renewals across shared finance inboxes, reconcile subscription charges from forwarded receipts, or power an autonomous finance agent.
- π Attachment-aware extraction β Reads PDF and HTML invoices attached to emails, not just email body text
- π§Ή Automatic deduplication β Stable
dedupe_keyper invoice prevents double-counting across forwarded or CC'd threads - π Strict JSON schema β Output validated against a JSON schema server-side, so downstream systems can trust the structure
- π Source provenance β Each invoice record includes the source message ID, subject, sender, timestamp, and attachment filenames
- π User-scoped isolation β Each API call is isolated to the authenticated user's data
- β‘ Single API call β The entire extraction runs through one
recall.ask()call
Connect your email datasource β Ask iGPT to extract invoices using a system prompt + JSON schema β Get back validated, structured invoice records with source provenance.
-
Loads credentials
- Reads
IGPT_API_KEYandIGPT_API_USERfrom environment variables. - If missing, prompts securely using
getpass(so the key isn't echoed in your terminal).
- Reads
-
Creates an iGPT client
client = IGPT(api_key=..., user=...)
-
Checks whether you already have datasources connected
- If no datasources are connected yet, the agent requests an authorization URL.
- The run stops and prints the authorization URL so you can connect your data, then you re-run the script.
-
Runs an extraction request
- Calls:
client.recall.ask(input=SYSTEM_PROMPT, output_format=OUTPUT_FORMAT)
SYSTEM_PROMPTtells the model to:- find invoice-like documents (invoices, receipts, renewals with totals, refunds/credit memos)
- treat invoice attachments as the source of truth when they disagree with the email text
- use
dedupe_keyas a stable, normalized unique key to detect duplicate emails or files for the same invoice - leave any unavailable fields as
null(never infer missing values) - standardize dates/currency/amount formats for consistency
- classify each record into a fixed
invoice_typeenum
- Calls:
-
Returns strict JSON
OUTPUT_FORMATenforces a strict schema:- top-level keys:
run_metadataandinvoices - each invoice includes vendor info, amounts, dates, payment status, provenance fields (message id/subject/from/received timestamp), attachment filenames, and optional notes
- top-level keys:
app.pyprints either the error response or the structured output.
βββ app.py # Entry point: runs the agent
βββ igpt/
β βββ __init__.py # Exports IgptAgent, SYSTEM_PROMPT, OUTPUT_FORMAT
β βββ agent.py # IgptAgent class: auth check β extraction β output
β βββ prompts.py # System prompt defining extraction rules
β βββ schema.py # JSON schema enforcing output structure
βββ requirements.txt
βββ LICENSE.txt
βββ .gitignore
Requirements: Python >= 3.8
- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate- Install dependencies:
python3 -m pip install -r requirements.txt- Set environment variables:
export IGPT_API_KEY="your-igpt-api-key"
export IGPT_API_USER="your-igpt-api-user"If they are not set, app.py will prompt you via getpass.
You can create your iGPT API key here.
- Run:
python3 app.pyOn success, the program prints a JSON object shaped like:
run_metadata: when the run was generated + optional search metadatainvoices: array of extracted invoice/receipt records, each matching the schema inschema.py
Note: values below are illustrative. Fields may be
nullif not explicitly present in the source.
{
"run_metadata": {
"generated_at_utc": "2026-01-22T14:52:59Z",
"date_range": "2025-01-01..2026-01-22",
"query_terms": ["invoice", "receipt", "payment", "order", "subscription", "renewal"]
},
"invoices": [{
"invoice_type": "subscription",
"vendor_name": "Example SaaS Inc.",
"vendor_domain": "example.com",
"vendor_billing_email": "billing@example.com",
"invoice_number": "INV-10492",
"order_id": null,
"invoice_date": "2026-01-10",
"due_date": null,
"service_period_start": "2026-01-10",
"service_period_end": "2026-02-10",
"plan_name": "Pro",
"billing_interval": "monthly",
"seats": 5,
"usage_window_start": null,
"usage_window_end": null,
"description": "Pro plan subscription (5 seats)",
"currency": "USD",
"subtotal_amount": 100.0,
"discount_amount": null,
"tax_amount": 0.0,
"total_amount": 100.0,
"payment_status": "paid",
"paid_date": "2026-01-10",
"payment_method": "Visa **** 4242",
"line_items": [{
"description": "Pro plan (5 seats) - Jan 10, 2026 to Feb 10, 2026",
"quantity": 5,
"unit_price_amount": 20.0,
"amount": 100.0
}],
"source_id": "msg01HZYXABC123",
"source_subject": "Your Example SaaS receipt (INV-10492)",
"source_from": "Example SaaS Billing billing@example.com",
"source_occurred_at_utc": "2026-01-10T08:15:22Z",
"source_attachment_filenames": ["InvoiceINV-10492.pdf"],
"source_filename": null,
"extraction_notes": [
"Evidence(invoice_date): attachment:InvoiceINV-10492.pdf -> \"Invoice Date: Jan 10, 2026\"",
"Evidence(total_amount): attachment:InvoiceINV-10492.pdf -> \"Total: $100.00\"",
"Evidence(paid_date): emailbody -> \"Payment processed on Jan 10, 2026\"",
"Attachment values preferred over email body where conflicts occurred."
],
"dedupe_key": "example.com_inv-10492"
}]
}Change what gets extracted β Edit igpt/prompts.py to target different document types (contracts, SOWs, purchase orders) or adjust extraction rules.
Change the output shape β Edit igpt/schema.py to add or remove fields. iGPT enforces the schema server-side, so your output will always match.
Add a date range β Limit extraction to a specific time window to improve performance and control. For example, restrict to the last 7 days or a fixed range like 2026-02-01..2026-02-24.
Add custom query terms β Focus on messages containing specific keywords or vendors (e.g., renewal, payment confirmation, Apple Inc, Figma) to narrow extraction scope.
Extend invoice types β Add entries to the invoice_type enum in the system prompt if your business requires additional billing model classifications beyond the defaults.
Add to an existing agent β The core extraction is a single function call:
from igptai import IGPT
client = IGPT(api_key="...", user="...")
result = client.recall.ask(
input="<your prompt>",
output_format={"schema": your_json_schema}
)Works with any orchestration framework (LangChain, LangGraph, CrewAI) or standalone.
Store the data β Save results to a database, push to an ERP system, or export to CSV for further analysis.
Build a dashboard β Monitor monthly spend across vendors, track active subscriptions, or visualize payment trends over time.
Automate it β Run extraction on a daily or weekly schedule and store results as they come in. Set up alerts when an invoice exceeds a threshold. Auto-approve low-value invoices below a set amount. Create reminders before service_period_end dates to catch upcoming renewals.
- iGPT Python SDK β
pip install igptai - iGPT Node.js SDK β
npm install igptai - API Documentation β Full endpoint reference
- Playground β Try queries interactively before writing code
MIT