Skip to content

Feat: Custom API Integrations#52

Open
sgogriff wants to merge 11 commits intoonllm-dev:mainfrom
sgogriff:feat/api-integrations
Open

Feat: Custom API Integrations#52
sgogriff wants to merge 11 commits intoonllm-dev:mainfrom
sgogriff:feat/api-integrations

Conversation

@sgogriff
Copy link
Copy Markdown
Contributor

@sgogriff sgogriff commented Apr 4, 2026

Summary

Adds API Integrations as a new telemetry subsystem for tracking token and cost usage from custom API-driven scripts via local JSONL ingestion.

What Changed

  • added backend JSONL ingestion and SQLite storage for API Integrations events
  • added read-only API endpoints for current usage, history, and ingest health
  • added API Integrations dashboard UI and settings visibility control
  • added Python wrapper examples for Anthropic, OpenAI, Mistral, OpenRouter, and Gemini
  • added setup and README documentation

Notes

  • API Integrations is separate from subscription/quota tracking, codebase reflects this
  • it tracks cumulative usage telemetry, not remaining plan percentage
  • ingestion is controlled by ONWATCH_API_INTEGRATIONS_ENABLED
  • source directory is configurable via ONWATCH_API_INTEGRATIONS_DIR
  • Docs currently only have a .py wrapper as an example, but more could easily be added. The same goes for adapting to a wider range of providers.

Testing

  • go test -race ./...
  • go vet ./...
  • manually tested JSONL ingestion and dashboard rendering with seeded data. Seeded data generator script available in examples/api_integrations/python along with .py examples and JSONL wrapper.

Screenshots!

api-integration-dark api-integration-light

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 4, 2026

@prakersh
Copy link
Copy Markdown
Contributor

prakersh commented Apr 5, 2026

Thanks a lot for this contribution! This is a substantial PR so we'll need some time to review it properly. Will follow up with detailed feedback once we've gone through everything.

@sgogriff
Copy link
Copy Markdown
Contributor Author

sgogriff commented Apr 5, 2026

No problem! Let me know if there is anything that needs re-thinking. Happy to help :)

@prakersh
Copy link
Copy Markdown
Contributor

prakersh commented Apr 5, 2026

Thanks for the detailed PR! I've done an initial review and the overall structure looks solid - clean separation, good test coverage, and consistent use of parameterized SQL.

A few things I'd like to address before merging:

Query bounds:

  • QueryAPIIntegrationUsageSummary() has no LIMIT - our project guardrails require bounded queries. Would need a cap here.
  • QueryAPIIntegrationUsageBuckets() loads all events into memory before grouping. For large time ranges this could be expensive - ideally the bucketing should happen in SQL.

Minor code issues:

  • Duplicate detection in InsertAPIIntegrationUsageEvent relies on strings.Contains(err.Error(), "unique") which is fragile. Prefer checking SQLite error codes.
  • GetActiveSystemAlertsByProvider double-parses createdAt (RFC3339 then RFC3339Nano) - second parse always overwrites the first. Just RFC3339Nano would suffice.

Typo:

  • README line mentions "accross seprarte" - should be "across separate"

Worth considering (non-blocking):

  • No data retention/pruning for api_integration_usage_events - will grow indefinitely
  • raw_line column stores full JSON alongside parsed columns, doubling storage

Happy to discuss any of these. Nice work on the docs and Python examples.

@sgogriff
Copy link
Copy Markdown
Contributor Author

sgogriff commented Apr 6, 2026

Glad that the PR looks good - I've gone through your comments and made these changes: -

  • Bounded the current usage summary query, limits to 500.
  • Moved API Integrations history bucketing into SQLite so large time ranges no longer require loading and grouping all raw events in memory.
  • Hardened duplicate detection for ingested events; switching from error-string matching to SQLite unique-constraint codes.
  • API Integrations alert timestamp parsing by using a single RFC3339Nano parse path.
  • Fixed the README typo in the API Integrations description.
  • Added automatic database retention/pruning for api_integration_usage_events via ONWATCH_API_INTEGRATIONS_RETENTION.
  • Default retention is now 60 days (1440h). And setting the value to 0 should disable DB pruning.
  • Source API Integrations JSONL files are not pruned by onWatch (future change? would have to think about how this interacts with offset_bytes in JSONL tailing). People with lots of integrations still need to rotate file manually.
  • Updated the API Integrations setup docs and README env var reference to document the new retention behaviour.

I didn't change raw_line column storing full JSON doubling storage. Probably needs addressing.

Also had a couple thoughts with where to take this next. Won't be doing it anytime soon, but any thoughts?

  • Add settings tab for the API integrations, similar to the others. Make it easier for end user!
  • Add some kind of threshold alerts (i.e. integration exceeds X tokens or $ cost in Y window). This one would be particularly useful.

Let me know if somethings not right, or if we should make more changes before merging :)

@sgogriff sgogriff force-pushed the feat/api-integrations branch from 99d54b4 to 6052ff6 Compare April 6, 2026 15:27
@prakersh
Copy link
Copy Markdown
Contributor

Hey @sgogriff, great work on the follow-up! You addressed everything cleanly - the SQL bucketing, SQLite error codes, retention/pruning, and the bounded summary query are all solid. Really nice execution.

I did a deeper pass and found a few more items to tidy up before we merge:

Must fix:

  1. Extra </div> in health summary (app.js ~line 5083)
    There's a stray closing </div> tag in renderAPIIntegrationsHealth that will break the health section layout in some browsers. Just needs removing.

  2. QueryAPIIntegrationUsageBuckets needs a LIMIT (api_integrations_store.go ~line 222)
    The summary query got bounded (500 cap - nice), but the bucket query slipped through. With 10 integrations over 30 days at 1-min granularity, that's potentially 432K rows loaded into memory before downsampling. A cap (e.g., 5000) would keep this safe.

  3. Fingerprint sub-second precision (types.go ~line 159)
    eventFingerprint formats the timestamp with time.RFC3339 (second precision). Two events within the same second that share identical integration/provider/model/tokens would silently dedup. Switching to time.RFC3339Nano would close this gap.

  4. String length validation on JSONL fields (types.go ~lines 65-154)
    integration, model, account, and metadata fields have no max length enforcement. A buggy producer could write very large strings. Simple guards would help - e.g., 256 chars for names, 4KB for metadata.

  5. Missing source_path index (api_integrations_store.go ~line 311)
    The health endpoint's LEFT JOIN on source_path has no index on that column in api_integration_usage_events. As event count grows, this becomes a full table scan per ingest state row. Adding CREATE INDEX IF NOT EXISTS idx_api_integration_usage_events_source ON api_integration_usage_events(source_path) would fix it.

  6. Drop raw_line column (types.go, store.go, api_integrations_store.go)
    Looking at this more closely - raw_line stores the full original JSON line alongside all the parsed columns, but nothing ever reads it. No handler or endpoint surfaces it. The source .jsonl files already serve as the raw audit trail (and they're not pruned by onWatch), so storing the same data again in SQLite is redundant and doubles per-event storage. I'd recommend removing the column, the RawLine struct field, and the related INSERT/SELECT references entirely.

Design questions (non-blocking, just want your thoughts):

  • AvailableProviders() / HasProvider() in config.go don't include api-integrations. Was this intentional? If backend code iterates providers for any reason, API integrations would be silently skipped. Wondering if it should be included there (gated on APIIntegrationsEnabled) or if you see it as fundamentally separate from polling providers.

Merge conflicts:

The PR currently has merge conflicts with main. Could you rebase onto the latest main and resolve those when you push the fixes above? That way we can get this merged smoothly once the changes look good.

On your future ideas - both a settings tab and threshold alerts sound great. The threshold alerts especially would be a killer feature for teams monitoring spend. Happy to discuss those in separate issues when you're ready.

Thanks again for the solid contribution and quick turnaround on feedback!

@sgogriff sgogriff force-pushed the feat/api-integrations branch from eb0aba7 to 2fd1fdf Compare April 13, 2026 09:40
@sgogriff
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed pass, all fixed up!

Extra - removed the closing tag.

Bucket query LIMIT - added apiIntegrationUsageBucketsLimit = 5000 constant as you suggested and capped the query alongside the existing summary limit.

Fingerprint precision - switched to time.RFC3339Nano in eventFingerprint.

String length validation - added maxIntegrationFieldLen = 256 and maxMetadataJSONLen = 4096 constants with guards in ParseUsageEventLine for integration, model, account, and compacted metadata_json. Tests added for each rejection path.

source_path index - added CREATE INDEX IF NOT EXISTS idx_api_integration_usage_events_source to the schema. Since it uses IF NOT EXISTS it runs on startup and covers existing databases automatically.

raw_line removal - dropped the field from the struct, removed it from INSERT/SELECT, and removed it from the schema. Added a DROP COLUMN migration in migrateSchema() for existing databases (with a TODO comment to remove it once everyone has migrated - a very very small number of users on this fork so should be quick, but nicer for us!).

Also rebased onto latest main - one conflict in config.go around the CodexHasProfiles field, resolved by keeping the upstream version.
While running tests on the rebase I also noticed TestRunStatus_PortFallbackDetectsOnwatchProcess is flaky on upstream main too (lsof lag on macOS? - the TCP connect succeeds before lsof can see the process). Included a fix in this PR but happy to pull it out into a separate issue/PR if you'd prefer to keep concerns separate.

Regarding the design;
Kept api-integrations out of AvailableProviders() intentionally. The existing entries in that list all represent providers that onWatch polls directly using some kind of configured credential. api-integrations is deliberately credential-free: it reads whatever providers appear in the JSONL data (so naturally can take inputs from any provider), so the set of providers is determined at runtime by the files, not at config time.
As you mention there is a silent-skip risk. Not fixed on the current structure, so happy to change it if you think that's the better call given the reasoning above. One middle-ground option could be a separate AvailableFeatures() covering non-polling capabilities, which would close that gap without conflating the two concepts

An alert feature is definitely on the roadmap -- it would be a very good addition. I'll get to it at some point, but you're welcome to take it on yourself if you'd rather not wait! :)
Please let me know if you find anything else needing attention regarding this PR in the meantime

@prakersh
Copy link
Copy Markdown
Contributor

Hey @sgogriff, all six must-fixes from the last round are verified and look solid. Rebase is clean. Did a security pass - parameterized SQL, XSS escaping, auth on new endpoints, input validation all check out.

Three more items before merge:

  1. Unbounded PartialLine growth (api_integrations_ingest_agent.go ~L160) - A .jsonl file with no newlines grows state.PartialLine by 256KB per scan cycle indefinitely. Add a cap (e.g., 512KB) and discard with a warning when exceeded.

  2. Unbounded alert creation (api_integrations_ingest_agent.go ~L189) - recordInvalidLine creates a DB row per bad line. A garbage file floods the table. Rate-limit to e.g. 10 alerts per file per cycle.

  3. No cap on files scanned (api_integrations_ingest_agent.go ~L98) - Thousands of .jsonl files would all be stat'd + DB-queried each 5s cycle. A soft cap (e.g., 100) with a warning log would help.

The flaky test fix is fine to keep here. Will open an issue for the AvailableFeatures() idea. Almost there!

- Add 512KB cap on PartialLine with discard + warning when exceeded
- Rate-limit invalid line alerts to 10 per file per scan cycle
- Cap file scan to 100 files per cycle with round-robin cursor for coverage

All three prevent memory/DB flooding from malformed input files.
@sgogriff
Copy link
Copy Markdown
Contributor Author

Hey @prakersh,
I’ve now applied all three fixes in the latest push.
All set for another look if needed, and do let me know if you find any more issues! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants