Skip to content

feat: dedup, Cognee pipeline overhaul, and security patches#99

Merged
krapfj23 merged 16 commits intomainfrom
dedup-readme-and-improvements
Apr 17, 2026
Merged

feat: dedup, Cognee pipeline overhaul, and security patches#99
krapfj23 merged 16 commits intomainfrom
dedup-readme-and-improvements

Conversation

@krapfj23
Copy link
Copy Markdown
Collaborator

Summary

  • Content-hash dedup: /documents/upload computes SHA-256 of file bytes and short-circuits via find_document_by_hash() if a doc already exists.
  • Cognee pipeline fixes: use correct search types, remove legacy classification/migration/search services, show document filename in search results.
  • Per-dataset insights + entities: replaced the global-graph snapshot/diff (which raced under concurrent uploads, leaving most docs with 0 insights · 0 entities) with a direct per-dataset read via get_dataset_related_nodes / get_dataset_related_edges. Keyed on Node.slug (not Node.id) since edges reference the slug.
  • New reset script: backend/scripts/clear_all.py wipes R2, Supabase, the Cognee graph (resolved through cognee.base_config to handle both the venv-internal and backend-root .cognee_system/), and the pgvector schema.
  • Security patches: frontend 3 → 0 vulns (npm audit fix); backend 37 → 14 pip-audit findings via min-version pins on aiohttp, cryptography, pypdf, requests, starlette, python-multipart, pygments, litellm. Required bumping fastapi 0.119 → >=0.120 since the old pin forced starlette<0.49.
  • Docs: CLAUDE.md updated to reflect new pipeline, search types, and reset workflow.

Test plan

  • cd backend && ruff check && pytest — 51 tests pass locally (skipping test_cognee.py / test_storage.py that need credentials).
  • cd frontend && npm audit — 0 vulnerabilities.
  • cd backend && pip-audit — 14 remaining findings are all in explicitly-deferred categories (transformers 5.0 RC, pytest/setuptools/pip majors, diskcache with no upstream fix).
  • End-to-end: python backend/scripts/clear_all.py --yes, restart backend, upload 3 PDFs simultaneously, confirm each document shows non-zero insight/entity counts (regression test for the concurrency bug).
  • Open a completed doc: Summary populated from SearchType.SUMMARIES, Insights render as Subject → relation → Object cards, Entities render as chips. No raw chunk text.
  • Re-upload a file that was already uploaded → API returns duplicate: true with the existing doc_id.

🤖 Generated with Claude Code

krapfj23 and others added 16 commits April 15, 2026 20:22
Formatting, async Supabase migration, improved error handling,
and logging across routes and services.
Add frontend Dockerfiles, ESLint, Prettier, Vercel config, and nginx.
Update docker-compose, env example, and lint workflow.
Remove broken route-level tests from test_ingest (referenced removed
functions). Update test_storage and test_dataset_name_validation for
current service signatures.
23 tests exercising full HTTP request → route → service → response
chain. Covers upload, search, graph, document CRUD, file-url, and
health check endpoints. External services mocked at SDK boundary.
Replace standalone script with pytest-discoverable e2e test. Creates
temp fixture data (no external mock_data needed), uses Cognee embedded
defaults (LanceDB/KuzuDB), auto-skips when LLM_API_KEY is missing.
Run pytest on every PR touching backend/. Excludes broken test_storage
and e2e test_cognee. Adds pip caching, pytest-asyncio dependency, and
registers the e2e marker in pyproject.toml.
Architecture overview, key files, environment variables, run/test
commands, branch naming conventions, and code review checklist.
…, and cross-page navigation

Fixes the graph flipping/bouncing bug by stabilizing the force simulation
(cooldownTicks, d3AlphaDecay, d3VelocityDecay, warmupTicks) and memoizing
graph data to prevent unnecessary re-renders. Adds:

- Click-to-inspect node detail panel with connected entities, related
  content (Cognee CHUNKS search), and source documents
- Connected node highlighting: selected node glows, neighbors stay
  visible, unrelated nodes dim to 20% opacity
- Graph node search (client-side filter with dropdown, zoom-to-node)
- Search-to-graph bridge: "View in Graph" button on search result
  source cards navigates to /graph?dataset=X
- URL param support: ?dataset= auto-selects filter, ?node= auto-selects
  and zooms to a node
- Improved UI: overlaid controls, polished hover tooltip, degree-based
  node sizing, UUID label filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compute a SHA-256 hash of file contents at upload time and check for an
existing completed document with the same hash before running the pipeline.
Duplicates return the existing document immediately, skipping R2 upload,
LLM classification, and Cognee ingestion. Frontend shows a distinct amber
"Duplicate" card with a link to the existing document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace outdated ETL-era README with practical setup instructions covering
Docker and manual workflows, project structure, API endpoints, testing,
linting, CI/CD, and branch/PR conventions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete legacy route and service modules that were superseded by the
Cognee-based pipeline. Update api.py, CLAUDE.md, and related services
to drop references to the removed modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arch results

Summary/Insights/Entities tabs were all rendering raw document chunks
because the pipeline used SearchType.CHUNKS for every query. Switch to
GRAPH_SUMMARY_COMPLETION for the summary and GRAPH_COMPLETION for
insights and entities, and add _split_bulleted() to break the resulting
narrative answers into discrete list items.

Also swap the dataset-slug pill on search results for the underlying
document filename (falling back to the dataset name when no source is
attached) so users see the specific document rather than a sanitized
client slug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the flaky global-graph snapshot/diff approach for extracting
insights and entities with a direct per-dataset read from Cognee's
relational store (get_dataset_related_nodes / get_dataset_related_edges).

The previous approach snapshotted the whole Kuzu graph before cognify
and diffed after, which raced under concurrent uploads and left most
documents with 0 insights · 0 entities. The new path queries nodes
and edges by indexed dataset_id, so concurrent pipelines can't
interfere with each other.

Also:
- Summary now uses SearchType.SUMMARIES (the cognify-generated
  summary), not GRAPH_SUMMARY_COMPLETION.
- Edges reference nodes by Node.slug (the original DataPoint id),
  not Node.id (which is a derived uuid5) — filter keyed on slug.
- Structural node types (TextDocument, DocumentChunk, TextSummary,
  IndexSchema, Document) excluded from entities and insight endpoints.
- Add backend/scripts/clear_all.py to wipe R2, Supabase rows, the
  Cognee graph (resolved via cognee's own base_config to handle the
  venv-internal .cognee_system path), and the pgvector schema.
- Update CLAUDE.md to document the new pipeline and reset script.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Frontend (3 → 0 vulns): npm audit fix bumped axios, follow-redirects,
and vite transitively. Lockfile-only; package.json untouched.

Backend (37 → 14 findings): added minimum-safe version pins in
requirements.txt for transitive deps flagged by pip-audit:
  aiohttp>=3.13.4      (10 aiohttp CVEs)
  cryptography>=46.0.7 (2 CVEs)
  pygments>=2.20.0
  pypdf>=6.10.2        (5 advisories)
  requests>=2.33.0
  starlette>=0.49.1    (CVE-2025-62727)
  python-multipart>=0.0.26
  litellm>=1.83.0      (3 CVEs)

Starlette 0.49 forced a fastapi bump: 0.119 pinned starlette<0.49, so
fastapi is now >=0.120.0 (resolved to 0.136.0). All 51 backend tests
pass, ruff clean.

Remaining 14 pip-audit findings are intentionally deferred:
- transformers 5.0.0rc3 (major RC, cognee dep, high breakage risk)
- pytest 9.0 and pip/setuptools majors (dev/build tooling only)
- diskcache 5.6.3 (no upstream fix available)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…ovements

# Conflicts:
#	CLAUDE.md
#	backend/app/services/document_pipeline.py
@kaydencel kaydencel self-requested a review April 17, 2026 21:45
Copy link
Copy Markdown
Collaborator

@kaydencel kaydencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krapfj23 krapfj23 merged commit 7f274ff into main Apr 17, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants