feat: dedup, Cognee pipeline overhaul, and security patches#99
Merged
feat: dedup, Cognee pipeline overhaul, and security patches#99
Conversation
Formatting, async Supabase migration, improved error handling, and logging across routes and services.
Add frontend Dockerfiles, ESLint, Prettier, Vercel config, and nginx. Update docker-compose, env example, and lint workflow.
Remove broken route-level tests from test_ingest (referenced removed functions). Update test_storage and test_dataset_name_validation for current service signatures.
23 tests exercising full HTTP request → route → service → response chain. Covers upload, search, graph, document CRUD, file-url, and health check endpoints. External services mocked at SDK boundary.
Replace standalone script with pytest-discoverable e2e test. Creates temp fixture data (no external mock_data needed), uses Cognee embedded defaults (LanceDB/KuzuDB), auto-skips when LLM_API_KEY is missing.
Run pytest on every PR touching backend/. Excludes broken test_storage and e2e test_cognee. Adds pip caching, pytest-asyncio dependency, and registers the e2e marker in pyproject.toml.
Architecture overview, key files, environment variables, run/test commands, branch naming conventions, and code review checklist.
…, and cross-page navigation Fixes the graph flipping/bouncing bug by stabilizing the force simulation (cooldownTicks, d3AlphaDecay, d3VelocityDecay, warmupTicks) and memoizing graph data to prevent unnecessary re-renders. Adds: - Click-to-inspect node detail panel with connected entities, related content (Cognee CHUNKS search), and source documents - Connected node highlighting: selected node glows, neighbors stay visible, unrelated nodes dim to 20% opacity - Graph node search (client-side filter with dropdown, zoom-to-node) - Search-to-graph bridge: "View in Graph" button on search result source cards navigates to /graph?dataset=X - URL param support: ?dataset= auto-selects filter, ?node= auto-selects and zooms to a node - Improved UI: overlaid controls, polished hover tooltip, degree-based node sizing, UUID label filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compute a SHA-256 hash of file contents at upload time and check for an existing completed document with the same hash before running the pipeline. Duplicates return the existing document immediately, skipping R2 upload, LLM classification, and Cognee ingestion. Frontend shows a distinct amber "Duplicate" card with a link to the existing document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace outdated ETL-era README with practical setup instructions covering Docker and manual workflows, project structure, API endpoints, testing, linting, CI/CD, and branch/PR conventions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete legacy route and service modules that were superseded by the Cognee-based pipeline. Update api.py, CLAUDE.md, and related services to drop references to the removed modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arch results Summary/Insights/Entities tabs were all rendering raw document chunks because the pipeline used SearchType.CHUNKS for every query. Switch to GRAPH_SUMMARY_COMPLETION for the summary and GRAPH_COMPLETION for insights and entities, and add _split_bulleted() to break the resulting narrative answers into discrete list items. Also swap the dataset-slug pill on search results for the underlying document filename (falling back to the dataset name when no source is attached) so users see the specific document rather than a sanitized client slug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the flaky global-graph snapshot/diff approach for extracting insights and entities with a direct per-dataset read from Cognee's relational store (get_dataset_related_nodes / get_dataset_related_edges). The previous approach snapshotted the whole Kuzu graph before cognify and diffed after, which raced under concurrent uploads and left most documents with 0 insights · 0 entities. The new path queries nodes and edges by indexed dataset_id, so concurrent pipelines can't interfere with each other. Also: - Summary now uses SearchType.SUMMARIES (the cognify-generated summary), not GRAPH_SUMMARY_COMPLETION. - Edges reference nodes by Node.slug (the original DataPoint id), not Node.id (which is a derived uuid5) — filter keyed on slug. - Structural node types (TextDocument, DocumentChunk, TextSummary, IndexSchema, Document) excluded from entities and insight endpoints. - Add backend/scripts/clear_all.py to wipe R2, Supabase rows, the Cognee graph (resolved via cognee's own base_config to handle the venv-internal .cognee_system path), and the pgvector schema. - Update CLAUDE.md to document the new pipeline and reset script. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Frontend (3 → 0 vulns): npm audit fix bumped axios, follow-redirects, and vite transitively. Lockfile-only; package.json untouched. Backend (37 → 14 findings): added minimum-safe version pins in requirements.txt for transitive deps flagged by pip-audit: aiohttp>=3.13.4 (10 aiohttp CVEs) cryptography>=46.0.7 (2 CVEs) pygments>=2.20.0 pypdf>=6.10.2 (5 advisories) requests>=2.33.0 starlette>=0.49.1 (CVE-2025-62727) python-multipart>=0.0.26 litellm>=1.83.0 (3 CVEs) Starlette 0.49 forced a fastapi bump: 0.119 pinned starlette<0.49, so fastapi is now >=0.120.0 (resolved to 0.136.0). All 51 backend tests pass, ruff clean. Remaining 14 pip-audit findings are intentionally deferred: - transformers 5.0.0rc3 (major RC, cognee dep, high breakage risk) - pytest 9.0 and pip/setuptools majors (dev/build tooling only) - diskcache 5.6.3 (no upstream fix available) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…ovements # Conflicts: # CLAUDE.md # backend/app/services/document_pipeline.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/documents/uploadcomputes SHA-256 of file bytes and short-circuits viafind_document_by_hash()if a doc already exists.get_dataset_related_nodes/get_dataset_related_edges. Keyed onNode.slug(notNode.id) since edges reference the slug.backend/scripts/clear_all.pywipes R2, Supabase, the Cognee graph (resolved throughcognee.base_configto handle both the venv-internal and backend-root.cognee_system/), and the pgvector schema.npm audit fix); backend 37 → 14 pip-audit findings via min-version pins on aiohttp, cryptography, pypdf, requests, starlette, python-multipart, pygments, litellm. Required bumping fastapi0.119 → >=0.120since the old pin forcedstarlette<0.49.Test plan
cd backend && ruff check && pytest— 51 tests pass locally (skippingtest_cognee.py/test_storage.pythat need credentials).cd frontend && npm audit— 0 vulnerabilities.cd backend && pip-audit— 14 remaining findings are all in explicitly-deferred categories (transformers 5.0 RC, pytest/setuptools/pip majors, diskcache with no upstream fix).python backend/scripts/clear_all.py --yes, restart backend, upload 3 PDFs simultaneously, confirm each document shows non-zero insight/entity counts (regression test for the concurrency bug).SearchType.SUMMARIES, Insights render asSubject → relation → Objectcards, Entities render as chips. No raw chunk text.duplicate: truewith the existingdoc_id.🤖 Generated with Claude Code