Skip to content

feat: add upload deduplication, README rewrite, and backend improvements#98

Merged
krapfj23 merged 13 commits intomainfrom
dedup-readme-and-improvements
Apr 17, 2026
Merged

feat: add upload deduplication, README rewrite, and backend improvements#98
krapfj23 merged 13 commits intomainfrom
dedup-readme-and-improvements

Conversation

@krapfj23
Copy link
Copy Markdown
Collaborator

Summary

  • Upload deduplication: SHA-256 content hashing at upload time. Duplicate files return the existing document immediately, skipping the full Cognee pipeline. Frontend shows an amber "Duplicate" card with a link to the existing document.
  • README rewrite: Replaced outdated ETL-era README with a developer onboarding guide covering Docker/manual setup, project structure, API endpoints, testing, linting, CI, and PR conventions.
  • Backend quality: Code refactoring, test suite (integration tests, dataset validation, storage), CI workflow for backend tests, CLAUDE.md project documentation.
  • Frontend enhancements: Knowledge graph node details panel, search, highlighting, cross-page navigation, upload progress tracking.

Changes

  • supabase/migrations/019_add_content_hash.sql — new content_hash column + index
  • backend/app/routes/documents.py — SHA-256 dedup check before pipeline
  • backend/app/services/document_metadata_service.pyfind_document_by_hash(), hash in create_document()
  • backend/tests/test_integration.py — 28 integration tests including 5 dedup tests
  • frontend/src/pages/UploadPage.tsx — duplicate detection UI
  • frontend/src/services/api.ts — updated UploadedFile type
  • README.md — full rewrite for developer onboarding

Test plan

  • All 28 backend integration tests pass (pytest tests/test_integration.py -v)
  • Ruff lint passes on all backend files
  • TypeScript compiles with zero errors (tsc --noEmit)
  • Prettier formatting passes on frontend files
  • CI pipelines pass (lint, test, Docker build)

🤖 Generated with Claude Code

krapfj23 and others added 12 commits April 15, 2026 20:22
Formatting, async Supabase migration, improved error handling,
and logging across routes and services.
Add frontend Dockerfiles, ESLint, Prettier, Vercel config, and nginx.
Update docker-compose, env example, and lint workflow.
Remove broken route-level tests from test_ingest (referenced removed
functions). Update test_storage and test_dataset_name_validation for
current service signatures.
23 tests exercising full HTTP request → route → service → response
chain. Covers upload, search, graph, document CRUD, file-url, and
health check endpoints. External services mocked at SDK boundary.
Replace standalone script with pytest-discoverable e2e test. Creates
temp fixture data (no external mock_data needed), uses Cognee embedded
defaults (LanceDB/KuzuDB), auto-skips when LLM_API_KEY is missing.
Run pytest on every PR touching backend/. Excludes broken test_storage
and e2e test_cognee. Adds pip caching, pytest-asyncio dependency, and
registers the e2e marker in pyproject.toml.
Architecture overview, key files, environment variables, run/test
commands, branch naming conventions, and code review checklist.
…, and cross-page navigation

Fixes the graph flipping/bouncing bug by stabilizing the force simulation
(cooldownTicks, d3AlphaDecay, d3VelocityDecay, warmupTicks) and memoizing
graph data to prevent unnecessary re-renders. Adds:

- Click-to-inspect node detail panel with connected entities, related
  content (Cognee CHUNKS search), and source documents
- Connected node highlighting: selected node glows, neighbors stay
  visible, unrelated nodes dim to 20% opacity
- Graph node search (client-side filter with dropdown, zoom-to-node)
- Search-to-graph bridge: "View in Graph" button on search result
  source cards navigates to /graph?dataset=X
- URL param support: ?dataset= auto-selects filter, ?node= auto-selects
  and zooms to a node
- Improved UI: overlaid controls, polished hover tooltip, degree-based
  node sizing, UUID label filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compute a SHA-256 hash of file contents at upload time and check for an
existing completed document with the same hash before running the pipeline.
Duplicates return the existing document immediately, skipping R2 upload,
LLM classification, and Cognee ingestion. Frontend shows a distinct amber
"Duplicate" card with a link to the existing document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace outdated ETL-era README with practical setup instructions covering
Docker and manual workflows, project structure, API endpoints, testing,
linting, CI/CD, and branch/PR conventions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete legacy route and service modules that were superseded by the
Cognee-based pipeline. Update api.py, CLAUDE.md, and related services
to drop references to the removed modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arch results

Summary/Insights/Entities tabs were all rendering raw document chunks
because the pipeline used SearchType.CHUNKS for every query. Switch to
GRAPH_SUMMARY_COMPLETION for the summary and GRAPH_COMPLETION for
insights and entities, and add _split_bulleted() to break the resulting
narrative answers into discrete list items.

Also swap the dataset-slug pill on search results for the underlying
document filename (falling back to the dataset name when no source is
attached) so users see the specific document rather than a sanitized
client slug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GordonBie123 GordonBie123 self-requested a review April 17, 2026 16:58
Copy link
Copy Markdown
Collaborator

@GordonBie123 GordonBie123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge

kaydencel
kaydencel previously approved these changes Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@kaydencel kaydencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge per Gordon and Jeff's request

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@kaydencel kaydencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve

@krapfj23 krapfj23 merged commit 336a365 into main Apr 17, 2026
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants