Skip to content

Dedup readme and improvements#101

Merged
krapfj23 merged 19 commits intomainfrom
dedup-readme-and-improvements
Apr 23, 2026
Merged

Dedup readme and improvements#101
krapfj23 merged 19 commits intomainfrom
dedup-readme-and-improvements

Conversation

@krapfj23
Copy link
Copy Markdown
Collaborator

{PR Title}

🎫 Closes #

🖹 Description

What changed:
Brief summary of the change (2-3 sentences max).

Why:
What problem this solves or requirement it addresses.

Technical decisions worth noting:
Any non-obvious choices and their rationale.

🤖 AI disclosure: Note what was AI-assisted vs. hand-written (Safe-Space).


🏗️ Change Type

  • Infrastructure / Config
  • CI/CD Pipeline
  • feature
  • Security / Secrets
  • Dependency Update
  • Hotfix

✅ Tests & Verification

Automated:

  • CI pipeline passes (link: )
  • Added/updated unit tests

💥 Known Issues / Limitations

Document any known issues, pending concerns, or follow-up tickets.

ℹ️ Additional Context

Anything else reviewers should know — dependencies on other PRs, timing constraints, etc.

krapfj23 and others added 19 commits April 15, 2026 20:22
Formatting, async Supabase migration, improved error handling,
and logging across routes and services.
Add frontend Dockerfiles, ESLint, Prettier, Vercel config, and nginx.
Update docker-compose, env example, and lint workflow.
Remove broken route-level tests from test_ingest (referenced removed
functions). Update test_storage and test_dataset_name_validation for
current service signatures.
23 tests exercising full HTTP request → route → service → response
chain. Covers upload, search, graph, document CRUD, file-url, and
health check endpoints. External services mocked at SDK boundary.
Replace standalone script with pytest-discoverable e2e test. Creates
temp fixture data (no external mock_data needed), uses Cognee embedded
defaults (LanceDB/KuzuDB), auto-skips when LLM_API_KEY is missing.
Run pytest on every PR touching backend/. Excludes broken test_storage
and e2e test_cognee. Adds pip caching, pytest-asyncio dependency, and
registers the e2e marker in pyproject.toml.
Architecture overview, key files, environment variables, run/test
commands, branch naming conventions, and code review checklist.
…, and cross-page navigation

Fixes the graph flipping/bouncing bug by stabilizing the force simulation
(cooldownTicks, d3AlphaDecay, d3VelocityDecay, warmupTicks) and memoizing
graph data to prevent unnecessary re-renders. Adds:

- Click-to-inspect node detail panel with connected entities, related
  content (Cognee CHUNKS search), and source documents
- Connected node highlighting: selected node glows, neighbors stay
  visible, unrelated nodes dim to 20% opacity
- Graph node search (client-side filter with dropdown, zoom-to-node)
- Search-to-graph bridge: "View in Graph" button on search result
  source cards navigates to /graph?dataset=X
- URL param support: ?dataset= auto-selects filter, ?node= auto-selects
  and zooms to a node
- Improved UI: overlaid controls, polished hover tooltip, degree-based
  node sizing, UUID label filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compute a SHA-256 hash of file contents at upload time and check for an
existing completed document with the same hash before running the pipeline.
Duplicates return the existing document immediately, skipping R2 upload,
LLM classification, and Cognee ingestion. Frontend shows a distinct amber
"Duplicate" card with a link to the existing document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace outdated ETL-era README with practical setup instructions covering
Docker and manual workflows, project structure, API endpoints, testing,
linting, CI/CD, and branch/PR conventions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete legacy route and service modules that were superseded by the
Cognee-based pipeline. Update api.py, CLAUDE.md, and related services
to drop references to the removed modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arch results

Summary/Insights/Entities tabs were all rendering raw document chunks
because the pipeline used SearchType.CHUNKS for every query. Switch to
GRAPH_SUMMARY_COMPLETION for the summary and GRAPH_COMPLETION for
insights and entities, and add _split_bulleted() to break the resulting
narrative answers into discrete list items.

Also swap the dataset-slug pill on search results for the underlying
document filename (falling back to the dataset name when no source is
attached) so users see the specific document rather than a sanitized
client slug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the flaky global-graph snapshot/diff approach for extracting
insights and entities with a direct per-dataset read from Cognee's
relational store (get_dataset_related_nodes / get_dataset_related_edges).

The previous approach snapshotted the whole Kuzu graph before cognify
and diffed after, which raced under concurrent uploads and left most
documents with 0 insights · 0 entities. The new path queries nodes
and edges by indexed dataset_id, so concurrent pipelines can't
interfere with each other.

Also:
- Summary now uses SearchType.SUMMARIES (the cognify-generated
  summary), not GRAPH_SUMMARY_COMPLETION.
- Edges reference nodes by Node.slug (the original DataPoint id),
  not Node.id (which is a derived uuid5) — filter keyed on slug.
- Structural node types (TextDocument, DocumentChunk, TextSummary,
  IndexSchema, Document) excluded from entities and insight endpoints.
- Add backend/scripts/clear_all.py to wipe R2, Supabase rows, the
  Cognee graph (resolved via cognee's own base_config to handle the
  venv-internal .cognee_system path), and the pgvector schema.
- Update CLAUDE.md to document the new pipeline and reset script.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Frontend (3 → 0 vulns): npm audit fix bumped axios, follow-redirects,
and vite transitively. Lockfile-only; package.json untouched.

Backend (37 → 14 findings): added minimum-safe version pins in
requirements.txt for transitive deps flagged by pip-audit:
  aiohttp>=3.13.4      (10 aiohttp CVEs)
  cryptography>=46.0.7 (2 CVEs)
  pygments>=2.20.0
  pypdf>=6.10.2        (5 advisories)
  requests>=2.33.0
  starlette>=0.49.1    (CVE-2025-62727)
  python-multipart>=0.0.26
  litellm>=1.83.0      (3 CVEs)

Starlette 0.49 forced a fastapi bump: 0.119 pinned starlette<0.49, so
fastapi is now >=0.120.0 (resolved to 0.136.0). All 51 backend tests
pass, ruff clean.

Remaining 14 pip-audit findings are intentionally deferred:
- transformers 5.0.0rc3 (major RC, cognee dep, high breakage risk)
- pytest 9.0 and pip/setuptools majors (dev/build tooling only)
- diskcache 5.6.3 (no upstream fix available)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…ovements

# Conflicts:
#	CLAUDE.md
#	backend/app/services/document_pipeline.py
find_document_by_hash() and get_document() use .maybe_single().execute(),
which returns None (not a result wrapper) when zero rows match in the
async supabase-py client. The old result.data access raised
AttributeError on every upload of a new (non-duplicate) file.

Check for None before accessing .data.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@kaydencel kaydencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge per jeff's request

@krapfj23 krapfj23 merged commit cb1687d into main Apr 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants