Speed up corpus home document index by trimming GraphQL over-fetch#1647
Speed up corpus home document index by trimming GraphQL over-fetch#1647JSv4 wants to merge 3 commits into
Conversation
The "Loading document structure..." spinner on Corpus Home was driven by
two GraphQL queries that requested far more data than the TOC renders.
1) GET_DOCUMENT_RELATIONSHIPS (cap: 500 rows) selected full source/target
document objects, corpus, creator, annotation label color/icon, data,
timestamps, and myPermissions on every row. DocumentTableOfContents
actually consumes only relationshipType, annotationLabel.text, and the
two document IDs -- the rest was discarded. The per-relationship icon
fields still triggered build_absolute_uri() calls in
create_file_resolver, and "parent" filtering happened entirely on the
client.
2) GET_CORPUS_DOCUMENTS_FOR_TOC selected icon (unused -- renderNode
derives the on-screen icon from fileType via getFileIcon) and
creator (also unused), adding a file-URL resolver call per document
plus an extra join.
Fixes:
- Add annotation_label_text iexact filter to DocumentRelationshipFilter
so the parent-label filter runs server-side alongside the existing
relationship_type filter.
- Introduce a new ultra-lean GET_CORPUS_DOCUMENT_TOC_EDGES query that
returns only { id, sourceDocument.id, targetDocument.id } and lives
alongside (does not replace) GET_DOCUMENT_RELATIONSHIPS -- the latter
is still consumed by CorpusDocumentRelationships and
DocumentRelationshipModal with its full payload.
- Slim GET_CORPUS_DOCUMENTS_FOR_TOC to { id, title, description, slug,
fileType }; RunCorpusActionModal (the only other consumer) only reads
id and title.
- Rewire DocumentTableOfContents to consume the new edges query, drop
redundant document-metadata overwrites from relationship rows, and
remove now-dead icon plumbing.
|
test comment |
Code Review — PR 1647: Speed up corpus home document index |
|
PLACEHOLDER_REVIEW_2 |
|
Part 1 of review - Testing single line with special chars: backtick test |
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an MIT-licensed enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage reports (unit tests via Vitest, component tests via Playwright + Istanbul)
yarn test:coverage:unit
yarn test:coverage:ct
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are captured during Playwright component tests and pushed back to a PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
|
line1\nline2\nline3 |
Code Review - PR 1647: Speed up corpus home document indextest content |
Code Review — PR #1647: Speed up corpus home document indexOverviewThis is a well-targeted performance fix. The root cause analysis (over-fetched GraphQL fields + client-side filtering of what should be a server-side predicate) is correct, and the solution is proportionate: a new ultra-lean TOC-specific query, a new backend filter, and clean removal of dead field plumbing. No existing consumers of What works well
Issues1. Missing constants for hardcoded strings (CLAUDE.md — no magic numbers)
relationshipType: "RELATIONSHIP",
annotationLabelText: "parent",Per project convention these belong in 2. No new backend test for the
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Hoist hard-coded "RELATIONSHIP" / "parent" query-variable strings into
DOCUMENT_RELATIONSHIP_TYPE_RELATIONSHIP and DOCUMENT_RELATIONSHIP_LABEL_PARENT
in frontend/src/assets/configurations/constants.ts and consume them from
DocumentTableOfContents.tsx + the CT wrapper (no magic strings rule).
- Add inline comment on relationshipTotalCount explaining that the lean TOC
edges query counts only parent-labeled RELATIONSHIP rows server-side; the
truncation threshold is narrower than the legacy fetch-all behaviour.
- Add interface comment on CorpusDocumentTocEdge.{source,target}Document
documenting why the GraphQL schema marks them nullable when the DB FKs are
non-null (consumers still must null-guard).
- Pin the new DocumentRelationshipFilter.annotation_label_text iexact filter
with test_document_relationships_annotation_label_text_filter — creates
three relationships (2 from setUp + 1 parent-labeled), queries with mixed-
case "PARENT" and asserts exactly one edge is returned. Guards against a
future refactor silently dropping the filter and reverting the TOC to a
full fetch.
- Rewire the CT wrapper to the new GET_CORPUS_DOCUMENT_TOC_EDGES query shape:
id + source/target IDs only, with relationshipType / annotationLabelText
variables matching what the component sends (Apollo MockedProvider matches
variables by deep-equal).
- Slim the test mocks' document edges to {id,title,description,slug,fileType}
(icon / creator dropped server-side) and tighten the cache keyArgs for
documentRelationships to include the new filter args.
- Bump strict-mode-violating getByText("Parent Document") calls in the CT
tests to exact match — the mocked description "A parent document for
testing hierarchy" also contains the substring.
Code Review – PR #1647: Speed up corpus home document index by trimming GraphQL over-fetchOverviewThis PR targets a real, measurable performance regression: the "Loading document structure…" wait on Corpus Home was caused by two over-fetched GraphQL queries. The fix is correct in principle — push filters server-side, slim the query shapes, and eliminate wasted resolver calls. The implementation is clean and well-commented. One cache isolation bug must be fixed before merging. 🔴 Critical: Production Apollo cache not updatedFile: // Current (unchanged by this PR):
documentRelationships: relayStylePagination(["corpusId", "documentId"]),The new Concrete failure scenario:
The test wrapper correctly adds these to its local Fix: // frontend/src/graphql/cache.ts
documentRelationships: relayStylePagination([
"corpusId",
"documentId",
"relationshipType",
"annotationLabelText",
]),This is exactly the pattern called out in CLAUDE.md Common Pitfall #15 ("Apollo cache 🟡 Minor: Backend test covers label filter in isolation, not the combined filter the TOC actually sends
This is low risk given the filter implementation is simple, but worth adding for completeness. ✅ What's done well
SummaryOne change required before this is production-safe: // frontend/src/graphql/cache.ts ~line 309
documentRelationships: relayStylePagination([
"corpusId",
"documentId",
"relationshipType", // ← add
"annotationLabelText", // ← add
]),Without it, the new lean TOC query and the existing full query will silently share cache entries for the same corpus, leading to either stale ID-only data in relationship modals or oversized payloads polluting the TOC tree. |
The "Loading document structure..." spinner on Corpus Home was driven by
two GraphQL queries that requested far more data than the TOC renders.
GET_DOCUMENT_RELATIONSHIPS (cap: 500 rows) selected full source/target
document objects, corpus, creator, annotation label color/icon, data,
timestamps, and myPermissions on every row. DocumentTableOfContents
actually consumes only relationshipType, annotationLabel.text, and the
two document IDs -- the rest was discarded. The per-relationship icon
fields still triggered build_absolute_uri() calls in
create_file_resolver, and "parent" filtering happened entirely on the
client.
GET_CORPUS_DOCUMENTS_FOR_TOC selected icon (unused -- renderNode
derives the on-screen icon from fileType via getFileIcon) and
creator (also unused), adding a file-URL resolver call per document
plus an extra join.
Fixes:
so the parent-label filter runs server-side alongside the existing
relationship_type filter.
returns only { id, sourceDocument.id, targetDocument.id } and lives
alongside (does not replace) GET_DOCUMENT_RELATIONSHIPS -- the latter
is still consumed by CorpusDocumentRelationships and
DocumentRelationshipModal with its full payload.
fileType }; RunCorpusActionModal (the only other consumer) only reads
id and title.
redundant document-metadata overwrites from relationship rows, and
remove now-dead icon plumbing.