feat: Phase B1 layout and basic OCR pipeline (all 13 tasks complete)#3
feat: Phase B1 layout and basic OCR pipeline (all 13 tasks complete)#3williaby wants to merge 27 commits into
Conversation
Generates docs/planning/PROJECT-PLAN.md from the four source planning documents: project-vision.md, tech-spec.md, roadmap.md, and adr-001-docling-serve-http-integration.md. The plan covers all four implementation phases (B1 through B4) with semantic-release-aligned branch names, per-phase deliverable checklists, acceptance criteria verbatim from the roadmap, quality gate thresholds, and a Phase 0 environment setup checklist for immediate developer use. Also stages adr-001 (was untracked). Override: branch-first commit rule Reason: docs-only synthesis artifact; no code changes; task explicitly targets current branch (main); solo repo, no other contributors affected Compensating control: no code paths changed; plan is read-only reference material Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add httpx, google-cloud-storage, fastapi, uvicorn, python-multipart, and starlette to core dependencies for the B1 OCR pipeline. Extend Settings with docling-serve URL/timeout, layout confidence threshold, and GCS bucket template. Add DoclingServiceError, GCSError, MissingSourceTrackError, and MissingDoclingDOMError to the exception hierarchy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…example updates - Add DoclingServiceError, GCSError, MissingSourceTrackError, MissingDoclingDOMError to core/__init__.py imports and __all__ re-exports - Add env_file=".env" to SettingsConfigDict in config.py per project convention - Add tests for all 4 new exception classes in test_exceptions.py using err.details["field"] (correct path, field stored in details not as attribute) - Add FOUNDRY_UNIFY_* env var entries to .env.example for pipeline settings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the canonical DoclingDOM Pydantic v2 schema (BBox, ElementDOM, PageDOM, DOMMetadata, DoclingDOM) as the GCS artifact contract for both document and audio tracks in the foundry-unify OCR pipeline. Fixes .gitignore to scope the models/ exclusion to the repo root only, not src/foundry_unify/models/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the B1-scoped DocumentMetadata input schema with DoclingRoutingParams (to_form_data helper), ProcessingRecommendation, and InboundDocumentMetadata, all using extra="ignore" for forward-compat with the full prepare-doc schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HTTP client adapter for docling-serve REST API with structured response parsing, health check, and proper error handling via DoclingServiceError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces SourceTrack enum and SourceTrackRouter validator for document vs audio processing path detection. Raises MissingSourceTrackError (HTTP 422) for absent or unknown source_track values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements KI-002 mitigation for Docling layout model misclassification. Multi-column body text is misclassified as Table with 100% false-positive rate when Docling layout confidence is below ~0.5. LayoutPostprocessor reclassifies Table detections at or below the confidence threshold to Text, preventing downstream reading-order corruption (KI-008). - Adds LayoutPostprocessor class with configurable confidence threshold - Applies deep copy to avoid mutating original data - Marks reclassified elements with ki002_reclassified flag - Preserves elements without confidence field (defaults to 1.0) - Comprehensive test coverage: 7 unit tests all passing - test_low_confidence_table_reclassified_to_text - test_high_confidence_table_preserved - test_table_at_threshold_is_reclassified (critical boundary test) - test_non_table_elements_not_affected - test_element_without_confidence_field_preserved_as_table - test_empty_pages_returns_empty_list - test_reclassified_element_tagged_with_ki002_flag Boundary condition verified: confidence == threshold triggers reclassification per the <= operator in the condition. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements DOMAssembler (Task 8/Phase B1) with 6 passing unit tests. Fixes S2583 dead-code branch (layout_confidence ternary after success guard), moves DoclingRawResponse to TYPE_CHECKING block (TC001), and adds PLR0913 per-file-ignore for the required 6-arg assemble() contract signature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r DOMAssembler Added reportExplicitAny to pyright suppressions for dict[str, Any] parameter annotations and .get() calls in DOMAssembler, as Docling's JSON schema is unversioned and typed access via Any is intentional. Also added explanatory comment for layout_confidence=1.0 and three new edge-case tests covering: success=False with complete status, empty pages handling, and page_no default value when absent from input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements GCS adapter layer for reading DocumentMetadata.json from 01-preprocessed/ and writing DoclingDOM.json to 03-docling-dom/. Both classes lazily import google-cloud-storage to avoid hard SDK dependency at import time, and wrap all GCS errors in GCSError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…od type annotations
Instantiates DoclingServeClient in /health/ready, returns 503 when health_check() is False; adds main.py FastAPI app factory and 3 unit tests.
…ify 503 body Wrap blocking DoclingServeClient.health_check() call with asyncio.to_thread() to avoid blocking the event loop on failing probes. Use the context manager for automatic connection cleanup. Remove scaffold dead code (check_cache, check_external_service, bare Kubernetes YAML string). Add 503 body shape assertions to the readiness test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add .sonarlint/ to .gitignore - Exclude docs/superpowers/ and docs/ADRs/ from front-matter validation (dev-tool and architecture decision directories need no schema enforcement) - Move _EXCLUDED_DIRS to module-level frozenset in validate_front_matter.py - Suppress A005 for utils/logging.py (intentional stdlib shadow) and N802 for scripts/ (AST visitor methods follow ast.NodeVisitor naming contract) - Fix darglint DAR102 in correlation.py: align docstring arg names with underscore-prefixed _logger/_method_name parameters - Add docs/superpowers/plans/2026-05-06-phase-b1-layout-ocr.md - Add docs/template-defects-report.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughThis PR implements Phase B1 of the Foundry Unify OCR pipeline, adding a POST ChangesPhase B1 OCR Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
✅ FIPS Compatibility Check
Status: ✅ PASSED What is FIPS?FIPS 140-2/140-3 is a US government standard for cryptographic modules. Common issues:
|
✅ Mutation Testing Results
What is Mutation Testing?Mutation testing introduces small changes (mutations) to your code and checks if your tests detect them. A high mutation score indicates your tests are effective at catching bugs.
|
Implements the Cloud Workflows entry point: reads DocumentMetadata.json from GCS, selects processing tier, runs docling-serve OCR with KI-002 layout postprocessing, assembles DoclingDOM, and writes to GCS. Halted documents bypass OCR and emit a zero-page DOM. Audio track returns 422 with a clear NOT_IMPLEMENTED message for B2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t assertions - Add `# noqa: ASYNC240` tag to tmp_path.unlink() call with condensed justification comment - Wrap halted-path write_docling_dom in try/except GCSError for consistency with happy path - Remove dead _make_dom() helper and its DoclingDOM/DOMMetadata/PageDOM imports from test file - Patch TierSelector.select in happy-path test and add processing_status, processing_tier, document_id assertions - Remove unnecessary GCSArtifactReader patch from audio-track test and fix double-hyphen in docstring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ions
- Use asyncio.to_thread for tmp_path.unlink to satisfy ASYNC240 properly
- Report tier.value ("halted") in ProcessResponse instead of hardcoded "standard"
- Add explicit TierSelector mock in HALTED test for determinism
- Assert processing_tier == "halted" in HALTED test to prevent regression
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates tests/integration/test_process_e2e.py with 11 B1 acceptance criteria. 10 are skipped (require live docling-serve or GCS). The one unit-level test (missing source_track -> 422) runs in CI without infra. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix basedpyright errors: Awaitable[Response] in CorrelationMiddleware, MutableHeaders lacks pop() in SecurityHeadersMiddleware - Add 9 security middleware unit tests to lift total coverage to 84.08% - All 22 pre-commit hooks pass; 0 basedpyright errors; 161 unit tests pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
There was a problem hiding this comment.
Actionable comments posted: 23
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/test_exceptions.py (1)
811-822:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
TestModuleExportsdoesn't guard the four new exception types
DoclingServiceError,GCSError,MissingSourceTrackError, andMissingDoclingDOMErrorare all introduced in this PR but are absent fromexpected_exports. If any of them are accidentally removed from__all__in a future change, this test will not catch the regression.🛡️ Proposed fix
expected_exports = [ "ProjectBaseError", "ConfigurationError", "ValidationError", "ResourceNotFoundError", "AuthenticationError", "AuthorizationError", "ExternalServiceError", "APIError", "DatabaseError", "BusinessLogicError", + "DoclingServiceError", + "GCSError", + "MissingSourceTrackError", + "MissingDoclingDOMError", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_exceptions.py` around lines 811 - 822, The test TestModuleExports misses the four new exception types, so update the expected_exports list in tests/unit/test_exceptions.py (the variable expected_exports used by TestModuleExports) to include "DoclingServiceError", "GCSError", "MissingSourceTrackError", and "MissingDoclingDOMError" so the test will assert those classes are exported via __all__ (also double-check the exceptions module's __all__ contains those symbols).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.env.example:
- Line 142: The .env.example contains a developer-specific LAN address in the
FOUNDRY_UNIFY_DOCLING_SERVE_URL entry; replace the hard-coded private IP value
with a generic placeholder (for example set
FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://localhost:5001 or
FOUNDRY_UNIFY_DOCLING_SERVE_URL=https://docling.example.com or leave empty) so
contributors don't accidentally target an internal host and no internal network
topology is exposed.
In `@docs/planning/adr/adr-001-docling-serve-http-integration.md`:
- Around line 275-282: Replace the broken Markdown links by turning them into
plain, non-hyperlinked notes (remove the [text](path) link syntax) so MkDocs
strict mode no longer resolves them; specifically update the "docling-serve API:
POST /v1/convert/file", "DOCLING-API-CONTRACT.md", and the three Implementation
References ("Reference HTTP client", "DoclingRoutingParams schema", "Handoff doc
Section 7: Infrastructure") to simple plain-text repository paths or
descriptions (e.g., "image_detection/.../docling_client.py") instead of Markdown
links, leaving clear human-readable pointers but no clickable links.
In `@docs/template-defects-report.md`:
- Around line 40-54: The markdown has multiple fenced code blocks that violate
MD031/MD040 by lacking surrounding blank lines and missing language identifiers;
edit docs/template-defects-report.md to ensure every triple-backtick fence has a
blank line before and after it and add appropriate language specifiers (e.g.,
```dockerfile for the CMD block, ```python for the FastAPI snippet, or
```text/```toml for config samples) so markdownlint passes; update the
Dockerfile CMD and FastAPI snippet fences shown in the diff accordingly and scan
the file for any other fences to apply the same fixes.
- Line 17: Replace the personal email on the "Reporter:" line ("Byron Williams /
byronawilliams@gmail.com") with a non-PII identifier such as a GitHub username
or role address (e.g., "ByronWilliams" or "security-team@yourorg.com"); update
the "Reporter:" entry in template-defects-report.md accordingly and ensure the
visible text no longer contains the email address.
In `@pyproject.toml`:
- Line 32: The dependency declaration includes uvicorn with the [standard]
extras which we don't use; update the pyproject.toml entry referencing
"uvicorn[standard]>=0.23.0" to plain "uvicorn>=0.23.0" to remove unused extras,
then regenerate/update the lockfile (e.g., poetry.lock or requirements) and
reinstall/update dependencies so the lockfile and environment reflect the
change; ensure any CI or Docker build steps that pin or cache dependencies are
updated to use the new plain uvicorn spec.
In `@src/foundry_unify/adapters/docling_serve_client.py`:
- Around line 167-172: The return value is coercing a legitimate 0 page_count to
1 via "page_count or 1"; update the DoclingRawResponse construction (the return
in the function that builds DoclingRawResponse) to only default to 1 when
page_count is None—e.g., set page_count to page_count if page_count is not None
else 1—so an explicit 0 is preserved while missing values still default to 1.
In `@src/foundry_unify/adapters/gcs_client.py`:
- Around line 56-70: The Pydantic ValidationError from
InboundDocumentMetadata.model_validate_json can escape the current try/except
because only the download is wrapped; update the logic so validation is also
caught and rewrapped as a GCSError: either expand the existing try block to
include the InboundDocumentMetadata.model_validate_json(text) call, or add an
except pydantic.ValidationError as exc that raises GCSError(msg, path=gcs_path)
from exc; reference InboundDocumentMetadata.model_validate_json, GCSError, and
the logger.info("gcs_read_start")/text variables when making the change so
validation failures return a GCSError instead of leaking a 500.
In `@src/foundry_unify/api/health.py`:
- Around line 95-107: The health check for DoclingServeClient may raise
exceptions that propagate and cause a 500; wrap the call to DoclingServeClient
(the with block and the asyncio.to_thread(docling_client.health_check) call) in
a try/except Exception as e so any exception is caught and converted into a
ReadinessCheck entry in checks["docling_serve"]; on success keep the existing
reachable/latency/error behavior, on exception set status=False, latency_ms to
the elapsed time since start, and error to a clear message including
settings.docling_serve_url and the exception string (use DoclingServeClient,
health_check, ReadinessCheck, checks, and settings.docling_serve_url to locate
the code).
In `@src/foundry_unify/api/process.py`:
- Around line 90-116: The DOM assembled for halted documents uses a hardcoded
processing_tier="standard" which conflicts with the HTTP response using
tier.value ("halted"); update the DOMAssembler.assemble call in the
ProcessingTier.HALTED branch to pass the real tier value (e.g.,
processing_tier=tier.value) instead of the literal "standard" so DOMMetadata and
ProcessResponse are consistent; ensure this change propagates to
writer.write_docling_dom and the returned ProcessResponse remains unchanged.
- Around line 74-177: The async handler is making multiple blocking calls; wrap
all sync I/O in asyncio.to_thread and ensure the DoclingServeClient is closed:
run reader.download_document_metadata(...) via await asyncio.to_thread(...), run
both writer.write_docling_dom(...) calls via await asyncio.to_thread(...),
perform the tempfile.NamedTemporaryFile creation/write and tmp_path.unlink
inside asyncio.to_thread (or use an async tempfile helper) and call
docling_client.convert_file(...) via await asyncio.to_thread(...); additionally
instantiate DoclingServeClient using its context manager (with
DoclingServeClient(...) as docling_client:) or explicitly close it at function
end (await asyncio.to_thread(docling_client.close) or similar) so the underlying
httpx.Client is not leaked.
- Around line 68-72: Replace the deprecated HTTP_422_UNPROCESSABLE_ENTITY
constant used in the raise HTTPException(...) for the audio branch with
HTTP_422_UNPROCESSABLE_CONTENT; locate the block checking request.source_track
== "audio" (where HTTPException is raised) and change the status_code argument
to use HTTP_422_UNPROCESSABLE_CONTENT so the runtime no longer emits the
Starlette deprecation warning.
In `@src/foundry_unify/core/config.py`:
- Around line 25-28: The default docling_serve_url currently uses a private LAN
IP which is infrastructure-specific; update the Field default for the
docling_serve_url attribute in the Config class (symbol: docling_serve_url) to a
neutral localhost URL (e.g. "http://localhost:5001" or "http://127.0.0.1:5001")
so deployments that forget to set FOUNDRY_UNIFY_DOCLING_SERVE_URL fail fast
instead of targeting a developer machine.
In `@src/foundry_unify/main.py`:
- Around line 5-17: The CorrelationMiddleware class is defined but never
registered, so add the middleware to the FastAPI app: import
CorrelationMiddleware (the class) and call
app.add_middleware(CorrelationMiddleware) after the FastAPI(...) instantiation
in main.py (where app is created) so the correlation ContextVars
(_correlation_id_ctx, _request_id_ctx, _trace_id_ctx, _span_id_ctx) are
populated and correlation_context_processor can emit X-Correlation-ID /
X-Request-ID headers for responses.
In `@src/foundry_unify/models/docling_dom.py`:
- Around line 19-22: The BBox model currently allows inverted coordinates; add
schema-level validation in the BBox class to reject cases where x2 < x1 or y2 <
y1 by raising a validation error. If BBox is a dataclass, implement a
__post_init__ that checks (x2 >= x1 and y2 >= y1) and raises ValueError with a
clear message; if BBox is a pydantic BaseModel, add a `@root_validator` or field
validators that perform the same checks and raise ValidationError. Ensure the
error message references the BBox and the failing fields (x1, x2, y1, y2) so
invalid geometry is caught early.
In `@src/foundry_unify/pipeline/layout_postprocessor.py`:
- Around line 51-52: The code assigns confidence = item.get("confidence", 1.0)
which does not handle an explicit None from Docling and later does None <=
self.confidence_threshold causing a TypeError; update the code in
layout_postprocessor (where confidence is assigned) to explicitly detect None
(e.g., raw_conf = item.get("confidence"); confidence = 1.0 if raw_conf is None
else float(raw_conf)) so that a true 0.0 remains valid and only None is
defaulted to 1.0 before comparing against self.confidence_threshold.
In `@tests/integration/test_process_e2e.py`:
- Around line 15-70: The module-level pytestmark applies integration/slow to
every test so test_missing_source_track_returns_422 is accidentally excluded
from default CI; move the test (test_missing_source_track_returns_422) out of
the integration-marked module into your unit tests suite, mark it as a unit test
(e.g. `@pytest.mark.unit`), and ensure it uses TestClient properly (use a context
manager with TestClient(app) or otherwise close the client) so the 422
validation assertion runs in CI as intended.
In `@tests/unit/test_adapters_gcs_client.py`:
- Around line 137-148: The test shows corrupt JSON from GCS currently raises
pydantic.ValidationError instead of the expected GCSError; update
GCSArtifactReader.download_document_metadata to catch json.JSONDecodeError and
pydantic.ValidationError around the blob.download/parsing/validation logic and
re-raise a GCSError that includes the offending path/bucket/object (and preserve
the original exception as context or cause), so callers can catch GCSError for
both I/O and deserialization failures.
In `@tests/unit/test_api_health.py`:
- Around line 44-61: Add a third unit test that simulates
DoclingServeClient.health_check raising an exception and asserts the readiness
endpoint returns 503 with the expected structured payload; specifically, create
a test named test_readiness_returns_503_when_docling_serve_raises that patches
"foundry_unify.api.health.DoclingServeClient" (like the other tests), set
mock_instance.health_check.side_effect to an Exception (or httpx.ConnectError)
and keep the context manager dunder methods, then call
client.get("/health/ready") and assert response.status_code == 503 and that
response.json() contains the same "detail" structure and a failing
"docling_serve" check as in
test_readiness_returns_503_when_docling_serve_unreachable.
In `@tests/unit/test_middleware_security.py`:
- Around line 30-49: Tests currently check added OWASP headers and HSTS but miss
the case where an incoming response already contains a Server header that
SecurityHeadersMiddleware should strip; add a unit test (e.g.,
test_security_headers_strip_server_header) that uses
TestClient(_make_app_with_security_headers()) to call the /ping endpoint (or a
route that returns a response with a pre-populated "Server" header) and assert
that "Server" is not present in response.headers, referencing the
SecurityHeadersMiddleware behavior and _make_app_with_security_headers() helper
to locate the middleware attachment.
In `@tests/unit/test_models_docling_dom.py`:
- Around line 14-17: The test test_bbox_validates_coordinates only asserts a
static literal and doesn't verify that BBox enforces x2 > x1; either add a
negative case that attempts to construct BBox(x1=100.0, x2=0.0, y1=..., y2=...)
and assert it raises the model's validation exception (e.g., use
pytest.raises(ValidationError) or the appropriate exception type) to prove the
validator works, referencing the BBox constructor and its validator, or if BBox
intentionally does not enforce ordering, rename the test to something like
test_bbox_accepts_valid_coordinates to reflect it only checks successful
construction.
In `@tests/unit/test_pipeline_layout_postprocessor.py`:
- Line 45: The docstring for the test currently says “< is strict; == triggers”
which is contradictory; update the test's docstring (the string that currently
reads "Confidence equal to the threshold is reclassified (< is strict; ==
triggers).") to explicitly state the rule as "Confidence is reclassified when
confidence <= threshold" (or equivalent wording: "reclassification occurs when
confidence <= threshold") so the behavior is unambiguous.
- Around line 58-69: The test only checks the "label" key and can miss other
in-place mutations; before calling
LayoutPostprocessor(confidence_threshold=0.5).apply(pages) create a deep copy of
the original pages (e.g., via copy.deepcopy) using the same _make_raw_pages
output, call apply(pages), then assert that the entire pages structure equals
the deep copy (full structural equality) to ensure no nested changes were made
by LayoutPostprocessor.apply; reference the test function
test_original_pages_not_mutated, the _make_raw_pages helper, and
LayoutPostprocessor.apply when updating the assertion.
In `@tools/validate_front_matter.py`:
- Around line 276-281: The exclusion check uses _EXCLUDED_DIRS against f.parts /
path.parts which matches any path segment at any depth; update the logic that
builds md_files (the branch handling Path.is_dir() and the ".md" suffix branch)
to anchor exclusions to the path segment directly under the docs root instead of
scanning all parts: compute the path relative to the docs root (e.g., using
Path.relative_to or by locating the "docs" segment in path.parts), then check
only the first component of that relative path (or apply the same anchored regex
used by the pre-commit config) against _EXCLUDED_DIRS before skipping; ensure
the change references the same variables (_EXCLUDED_DIRS, md_files, f/path) so
only files directly under docs/<excluded_dir> are ignored.
---
Outside diff comments:
In `@tests/unit/test_exceptions.py`:
- Around line 811-822: The test TestModuleExports misses the four new exception
types, so update the expected_exports list in tests/unit/test_exceptions.py (the
variable expected_exports used by TestModuleExports) to include
"DoclingServiceError", "GCSError", "MissingSourceTrackError", and
"MissingDoclingDOMError" so the test will assert those classes are exported via
__all__ (also double-check the exceptions module's __all__ contains those
symbols).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: eb677c1b-28bf-4f83-8dfe-e6bda5c8149d
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (42)
.env.example.gitignore.pre-commit-config.yamldocs/api-reference.mddocs/planning/PROJECT-PLAN.mddocs/planning/adr/adr-001-docling-serve-http-integration.mddocs/superpowers/plans/2026-05-06-phase-b1-layout-ocr.mddocs/template-defects-report.mdpyproject.tomlsrc/foundry_unify/adapters/__init__.pysrc/foundry_unify/adapters/docling_serve_client.pysrc/foundry_unify/adapters/gcs_client.pysrc/foundry_unify/api/health.pysrc/foundry_unify/api/process.pysrc/foundry_unify/core/__init__.pysrc/foundry_unify/core/config.pysrc/foundry_unify/core/exceptions.pysrc/foundry_unify/main.pysrc/foundry_unify/middleware/correlation.pysrc/foundry_unify/middleware/security.pysrc/foundry_unify/models/__init__.pysrc/foundry_unify/models/docling_dom.pysrc/foundry_unify/models/document_metadata.pysrc/foundry_unify/pipeline/__init__.pysrc/foundry_unify/pipeline/dom_assembler.pysrc/foundry_unify/pipeline/layout_postprocessor.pysrc/foundry_unify/pipeline/source_track_router.pysrc/foundry_unify/pipeline/tier_selector.pytests/integration/test_process_e2e.pytests/unit/test_adapters_docling_serve_client.pytests/unit/test_adapters_gcs_client.pytests/unit/test_api_health.pytests/unit/test_api_process.pytests/unit/test_exceptions.pytests/unit/test_middleware_security.pytests/unit/test_models_docling_dom.pytests/unit/test_models_document_metadata.pytests/unit/test_pipeline_dom_assembler.pytests/unit/test_pipeline_layout_postprocessor.pytests/unit/test_pipeline_source_track_router.pytests/unit/test_pipeline_tier_selector.pytools/validate_front_matter.py
💤 Files with no reviewable changes (1)
- docs/api-reference.md
| # ============================================================================ | ||
|
|
||
| # docling-serve connection | ||
| FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://192.168.1.209:5001 |
There was a problem hiding this comment.
Replace the private LAN IP with a generic placeholder
http://192.168.1.209:5001 is a developer-specific home-network address. Contributors who copy .env.example verbatim will target a non-existent host, and the commit exposes internal network topology unnecessarily.
🔧 Suggested fix
-FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://192.168.1.209:5001
+FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://localhost:5001📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://192.168.1.209:5001 | |
| FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://localhost:5001 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.env.example at line 142, The .env.example contains a developer-specific LAN
address in the FOUNDRY_UNIFY_DOCLING_SERVE_URL entry; replace the hard-coded
private IP value with a generic placeholder (for example set
FOUNDRY_UNIFY_DOCLING_SERVE_URL=http://localhost:5001 or
FOUNDRY_UNIFY_DOCLING_SERVE_URL=https://docling.example.com or leave empty) so
contributors don't accidentally target an internal host and no internal network
topology is exposed.
| - [docling-serve API: POST /v1/convert/file](http://192.168.1.209:5001/docs) | ||
| - [DOCLING-API-CONTRACT.md](homelab-infra/docs/planning/contracts/DOCLING-API-CONTRACT.md) | ||
|
|
||
| ### Implementation References | ||
|
|
||
| - [Reference HTTP client](../../../../image_detection/src/image_preprocessing_detector/text_extraction/docling_client.py) | ||
| - [DoclingRoutingParams schema](../../../../image_detection/src/image_preprocessing_detector/schema.py#L755) | ||
| - [Handoff doc Section 7: Infrastructure](../../../../image_detection/docs/development/RAG%20Pipeline/foundry-unify-team-handoff.md) |
There was a problem hiding this comment.
Broken cross-repo links are failing the MkDocs CI job — confirmed by pipeline failures
Four Implementation References links (lines 280–282) and one External References link (line 277) point to paths in sibling repositories (image_detection/, homelab-infra/) that do not exist relative to this repo's MkDocs docs/ root. MkDocs strict mode raises an unresolved-link warning for each, and the pipeline failure log confirms all four.
🔗 Proposed fix — convert to non-hyperlinked notes
### Implementation References
-- [Reference HTTP client](../../../../image_detection/src/image_preprocessing_detector/text_extraction/docling_client.py)
-- [DoclingRoutingParams schema](../../../../image_detection/src/image_preprocessing_detector/schema.py#L755)
-- [Handoff doc Section 7: Infrastructure](../../../../image_detection/docs/development/RAG%20Pipeline/foundry-unify-team-handoff.md)
+- Reference HTTP client: `image_detection/src/image_preprocessing_detector/text_extraction/docling_client.py`
+- DoclingRoutingParams schema: `image_detection/src/image_preprocessing_detector/schema.py` line 755
+- Handoff doc Section 7: `image_detection/docs/development/RAG Pipeline/foundry-unify-team-handoff.md`Similarly for line 277:
-- [DOCLING-API-CONTRACT.md](homelab-infra/docs/planning/contracts/DOCLING-API-CONTRACT.md)
+- DOCLING-API-CONTRACT.md: `homelab-infra/docs/planning/contracts/DOCLING-API-CONTRACT.md`🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/planning/adr/adr-001-docling-serve-http-integration.md` around lines 275
- 282, Replace the broken Markdown links by turning them into plain,
non-hyperlinked notes (remove the [text](path) link syntax) so MkDocs strict
mode no longer resolves them; specifically update the "docling-serve API: POST
/v1/convert/file", "DOCLING-API-CONTRACT.md", and the three Implementation
References ("Reference HTTP client", "DoclingRoutingParams schema", "Handoff doc
Section 7: Infrastructure") to simple plain-text repository paths or
descriptions (e.g., "image_detection/.../docling_client.py") instead of Markdown
links, leaving clear human-readable pointers but no clickable links.
| **Source project:** ByronWilliamsCPA/Unify (foundry_unify) | ||
| **Template version:** 0.1.0 (via cruft, `.cruft.json` commit ref) | ||
| **Discovered:** 2026-05-05, via full compliance audit across 8 domain agents | ||
| **Reporter:** Byron Williams / byronawilliams@gmail.com |
There was a problem hiding this comment.
Remove the personal email address — PII committed to version history.
byronawilliams@gmail.com will be permanently embedded in git history. For an org repo this is a compliance/privacy concern. Replace with a GitHub username or a role address.
-**Reporter:** Byron Williams / byronawilliams@gmail.com
+**Reporter:** Byron Williams ([`@williaby`](https://github.com/williaby))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Reporter:** Byron Williams / byronawilliams@gmail.com | |
| **Reporter:** Byron Williams ([`@williaby`](https://github.com/williaby)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/template-defects-report.md` at line 17, Replace the personal email on
the "Reporter:" line ("Byron Williams / byronawilliams@gmail.com") with a
non-PII identifier such as a GitHub username or role address (e.g.,
"ByronWilliams" or "security-team@yourorg.com"); update the "Reporter:" entry in
template-defects-report.md accordingly and ensure the visible text no longer
contains the email address.
| ```dockerfile | ||
| CMD ["uvicorn", "{{cookiecutter.project_slug}}.main:app", "--host", "0.0.0.0", "--port", "8000"] | ||
| ``` | ||
| No `main.py` is generated under `src/{{cookiecutter.project_slug}}/`. The container fails | ||
| to start with a `ModuleNotFoundError` on first deployment. | ||
|
|
||
| **Suggested fix:** | ||
| Either generate `src/{{cookiecutter.project_slug}}/main.py` with a minimal FastAPI app: | ||
| ```python | ||
| from fastapi import FastAPI | ||
| from {{cookiecutter.project_slug}}.api.health import health_router | ||
|
|
||
| app = FastAPI(title="{{ cookiecutter.project_name }}") | ||
| app.include_router(health_router) | ||
| ``` |
There was a problem hiding this comment.
40+ markdownlint violations (MD031/MD040) may fail Qlty pre-push checks.
Static analysis flags fenced code blocks that are not surrounded by blank lines (MD031) and blocks without a language specifier (MD040) throughout the file. Qlty runs markdownlint on pre-push; depending on severity configuration, these could block future contributor pushes.
Representative fixes:
- Add a blank line before and after every
```fence. - Add a language identifier to plain-text fences that represent generic config snippets (e.g.,
```textor```toml).
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 40-40: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 42-42: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 48-48: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 54-54: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/template-defects-report.md` around lines 40 - 54, The markdown has
multiple fenced code blocks that violate MD031/MD040 by lacking surrounding
blank lines and missing language identifiers; edit
docs/template-defects-report.md to ensure every triple-backtick fence has a
blank line before and after it and add appropriate language specifiers (e.g.,
```dockerfile for the CMD block, ```python for the FastAPI snippet, or
```text/```toml for config samples) so markdownlint passes; update the
Dockerfile CMD and FastAPI snippet fences shown in the diff accordingly and scan
the file for any other fences to apply the same fixes.
| "httpx>=0.27.0", | ||
| "google-cloud-storage>=2.10.0", | ||
| "fastapi>=0.120.1", | ||
| "uvicorn[standard]>=0.23.0", |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether uvicorn[standard] extras appear to be required by current code.
# Expected:
# - If no usage/signals are found, downgrade to plain uvicorn.
# - If required operationally, document why [standard] is needed.
rg -nP --type=py '\b(watchfiles|uvloop|httptools|websockets|wsproto)\b' src tests
rg -n 'uvicorn(\[standard\])?' pyproject.tomlRepository: ByronWilliamsCPA/Unify
Length of output: 197
🏁 Script executed:
# Check the context around uvicorn in pyproject.toml
head -n 50 pyproject.toml | tail -n 25Repository: ByronWilliamsCPA/Unify
Length of output: 1088
🏁 Script executed:
# Also check if there's any README or documentation explaining dependencies
find . -maxdepth 2 -type f \( -name "README*" -o -name "DEPENDENCIES*" \) | head -5Repository: ByronWilliamsCPA/Unify
Length of output: 173
🏁 Script executed:
# Check the application structure to understand its purpose
ls -la
cat -n pyproject.toml | head -n 40Repository: ByronWilliamsCPA/Unify
Length of output: 5069
Downgrade uvicorn to plain version without [standard] extras.
The [standard] extras (watchfiles, uvloop, httptools, websockets, wsproto) are not used in the codebase. Plain uvicorn is fully capable of running the FastAPI application. Removing these extras reduces unnecessary dependencies and the security/update surface for a production OCR orchestration service.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pyproject.toml` at line 32, The dependency declaration includes uvicorn with
the [standard] extras which we don't use; update the pyproject.toml entry
referencing "uvicorn[standard]>=0.23.0" to plain "uvicorn>=0.23.0" to remove
unused extras, then regenerate/update the lockfile (e.g., poetry.lock or
requirements) and reinstall/update dependencies so the lockfile and environment
reflect the change; ensure any CI or Docker build steps that pin or cache
dependencies are updated to use the new plain uvicorn spec.
| @pytest.mark.unit | ||
| def test_security_headers_added_to_response() -> None: | ||
| """SecurityHeadersMiddleware adds OWASP headers to every response.""" | ||
| client = TestClient(_make_app_with_security_headers()) | ||
| response = client.get("/ping") | ||
| assert response.status_code == 200 | ||
| assert response.headers.get("X-Content-Type-Options") == "nosniff" | ||
| assert response.headers.get("X-Frame-Options") == "DENY" | ||
| assert response.headers.get("X-XSS-Protection") == "1; mode=block" | ||
| assert "Content-Security-Policy" in response.headers | ||
| assert "Referrer-Policy" in response.headers | ||
| assert "Permissions-Policy" in response.headers | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_security_headers_no_hsts_on_http() -> None: | ||
| """Strict-Transport-Security is omitted for plain-HTTP requests.""" | ||
| client = TestClient(_make_app_with_security_headers()) | ||
| response = client.get("/ping") | ||
| assert "Strict-Transport-Security" not in response.headers |
There was a problem hiding this comment.
Add a test that explicitly verifies Server header stripping.
Current tests validate added headers and HSTS behavior, but they don’t cover the branch where a response already contains Server and middleware must remove it.
Proposed test addition
+@pytest.mark.unit
+def test_security_headers_remove_server_header_when_present() -> None:
+ """SecurityHeadersMiddleware removes upstream Server header."""
+ app = FastAPI()
+ app.add_middleware(SecurityHeadersMiddleware)
+
+ `@app.get`("/ping")
+ def ping() -> Response:
+ response = Response(content="ok", media_type="text/plain")
+ response.headers["Server"] = "custom-server"
+ return response
+
+ client = TestClient(app)
+ response = client.get("/ping")
+ assert "Server" not in response.headersAs per coding guidelines, "Review test files for: - Comprehensive test coverage of edge cases".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_middleware_security.py` around lines 30 - 49, Tests currently
check added OWASP headers and HSTS but miss the case where an incoming response
already contains a Server header that SecurityHeadersMiddleware should strip;
add a unit test (e.g., test_security_headers_strip_server_header) that uses
TestClient(_make_app_with_security_headers()) to call the /ping endpoint (or a
route that returns a response with a pre-populated "Server" header) and assert
that "Server" is not present in response.headers, referencing the
SecurityHeadersMiddleware behavior and _make_app_with_security_headers() helper
to locate the middleware attachment.
| @pytest.mark.unit | ||
| def test_bbox_validates_coordinates() -> None: | ||
| bbox = BBox(x1=0.0, y1=0.0, x2=100.0, y2=50.0) | ||
| assert bbox.x2 > bbox.x1 |
There was a problem hiding this comment.
test_bbox_validates_coordinates only exercises valid input — the assertion is trivially true.
assert bbox.x2 > bbox.x1 is always 100.0 > 0.0 regardless of any model behaviour. The test neither confirms that BBox enforces x2 > x1 nor catches any regression where that constraint is removed. Per coding guidelines, tests should cover edge cases with assertions that clearly reflect the model contract.
Either add a negative case to confirm the validator rejects x2 < x1, or rename the test to reflect that it only checks construction with valid coordinates:
✅ Option A — add negative case (if BBox enforces coordinate ordering)
+import pytest
+from pydantic import ValidationError as PydanticValidationError
+
`@pytest.mark.unit`
def test_bbox_validates_coordinates() -> None:
bbox = BBox(x1=0.0, y1=0.0, x2=100.0, y2=50.0)
assert bbox.x2 > bbox.x1
+
+
+@pytest.mark.unit
+def test_bbox_rejects_inverted_x_coordinates() -> None:
+ with pytest.raises(PydanticValidationError):
+ BBox(x1=100.0, y1=0.0, x2=0.0, y2=50.0)✅ Option B — rename if BBox has no ordering constraint
-def test_bbox_validates_coordinates() -> None:
+def test_bbox_stores_coordinates() -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.mark.unit | |
| def test_bbox_validates_coordinates() -> None: | |
| bbox = BBox(x1=0.0, y1=0.0, x2=100.0, y2=50.0) | |
| assert bbox.x2 > bbox.x1 | |
| `@pytest.mark.unit` | |
| def test_bbox_stores_coordinates() -> None: | |
| bbox = BBox(x1=0.0, y1=0.0, x2=100.0, y2=50.0) | |
| assert bbox.x2 > bbox.x1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_models_docling_dom.py` around lines 14 - 17, The test
test_bbox_validates_coordinates only asserts a static literal and doesn't verify
that BBox enforces x2 > x1; either add a negative case that attempts to
construct BBox(x1=100.0, x2=0.0, y1=..., y2=...) and assert it raises the
model's validation exception (e.g., use pytest.raises(ValidationError) or the
appropriate exception type) to prove the validator works, referencing the BBox
constructor and its validator, or if BBox intentionally does not enforce
ordering, rename the test to something like test_bbox_accepts_valid_coordinates
to reflect it only checks successful construction.
|
|
||
| @pytest.mark.unit | ||
| def test_table_at_threshold_is_reclassified() -> None: | ||
| """Confidence equal to the threshold is reclassified (< is strict; == triggers).""" |
There was a problem hiding this comment.
Fix contradictory threshold docstring wording.
“< is strict; == triggers” is self-contradictory. Reword to explicitly say reclassification occurs when confidence <= threshold.
As per coding guidelines, "tests/**/*.py: Review test files for: Clear test names describing behavior".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_pipeline_layout_postprocessor.py` at line 45, The docstring
for the test currently says “< is strict; == triggers” which is contradictory;
update the test's docstring (the string that currently reads "Confidence equal
to the threshold is reclassified (< is strict; == triggers).") to explicitly
state the rule as "Confidence is reclassified when confidence <= threshold" (or
equivalent wording: "reclassification occurs when confidence <= threshold") so
the behavior is unambiguous.
| def test_original_pages_not_mutated() -> None: | ||
| """apply() must not modify the caller's input list.""" | ||
| pages = _make_raw_pages( | ||
| [ | ||
| [ | ||
| {"label": "Table", "confidence": 0.2, "text": "col1 col2"}, | ||
| ] | ||
| ] | ||
| ) | ||
| original_label = pages[0]["items"][0]["label"] | ||
| LayoutPostprocessor(confidence_threshold=0.5).apply(pages) | ||
| assert pages[0]["items"][0]["label"] == original_label |
There was a problem hiding this comment.
Strengthen the non-mutation assertion to catch nested changes.
This test currently checks only label, so in-place mutation of other keys would go undetected. Assert full structural equality against a deep copy.
Suggested patch
+from copy import deepcopy
from typing import Any
@@
def test_original_pages_not_mutated() -> None:
@@
- original_label = pages[0]["items"][0]["label"]
+ original_pages = deepcopy(pages)
LayoutPostprocessor(confidence_threshold=0.5).apply(pages)
- assert pages[0]["items"][0]["label"] == original_label
+ assert pages == original_pagesAs per coding guidelines, "tests/**/*.py: Review test files for: Comprehensive test coverage of edge cases ... Assertions that clearly explain failures".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_original_pages_not_mutated() -> None: | |
| """apply() must not modify the caller's input list.""" | |
| pages = _make_raw_pages( | |
| [ | |
| [ | |
| {"label": "Table", "confidence": 0.2, "text": "col1 col2"}, | |
| ] | |
| ] | |
| ) | |
| original_label = pages[0]["items"][0]["label"] | |
| LayoutPostprocessor(confidence_threshold=0.5).apply(pages) | |
| assert pages[0]["items"][0]["label"] == original_label | |
| def test_original_pages_not_mutated() -> None: | |
| """apply() must not modify the caller's input list.""" | |
| pages = _make_raw_pages( | |
| [ | |
| [ | |
| {"label": "Table", "confidence": 0.2, "text": "col1 col2"}, | |
| ] | |
| ] | |
| ) | |
| original_pages = deepcopy(pages) | |
| LayoutPostprocessor(confidence_threshold=0.5).apply(pages) | |
| assert pages == original_pages |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_pipeline_layout_postprocessor.py` around lines 58 - 69, The
test only checks the "label" key and can miss other in-place mutations; before
calling LayoutPostprocessor(confidence_threshold=0.5).apply(pages) create a deep
copy of the original pages (e.g., via copy.deepcopy) using the same
_make_raw_pages output, call apply(pages), then assert that the entire pages
structure equals the deep copy (full structural equality) to ensure no nested
changes were made by LayoutPostprocessor.apply; reference the test function
test_original_pages_not_mutated, the _make_raw_pages helper, and
LayoutPostprocessor.apply when updating the assertion.
| if not any(part in _EXCLUDED_DIRS for part in f.parts) | ||
| ] | ||
| ) | ||
| elif path.suffix.lower() == ".md": | ||
| # Only add file if not in planning directory | ||
| if "planning" not in path.parts: | ||
| if not any(part in _EXCLUDED_DIRS for part in path.parts): | ||
| md_files.append(path) |
There was a problem hiding this comment.
_EXCLUDED_DIRS matches at any path depth, not just under docs/.
f.parts / path.parts includes every path component, so a file like docs/guides/planning/overview.md would be silently excluded from validation even though the pre-commit files: regex does not exclude it (the regex only anchors the exclusion directly under docs/). The script would skip the file with no warning, leaving it permanently unvalidated.
Anchor the exclusion to the components that actually follow docs/ by checking only the first path segment relative to the docs/ root, or replicate the same anchoring logic the regex uses:
🛡️ Proposed fix
if path.is_dir():
all_md_files = path.rglob("*.md")
md_files.extend(
[
f
for f in all_md_files
- if not any(part in _EXCLUDED_DIRS for part in f.parts)
+ if not any(
+ part in _EXCLUDED_DIRS
+ for part in f.relative_to(path).parts[:1] # only top-level under the search root
+ )
]
)
elif path.suffix.lower() == ".md":
- if not any(part in _EXCLUDED_DIRS for part in path.parts):
+ # Exclude only if the immediate parent (relative to docs/) is an excluded dir
+ if not any(part in _EXCLUDED_DIRS for part in path.parts[1:2]):
md_files.append(path)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/validate_front_matter.py` around lines 276 - 281, The exclusion check
uses _EXCLUDED_DIRS against f.parts / path.parts which matches any path segment
at any depth; update the logic that builds md_files (the branch handling
Path.is_dir() and the ".md" suffix branch) to anchor exclusions to the path
segment directly under the docs root instead of scanning all parts: compute the
path relative to the docs root (e.g., using Path.relative_to or by locating the
"docs" segment in path.parts), then check only the first component of that
relative path (or apply the same anchored regex used by the pre-commit config)
against _EXCLUDED_DIRS before skipping; ensure the change references the same
variables (_EXCLUDED_DIRS, md_files, f/path) so only files directly under
docs/<excluded_dir> are ignored.




Summary
Implements the complete Phase B1: Layout and Basic OCR for the Foundry RAG pipeline. All 13 planned tasks are done and all quality gates pass.
Completed (Tasks 1-13):
DoclingDOM,InboundDocumentMetadata,DoclingRoutingParamsPydantic schemasDoclingServeClient— HTTP adapter forPOST /v1/convert/filewith context-manager lifecycleTierSelector— routing logic with halted-document fast-pathSourceTrackRouter— audio/document track dispatch withMISSING_SOURCE_TRACKerror codeLayoutPostprocessor— KI-002 mitigation (Table elements at confidence ≤ 0.5 reclassified to Text)DOMAssembler— converts raw Docling JSON to canonicalDoclingDOMGCSArtifactReader/GCSArtifactWriter— GCS I/O with lazy client injection for testability/readinesschecks docling-serve reachability viaasyncio.to_threadPOST /process— Cloud Workflows entry point orchestrating OCR pipeline end-to-end; audio track returns 422tests/integration/test_process_e2e.py; 10 skip (live infra), 1 runs in CINote: This PR depends on
fix/compliance-audit-2026-05-05(PR #2) being merged first to avoid rebasing churn. Both branches share the same ancestor.Key design decisions
asyncio.to_thread()wraps synchronous calls in async handlers (health check, temp-file cleanup) to prevent event-loop blockingdataclasses.replace()(notcopy.replace(), which is Python 3.13+) used for immutable updatesgoogle.cloud.storageclient imported inside_get_client()to allowMagicMockinjection without a live GCS connectionPOST /processusesprocessing_tier=tier.value("halted") in the wire response, while passing"standard"toDOMAssembler.assemblewhich has a narrowerLiteraltype constraint — two distinct contractsreportAny/reportExplicitAnysuppressions scoped to untyped Docling JSON (dict[str, Any]) access lines onlyKnown follow-up for B2
GCSArtifactReader,GCSArtifactWriter, andDoclingServeClientinPOST /processshould be moved to module-level singletons orDepends()providers for connection-pool hygiene under load#ASSUME/#VERIFYtagged inprocess.py)Test plan
uv run pytest tests/unit/ -v— 161 unit tests pass, 84.08% coverageuv run pytest tests/integration/test_process_e2e.py::test_missing_source_track_returns_422— passespre-commit run --all-files— all 22 hooks passuv run basedpyright src/— 0 errorsuv run bandit -r src/ -ll— 0 HIGH/CRITICAL findingsGenerated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
/processendpoint for OCR document processing with layout-to-text refinement and cloud storage integrationDocumentation
Configuration