Skip to content

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Jan 23, 2026

#45

Summary by CodeRabbit

  • New Features

    • MCP stdio server (exstruct-mcp) with tools: extract, cursored JSON chunk read, and input validation; extraction modes: light/standard/verbose; output conflict handling and path policy controls.
    • Mode-aware table detection and improved border scanning.
  • Documentation

    • MCP docs, examples, agent integration guidance (EN/JA), site nav entry, and new CHANGELOG.
  • Configuration

    • Package version bumped and new optional "mcp" extras and CLI script.
  • Chores

    • Expanded .gitignore; added timing/logging instrumentation.
  • Tests

    • Extensive MCP/unit/integration tests for chunking, policies, tools, server behavior, and utilities.

✏️ Tip: You can customize this high-level summary in your review settings.

@harumiWeb harumiWeb self-assigned this Jan 23, 2026
@harumiWeb harumiWeb added the enhancement New feature or request label Jan 23, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a stdio-based MCP server and public MCP package surface, new MCP tools (extract, read_json_chunk, validate_input), PathPolicy path safety, JSON chunking with pagination/filters, extraction runner with conflict policies, mode-aware table detection and scan limits, timing instrumentation, docs, packaging, and extensive tests.

Changes

Cohort / File(s) Summary
Project config & CI
/.gitignore, pyproject.toml, mkdocs.yml, .github/workflows/pytest.yml
Update .gitignore; bump version to 0.4.2; add mcp and all optional deps; add exstruct-mcp CLI script and taskipy task; add docs nav and CI extra install.
Documentation
README.md, README.ja.md, docs/README.en.md, docs/README.ja.md, docs/mcp.md, docs/release-notes/*, CHANGELOG.md
Add MCP Server (stdio) docs, examples, agent guidance, release notes and changelog; minor formatting fixes.
Agents / Roadmap / Tasks
docs/agents/FEATURE_SPEC.md, docs/agents/TASKS.md, docs/agents/ROADMAP.md
Replace formula-focused content with MCP server MVP spec, tasks, and roadmap entries.
MCP public API surface
src/exstruct/mcp/__init__.py
New package init re-exporting MCP types, requests, results, tools, and run_* helpers.
MCP server & tooling
src/exstruct/mcp/server.py, src/exstruct/mcp/tools.py
New server CLI/entrypoint and ServerConfig; logging, MCP import/warmup, FastMCP app creation and tool registration; tool adapters and tool-facing models.
MCP core implementations
src/exstruct/mcp/extract_runner.py, src/exstruct/mcp/chunk_reader.py, src/exstruct/mcp/validate_input.py, src/exstruct/mcp/io.py
New extract runner (conflict policies, metadata), JSON chunk reader with cursor/filter/pagination, input validation (COM checks), and PathPolicy enforcement.
Core detection & instrumentation
src/exstruct/core/cells.py, src/exstruct/core/backends/openpyxl_backend.py, src/exstruct/core/pipeline.py, src/exstruct/io/__init__.py, src/exstruct/core/shapes.py, src/exstruct/engine.py
Add ExtractionMode and TableScanLimits; mode-aware detect_tables; border-cluster backend selection; time instrumentation around pipeline and IO; docstring tweaks.
Tests
tests/mcp/*, tests/core/*, tests/integration/*
Add extensive MCP tests (chunk reader, extract runner utils/conflict, path policy, server, tool models, validate_input, tool handlers); adapt detector test helpers to accept **kwargs.
Minor changes / test signatures
tests/core/test_pipeline.py, tests/core/test_table_detection_branching.py, src/exstruct/core/backends/openpyxl_backend.py
Update test helpers to accept additional kwargs; extend detect_tables signature to accept mode parameter and thread mode through call sites.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant FastMCP
  participant PathPolicy
  participant ExtractRunner
  participant Filesystem

  Client->>FastMCP: call "exstruct.extract"(xlsx_path, mode, format, out_dir, on_conflict)
  FastMCP->>PathPolicy: ensure_allowed(input_path/out_dir)
  PathPolicy-->>FastMCP: allowed / raise
  FastMCP->>ExtractRunner: run_extract(request)
  ExtractRunner->>Filesystem: read input workbook
  Filesystem-->>ExtractRunner: workbook bytes
  ExtractRunner->>ExtractRunner: process_excel -> produce outputs
  ExtractRunner->>Filesystem: write output file (apply conflict policy)
  Filesystem-->>ExtractRunner: write result
  ExtractRunner-->>FastMCP: ExtractResult (out_path, warnings, workbook_meta)
  FastMCP-->>Client: return tool output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

  • Dev/refactor #23 — overlaps core detection and pipeline changes (mode/limits-aware table detection and pipeline timing).

Poem

🐰 A nibble, a hop, a server takes flight,

three tiny tools hum through day and night,
chunks and extracts, policies guard the gate,
logs softly tapping the data's new fate,
I chew carrots and celebrate this update!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Implementation of MCP server functionality' is a straightforward description of the main change in the PR, referring to the primary focus on introducing MCP server capabilities throughout the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7402f0660c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.34694% with 45 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/exstruct/core/cells.py 76.11% 16 Missing ⚠️
src/exstruct/mcp/chunk_reader.py 91.08% 14 Missing ⚠️
src/exstruct/mcp/server.py 89.58% 10 Missing ⚠️
src/exstruct/mcp/extract_runner.py 97.22% 3 Missing ⚠️
src/exstruct/mcp/io.py 92.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@README.ja.md`:
- Around line 72-74: リンク先がCLIガイドになっているためREADMEの「MCPサーバー」リンクを正しいドキュメントに差し替えてください
— README.ja.mdの該当箇所でリンクテキスト「MCPサーバー」を保持し、URL
`https://harumiweb.github.io/exstruct/cli/` をナビゲーションに合わせた
`/mcp/`(または静的サイトの完全URLであれば
`https://harumiweb.github.io/exstruct/mcp/`)に更新してください。

In `@README.md`:
- Around line 75-77: Update the "MCP Server" link target in README.md so it
points to the current MCP docs; locate the markdown anchor text "MCP Server" and
change the href from /cli/ to /mcp/ (or the correct published path) after
verifying the new MCP guide URL, ensuring the link consistently references the
new MCP documentation location.

In `@src/exstruct/mcp/chunk_reader.py`:
- Around line 83-100: In _resolve_output_path, after resolving the path (in
function _resolve_output_path and where policy.ensure_allowed is used), add a
guard to verify resolved.is_file() and if it is not a regular file raise a
ValueError with a clear message (e.g. "Output path is not a file: {resolved}")
instead of allowing a later IsADirectoryError; keep the existing
FileNotFoundError for non-existent paths.

In `@src/exstruct/mcp/extract_runner.py`:
- Around line 25-35: The ExtractRequest.options field is declared but never used
by run_extract, so either wire it into the extraction pipeline or remove/replace
it with a typed schema; update the run_extract function to accept and apply
ExtractRequest.options (or a new typed Options dataclass/Pydantic model) when
building the extractor and passing configuration to functions that control
output/format/on_conflict, and replace dict[str, Any] with a concrete model type
to avoid Any per guidelines; do the same fix for the similar unused options
declaration found around lines 85-90 (replace with the same typed options model
or remove it if unused).
- Around line 101-118: In _resolve_input_path, after obtaining resolved (via
policy.ensure_allowed or path.resolve()), add an explicit check that
resolved.is_file() and raise a ValueError (e.g. "Input path is not a file:
{resolved}") if it is not a regular file so directories don't slip through the
exists() check and later fail in process_excel; keep the existing
FileNotFoundError for missing paths and update the docstring to mention the new
ValueError for non-file inputs.

In `@src/exstruct/mcp/server.py`:
- Around line 242-271: The _read_json_chunk_tool function is currently
synchronous and calls run_read_json_chunk_tool directly, blocking the async
event loop; change its signature to async def _read_json_chunk_tool(...) and
call run_read_json_chunk_tool inside anyio.to_thread.run_sync (e.g. await
anyio.to_thread.run_sync(functools.partial(run_read_json_chunk_tool, payload,
policy=policy))) casting the result to ReadJsonChunkToolOutput as in the
_extract_tool pattern; ensure anyio and functools are imported and leave the
chunk_tool = app.tool(...) registration unchanged.
- Around line 273-286: The _validate_input_tool handler is currently synchronous
and calls run_validate_input_tool which does blocking file I/O; make
_validate_input_tool an async function and call the blocking work via await
anyio.to_thread.run_sync(run_validate_input_tool, payload, policy=policy)
(mirroring the _extract_tool pattern), then register it with
validate_tool(_validate_input_tool) unchanged so the FastMCP event loop is not
blocked by file.open/file.read.

In `@tests/mcp/test_server.py`:
- Around line 8-9: CI is failing because tests/mcp/test_server.py imports anyio
(a transitive dependency of the mcp extra) but the GitHub Actions workflow only
installs the [yaml] extra; update the workflow step in
.github/workflows/pytest.yml that runs pip install -e .[yaml] to install the mcp
extra as well (pip install -e .[yaml,mcp]) so tests/mcp/test_server.py and its
anyio dependency are available during CI.
♻️ Duplicate comments (2)
docs/README.ja.md (1)

72-74: Same MCP link-target concern as in README.md.

docs/README.en.md (1)

75-77: Same MCP link-target concern as in README.md.

🧹 Nitpick comments (10)
src/exstruct/io/__init__.py (1)

459-486: Consider DEBUG level for granular timing logs.

The three-phase timing (model_dump, serialization, total) provides excellent observability. However, logging all three at INFO level may be verbose in production. Consider using DEBUG for the intermediate phases and INFO only for the total, or make the log level configurable.

♻️ Optional: Use DEBUG for intermediate timing
     logger.info(
-        "serialize_workbook model_dump completed in %.2fs",
+    logger.debug(
+        "serialize_workbook model_dump completed in %.2fs",
         time.monotonic() - dump_start,
     )
     serialize_start = time.monotonic()
     result = _serialize_payload_from_hint(
         filtered_dict, format_hint, pretty=pretty, indent=indent
     )
-    logger.info(
+    logger.debug(
         "serialize_workbook serialization completed in %.2fs",
         time.monotonic() - serialize_start,
     )
     logger.info(
         "serialize_workbook total completed in %.2fs",
         time.monotonic() - total_start,
     )
src/exstruct/core/cells.py (1)

142-149: Consider documenting the 1.5 scaling factor rationale.

The mode-based limit resolution is clear, but the 1.5x scaling for standard/verbose modes could benefit from a brief comment explaining the reasoning (e.g., "allow larger scans for richer extraction modes").

📝 Optional: Add comment for scaling rationale
 def _resolve_table_scan_limits(
     mode: ExtractionMode, scan_limits: TableScanLimits | None
 ) -> TableScanLimits:
     if scan_limits is not None:
         return scan_limits
     if mode in {"standard", "verbose"}:
+        # Standard/verbose modes allow larger scans for more thorough detection
         return _DEFAULT_TABLE_SCAN_LIMITS.scaled(1.5)
     return _DEFAULT_TABLE_SCAN_LIMITS
src/exstruct/core/backends/openpyxl_backend.py (1)

120-140: Consider using the ExtractionMode type alias for consistency.

The mode parameter uses an inline Literal["light", "standard", "verbose"] type annotation, but cells.py already defines an ExtractionMode type alias for this. Using the alias would ensure consistency if the allowed modes ever change.

♻️ Suggested refactor
 from typing import Literal
+from ..cells import ExtractionMode
 
 ...
 
     def detect_tables(
         self,
         sheet_name: str,
         *,
-        mode: Literal["light", "standard", "verbose"] = "standard",
+        mode: ExtractionMode = "standard",
     ) -> list[str]:
tests/mcp/test_path_policy.py (1)

10-30: Add Google-style docstrings to the new test functions.

These test helpers are missing docstrings required by the project’s Python guidelines. As per coding guidelines, please add concise Google-style docstrings to each test.

tests/mcp/test_validate_input.py (1)

13-52: Add Google-style docstrings to the new test functions.

These tests were added without docstrings. As per coding guidelines, please add concise Google-style docstrings to each test.

docs/agents/FEATURE_SPEC.md (1)

76-85: Add a language specifier to the fenced code block.

The code block should have a language identifier for markdown lint compliance and better rendering. Use text or plaintext for directory structures.

Proposed fix
-```
+```text
 src/exstruct/
   mcp/
     __init__.py
     server.py          # MCP server entrypoint (stdio)
     tools.py           # tool definitions + handlers
     io.py              # path validation, safe read/write
     extract_runner.py  # internal API call or subprocess fallback
     chunk_reader.py    # JSON partial read / pointer / sheet filters
</details>

</blockquote></details>
<details>
<summary>src/exstruct/mcp/chunk_reader.py (1)</summary><blockquote>

`192-213`: **Prefer a dedicated result object over tuples for structured returns.**  
Helpers like `_apply_filters` and `_build_sheet_chunk` return multi-value tuples; a small result type would make fields explicit and align with project guidance. As per coding guidelines, consider replacing tuple returns with a dedicated result model.  



Also applies to: 304-365

</blockquote></details>
<details>
<summary>src/exstruct/mcp/extract_runner.py (1)</summary><blockquote>

`192-219`: **Consider a named result type for helper outputs.**  
`_apply_conflict_policy` and `_try_read_workbook_meta` return tuples; using a small result object would make call sites clearer and align with project guidance. As per coding guidelines, prefer structured result types over tuples.  



Also applies to: 235-256

</blockquote></details>
<details>
<summary>src/exstruct/mcp/server.py (2)</summary><blockquote>

`172-185`: **Incomplete docstring: missing `on_conflict` parameter.**

The docstring documents `policy` but omits the `on_conflict` parameter. Per coding guidelines, all parameters should be documented.

<details>
<summary>📝 Proposed fix</summary>

```diff
 def _create_app(policy: PathPolicy, *, on_conflict: OnConflictPolicy) -> FastMCP:
     """Create the MCP FastMCP application.
 
     Args:
         policy: Path policy for filesystem access.
+        on_conflict: Default conflict resolution policy for output files.
 
     Returns:
         FastMCP application instance.
     """

188-196: Incomplete docstring: missing default_on_conflict parameter.

The default_on_conflict parameter is not documented in the docstring.

📝 Proposed fix
 def _register_tools(
     app: FastMCP, policy: PathPolicy, *, default_on_conflict: OnConflictPolicy
 ) -> None:
     """Register MCP tools for the server.
 
     Args:
         app: FastMCP application instance.
         policy: Path policy for filesystem access.
+        default_on_conflict: Default conflict resolution policy when not specified per-call.
     """

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements MCP (Model Context Protocol) server functionality for ExStruct, enabling AI agents to use ExStruct as a tool for extracting structured data from Excel files. The implementation adds three MCP tools: extraction, chunked JSON reading, and input validation, with support for light/standard/verbose extraction modes.

Changes:

  • Added MCP server with stdio transport supporting three tools for Excel extraction
  • Implemented path-based security with allowlist/denylist patterns
  • Added performance instrumentation with timing logs for pipeline steps and serialization
  • Enhanced table detection with configurable scan limits that scale based on extraction mode

Reviewed changes

Copilot reviewed 36 out of 38 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Added mcp optional dependency group and exstruct-mcp CLI entrypoint
uv.lock Added MCP SDK and dependencies (anyio, httpx, starlette, pydantic-settings, etc.)
src/exstruct/mcp/*.py New MCP server implementation with tools, path policy, extraction runner, chunk reader, and validation
src/exstruct/core/cells.py Added table scan limits, border clustering backend selection, and openpyxl scan optimizations
src/exstruct/core/pipeline.py Added timing instrumentation for pipeline steps
src/exstruct/io/init.py Added timing logs for serialization and file write operations
tests/mcp/*.py Comprehensive test suite for MCP functionality
tests/core/*.py Updated tests to pass mode parameter to table detection functions
docs/mcp.md New documentation for MCP server usage and configuration
README*.md Added MCP server section with installation and usage instructions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…xstruct.validate_input`をそれぞれ`exstruct_extract`、`exstruct_read_json_chunk`、`exstruct_validate_input`に変更。バージョンを0.4.1に更新し、リリースノートを追加。
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 51-54: Update the mcp dependency constraint in pyproject.toml:
locate the mcp array entry (the lines defining "mcp>=1.6.0,<2.0.0") and change
the minimum version to the current stable baseline by replacing ">=1.6.0,<2.0.0"
with ">=1.25.0,<2.0.0" so the project requires at least mcp 1.25.0 while still
preventing v2.x releases.
🧹 Nitpick comments (4)
src/exstruct/mcp/server.py (2)

172-185: Incomplete docstring for _create_app.

The docstring is missing the on_conflict parameter description in the Args section.

📝 Proposed fix
 def _create_app(policy: PathPolicy, *, on_conflict: OnConflictPolicy) -> FastMCP:
     """Create the MCP FastMCP application.

     Args:
         policy: Path policy for filesystem access.
+        on_conflict: Output conflict policy for extraction.

     Returns:
         FastMCP application instance.
     """

188-196: Incomplete docstring for _register_tools.

The docstring is missing the default_on_conflict parameter description in the Args section.

📝 Proposed fix
 def _register_tools(
     app: FastMCP, policy: PathPolicy, *, default_on_conflict: OnConflictPolicy
 ) -> None:
     """Register MCP tools for the server.

     Args:
         app: FastMCP application instance.
         policy: Path policy for filesystem access.
+        default_on_conflict: Default output conflict policy for extraction.
     """
docs/agents/FEATURE_SPEC.md (1)

76-85: Add language specifier to fenced code block.

The fenced code block is missing a language identifier. For directory structure listings, use text or plaintext to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 src/exstruct/
   mcp/
     __init__.py
pyproject.toml (1)

40-47: Consider deduplicating dependency specifications.

The mcp and httpx dependencies are specified in both the all group and the mcp group with identical constraints. While this works correctly, you could reference the mcp extra from all to avoid duplication if the tooling supports it.

Also applies to: 51-54

@harumiWeb harumiWeb merged commit f5a4642 into main Jan 24, 2026
9 checks passed
@harumiWeb harumiWeb deleted the feat/mcp branch January 24, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants