-
Notifications
You must be signed in to change notification settings - Fork 15
Implementation of MCP server functionality #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.gitignoreにエラーログファイルを追加
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a stdio-based MCP server and public MCP package surface, new MCP tools (extract, read_json_chunk, validate_input), PathPolicy path safety, JSON chunking with pagination/filters, extraction runner with conflict policies, mode-aware table detection and scan limits, timing instrumentation, docs, packaging, and extensive tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FastMCP
participant PathPolicy
participant ExtractRunner
participant Filesystem
Client->>FastMCP: call "exstruct.extract"(xlsx_path, mode, format, out_dir, on_conflict)
FastMCP->>PathPolicy: ensure_allowed(input_path/out_dir)
PathPolicy-->>FastMCP: allowed / raise
FastMCP->>ExtractRunner: run_extract(request)
ExtractRunner->>Filesystem: read input workbook
Filesystem-->>ExtractRunner: workbook bytes
ExtractRunner->>ExtractRunner: process_excel -> produce outputs
ExtractRunner->>Filesystem: write output file (apply conflict policy)
Filesystem-->>ExtractRunner: write result
ExtractRunner-->>FastMCP: ExtractResult (out_path, warnings, workbook_meta)
FastMCP-->>Client: return tool output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7402f0660c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@README.ja.md`:
- Around line 72-74: リンク先がCLIガイドになっているためREADMEの「MCPサーバー」リンクを正しいドキュメントに差し替えてください
— README.ja.mdの該当箇所でリンクテキスト「MCPサーバー」を保持し、URL
`https://harumiweb.github.io/exstruct/cli/` をナビゲーションに合わせた
`/mcp/`(または静的サイトの完全URLであれば
`https://harumiweb.github.io/exstruct/mcp/`)に更新してください。
In `@README.md`:
- Around line 75-77: Update the "MCP Server" link target in README.md so it
points to the current MCP docs; locate the markdown anchor text "MCP Server" and
change the href from /cli/ to /mcp/ (or the correct published path) after
verifying the new MCP guide URL, ensuring the link consistently references the
new MCP documentation location.
In `@src/exstruct/mcp/chunk_reader.py`:
- Around line 83-100: In _resolve_output_path, after resolving the path (in
function _resolve_output_path and where policy.ensure_allowed is used), add a
guard to verify resolved.is_file() and if it is not a regular file raise a
ValueError with a clear message (e.g. "Output path is not a file: {resolved}")
instead of allowing a later IsADirectoryError; keep the existing
FileNotFoundError for non-existent paths.
In `@src/exstruct/mcp/extract_runner.py`:
- Around line 25-35: The ExtractRequest.options field is declared but never used
by run_extract, so either wire it into the extraction pipeline or remove/replace
it with a typed schema; update the run_extract function to accept and apply
ExtractRequest.options (or a new typed Options dataclass/Pydantic model) when
building the extractor and passing configuration to functions that control
output/format/on_conflict, and replace dict[str, Any] with a concrete model type
to avoid Any per guidelines; do the same fix for the similar unused options
declaration found around lines 85-90 (replace with the same typed options model
or remove it if unused).
- Around line 101-118: In _resolve_input_path, after obtaining resolved (via
policy.ensure_allowed or path.resolve()), add an explicit check that
resolved.is_file() and raise a ValueError (e.g. "Input path is not a file:
{resolved}") if it is not a regular file so directories don't slip through the
exists() check and later fail in process_excel; keep the existing
FileNotFoundError for missing paths and update the docstring to mention the new
ValueError for non-file inputs.
In `@src/exstruct/mcp/server.py`:
- Around line 242-271: The _read_json_chunk_tool function is currently
synchronous and calls run_read_json_chunk_tool directly, blocking the async
event loop; change its signature to async def _read_json_chunk_tool(...) and
call run_read_json_chunk_tool inside anyio.to_thread.run_sync (e.g. await
anyio.to_thread.run_sync(functools.partial(run_read_json_chunk_tool, payload,
policy=policy))) casting the result to ReadJsonChunkToolOutput as in the
_extract_tool pattern; ensure anyio and functools are imported and leave the
chunk_tool = app.tool(...) registration unchanged.
- Around line 273-286: The _validate_input_tool handler is currently synchronous
and calls run_validate_input_tool which does blocking file I/O; make
_validate_input_tool an async function and call the blocking work via await
anyio.to_thread.run_sync(run_validate_input_tool, payload, policy=policy)
(mirroring the _extract_tool pattern), then register it with
validate_tool(_validate_input_tool) unchanged so the FastMCP event loop is not
blocked by file.open/file.read.
In `@tests/mcp/test_server.py`:
- Around line 8-9: CI is failing because tests/mcp/test_server.py imports anyio
(a transitive dependency of the mcp extra) but the GitHub Actions workflow only
installs the [yaml] extra; update the workflow step in
.github/workflows/pytest.yml that runs pip install -e .[yaml] to install the mcp
extra as well (pip install -e .[yaml,mcp]) so tests/mcp/test_server.py and its
anyio dependency are available during CI.
♻️ Duplicate comments (2)
docs/README.ja.md (1)
72-74: Same MCP link-target concern as in README.md.docs/README.en.md (1)
75-77: Same MCP link-target concern as in README.md.
🧹 Nitpick comments (10)
src/exstruct/io/__init__.py (1)
459-486: Consider DEBUG level for granular timing logs.The three-phase timing (model_dump, serialization, total) provides excellent observability. However, logging all three at INFO level may be verbose in production. Consider using DEBUG for the intermediate phases and INFO only for the total, or make the log level configurable.
♻️ Optional: Use DEBUG for intermediate timing
logger.info( - "serialize_workbook model_dump completed in %.2fs", + logger.debug( + "serialize_workbook model_dump completed in %.2fs", time.monotonic() - dump_start, ) serialize_start = time.monotonic() result = _serialize_payload_from_hint( filtered_dict, format_hint, pretty=pretty, indent=indent ) - logger.info( + logger.debug( "serialize_workbook serialization completed in %.2fs", time.monotonic() - serialize_start, ) logger.info( "serialize_workbook total completed in %.2fs", time.monotonic() - total_start, )src/exstruct/core/cells.py (1)
142-149: Consider documenting the 1.5 scaling factor rationale.The mode-based limit resolution is clear, but the 1.5x scaling for standard/verbose modes could benefit from a brief comment explaining the reasoning (e.g., "allow larger scans for richer extraction modes").
📝 Optional: Add comment for scaling rationale
def _resolve_table_scan_limits( mode: ExtractionMode, scan_limits: TableScanLimits | None ) -> TableScanLimits: if scan_limits is not None: return scan_limits if mode in {"standard", "verbose"}: + # Standard/verbose modes allow larger scans for more thorough detection return _DEFAULT_TABLE_SCAN_LIMITS.scaled(1.5) return _DEFAULT_TABLE_SCAN_LIMITSsrc/exstruct/core/backends/openpyxl_backend.py (1)
120-140: Consider using theExtractionModetype alias for consistency.The
modeparameter uses an inlineLiteral["light", "standard", "verbose"]type annotation, butcells.pyalready defines anExtractionModetype alias for this. Using the alias would ensure consistency if the allowed modes ever change.♻️ Suggested refactor
from typing import Literal +from ..cells import ExtractionMode ... def detect_tables( self, sheet_name: str, *, - mode: Literal["light", "standard", "verbose"] = "standard", + mode: ExtractionMode = "standard", ) -> list[str]:tests/mcp/test_path_policy.py (1)
10-30: Add Google-style docstrings to the new test functions.These test helpers are missing docstrings required by the project’s Python guidelines. As per coding guidelines, please add concise Google-style docstrings to each test.
tests/mcp/test_validate_input.py (1)
13-52: Add Google-style docstrings to the new test functions.These tests were added without docstrings. As per coding guidelines, please add concise Google-style docstrings to each test.
docs/agents/FEATURE_SPEC.md (1)
76-85: Add a language specifier to the fenced code block.The code block should have a language identifier for markdown lint compliance and better rendering. Use
textorplaintextfor directory structures.Proposed fix
-``` +```text src/exstruct/ mcp/ __init__.py server.py # MCP server entrypoint (stdio) tools.py # tool definitions + handlers io.py # path validation, safe read/write extract_runner.py # internal API call or subprocess fallback chunk_reader.py # JSON partial read / pointer / sheet filters</details> </blockquote></details> <details> <summary>src/exstruct/mcp/chunk_reader.py (1)</summary><blockquote> `192-213`: **Prefer a dedicated result object over tuples for structured returns.** Helpers like `_apply_filters` and `_build_sheet_chunk` return multi-value tuples; a small result type would make fields explicit and align with project guidance. As per coding guidelines, consider replacing tuple returns with a dedicated result model. Also applies to: 304-365 </blockquote></details> <details> <summary>src/exstruct/mcp/extract_runner.py (1)</summary><blockquote> `192-219`: **Consider a named result type for helper outputs.** `_apply_conflict_policy` and `_try_read_workbook_meta` return tuples; using a small result object would make call sites clearer and align with project guidance. As per coding guidelines, prefer structured result types over tuples. Also applies to: 235-256 </blockquote></details> <details> <summary>src/exstruct/mcp/server.py (2)</summary><blockquote> `172-185`: **Incomplete docstring: missing `on_conflict` parameter.** The docstring documents `policy` but omits the `on_conflict` parameter. Per coding guidelines, all parameters should be documented. <details> <summary>📝 Proposed fix</summary> ```diff def _create_app(policy: PathPolicy, *, on_conflict: OnConflictPolicy) -> FastMCP: """Create the MCP FastMCP application. Args: policy: Path policy for filesystem access. + on_conflict: Default conflict resolution policy for output files. Returns: FastMCP application instance. """
188-196: Incomplete docstring: missingdefault_on_conflictparameter.The
default_on_conflictparameter is not documented in the docstring.📝 Proposed fix
def _register_tools( app: FastMCP, policy: PathPolicy, *, default_on_conflict: OnConflictPolicy ) -> None: """Register MCP tools for the server. Args: app: FastMCP application instance. policy: Path policy for filesystem access. + default_on_conflict: Default conflict resolution policy when not specified per-call. """
cells.pyに最小行数の列幅縮小制限を追加し、load_border_maps_xlsx関数を修正。テストケースを追加して動作を確認。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements MCP (Model Context Protocol) server functionality for ExStruct, enabling AI agents to use ExStruct as a tool for extracting structured data from Excel files. The implementation adds three MCP tools: extraction, chunked JSON reading, and input validation, with support for light/standard/verbose extraction modes.
Changes:
- Added MCP server with stdio transport supporting three tools for Excel extraction
- Implemented path-based security with allowlist/denylist patterns
- Added performance instrumentation with timing logs for pipeline steps and serialization
- Enhanced table detection with configurable scan limits that scale based on extraction mode
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added mcp optional dependency group and exstruct-mcp CLI entrypoint |
| uv.lock | Added MCP SDK and dependencies (anyio, httpx, starlette, pydantic-settings, etc.) |
| src/exstruct/mcp/*.py | New MCP server implementation with tools, path policy, extraction runner, chunk reader, and validation |
| src/exstruct/core/cells.py | Added table scan limits, border clustering backend selection, and openpyxl scan optimizations |
| src/exstruct/core/pipeline.py | Added timing instrumentation for pipeline steps |
| src/exstruct/io/init.py | Added timing logs for serialization and file write operations |
| tests/mcp/*.py | Comprehensive test suite for MCP functionality |
| tests/core/*.py | Updated tests to pass mode parameter to table detection functions |
| docs/mcp.md | New documentation for MCP server usage and configuration |
| README*.md | Added MCP server section with installation and usage instructions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…xstruct.validate_input`をそれぞれ`exstruct_extract`、`exstruct_read_json_chunk`、`exstruct_validate_input`に変更。バージョンを0.4.1に更新し、リリースノートを追加。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 51-54: Update the mcp dependency constraint in pyproject.toml:
locate the mcp array entry (the lines defining "mcp>=1.6.0,<2.0.0") and change
the minimum version to the current stable baseline by replacing ">=1.6.0,<2.0.0"
with ">=1.25.0,<2.0.0" so the project requires at least mcp 1.25.0 while still
preventing v2.x releases.
🧹 Nitpick comments (4)
src/exstruct/mcp/server.py (2)
172-185: Incomplete docstring for_create_app.The docstring is missing the
on_conflictparameter description in the Args section.📝 Proposed fix
def _create_app(policy: PathPolicy, *, on_conflict: OnConflictPolicy) -> FastMCP: """Create the MCP FastMCP application. Args: policy: Path policy for filesystem access. + on_conflict: Output conflict policy for extraction. Returns: FastMCP application instance. """
188-196: Incomplete docstring for_register_tools.The docstring is missing the
default_on_conflictparameter description in the Args section.📝 Proposed fix
def _register_tools( app: FastMCP, policy: PathPolicy, *, default_on_conflict: OnConflictPolicy ) -> None: """Register MCP tools for the server. Args: app: FastMCP application instance. policy: Path policy for filesystem access. + default_on_conflict: Default output conflict policy for extraction. """docs/agents/FEATURE_SPEC.md (1)
76-85: Add language specifier to fenced code block.The fenced code block is missing a language identifier. For directory structure listings, use
textorplaintextto satisfy markdown linting rules.📝 Proposed fix
-``` +```text src/exstruct/ mcp/ __init__.pypyproject.toml (1)
40-47: Consider deduplicating dependency specifications.The
mcpandhttpxdependencies are specified in both theallgroup and themcpgroup with identical constraints. While this works correctly, you could reference themcpextra fromallto avoid duplication if the tooling supports it.Also applies to: 51-54
#45
Summary by CodeRabbit
New Features
Documentation
Configuration
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.