feat: add describe_data_tool for agentic model selection#387
Open
kpal002 wants to merge 1 commit into
Open
Conversation
Returns a statistical fingerprint of a named sktime dataset: length, frequency, mean/std, trend slope, missingness, and candidate seasonal period (detected via ACF of the first-differenced series). Enables LLM agents to reason about data characteristics before choosing a forecaster — the missing first step in an agentic selection loop. Ref: #<YOUR_ISSUE_NUMBER>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Ref #386 — Agentic forecaster workflow: missing tools for iterative candidate evaluation.
What does this implement/fix? Explain your changes.
Adds
describe_data_tool(dataset, target_col?)— a series fingerprinting tool that returns summary statistics for a named sktime dataset:length,n_missing,min,max,mean,stdtrend_slope_per_step— OLS slope of the series vs time indexcandidate_seasonal_period— detected via ACF of the first-differenced seriesfrequency— inferred from the pandas DatetimeIndex where availableWhy first-differencing matters: raw ACF on a trending series (e.g. airline passengers) is dominated by trend autocorrelation at short lags and returns spurious periods like
sp=2. First-differencing removes the trend so seasonal peaks surface correctly (sp=12for airline data).Motivation: an LLM agent running a model-selection loop needs to inspect data characteristics before deciding which estimators to try. Without this tool, the agent must call
evaluate_estimatororfit_predictblind.describe_dataprovides the "look at the data first" step that makes iterative agentic selection meaningful. See the linked issue for the full workflow gap analysis.Changes:
src/sktime_mcp/tools/describe_data.py— new tool (pure numpy, no new deps)src/sktime_mcp/tools/__init__.py— export addedsrc/sktime_mcp/server.py—Tool()schema + dispatcher case addedDoes your contribution introduce a new dependency? If yes, which one?
No. All helpers use only
numpyandpandas, both already core dependencies.What should a reviewer concentrate their feedback on?
"airline","sunspots","lynx") to match the convention inevaluate_estimator. If the preferred pattern is data handles fromload_data_source, happy to change the interface.0.2on the differenced series. Happy to tune or expose as a parameter.Any other comments?
This is a draft PR accompanying a proposal for the ESoC 2026 sktime agentic track. The tool is part of a larger agentic forecaster prototype (sktime/sktime#9721) where the same fingerprinting logic has been validated against real sktime datasets including airline passengers.
PR checklist
For all contributions