- 03/28/2026 : this stack is now containerized - see
Docker quickstartbelow.
The goal OHDSI Study Design Assistant (SDA) is to provide an experience similar to working with a coding agent but for designing and executing observational retrospective studies using OHDSI tools. SDA is designed to organize and enable users to interact with a wide variety of agentic tools to suppor their study work. It does so by providing a clean separation between the agentic user experience and the generative AI tools. Check out the tag first_agent_and_strategus for the first version to assist with Strategus (not validated) as shown in the more recent video for the second version (no sound). This demonstrates a possible way for the agent to help the user design, run, and interpret the results of an OHDSI incidence rate analysis using the CohortIncidenceModule of OHDSI Strategus. This older video shows an prior test of this concept.
Here are some ways:
- Create a fork of the project, branch the new project's main branch, edit the README.md and do a pull request back this main branch. Your changes could be integrated very quickly that way!
- Join the discussion on the OHDSI Forums
- Attend the Generative AI WG monthly calls (currently 2nd Tuesdays of the month at 12 Eastern) or reach out directly to Rich Boyce on the OHDSI Teams or the OHDSI forums.
- You may also post "question" issues on this repo.
-
data_quality_interpretation: study agent provides interpretation from Data Quality Dashboard, Achilles Heel data quality checks, and Achilles data source characterizations over one or more sources that a user intends to use within a study. In this mode, the study agent derive insights from those sources based on the user's study intent. This is important because it will make the information in the characterizations and QC reports more relevant and actionable to users than static and broad-scope reports (current state). Users will use this tool from R initially. -
create_new_phenotype_definition: Study agent will guide the user through the creation of a definition for an EHR phenotype for the target or outcome cohort relevant to their study intent. This workflow involves selection of concepts, organization of concepts into concept sets, and assembly into cohort definition logic. In addition to concept retrieval, the agent will support reasoning over the semantic relationships encoded in the OMOP vocabulary system (via identity, hierarchical, compositional, associative and attribute links) to help users identify appropriate inclusions, exclusions, and boundary conditions. This enables deterministic validation of constructed concept sets, supports principled disambiguation of similar concepts during grounding, and provides traceable justification for why specific concepts or groups are included in a phenotype definition. Users will use this tool from R or Atlas initially. -
keeper_design_sample: Study agent helps the user to create the createKeeper function to pull cases matching a clinical definition. This will guide the user through building the set of symptoms, related differential diagnoses (those that need to be ruled out), diagnostic procedures, complications, exposures, and measurements for the clinical definition.
Build out the entire set of planned services, each one evaluated and user-tested.
-
An Agent Client Protocol (ACP) server that owns interaction policy: confirmations, safe summaries, and tool invocation routing.
acp_agent/: interaction policy + routing; calls MCP tools or falls back to core.
-
Multiple MCP servers that own tool contracts: JSON schemas + deterministic tool outputs.
mcp_server/: exposes tool APIs (core tools plus phenotype retrieval and prompt bundles).
-
Core logic stays pure and reusable across both ACP and MCP layers.
core/: pure, deterministic business logic (no IO, no network).
ACP provides consistent UX and control across environments (R, Atlas/WebAPI, notebooks), while MCP provides a shared tool bus that can be reused across agents and institutions. ACP orchestrates tool calls and LLM calls; MCP owns retrieval, prompt assets, and deterministic tool outputs. This enables the same core tools can be accessed via MCP or directly by ACP without coupling to datasets or local files.
NOTE: at no time for any of the services should an LLM see row-level data (this can be accomplished through the careful use of protocols (MCP for tooling, Agent Client Protocol for OHDSI tool <-> LLM communication) and a security layer).
See docs/TESTING.md for install and CLI smoke tests.
- ACP calls MCP
phenotype_searchto retrieve candidates. - ACP calls MCP
phenotype_prompt_bundleto fetch prompt assets and output schema. - ACP calls an OpenAI-compatible LLM API to rank candidates.
- Core validates and filters LLM output.
For details on the design, see docs/PHENOTYPE_RECOMMENDATION_DESIGN.md.
- ACP calls MCP
phenotype_prompt_bundlefor improvement prompts. - ACP calls an OpenAI-compatible LLM API for improvement suggestions.
- ACP calls MCP
phenotype_improvementswith LLM output for validation.
This flow reviews one phenotype definition at a time. If multiple cohorts are provided, ACP uses the first.
- ACP calls MCP
lint_prompt_bundlefor lint prompts. - ACP calls an OpenAI-compatible LLM API for findings/patches/actions.
- ACP calls MCP
propose_concept_set_diffwith LLM output for validation.
- ACP calls MCP
phenotype_prompt_bundlefor cohort critique prompts. - ACP calls an OpenAI-compatible LLM API for findings/patches.
- ACP calls MCP
cohort_lintwith LLM output for validation.
- ACP calls MCP
keeper_sanitize_rowto remove PHI/PII (fail-closed). - ACP calls MCP
keeper_prompt_bundleandkeeper_build_promptfor a sanitized patient prompt. - ACP calls an OpenAI-compatible LLM API to review the patient summary.
- ACP calls MCP
keeper_parse_responseto normalize the label.
LLM requests never include row-level PHI/PII; only sanitized summaries are sent.
For details on PHI/PII handling, see docs/PHENOTYPE_VALIDATION_REVIEW.md.
- ACP calls MCP
phenotype_recommendation_advicefor advisory prompt assets and schema. - ACP calls an OpenAI-compatible LLM API to return actionable guidance.
- Core validates the advisory output.
This flow is used as a fallback when users do not accept initial recommendations.
The interactive Strategus shell orchestrates phenotype selection, improvements, and script
generation for a CohortIncidence study. See docs/STRATEGUS_SHELL.md.
Service definitions live in docs/SERVICE_REGISTRY.yaml. ACP exposes a /services endpoint that
reports registry entries plus any additional ACP-implemented services. You can list services
quickly with doit list_services.
Prerequisite: you have embedded phenotype definitions - see ./docs/PHENOTYPE_INDEXING.md
- Start the ACP server (runs on http://127.0.0.1:8765/ by default):
export LLM_API_KEY=<YOUR KEY>
export LLM_API_URL="<URL BASE>/api/chat/completions"
export LLM_LOG=1
export LLM_MODEL=<a model that supports completions>
export EMBED_API_KEY=<YOUR KEY>
export EMBED_MODEL=<a text embedding model>
export EMBED_URL="<URL BASE>/v1/embeddings"
export PHENOTYPE_INDEX_DIR="<ABSOLUTE PATH TO phenotype_index>"
export STUDY_AGENT_MCP_CWD="<REPO ROOT (optional, for stable relative paths)>"
export STUDY_AGENT_HOST=127.0.0.1
export STUDY_AGENT_PORT=8765
export STUDY_AGENT_MCP_COMMAND=study-agent-mcp
export STUDY_AGENT_MCP_ARGS=""
study-agent-acpNote: This starts MCP via stdio. If you use MCP over HTTP, do not set STUDY_AGENT_MCP_COMMAND.
Note: Prefer stopping the ACP process (SIGINT/SIGTERM) so the MCP subprocess is closed cleanly. Killing the MCP directly can leave defunct processes.
Note: ACP uses a threaded HTTP server by default. Set STUDY_AGENT_THREADING=0 to disable threading.
Note: /health includes MCP preflight details under mcp_index when MCP is configured.
Troubleshooting: run python mcp_server/scripts/mcp_probe.py to verify index paths and search without ACP.
Start MCP as a separate HTTP service:
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8790
export MCP_PATH=/mcp
study-agent-mcpThen point ACP at it:
export STUDY_AGENT_MCP_URL="http://127.0.0.1:8790/mcp"
study-agent-acpNote: STUDY_AGENT_MCP_URL must include the port (e.g. :8790). When set, ACP uses HTTP and ignores STUDY_AGENT_MCP_COMMAND.
PowerShell (Windows) quickstart:
$env:MCP_TRANSPORT = "http"
$env:MCP_HOST = "127.0.0.1"
$env:MCP_PORT = "8790"
$env:MCP_PATH = "/mcp"
study-agent-mcp$env:STUDY_AGENT_MCP_URL = "http://127.0.0.1:8790/mcp"
study-agent-acp- Run
phenotype_recommendation
curl -s -X POST http://127.0.0.1:8765/flows/phenotype_recommendation \
-H 'Content-Type: application/json' \
-d '{"study_intent":"Identify clinical risk factors for older adult patients who experience an adverse event of acute gastro-intenstinal (GI) bleeding", "top_k":20, "max_results":10,"candidate_limit":10}'Use Docker compose to run MCP and ACP together with MCP over HTTP.
NOTE: If you plan to use phenotype services, you will need to phenotype index (see ./docs/PHENOTYPE_INDEXING.md) with output to data/phenotype_index/
- Prepare environment variables:
cp .env.example .envRecommended contents of .env:
EMBED_API_KEY=<your api key>
EMBED_MODEL=<an embedding model>
EMBED_URL=http://172.17.0.1:3000/ollama/api/embed # or equivalent
LLM_API_KEY=<your api key>
LLM_API_URL=http://172.17.0.1:3000/api/chat/completions # or equivalent
LLM_MODEL=<a chat completion model>
LLM_LOG=1
LLM_USE_RESPONSES=0
LLM_TIMEOUT=180
STUDY_AGENT_ALLOW_CORE_FALLBACK=0
STUDY_AGENT_DEBUG=1
- Build and start both services:
sudo docker compose up --build -d # you might not need sudo depending on your docker set up- Check service health and tool listing:
curl -s http://127.0.0.1:8765/health | python -m json.toolExpected output:
{
"status": "ok",
"mcp": {
"ok": true,
"mode": "http"
},
"mcp_index": {
"skipped": true
}
}
This should show a number of services with an empty warnings list
curl -s http://127.0.0.1:8765/services | python -m json.toolNotes:
- ACP is exposed on port 8765 and MCP on port 8790.
- The phenotype index is mounted from
./data/phenotype_indexinto MCP at/data/phenotype_index.
Detailed tests can be found in docs/TESTING.md but this one is useful for a quick check that a tool that uses the chat completion is functioning reachable:
# phenotype_intent_split
curl -s -X POST http://127.0.0.1:8765/flows/phenotype_intent_split \
-H 'Content-Type: application/json' \
-d '{"study_intent":"Identify clinical risk factors for older adult patients who experience an adverse event of acute gastro-intenstinal (GI) bleeding"}'
...expected output something like
{"status": "ok", "llm_used": true, "intent_split": {"plan": "The target cohort identifies the initial group of patients, and the outcome cohort defines the adverse event of interest to be tracked over time in this cohort.", "target_statement": "Patients aged 65 years and older who have a record of admission to a hospital.", "outcome_statement": "Patients aged 65 years and older who have a record of acute gastrointestinal (GI) bleeding.", "rationale": "This split defines the cohort of older adults at risk of GI bleeding (target) and identifies the specific adverse event we will be tracking in this population (outcome).", "questions": ["What are the specific definitions of 'acute GI bleeding' and 'hospital admission' within the study?", "Are there specific GI conditions that should be included or excluded from the outcome cohort (e.g., ulcers, diverticulitis)?", "What is the desired timeframe for the follow-up period after the index date?"], "mode": "llm"}}%
Another example, this one examining safe harbor patient data to determinee if a GI bleed occurred:
curl -s -X POST http://127.0.0.1:8765/flows/phenotype_validation_review \
-H 'Content-Type: application/json' \
-d '{
"disease_name": "Gastrointestinal bleeding",
"keeper_row": {
"age": 44,
"gender": "Male",
"visitContext": "Inpatient Visit",
"presentation": "Gastrointestinal hemorrhage",
"priorDisease": "Peptic ulcer",
"symptoms": "",
"comorbidities": "",
"priorDrugs": "celecoxib",
"priorTreatmentProcedures": "",
"diagnosticProcedures": "",
"measurements": "",
"alternativeDiagnosis": "",
"afterDisease": "",
"afterDrugs": "Naproxen",
"afterTreatmentProcedures": ""
}
}'
...expected result something like:
{"status": "ok", "tool": "keeper_parse_response", "warnings": [], "safe_summary": {"plan": null}, "full_result": {"label": "yes", "rationale": "The patient's diagnosis recorded on the day of the visit is 'Gastrointestinal hemorrhage', which directly indicates the presence of GI bleeding. This is sufficient evidence to confirm the presence of the phenotype.", "_meta": {"tool": "keeper_parse_response"}}, "llm_used": true}
Below is a set of planned study agent services, organized by category. For each service, document the input, output, and validation approach.
Input: PICO/TAR for a study intent.
Output: Templated protocol.
Validation: Protocol completeness and consistency review.
Input: PICO/TAR and hypothesis.
Output: Background document justifying the study (systematic research summary).
Validation: Source coverage and alignment with hypothesis.
Input: Protocol.
Output: Critique reviewing required components and consistency.
Validation: Checklist of required components; coherence checks.
Input: Protocol or study intent statement.
Output: Directed acyclic graph of known causal/associative relations (LLM + literature discovery).
Validation: Consistency with cited relations and domain plausibility.
Input: The user's study intent statement and cohort diagnostics output including code to run and the results files
Output: narrative summary / report of the analysis.
Validation: Correctly reported summary of the methods and results.
Input: The user's study intent statement and cohort diagnostics and a completed analysis with strategus output folders with code to run and the results files (incidence/estimation/characterization).
Output: narrative summary / report of the analysis.
Validation: Correctly reported summary of the methods and results.
Input: Study specification intent or existing Strategus JSON.
Output: Composed/compared/edited/criticized/debugged Strategus JSON.
Validation: Schema validation and diff review.
Input: Study intent.
Output: Suggested phenotypes with cohort definition artifacts for user-accepted selections.
Validation: Allowed-id filtering; user confirmation before writes.
Input: Selected phenotypes + study intent.
Output: Improved cohort definitions or Atlas records for accepted changes.
Validation: Target cohort ID validation; user confirmation before writes.
Input: Phenotype/covariate intent lacking a cohort definition.
Output: Suggested concept sets and created concept set artifacts if accepted.
Validation: Concept set schema validation; user confirmation before writes.
Input: Target (optionally comparator).
Output: Recommended negative control outcomes with cohort definitions if accepted.
Validation: Clinical plausibility check; user confirmation before writes.
Input: Target.
Output: Proposed comparator cohort definition if accepted (optionally using OHDSI Comparator Selector).
Validation: Comparator appropriateness review; user confirmation before writes.
Input: Study intent + DAG.
Output: Adjustment set from OHDSI features plus suggested FeatureExtraction features.
Validation: Confounder/collider/mediator checks against DAG. E.g., showing the user if any known and biased collider that someone in another paper published might accidentally be including in their study design. See this JAMA article for more about colliders. Also, potentially using a knowledge graph of causal findings from the entire literature to informat the user of the same.
Input: Concept set + study intent.
Output: Proposed patches to concept set artifacts if accepted.
Validation: Deterministic diff rules; user confirmation before writes.
Input: Selected phenotype(s).
Output: R code (or Atlas services) to characterize populations.
Validation: Execution preview; user confirmation before running.
Input: Phenotype definitions + data quality sources (DQD, Achilles Heel, characterization).
Output: Mitigations and patches for accepted issues.
Validation: Issue traceability to data quality sources; user confirmation before writes.
Input: Phenotype definition(s) + datasets.
Output: R code to run (e.g., Cohort Diagnostics) and a brief summary of drivers of cohort size variation.
Validation: Reproducible execution outputs; summary tied to diagnostics.
Input: Selected phenotype definition (usually for an outcome cohort) and a narrative clinical description with differential diagnoses and known associated factors for validation and to compare to known phenotype performance.
Output: code to extract sample cases based on the clinical description and LLM-assessment of a sample (user-specified or random) of cohort records stripped of PHI.
Validation: Sampling logic review; user confirmation.
Input: Phenotype/covariate intent without a cohort definition.
Output: Capr code for cohort definition.
Validation: Schema validation; user confirmation before writes.
Input: Cohort JSON.
Output: Proposed patches for design issues and execution efficiency.
Validation: Deterministic lint rules; user confirmation before writes.
Input: Target + outcome.
Output: Judgement on causal implausibility with explanation and citations.
Validation: Citation review and domain plausibility.