Enable langsmith studio (or Look Ma, no Command Line!)#304
Enable langsmith studio (or Look Ma, no Command Line!)#304yangm2 wants to merge 60 commits intocodeforpdx:mainfrom
Conversation
add a schema file to enforce scenario structure and validation flow add CLI to work with LangSmith website
…ler editing separate langgraph entry point to graph.py for LangSmith Studio
make system_prompt editable in LangSmith Studio
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
|
/claude pr-review |
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
add property-based tests
|
@claude pr-review |
This comment was marked as outdated.
This comment was marked as outdated.
clean up noisy tests
|
@claude pr-review include mermaid diagrams showing the proportion of changes in code vs markdown |
|
Claude finished @yangm2's task in 2m 58s —— View job PR Review: Enable LangSmith Studio
Change compositionThis is a big PR — 35 files, 4152 additions, 775 deletions — but a large chunk is new infra and docs, not functional churn. Additions by category (4 152 lines added): pie title Additions by category
"Production Python" : 1571
"Tests" : 984
"Markdown / docs" : 887
"Config & data" : 710
All changed lines (adds + deletes) (4 927 total): pie title Total lines changed by file type
"Python code" : 3062
"Markdown / docs" : 1128
"Config & data" : 737
The docs/markdown share (~21% of additions) is notably large for a code PR — the new Overall impressionSolid architecture work. The key moves — extracting Issues1. assert SINGLETON.GOOGLE_APPLICATION_CREDENTIALS is not None, (
"GOOGLE_APPLICATION_CREDENTIALS is not set"
)
2. Module-level evaluator instantiation — citation_accuracy_evaluator: SimpleEvaluator = create_llm_as_judge(
model=EVALUATOR_MODEL_NAME, prompt=load_rubric("citation_accuracy"), ...
)These three calls to 3. if not hasattr(run, "trace") or not run.trace:
return {..., "score": 0.0, "comment": "No trace available"}
for step in run.trace.get("steps", []):The LangSmith 4. latency_seconds = (run.end_time - run.start_time).total_seconds()
token_usage = run.usage.get("total_tokens", 0) if run.usage else 0
5. CodeQL: implicit return in The CodeQL scan flagged a mixed explicit/implicit return in this file (review comment, no line number). Looking at the current code, the culprit is likely 6. The push command only adds new scenarios (those whose 7. if self.agent is None:
self.agent = self.__create_agent_for_session(city, state, thread_id)The agent (and its baked-in system prompt) is created once and reused for all subsequent calls, even if Nits
Positive callouts
|
What type of PR is this? (check all applicable)
Description
tl;dr - Use a GUI to iterate on model and evaluation inputs. The GUI runs on an app that is deployed on LangSmith. After this PR is merged, we can deploy the
tenantfirstaid/mainbranch and which will automatically update as changes are pushed (or a different branch that we manually sync).Evaluation framework (
backend/evaluate/): moved evaluation scripts into their own subdir with a CLI (langsmith_dataset.py) for managing datasets, scenarios, and experiments. Scenarios live in a git-tracked JSONL file with a JSON schema for validation. The CLI supports push/pull to sync with LangSmith's web UI, plus fine-grained operations like append/remove/diff/merge on individual scenarios.Editable prompts and rubrics: the system prompt, letter template, and LLM-as-judge evaluator rubrics are now plain markdown files that lawyers can edit without touching Python.
constants.pyloads them at startup with placeholder substitution for the system prompt. The evaluator rubrics live in evaluators/*.md and are wrapped by thin Python code inlangsmith_evaluators.py.LangGraph entry point (
graph.py+langgraph.json): a shared module that exposes the LLM, tools, and acreate_graph()factory. LangChainChatManager now delegates to this instead of duplicating the LLM config and tool list. Thelanggraph.jsonmanifest enableslanggraph devfor local Studio testing (no LangSmith seat or Docker required) and future LangSmith Cloud deployment.Docs: comprehensive
EVALUATION.mdcovering the evaluation flow, dataset management, scoring rubrics,langgraph devsetup, and collaboration workflows.TODO:
concoct incantation to use LangSmith Workspace secret in Deployment environment variablebug filed on LangSmithEVALUATION.mdwith WorkSpace setup, Deployment setup & Online Evaluator setupRelated Tickets & Documents
QA Instructions, Screenshots, Recordings
LangSmith Studio running on a backend deployed from a GitHub branch ...
Test out changes to the System Prompt ...

Added/updated tests?
Documentation
Architecture.mdhas been updated[optional] Are there any post deployment tasks we need to perform?