Enable langsmith studio (or Look Ma, no Command Line!) by yangm2 · Pull Request #304 · codeforpdx/tenantfirstaid

yangm2 · 2026-03-11T03:53:29Z

What type of PR is this? (check all applicable)

Description

tl;dr - Use a GUI to iterate on model and evaluation inputs. The GUI runs on an app that is deployed on LangSmith. After this PR is merged, we can deploy the tenantfirstaid/main branch and which will automatically update as changes are pushed (or a different branch that we manually sync).

Evaluation framework (backend/evaluate/): moved evaluation scripts into their own subdir with a CLI (langsmith_dataset.py) for managing datasets, scenarios, and experiments. Scenarios live in a git-tracked JSONL file with a JSON schema for validation. The CLI supports push/pull to sync with LangSmith's web UI, plus fine-grained operations like append/remove/diff/merge on individual scenarios.

Editable prompts and rubrics: the system prompt, letter template, and LLM-as-judge evaluator rubrics are now plain markdown files that lawyers can edit without touching Python. constants.py loads them at startup with placeholder substitution for the system prompt. The evaluator rubrics live in evaluators/*.md and are wrapped by thin Python code in langsmith_evaluators.py.

LangGraph entry point (graph.py + langgraph.json): a shared module that exposes the LLM, tools, and a create_graph() factory. LangChainChatManager now delegates to this instead of duplicating the LLM config and tool list. The langgraph.json manifest enables langgraph dev for local Studio testing (no LangSmith seat or Docker required) and future LangSmith Cloud deployment.

Docs: comprehensive EVALUATION.md covering the evaluation flow, dataset management, scoring rubrics, langgraph dev setup, and collaboration workflows.

TODO:

~~concoct incantation to use LangSmith Workspace secret in Deployment environment variable~~ bug filed on LangSmith
manually configure inline/online evaluators for dataset
update EVALUATION.md with WorkSpace setup, Deployment setup & Online Evaluator setup

Related Tickets & Documents

Related Issue LangSmith Experiments on Single-turn dataset #245, Jsonl based evaluation flow #300 (supercedes)
Closes #

QA Instructions, Screenshots, Recordings

LangSmith Studio running on a backend deployed from a GitHub branch ...

Test out changes to the System Prompt ...

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests have not been included
I need help with writing tests

Documentation

If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

add a schema file to enforce scenario structure and validation flow add CLI to work with LangSmith website

…ler editing separate langgraph entry point to graph.py for LangSmith Studio

backend/tenantfirstaid/langchain_tools.py

make system_prompt editable in LangSmith Studio

…oogleGenerativeAI

yangm2 · 2026-03-16T01:16:54Z

@claude pr-review

yangm2 · 2026-03-16T23:04:06Z

@claude pr-review

fix import

yangm2 · 2026-03-17T01:09:10Z

/claude pr-review

leekahung · 2026-03-17T01:19:43Z

@claude pr-review

add property-based tests

yangm2 · 2026-03-17T19:07:06Z

backend/tests/test_langsmith_dataset.py

+"""Tests for evaluate/langsmith_dataset.py."""
+
+import json
+import sys


yangm2 · 2026-03-17T19:07:14Z

backend/tests/test_langsmith_dataset.py

+
+import json
+import sys
+from io import StringIO


yangm2 · 2026-03-17T19:07:56Z

@claude pr-review

clean up noisy tests

yangm2 · 2026-03-17T20:38:43Z

@claude pr-review include mermaid diagrams showing the proportion of changes in code vs markdown

claude · 2026-03-17T20:39:02Z

Claude finished @yangm2's task in 2m 58s —— View job

PR Review: Enable LangSmith Studio

Gather diff stats (code vs markdown)
Read key changed files
Generate mermaid diagrams
Write comprehensive review

Change composition

This is a big PR — 35 files, 4152 additions, 775 deletions — but a large chunk is new infra and docs, not functional churn.

Additions by category (4 152 lines added):

pie title Additions by category
    "Production Python" : 1571
    "Tests" : 984
    "Markdown / docs" : 887
    "Config & data" : 710

All changed lines (adds + deletes) (4 927 total):

pie title Total lines changed by file type
    "Python code" : 3062
    "Markdown / docs" : 1128
    "Config & data" : 737

The docs/markdown share (~21% of additions) is notably large for a code PR — the new EVALUATION.md alone is 675 lines. That's a feature, not a complaint: the rubrics and prompts moving to markdown is the whole point of this PR.

Overall impression

Solid architecture work. The key moves — extracting graph.py as the shared entry point, externalising prompts/rubrics to markdown, and building the langsmith_dataset CLI — are well-motivated and well-executed. Test coverage for the new CLI is thorough (634 lines, property-based tests with Hypothesis). A few issues worth addressing before merge.

Issues

1. assert used for a runtime precondition — graph.py:44

assert SINGLETON.GOOGLE_APPLICATION_CREDENTIALS is not None, (
    "GOOGLE_APPLICATION_CREDENTIALS is not set"
)

assert is stripped by the Python optimiser (python -O). This should be an explicit if ... raise ValueError(...). The constants.py constructor already raises ValueError for missing vars, so this assertion is probably redundant anyway — but if you want the guard, use the right mechanism. Fix this →

2. Module-level evaluator instantiation — langsmith_evaluators.py:79-100

citation_accuracy_evaluator: SimpleEvaluator = create_llm_as_judge(
    model=EVALUATOR_MODEL_NAME, prompt=load_rubric("citation_accuracy"), ...
)

These three calls to create_llm_as_judge run at import time. If that constructor makes network calls or requires credentials, importing this module in CI (or in any test that pulls in the evaluate package) will fail without a live model endpoint. Consider lazy initialisation (a get_evaluators() factory, or instantiation inside the evaluation runner) so the module stays importable in offline environments.

3. tool_usage_evaluator accesses run.trace — langsmith_evaluators.py:154-164

if not hasattr(run, "trace") or not run.trace:
    return {..., "score": 0.0, "comment": "No trace available"}
for step in run.trace.get("steps", []):

The LangSmith Run object doesn't have a .trace attribute with a "steps" key in its public API. This evaluator will almost always fall into the "No trace available" branch and return 0.0, which would silently skew results. Either use the correct LangSmith API to inspect child runs, or remove this evaluator until the API surface is verified.

4. performance_evaluator — langsmith_evaluators.py:191-192

latency_seconds = (run.end_time - run.start_time).total_seconds()
token_usage = run.usage.get("total_tokens", 0) if run.usage else 0

run.end_time can be None if the run hasn't completed; the subtraction will raise TypeError. Add a guard or handle the None case explicitly.

5. CodeQL: implicit return in langchain_tools.py

The CodeQL scan flagged a mixed explicit/implicit return in this file (review comment, no line number). Looking at the current code, the culprit is likely RagBuilder.__init__ — it has no explicit return None but the class has __credentials assigned via multiple branches with no unified return path. This is cosmetic but easy to silence with an explicit return or restructuring.

6. cmd_dataset_push doesn't update or remove — langsmith_dataset.py:178-207

The push command only adds new scenarios (those whose scenario_id isn't already in LangSmith). If a scenario's content is changed locally and re-pushed, the remote copy is silently left stale. This may be intentional (push is additive; pull overwrites), but it's not documented in the help text and could trip up collaborators. Consider adding a note to the docstring/help text, or a --force flag that updates existing scenarios.

7. LangChainChatManager agent is bound to first-call location — langchain_chat_manager.py:100-101

if self.agent is None:
    self.agent = self.__create_agent_for_session(city, state, thread_id)

The agent (and its baked-in system prompt) is created once and reused for all subsequent calls, even if city/state change. This is a pre-existing bug, not introduced here, but the refactor touches these lines and it's worth flagging for a follow-up.

Nits

langchain_tools.py: retrieve_city_state_laws constructs a new RagBuilder (and thus a new VertexAISearchRetriever) on every tool call. That's fine for correctness (filter changes per call), but it means a fresh credential load each time. Not a bug, worth a comment explaining the intent.
constants.py:73-80: The VERTEX_AI_DATASTORE fixup FIXME comment has an open issue (change env var VERTEX_AI_DATASTORE value for langchain port #247). Good that it's tracked; easy to forget.
langgraph.json: $schema URL uses https://langgra.ph/schema.json — looks like a typo for https://langchain-ai.github.io/langgraph/.... This likely doesn't matter at runtime but may affect IDE validation.
langsmith_evaluators.py:19-20: The commented-out EVALUATOR_MODEL_NAME line is dead code. Remove it or make it a real decision point.

Positive callouts

The google_auth.py dual-mode credential loading (file path vs inline JSON) is clean and well-tested, and the AssertionError("unreachable") CodeQL workaround is clearly commented.
Moving prompts and rubrics to markdown is exactly the right abstraction for a non-technical audience. _load_system_prompt() with explicit placeholder substitution is much safer than raw f-strings.
The langsmith_dataset.py CLI is well-structured: consistent argparse subcommands, _Validate dataclass with error/warn modes, git-cleanliness guard on pull, dry-run support. The property-based tests (Hypothesis) for the diff/merge logic are a nice touch.
Deferred LLM initialisation in graph.py (_get_llm with a lock) is the right pattern for a module that gets imported in tests.

yangm2 added 11 commits March 9, 2026 16:51

move evaluation scripts into it's own subdir

1f9683f

add a schema file to enforce scenario structure and validation flow add CLI to work with LangSmith website

updates for flow changes

484963b

change newlines to breaks

5276e3f

fix formatting and import path

bb14f96

add a roadmap

e6ca87b

update some of the docs for new subdir

8cedfbf

fixes from review

4c37c83

refine subcommands and parameters

df284e4

fail overwrite if JSONL is dirty

da7d75e

claude moved multi-line strings into separate markdown files for simp…

7e648e3

…ler editing separate langgraph entry point to graph.py for LangSmith Studio

update docs

b9a33fd

yangm2 self-assigned this Mar 11, 2026

yangm2 added documentation Improvements or additions to documentation backend Bot implementation and other backend concerns infrastructure Pull requests related to infrastructure and underlying workflows labels Mar 11, 2026

accept JSON contents or JSON file in GOOGLE_APPLICATION_CREDENTIALS

d68af73

github-advanced-security bot found potential problems Mar 11, 2026

View reviewed changes

backend/tenantfirstaid/langchain_tools.py Fixed Show fixed Hide fixed

yangm2 added 6 commits March 10, 2026 22:17

suggestion from LangSmith deployment log

ca67d74

specify python version otherwise deployment uses 3.11 and fails

e5d7e2d

fix deployment resolution

21c918c

fix deployment issue with checkpointer

da6973d

make system_prompt editable in LangSmith Studio

enable uv run langgraph dev flow

628b294

pass through secret to cloud deployment

ca3481c

dan-moncada marked this pull request as ready for review March 12, 2026 01:19

dan-moncada requested a review from leekahung March 12, 2026 01:21

dan-moncada closed this Mar 12, 2026

dan-moncada reopened this Mar 12, 2026

yangm2 added 3 commits March 11, 2026 20:59

revert hallucinated solution then document proper solution

1f69388

explicitly pass google cred to both VertexAISearchRetriever and ChatG…

b29b937

…oogleGenerativeAI

make it possible to not select a city for LangSmith Studio

f2f10a5

yangm2 added 5 commits March 14, 2026 13:23

bump langsmith version

3a22891

bump backend version since we have biggish changes

7e1fa3e

revert evaluator/prompt push - LangSmith SDK does not support this

3b4b36a

waive false positive

8f68770

cleanup from code review

70aa9ab

This comment was marked as outdated.

Sign in to view

fix code-review issues

c308cb1

This comment was marked as outdated.

Sign in to view

yangm2 added 5 commits March 16, 2026 17:00

move langgraph to prd deps

1706cba

appease typechecking

8b761ec

use constant

c9fdd3f

appease typechecking

a0e23b5

fmt

5d7c595

fix import

This comment was marked as outdated.

Sign in to view

yangm2 added 2 commits March 17, 2026 11:20

fix PEP8 naming conventions and add lint rule

924dee0

name magic value

62a1561

add property-based tests

github-advanced-security bot found potential problems Mar 17, 2026

View reviewed changes

yangm2 added 2 commits March 17, 2026 11:49

fix lint

6300695

categorize tests

51863a6

This comment was marked as outdated.

Sign in to view

yangm2 added 2 commits March 17, 2026 12:13

update

779b1ab

rework dataset/scenario diff/merge and enforce schema assumptions

745d1ae

clean up noisy tests

fix bound evaluator model configuration explanation

d6b32c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable langsmith studio (or Look Ma, no Command Line!)#304

Enable langsmith studio (or Look Ma, no Command Line!)#304
yangm2 wants to merge 60 commits intocodeforpdx:mainfrom
yangm2:enable-langsmith-studio

yangm2 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

yangm2 commented Mar 16, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 16, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 17, 2026

Uh oh!

leekahung commented Mar 17, 2026

Uh oh!

This comment was marked as outdated.

Check notice

yangm2 Mar 17, 2026

Uh oh!

Check notice

yangm2 Mar 17, 2026

Uh oh!

yangm2 commented Mar 17, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yangm2 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

Documentation

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

Uh oh!

yangm2 commented Mar 16, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 16, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 17, 2026

Uh oh!

leekahung commented Mar 17, 2026

Uh oh!

This comment was marked as outdated.

Check notice

yangm2 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Check notice

yangm2 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

yangm2 commented Mar 17, 2026

Uh oh!

This comment was marked as outdated.

yangm2 commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Enable LangSmith Studio

Change composition

Overall impression

Issues

Nits

Positive callouts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yangm2 commented Mar 11, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading