Skip to content

Enable langsmith studio (or Look Ma, no Command Line!)#304

Open
yangm2 wants to merge 60 commits intocodeforpdx:mainfrom
yangm2:enable-langsmith-studio
Open

Enable langsmith studio (or Look Ma, no Command Line!)#304
yangm2 wants to merge 60 commits intocodeforpdx:mainfrom
yangm2:enable-langsmith-studio

Conversation

@yangm2
Copy link
Contributor

@yangm2 yangm2 commented Mar 11, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Infrastructure
  • Maintenance

Description

tl;dr - Use a GUI to iterate on model and evaluation inputs. The GUI runs on an app that is deployed on LangSmith. After this PR is merged, we can deploy the tenantfirstaid/main branch and which will automatically update as changes are pushed (or a different branch that we manually sync).

Evaluation framework (backend/evaluate/): moved evaluation scripts into their own subdir with a CLI (langsmith_dataset.py) for managing datasets, scenarios, and experiments. Scenarios live in a git-tracked JSONL file with a JSON schema for validation. The CLI supports push/pull to sync with LangSmith's web UI, plus fine-grained operations like append/remove/diff/merge on individual scenarios.

Editable prompts and rubrics: the system prompt, letter template, and LLM-as-judge evaluator rubrics are now plain markdown files that lawyers can edit without touching Python. constants.py loads them at startup with placeholder substitution for the system prompt. The evaluator rubrics live in evaluators/*.md and are wrapped by thin Python code in langsmith_evaluators.py.

LangGraph entry point (graph.py + langgraph.json): a shared module that exposes the LLM, tools, and a create_graph() factory. LangChainChatManager now delegates to this instead of duplicating the LLM config and tool list. The langgraph.json manifest enables langgraph dev for local Studio testing (no LangSmith seat or Docker required) and future LangSmith Cloud deployment.

Docs: comprehensive EVALUATION.md covering the evaluation flow, dataset management, scoring rubrics, langgraph dev setup, and collaboration workflows.

TODO:

  • concoct incantation to use LangSmith Workspace secret in Deployment environment variable bug filed on LangSmith
  • manually configure inline/online evaluators for dataset
  • update EVALUATION.md with WorkSpace setup, Deployment setup & Online Evaluator setup

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

LangSmith Studio running on a backend deployed from a GitHub branch ...

screenshot of LangSmith Studio

Test out changes to the System Prompt ...
image

When plumbing breaks, your landlord has a role,
To keep your dwelling habitable and whole.
The law requires that they must maintain,
Your home, so you may peacefully remain.
A home is not considered fit to lease,
If it does not have working

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests have not been included
  • I need help with writing tests

Documentation

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

@yangm2 yangm2 self-assigned this Mar 11, 2026
@yangm2 yangm2 added documentation Improvements or additions to documentation backend Bot implementation and other backend concerns infrastructure Pull requests related to infrastructure and underlying workflows labels Mar 11, 2026
@dan-moncada dan-moncada marked this pull request as ready for review March 12, 2026 01:19
@dan-moncada dan-moncada requested a review from leekahung March 12, 2026 01:21
@dan-moncada dan-moncada reopened this Mar 12, 2026
@yangm2
Copy link
Contributor Author

yangm2 commented Mar 16, 2026

@claude pr-review

@claude

This comment was marked as outdated.

@yangm2
Copy link
Contributor Author

yangm2 commented Mar 16, 2026

@claude pr-review

@claude

This comment was marked as outdated.

@yangm2
Copy link
Contributor Author

yangm2 commented Mar 17, 2026

/claude pr-review

@leekahung
Copy link
Contributor

@claude pr-review

@claude

This comment was marked as outdated.

"""Tests for evaluate/langsmith_dataset.py."""

import json
import sys

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'sys' is not used.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


import json
import sys
from io import StringIO

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'StringIO' is not used.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yangm2
Copy link
Contributor Author

yangm2 commented Mar 17, 2026

@claude pr-review

@claude

This comment was marked as outdated.

@yangm2
Copy link
Contributor Author

yangm2 commented Mar 17, 2026

@claude pr-review include mermaid diagrams showing the proportion of changes in code vs markdown

@claude
Copy link
Contributor

claude bot commented Mar 17, 2026

Claude finished @yangm2's task in 2m 58s —— View job


PR Review: Enable LangSmith Studio

  • Gather diff stats (code vs markdown)
  • Read key changed files
  • Generate mermaid diagrams
  • Write comprehensive review

Change composition

This is a big PR — 35 files, 4152 additions, 775 deletions — but a large chunk is new infra and docs, not functional churn.

Additions by category (4 152 lines added):

pie title Additions by category
    "Production Python" : 1571
    "Tests" : 984
    "Markdown / docs" : 887
    "Config & data" : 710
Loading

All changed lines (adds + deletes) (4 927 total):

pie title Total lines changed by file type
    "Python code" : 3062
    "Markdown / docs" : 1128
    "Config & data" : 737
Loading

The docs/markdown share (~21% of additions) is notably large for a code PR — the new EVALUATION.md alone is 675 lines. That's a feature, not a complaint: the rubrics and prompts moving to markdown is the whole point of this PR.


Overall impression

Solid architecture work. The key moves — extracting graph.py as the shared entry point, externalising prompts/rubrics to markdown, and building the langsmith_dataset CLI — are well-motivated and well-executed. Test coverage for the new CLI is thorough (634 lines, property-based tests with Hypothesis). A few issues worth addressing before merge.


Issues

1. assert used for a runtime precondition — graph.py:44

assert SINGLETON.GOOGLE_APPLICATION_CREDENTIALS is not None, (
    "GOOGLE_APPLICATION_CREDENTIALS is not set"
)

assert is stripped by the Python optimiser (python -O). This should be an explicit if ... raise ValueError(...). The constants.py constructor already raises ValueError for missing vars, so this assertion is probably redundant anyway — but if you want the guard, use the right mechanism. Fix this →

2. Module-level evaluator instantiation — langsmith_evaluators.py:79-100

citation_accuracy_evaluator: SimpleEvaluator = create_llm_as_judge(
    model=EVALUATOR_MODEL_NAME, prompt=load_rubric("citation_accuracy"), ...
)

These three calls to create_llm_as_judge run at import time. If that constructor makes network calls or requires credentials, importing this module in CI (or in any test that pulls in the evaluate package) will fail without a live model endpoint. Consider lazy initialisation (a get_evaluators() factory, or instantiation inside the evaluation runner) so the module stays importable in offline environments.

3. tool_usage_evaluator accesses run.tracelangsmith_evaluators.py:154-164

if not hasattr(run, "trace") or not run.trace:
    return {..., "score": 0.0, "comment": "No trace available"}
for step in run.trace.get("steps", []):

The LangSmith Run object doesn't have a .trace attribute with a "steps" key in its public API. This evaluator will almost always fall into the "No trace available" branch and return 0.0, which would silently skew results. Either use the correct LangSmith API to inspect child runs, or remove this evaluator until the API surface is verified.

4. performance_evaluatorlangsmith_evaluators.py:191-192

latency_seconds = (run.end_time - run.start_time).total_seconds()
token_usage = run.usage.get("total_tokens", 0) if run.usage else 0

run.end_time can be None if the run hasn't completed; the subtraction will raise TypeError. Add a guard or handle the None case explicitly.

5. CodeQL: implicit return in langchain_tools.py

The CodeQL scan flagged a mixed explicit/implicit return in this file (review comment, no line number). Looking at the current code, the culprit is likely RagBuilder.__init__ — it has no explicit return None but the class has __credentials assigned via multiple branches with no unified return path. This is cosmetic but easy to silence with an explicit return or restructuring.

6. cmd_dataset_push doesn't update or remove — langsmith_dataset.py:178-207

The push command only adds new scenarios (those whose scenario_id isn't already in LangSmith). If a scenario's content is changed locally and re-pushed, the remote copy is silently left stale. This may be intentional (push is additive; pull overwrites), but it's not documented in the help text and could trip up collaborators. Consider adding a note to the docstring/help text, or a --force flag that updates existing scenarios.

7. LangChainChatManager agent is bound to first-call location — langchain_chat_manager.py:100-101

if self.agent is None:
    self.agent = self.__create_agent_for_session(city, state, thread_id)

The agent (and its baked-in system prompt) is created once and reused for all subsequent calls, even if city/state change. This is a pre-existing bug, not introduced here, but the refactor touches these lines and it's worth flagging for a follow-up.


Nits

  • langchain_tools.py: retrieve_city_state_laws constructs a new RagBuilder (and thus a new VertexAISearchRetriever) on every tool call. That's fine for correctness (filter changes per call), but it means a fresh credential load each time. Not a bug, worth a comment explaining the intent.

  • constants.py:73-80: The VERTEX_AI_DATASTORE fixup FIXME comment has an open issue (change env var VERTEX_AI_DATASTORE value for langchain port #247). Good that it's tracked; easy to forget.

  • langgraph.json: $schema URL uses https://langgra.ph/schema.json — looks like a typo for https://langchain-ai.github.io/langgraph/.... This likely doesn't matter at runtime but may affect IDE validation.

  • langsmith_evaluators.py:19-20: The commented-out EVALUATOR_MODEL_NAME line is dead code. Remove it or make it a real decision point.


Positive callouts

  • The google_auth.py dual-mode credential loading (file path vs inline JSON) is clean and well-tested, and the AssertionError("unreachable") CodeQL workaround is clearly commented.
  • Moving prompts and rubrics to markdown is exactly the right abstraction for a non-technical audience. _load_system_prompt() with explicit placeholder substitution is much safer than raw f-strings.
  • The langsmith_dataset.py CLI is well-structured: consistent argparse subcommands, _Validate dataclass with error/warn modes, git-cleanliness guard on pull, dry-run support. The property-based tests (Hypothesis) for the diff/merge logic are a nice touch.
  • Deferred LLM initialisation in graph.py (_get_llm with a lock) is the right pattern for a module that gets imported in tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns documentation Improvements or additions to documentation infrastructure Pull requests related to infrastructure and underlying workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants