feat: integrate step-level metrics into STOI product#1
Open
liziye627-design wants to merge 6 commits intoKevinYoung-Kw:mainfrom
Open
feat: integrate step-level metrics into STOI product#1liziye627-design wants to merge 6 commits intoKevinYoung-Kw:mainfrom
liziye627-design wants to merge 6 commits intoKevinYoung-Kw:mainfrom
Conversation
…dity, feedback-validity)
Based on the research report defining F/V/C/U quality dimensions and TE/SUS/FR/MG/RR core metrics: - Add stoi_metrics.py: StepParser, TokenAttributor, QualityScorer, MetricsCalculator implementing step-level token attribution with tiktoken/heuristic fallback, four-dimensional quality scoring (Factuality, Validity, Coherence, Utility), and five quantitative metrics (Token Efficiency, Step Utility Score, Faithfulness Risk, Monitorability Gain, Redundancy Ratio) - Fix calc_stoi algorithm: wasted_tokens now correctly counts only new_tokens instead of (total_context - cache_read), and applies cache_creation investment ratio to reduce waste penalty - Fix calc_stoi_score L3 weight comment (0.35 -> 0.20 to match code) - Update stoi_proxy.py with the corrected waste calculation - Add token efficiency and useful output ratio to session reports - Add comprehensive test suite (73 tests) covering engine fixes and the new metrics framework Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- stoi_analyze: extract output_text from session content blocks, add compute_step_metrics/aggregate_step_metrics/print_step_metrics_report functions, support --metrics flag in analyze command - stoi: add 'metrics' and 'report' CLI commands with F/V/C/U quality dimensions and TE/SUS/FR/MG/RR core metrics - stoi_tui: add metrics-box to dashboard with F/V/C/U bars and core metrics display, enhance suggestions with step-level insights - tests: add 10 integration tests for compute/aggregate/output_text
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
output_textfrom session content blocks, addcompute_step_metrics(),aggregate_step_metrics(),print_step_metrics_report()functions, support--metricsflagmetricsandreportCLI commands with F/V/C/U quality dimensions and TE/SUS/FR/MG/RR core metricsTest plan
python3 -m pytest tests/ -v)python3 stoi.py metrics --latest— verify metrics command workspython3 stoi.py report --days 7— verify report command workspython3 stoi.py analyze --metrics --top 5— verify --metrics flag workspython3 stoi.py stats— verify existing commands still work