Add Deep QA benchmark mode and Telegram intent handling#149
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a "Deep QA" benchmark mode and significantly enhances the Telegram bridge's interaction capabilities through message intent classification and LLM-powered contextual prompts for login credentials. The CLI and benchmark scripts have been updated to support and propagate the new qa_mode parameter across the execution pipeline. Feedback focuses on improving maintainability by addressing code duplication in the Telegram bridge's message handling logic and suggests centralizing redundant QA mode helper functions that are currently duplicated across multiple benchmark-related files.
| if self._looks_like_casual_reply(raw): | ||
| await message.reply_text( | ||
| self._format_pending_help( | ||
| kind, | ||
| str(hub_pending.get("question") or "추가 입력이 필요합니다."), | ||
| normalized_fields, | ||
| ) | ||
| ) | ||
| return |
There was a problem hiding this comment.
This logic for handling a casual reply while an input is pending from the hub_context is duplicated from the block handling pending interventions from the bridge itself (lines 1973-1981).
To avoid code duplication and improve maintainability, you could extract this logic into a helper method that takes the pending information as arguments and sends the appropriate help message.
| def _normalize_qa_mode(value: str | None) -> str | None: | ||
| raw = str(value or "").strip().lower() | ||
| if raw in {"", "off", "none", "default", "false", "0"}: | ||
| return None | ||
| if raw in {"adaptive", ADAPTIVE_QA_MODE, "progressive_qa"}: | ||
| return ADAPTIVE_QA_MODE | ||
| if raw in {"deep", "deep_qa", "aggressive_qa", DEEP_ADAPTIVE_QA_MODE}: | ||
| return DEEP_ADAPTIVE_QA_MODE | ||
| return None | ||
|
|
||
|
|
||
| def _benchmark_mode_label(qa_mode: str | None) -> str: | ||
| normalized = _normalize_qa_mode(qa_mode) | ||
| if normalized == DEEP_ADAPTIVE_QA_MODE: | ||
| return "deep_qa" | ||
| if normalized == ADAPTIVE_QA_MODE: | ||
| return "adaptive_qa" | ||
| return "standard" | ||
|
|
There was a problem hiding this comment.
These QA mode helper functions (_normalize_qa_mode, _benchmark_mode_label) are also defined in gaia/src/terminal_benchmark_mode.py and scripts/run_kpi_benchmark_pack.py. This code duplication can lead to inconsistencies and maintenance challenges. For instance, the implementation of _normalize_qa_mode here uses if/elif statements, while the one in terminal_benchmark_mode.py uses a more extensible dictionary-based approach.
To improve maintainability, consider refactoring these helpers into a shared utility module (e.g., in gaia/src/utils/ or a new gaia/src/benchmark_utils.py). This would centralize the logic and ensure consistency across the different benchmark scripts.
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated “Deep QA” benchmark launch path (for human-vs-GAIA comparisons) and propagates qa_mode / benchmark_mode tagging through benchmark artifacts, while also improving Telegram ingress by classifying freeform messages (help/status/casual/goal) and making login prompts more conversational.
Changes:
- Add
--qa-modehandling andbenchmark_modelabeling to KPI pack + per-goal benchmark scripts and their artifacts. - Extend terminal benchmark mode + CLI launcher to support a “Deep QA 전용 벤치마크 실행” path that forwards deep QA settings to suite runs.
- Update Telegram bridge to (a) generate contextual login prompts (optionally via LLM), and (b) classify messages to avoid queuing casual/help/status text as goals.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/run_kpi_benchmark_pack.py | Add --qa-mode forwarding and include qa_mode / benchmark_mode in pack artifacts. |
| scripts/run_goal_benchmark.py | Add --qa-mode support, set QA env flags, and tag per-scenario/summary artifacts. |
| gaia/src/terminal_benchmark_mode.py | Add QA-mode normalization/labeling and forward QA mode into spawned benchmark scripts. |
| gaia/cli.py | Add terminal “Deep QA benchmark” purpose and route it to terminal benchmark mode with deep QA enabled. |
| gaia/telegram_bridge.py | Add contextual login prompts + freeform intent classification; improve user-facing queue/help/status responses. |
| gaia/tests/unit/test_terminal_benchmark_mode.py | Cover deep QA forwarding + artifact tagging in terminal benchmark mode. |
| gaia/tests/unit/test_cli_adaptive_qa_mode.py | Cover deep QA benchmark purpose selection + launcher routing. |
| gaia/tests/unit/test_run_kpi_benchmark_pack_script.py | Cover KPI pack command construction with deep QA mode. |
| gaia/tests/unit/test_run_goal_benchmark_script.py | Cover QA helpers + child payload propagation for deep QA. |
| gaia/tests/unit/test_telegram_bridge.py | Cover login prompt behavior and intent classification (help/status/casual/goal). |
| gaia/tests/unit/test_gui_benchmark_sync.py | Adjust subprocess fakes for expanded kwargs; update benchmark UI expectation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| requested_qa_mode = str(args.qa_mode or "").strip() | ||
| if not requested_qa_mode or requested_qa_mode.lower() in {"off", "none", "default", "false", "0"}: | ||
| requested_qa_mode = str(suite.get("qa_mode") or requested_qa_mode).strip() | ||
| normalized_qa_mode = _normalize_qa_mode(requested_qa_mode) |
| runner_id = resolve_runner_id(env=os.environ) | ||
| normalized_qa_mode = normalize_benchmark_qa_mode(qa_mode) | ||
| if normalized_qa_mode: | ||
| emit(f"Deep QA 벤치마크 프로필: {benchmark_qa_mode_label(normalized_qa_mode)} 모드로 실행합니다.") |
| queued_ahead = self.queue.qsize() | ||
| with self._state_lock: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34369fc6d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await message.reply_text(self._format_live_status(chat_id)) | ||
| return | ||
|
|
||
| intent = self._classify_freeform_message_intent(raw) |
There was a problem hiding this comment.
Preserve slash commands before intent filtering
This freeform intent path now runs for every allowed message, including Chat Hub slash commands. Short commands such as /ai, /session, /plan, or /status are classified as casual/status/help before they reach the queue, so they return a Telegram helper response instead of being dispatched to dispatch_command, which still implements those commands. Skip this classifier for non-Telegram slash commands (or queue them directly) so existing remote command handling keeps working.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28c65d7890
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "intent": "goal", | ||
| "confidence": 0.0, | ||
| "reply": "", | ||
| "goal_text": raw, | ||
| "pending_text": "", |
There was a problem hiding this comment.
Keep deterministic status/help routing when LLM routing is off
When the LLM intent router is disabled (GAIA_TELEGRAM_LLM_MESSAGE_INTENT=0) or unavailable (_get_chat_router_client() returns None), this fallback always classifies non-pending input as goal, so messages like 현재 상태 are queued as test goals instead of returning live status/help. handle_message() now depends on intent == "status"/"help" for those responses, so this regression breaks core Telegram control behavior in no-LLM/degraded environments.
Useful? React with 👍 / 👎.
| normalized_qa_mode = _normalize_qa_mode(str(args.qa_mode or "")) | ||
| benchmark_mode = _benchmark_mode_label(normalized_qa_mode) |
There was a problem hiding this comment.
Report effective QA mode instead of only CLI argument mode
run_goal_benchmark.py can enable QA mode from each suite file when --qa-mode is off (it falls back to suite["qa_mode"]), but this pack-level report records qa_mode/benchmark_mode only from CLI args. That means packs can run in deep/adaptive mode while top-level metadata still says off/standard, which mislabels artifacts and downstream KPI filtering.
Useful? React with 👍 / 👎.
Summary
Tests