Add Deep QA benchmark mode and Telegram intent handling by coldmans · Pull Request #149 · capston2025/capston

coldmans · 2026-05-19T11:05:07Z

Summary

add a dedicated Deep QA benchmark launch path for human-vs-GAIA comparison runs
propagate qa_mode/benchmark_mode through terminal, single-suite, and KPI pack benchmark artifacts
make Telegram ingress classify help/status/casual messages before queueing test goals, with more conversational login prompts

Tests

.venv/bin/python -m pytest gaia/tests/unit -q
python scripts/lint_harness_docs.py
git diff --check

gemini-code-assist

Code Review

This pull request introduces a "Deep QA" benchmark mode and significantly enhances the Telegram bridge's interaction capabilities through message intent classification and LLM-powered contextual prompts for login credentials. The CLI and benchmark scripts have been updated to support and propagate the new qa_mode parameter across the execution pipeline. Feedback focuses on improving maintainability by addressing code duplication in the Telegram bridge's message handling logic and suggests centralizing redundant QA mode helper functions that are currently duplicated across multiple benchmark-related files.

gemini-code-assist · 2026-05-19T11:08:33Z

+            if self._looks_like_casual_reply(raw):
+                await message.reply_text(
+                    self._format_pending_help(
+                        kind,
+                        str(hub_pending.get("question") or "추가 입력이 필요합니다."),
+                        normalized_fields,
+                    )
+                )
+                return


This logic for handling a casual reply while an input is pending from the hub_context is duplicated from the block handling pending interventions from the bridge itself (lines 1973-1981).

To avoid code duplication and improve maintainability, you could extract this logic into a helper method that takes the pending information as arguments and sends the appropriate help message.

gemini-code-assist · 2026-05-19T11:08:33Z

+def _normalize_qa_mode(value: str | None) -> str | None:
+    raw = str(value or "").strip().lower()
+    if raw in {"", "off", "none", "default", "false", "0"}:
+        return None
+    if raw in {"adaptive", ADAPTIVE_QA_MODE, "progressive_qa"}:
+        return ADAPTIVE_QA_MODE
+    if raw in {"deep", "deep_qa", "aggressive_qa", DEEP_ADAPTIVE_QA_MODE}:
+        return DEEP_ADAPTIVE_QA_MODE
+    return None
+
+
+def _benchmark_mode_label(qa_mode: str | None) -> str:
+    normalized = _normalize_qa_mode(qa_mode)
+    if normalized == DEEP_ADAPTIVE_QA_MODE:
+        return "deep_qa"
+    if normalized == ADAPTIVE_QA_MODE:
+        return "adaptive_qa"
+    return "standard"
+


These QA mode helper functions (_normalize_qa_mode, _benchmark_mode_label) are also defined in gaia/src/terminal_benchmark_mode.py and scripts/run_kpi_benchmark_pack.py. This code duplication can lead to inconsistencies and maintenance challenges. For instance, the implementation of _normalize_qa_mode here uses if/elif statements, while the one in terminal_benchmark_mode.py uses a more extensible dictionary-based approach.

To improve maintainability, consider refactoring these helpers into a shared utility module (e.g., in gaia/src/utils/ or a new gaia/src/benchmark_utils.py). This would centralize the logic and ensure consistency across the different benchmark scripts.

Copilot

Pull request overview

This PR adds a dedicated “Deep QA” benchmark launch path (for human-vs-GAIA comparisons) and propagates qa_mode / benchmark_mode tagging through benchmark artifacts, while also improving Telegram ingress by classifying freeform messages (help/status/casual/goal) and making login prompts more conversational.

Changes:

Add --qa-mode handling and benchmark_mode labeling to KPI pack + per-goal benchmark scripts and their artifacts.
Extend terminal benchmark mode + CLI launcher to support a “Deep QA 전용 벤치마크 실행” path that forwards deep QA settings to suite runs.
Update Telegram bridge to (a) generate contextual login prompts (optionally via LLM), and (b) classify messages to avoid queuing casual/help/status text as goals.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
scripts/run_kpi_benchmark_pack.py	Add `--qa-mode` forwarding and include `qa_mode` / `benchmark_mode` in pack artifacts.
scripts/run_goal_benchmark.py	Add `--qa-mode` support, set QA env flags, and tag per-scenario/summary artifacts.
gaia/src/terminal_benchmark_mode.py	Add QA-mode normalization/labeling and forward QA mode into spawned benchmark scripts.
gaia/cli.py	Add terminal “Deep QA benchmark” purpose and route it to terminal benchmark mode with deep QA enabled.
gaia/telegram_bridge.py	Add contextual login prompts + freeform intent classification; improve user-facing queue/help/status responses.
gaia/tests/unit/test_terminal_benchmark_mode.py	Cover deep QA forwarding + artifact tagging in terminal benchmark mode.
gaia/tests/unit/test_cli_adaptive_qa_mode.py	Cover deep QA benchmark purpose selection + launcher routing.
gaia/tests/unit/test_run_kpi_benchmark_pack_script.py	Cover KPI pack command construction with deep QA mode.
gaia/tests/unit/test_run_goal_benchmark_script.py	Cover QA helpers + child payload propagation for deep QA.
gaia/tests/unit/test_telegram_bridge.py	Cover login prompt behavior and intent classification (help/status/casual/goal).
gaia/tests/unit/test_gui_benchmark_sync.py	Adjust subprocess fakes for expanded kwargs; update benchmark UI expectation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    requested_qa_mode = str(args.qa_mode or "").strip()
+    if not requested_qa_mode or requested_qa_mode.lower() in {"off", "none", "default", "false", "0"}:
+        requested_qa_mode = str(suite.get("qa_mode") or requested_qa_mode).strip()
+    normalized_qa_mode = _normalize_qa_mode(requested_qa_mode)


    runner_id = resolve_runner_id(env=os.environ)
+    normalized_qa_mode = normalize_benchmark_qa_mode(qa_mode)
+    if normalized_qa_mode:
+        emit(f"Deep QA 벤치마크 프로필: {benchmark_qa_mode_label(normalized_qa_mode)} 모드로 실행합니다.")


+        queued_ahead = self.queue.qsize()
        with self._state_lock:


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34369fc6d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T11:09:34Z

            await message.reply_text(self._format_live_status(chat_id))
            return

+        intent = self._classify_freeform_message_intent(raw)


Preserve slash commands before intent filtering

This freeform intent path now runs for every allowed message, including Chat Hub slash commands. Short commands such as /ai, /session, /plan, or /status are classified as casual/status/help before they reach the queue, so they return a Telegram helper response instead of being dispatched to dispatch_command, which still implements those commands. Skip this classifier for non-Telegram slash commands (or queue them directly) so existing remote command handling keeps working.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28c65d7890

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T17:52:14Z

+            "intent": "goal",
+            "confidence": 0.0,
+            "reply": "",
+            "goal_text": raw,
+            "pending_text": "",


Keep deterministic status/help routing when LLM routing is off

When the LLM intent router is disabled (GAIA_TELEGRAM_LLM_MESSAGE_INTENT=0) or unavailable (_get_chat_router_client() returns None), this fallback always classifies non-pending input as goal, so messages like 현재 상태 are queued as test goals instead of returning live status/help. handle_message() now depends on intent == "status"/"help" for those responses, so this regression breaks core Telegram control behavior in no-LLM/degraded environments.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-19T17:52:14Z

+    normalized_qa_mode = _normalize_qa_mode(str(args.qa_mode or ""))
+    benchmark_mode = _benchmark_mode_label(normalized_qa_mode)


Report effective QA mode instead of only CLI argument mode

run_goal_benchmark.py can enable QA mode from each suite file when --qa-mode is off (it falls back to suite["qa_mode"]), but this pack-level report records qa_mode/benchmark_mode only from CLI args. That means packs can run in deep/adaptive mode while top-level metadata still says off/standard, which mislabels artifacts and downstream KPI filtering.

Useful? React with 👍 / 👎.

Add Deep QA benchmark mode and Telegram intent handling

34369fc

Copilot AI review requested due to automatic review settings May 19, 2026 11:05

Copilot started reviewing on behalf of coldmans May 19, 2026 11:05 View session

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Copilot AI reviewed May 19, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

Fix Deep QA routing and Telegram chatbot handling

28c65d7

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Deep QA benchmark mode and Telegram intent handling#149

Add Deep QA benchmark mode and Telegram intent handling#149
coldmans wants to merge 2 commits into
mainfrom
codex/deep-qa-benchmark-telegram-chat

coldmans commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		normalized_qa_mode = _normalize_qa_mode(str(args.qa_mode or ""))
		benchmark_mode = _benchmark_mode_label(normalized_qa_mode)

Uh oh!

Conversation

coldmans commented May 19, 2026

Summary

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants