Skip to content

Add Deep QA benchmark mode and Telegram intent handling#149

Open
coldmans wants to merge 2 commits into
mainfrom
codex/deep-qa-benchmark-telegram-chat
Open

Add Deep QA benchmark mode and Telegram intent handling#149
coldmans wants to merge 2 commits into
mainfrom
codex/deep-qa-benchmark-telegram-chat

Conversation

@coldmans

Copy link
Copy Markdown
Contributor

Summary

  • add a dedicated Deep QA benchmark launch path for human-vs-GAIA comparison runs
  • propagate qa_mode/benchmark_mode through terminal, single-suite, and KPI pack benchmark artifacts
  • make Telegram ingress classify help/status/casual messages before queueing test goals, with more conversational login prompts

Tests

  • .venv/bin/python -m pytest gaia/tests/unit -q
  • python scripts/lint_harness_docs.py
  • git diff --check

Copilot AI review requested due to automatic review settings May 19, 2026 11:05

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "Deep QA" benchmark mode and significantly enhances the Telegram bridge's interaction capabilities through message intent classification and LLM-powered contextual prompts for login credentials. The CLI and benchmark scripts have been updated to support and propagate the new qa_mode parameter across the execution pipeline. Feedback focuses on improving maintainability by addressing code duplication in the Telegram bridge's message handling logic and suggests centralizing redundant QA mode helper functions that are currently duplicated across multiple benchmark-related files.

Comment thread gaia/telegram_bridge.py Outdated
Comment on lines +2025 to +2033
if self._looks_like_casual_reply(raw):
await message.reply_text(
self._format_pending_help(
kind,
str(hub_pending.get("question") or "추가 입력이 필요합니다."),
normalized_fields,
)
)
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for handling a casual reply while an input is pending from the hub_context is duplicated from the block handling pending interventions from the bridge itself (lines 1973-1981).

To avoid code duplication and improve maintainability, you could extract this logic into a helper method that takes the pending information as arguments and sends the appropriate help message.

Comment on lines +132 to +150
def _normalize_qa_mode(value: str | None) -> str | None:
raw = str(value or "").strip().lower()
if raw in {"", "off", "none", "default", "false", "0"}:
return None
if raw in {"adaptive", ADAPTIVE_QA_MODE, "progressive_qa"}:
return ADAPTIVE_QA_MODE
if raw in {"deep", "deep_qa", "aggressive_qa", DEEP_ADAPTIVE_QA_MODE}:
return DEEP_ADAPTIVE_QA_MODE
return None


def _benchmark_mode_label(qa_mode: str | None) -> str:
normalized = _normalize_qa_mode(qa_mode)
if normalized == DEEP_ADAPTIVE_QA_MODE:
return "deep_qa"
if normalized == ADAPTIVE_QA_MODE:
return "adaptive_qa"
return "standard"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These QA mode helper functions (_normalize_qa_mode, _benchmark_mode_label) are also defined in gaia/src/terminal_benchmark_mode.py and scripts/run_kpi_benchmark_pack.py. This code duplication can lead to inconsistencies and maintenance challenges. For instance, the implementation of _normalize_qa_mode here uses if/elif statements, while the one in terminal_benchmark_mode.py uses a more extensible dictionary-based approach.

To improve maintainability, consider refactoring these helpers into a shared utility module (e.g., in gaia/src/utils/ or a new gaia/src/benchmark_utils.py). This would centralize the logic and ensure consistency across the different benchmark scripts.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a dedicated “Deep QA” benchmark launch path (for human-vs-GAIA comparisons) and propagates qa_mode / benchmark_mode tagging through benchmark artifacts, while also improving Telegram ingress by classifying freeform messages (help/status/casual/goal) and making login prompts more conversational.

Changes:

  • Add --qa-mode handling and benchmark_mode labeling to KPI pack + per-goal benchmark scripts and their artifacts.
  • Extend terminal benchmark mode + CLI launcher to support a “Deep QA 전용 벤치마크 실행” path that forwards deep QA settings to suite runs.
  • Update Telegram bridge to (a) generate contextual login prompts (optionally via LLM), and (b) classify messages to avoid queuing casual/help/status text as goals.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/run_kpi_benchmark_pack.py Add --qa-mode forwarding and include qa_mode / benchmark_mode in pack artifacts.
scripts/run_goal_benchmark.py Add --qa-mode support, set QA env flags, and tag per-scenario/summary artifacts.
gaia/src/terminal_benchmark_mode.py Add QA-mode normalization/labeling and forward QA mode into spawned benchmark scripts.
gaia/cli.py Add terminal “Deep QA benchmark” purpose and route it to terminal benchmark mode with deep QA enabled.
gaia/telegram_bridge.py Add contextual login prompts + freeform intent classification; improve user-facing queue/help/status responses.
gaia/tests/unit/test_terminal_benchmark_mode.py Cover deep QA forwarding + artifact tagging in terminal benchmark mode.
gaia/tests/unit/test_cli_adaptive_qa_mode.py Cover deep QA benchmark purpose selection + launcher routing.
gaia/tests/unit/test_run_kpi_benchmark_pack_script.py Cover KPI pack command construction with deep QA mode.
gaia/tests/unit/test_run_goal_benchmark_script.py Cover QA helpers + child payload propagation for deep QA.
gaia/tests/unit/test_telegram_bridge.py Cover login prompt behavior and intent classification (help/status/casual/goal).
gaia/tests/unit/test_gui_benchmark_sync.py Adjust subprocess fakes for expanded kwargs; update benchmark UI expectation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +631 to +634
requested_qa_mode = str(args.qa_mode or "").strip()
if not requested_qa_mode or requested_qa_mode.lower() in {"off", "none", "default", "false", "0"}:
requested_qa_mode = str(suite.get("qa_mode") or requested_qa_mode).strip()
normalized_qa_mode = _normalize_qa_mode(requested_qa_mode)
runner_id = resolve_runner_id(env=os.environ)
normalized_qa_mode = normalize_benchmark_qa_mode(qa_mode)
if normalized_qa_mode:
emit(f"Deep QA 벤치마크 프로필: {benchmark_qa_mode_label(normalized_qa_mode)} 모드로 실행합니다.")
Comment thread gaia/telegram_bridge.py
Comment on lines +2049 to 2050
queued_ahead = self.queue.qsize()
with self._state_lock:

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34369fc6d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread gaia/telegram_bridge.py Outdated
await message.reply_text(self._format_live_status(chat_id))
return

intent = self._classify_freeform_message_intent(raw)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve slash commands before intent filtering

This freeform intent path now runs for every allowed message, including Chat Hub slash commands. Short commands such as /ai, /session, /plan, or /status are classified as casual/status/help before they reach the queue, so they return a Telegram helper response instead of being dispatched to dispatch_command, which still implements those commands. Skip this classifier for non-Telegram slash commands (or queue them directly) so existing remote command handling keeps working.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28c65d7890

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread gaia/telegram_bridge.py
Comment on lines +1296 to +1300
"intent": "goal",
"confidence": 0.0,
"reply": "",
"goal_text": raw,
"pending_text": "",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep deterministic status/help routing when LLM routing is off

When the LLM intent router is disabled (GAIA_TELEGRAM_LLM_MESSAGE_INTENT=0) or unavailable (_get_chat_router_client() returns None), this fallback always classifies non-pending input as goal, so messages like 현재 상태 are queued as test goals instead of returning live status/help. handle_message() now depends on intent == "status"/"help" for those responses, so this regression breaks core Telegram control behavior in no-LLM/degraded environments.

Useful? React with 👍 / 👎.

Comment on lines +506 to +507
normalized_qa_mode = _normalize_qa_mode(str(args.qa_mode or ""))
benchmark_mode = _benchmark_mode_label(normalized_qa_mode)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report effective QA mode instead of only CLI argument mode

run_goal_benchmark.py can enable QA mode from each suite file when --qa-mode is off (it falls back to suite["qa_mode"]), but this pack-level report records qa_mode/benchmark_mode only from CLI args. That means packs can run in deep/adaptive mode while top-level metadata still says off/standard, which mislabels artifacts and downstream KPI filtering.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants