Skip to content

Adds amber scenario.#19

Open
CdavM wants to merge 2 commits intoFroot-NetSys:a2a-agentxfrom
RDI-Foundation:cdavm/amber
Open

Adds amber scenario.#19
CdavM wants to merge 2 commits intoFroot-NetSys:a2a-agentxfrom
RDI-Foundation:cdavm/amber

Conversation

@CdavM
Copy link
Copy Markdown

@CdavM CdavM commented Apr 1, 2026

No description provided.

entrypoint: [
"uv",
"run",
"malt_agent.py",
Copy link
Copy Markdown
Collaborator

@Kolleida Kolleida Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entrypoint should be running the ./start_route_agent.sh in place of uv run malt_agent.py and using the route_agent image (I'm assuming this is for route app?). Also, this container needs to be run with --privileged and mount /lib/modules.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kolleida! I believe it's not possible to run an image with --privileged in amber. This is why we went with the MALT agent instead. Is there another way we can run your benchmark via amber? Please let me know what image/entrypoint/config parameters to use. Thank you!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CdavM If you are running MALT, then the role should be "malt_operator", not "route_operator" (this was mainly used by the leaderboard query to filter MALT specific results). Also, the config should look something like this:

assessment_config: {
  prompt_type: "zeroshot_base",
  num_queries: 3,
  complexity_level = ["level1", "level2", "level3"],
  output_dir: "dump",
  output_file = "query_output.jsonl"
  benchmark_path: "assessment_queries.jsonl",
  regenerate_benchmark: true
}

This config generates 30 queries in total spread across the 3 levels. Increasing num_queries adds 10 queries (you can choose how much you think is appropriate for good signal).

Later I saw the agentbeats version of NetArena on the website, and the description references the K8s benchmark, but you guys are doing MALT instead. Is this also because the setup needed (e.g. boostrap a KIND cluster) is impossible/hard to express in amber?

Copy link
Copy Markdown
Collaborator

@Kolleida Kolleida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What agent is this supposed to run? There seems to be a mismatch between the assessment config/operator name at the bottom of the manifest and the green agent container being run. If there is anything I can do to help/clarify things please lmk!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants