Conversation
| entrypoint: [ | ||
| "uv", | ||
| "run", | ||
| "malt_agent.py", |
There was a problem hiding this comment.
Entrypoint should be running the ./start_route_agent.sh in place of uv run malt_agent.py and using the route_agent image (I'm assuming this is for route app?). Also, this container needs to be run with --privileged and mount /lib/modules.
There was a problem hiding this comment.
Thanks @Kolleida! I believe it's not possible to run an image with --privileged in amber. This is why we went with the MALT agent instead. Is there another way we can run your benchmark via amber? Please let me know what image/entrypoint/config parameters to use. Thank you!
There was a problem hiding this comment.
@CdavM If you are running MALT, then the role should be "malt_operator", not "route_operator" (this was mainly used by the leaderboard query to filter MALT specific results). Also, the config should look something like this:
assessment_config: {
prompt_type: "zeroshot_base",
num_queries: 3,
complexity_level = ["level1", "level2", "level3"],
output_dir: "dump",
output_file = "query_output.jsonl"
benchmark_path: "assessment_queries.jsonl",
regenerate_benchmark: true
}This config generates 30 queries in total spread across the 3 levels. Increasing num_queries adds 10 queries (you can choose how much you think is appropriate for good signal).
Later I saw the agentbeats version of NetArena on the website, and the description references the K8s benchmark, but you guys are doing MALT instead. Is this also because the setup needed (e.g. boostrap a KIND cluster) is impossible/hard to express in amber?
No description provided.