This web application evaluates OpenAI and Anthropic models using single prompts or batch JSONL exam datasets (with optional images). It is designed for benchmarking model performance on construction management certification exams, including AIC and CMAA.
- At a Glance
- Prerequisites
- Start the App
- Configure Models
- Run a Single Question
- Run Batch from JSONL
- JSONL Format (expected)
- Recommended file layout (to avoid missing images)
- Extend to More Models
- Resource access
- More Information
- License
- Purpose: benchmark LLM performance on construction management certification-style questions.
- Providers: OpenAI and Anthropic (API key required).
- Modes: single question runs and batch JSONL evaluation with optional images.
- Output: live run logs plus exportable JSON results.
- Python 3.9+ installed
- Internet access to model APIs
- OpenAI API key and/or Anthropic API key
python "./server.py" --host 127.0.0.1 --port 8000
Open:
http://127.0.0.1:8000
In Model Settings:
- Select one or more Providers.
- Select one or more Models.
- Enter API keys:
- OpenAI API Key
- Anthropic API Key
- (Optional) Enable Show API keys to verify the typed key.
- Set:
- Temperature (
0.00to1.00) - Running Times (
1to20)
- Temperature (
- Enter question text.
- (Optional) Upload images.
- Click Run.
- Watch real-time output in the Output panel.
- Click Save Results to export the latest run JSON.
- In Batch JSONL, upload one or more
.jsonlfiles. - Click Run Batch.
- Before runs start, the app prints image path diagnostics (unresolved image URIs).
- Progress bar updates while running.
- Output streams in real time per question/model/run.
- Review the Result Table under the output panel.
- Click Save Results to export batch results JSON.
Each line must be one JSON object (no trailing commas). Required and optional fields:
id(string, recommended)question(string, required)choices(object withA/B/C/D, recommended for MCQ)answer(string, optional, expected answer key for evaluation)images(array ornull, optional)table_markdown(string ornull, optional)
Example line:
{"id":"CAC-0001","question":"...","choices":{"A":"...","B":"...","C":"...","D":"..."},"answer":"B","images":[{"uri":"data/images/CAC-0015_fig1.png"}],"table_markdown":null}Notes:
imagescan benullor an array.- Each
imagesitem should look like:{"uri":"relative/or/absolute/path.png","caption":"...","type":"figure"}(caption/type optional). - Relative
images[].uriin uploaded JSONL may be unresolved; unresolved entries are listed in image path diagnostics before run starts. - Structured output mode asks models to return JSON with keys:
answer,explanation. - A ready-to-run toy sample is included at
cert_eval/data/example_format.jsonlwith imagecert_eval/data/images/toy_blocks.png.
Keep JSONL and image files under the same served root so relative URIs resolve consistently.
Example:
construction-education-llm/
cert_eval/
data/
CAC.jsonl
images/
CAC-0015_fig1.png
CAC-0016_fig1.png
Then in JSONL use relative URIs like:
"uri": "data/images/CAC-0015_fig1.png"
And set Batch Image Base Path to:
/cert_eval/
If your images are in a different folder, either:
- Update
images[].urito correct relative paths from your chosen base path, or - Use absolute
http(s)image URLs.
Model/provider options are defined in app.js in providerCatalog.
Example:
const providerCatalog = {
openai: {
label: "OpenAI",
endpoint: "https://api.openai.com/v1/responses",
models: ["gpt-4o", "gpt-5.2"]
},
anthropic: {
label: "Anthropic",
endpoint: "/api/anthropic/messages",
models: ["claude-sonnet-4-20250514", "claude-sonnet-4-6"]
}
};To add a new model version:
- Update the
modelslist under the correct provider inproviderCatalog. - Save and refresh browser (
Ctrl+F5). - Select the model in the UI and run a quick single test.
To add a new provider:
- Add a new provider entry to
providerCatalog(label,endpoint,models). - Add request payload mapping in
buildPayload(...). - Add response parsing in
parseProviderResponse(...). - Add provider key routing in
getApiKeyForProvider(...). - If browser CORS blocks direct calls, add a proxy route in
server.pyand point providerendpointto local/api/....
Tip for latest versions:
- Use exact official API model IDs from provider docs.
- Retire old model IDs from
providerCatalog.modelswhen no longer supported.
- AIC resources can be accessed at: https://aic-builds.org/certifications/
- CMAA exam resources are available at: https://www.cmaanet.org/bookstore
- Some source materials may require purchase or direct contact with the issuing organization for access.
If you use this benchmark or repository, please cite:
@misc{xiong2025aimasterconstruction,
title={Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams},
author={Ruoxin Xiong and Yanyu Wang and Suat Gunhan and Yimin Zhu and Charles Berryman},
year={2025},
eprint={2504.08779},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.08779}
}Code in this repository is licensed under the Apache License 2.0. See LICENSE.
Exam datasets, images, and third-party source materials referenced by this project may have separate terms and are not automatically covered by the repository code license. You are responsible for obtaining any required permissions for those materials.
