Supported by the Technology Agency of the Czech Republic project:
Applied Legal Language Model and Benchmarks for Legal Practice (FW11020230)
This repository contains inference and evaluation code for the Czech Law Multiple-Choice Benchmark (CLMC).
The dataset is not publicly available (yet). Please contact us to request access for academic purposes.
This guide assumes a SLURM cluster environment. Adapting the setup to other environments should be straightforward.
Load required modules:
ml vLLM/0.12.0-foss-2025a-CUDA-12.8.0
ml Triton/3.5.0-gfbf-2025a-CUDA-12.8.0Create a Python virtual environment:
python -m venv .venvCreate a .env file and set the OPENAI_API_KEY environment variable.
Activate the environment:
source .venv/bin/activateInstall required packages:
pip install -r requirements.txtInstall vLLM manually if your environment does not provide a module.
To easily switch between different models, the benchmarking suite uses AIC vLLM Proxy server, which encapsulates vllm serve and provides Ollama-like model management.
Currently, only a single model can be served at a time.
Run the AIC vLLM Proxy server. Select (or create) a configuration matching the models you want to evaluate. The following example runs the proxy on the CTU RCI cluster using an NVIDIA H200 GPU:
cd slurm
sbatch vllm_proxy_1h200.batchSLURM logs are stored in the logs/ directory.
Configuration is done by editing the main() function in:
src/lexbench_cs/run_clmc_inference.py
Key parameters:
PROXY_URL: proxy connection stringMODEL2SPEC: target LLM definitionsTEMPLATE_NAMES: selected evaluation prompt templates
Run inference:
cd slurm
sbatch run_clmc_inference.batchResults will be stored in the EXP/ directory.
Edit:
src/lexbench_cs/run_clmc_inference_openai.py
Then run:
cd slurm
sbatch run_clmc_inference_openai.batchcd slurm
sbatch run_evaluate.batchAggregated results (Markdown and LaTeX tables) are stored in:
EXP/clmc/evaluation.mdEXP/clmc/evaluation.tex
MIT License
© AIC, Czech Technical University in Prague, 2026