Skip to content

aic-factcheck/lexbench-cs

Repository files navigation

LexBench-CS: LLM benchmarking for Czech legal reasoning tasks

Supported by the Technology Agency of the Czech Republic project:
Applied Legal Language Model and Benchmarks for Legal Practice (FW11020230)

This repository contains inference and evaluation code for the Czech Law Multiple-Choice Benchmark (CLMC).

The dataset is not publicly available (yet). Please contact us to request access for academic purposes.


Installation

This guide assumes a SLURM cluster environment. Adapting the setup to other environments should be straightforward.

Load required modules:

ml vLLM/0.12.0-foss-2025a-CUDA-12.8.0
ml Triton/3.5.0-gfbf-2025a-CUDA-12.8.0

Create a Python virtual environment:

python -m venv .venv

Create a .env file and set the OPENAI_API_KEY environment variable.

Activate the environment:

source .venv/bin/activate

Install required packages:

pip install -r requirements.txt

Install vLLM manually if your environment does not provide a module.


Run Inference

Local models using vLLM

To easily switch between different models, the benchmarking suite uses AIC vLLM Proxy server, which encapsulates vllm serve and provides Ollama-like model management. Currently, only a single model can be served at a time.

Run the AIC vLLM Proxy server. Select (or create) a configuration matching the models you want to evaluate. The following example runs the proxy on the CTU RCI cluster using an NVIDIA H200 GPU:

cd slurm
sbatch vllm_proxy_1h200.batch

SLURM logs are stored in the logs/ directory.

Configuration is done by editing the main() function in:

src/lexbench_cs/run_clmc_inference.py

Key parameters:

  • PROXY_URL: proxy connection string
  • MODEL2SPEC: target LLM definitions
  • TEMPLATE_NAMES: selected evaluation prompt templates

Run inference:

cd slurm
sbatch run_clmc_inference.batch

Results will be stored in the EXP/ directory.


OpenAI models

Edit:

src/lexbench_cs/run_clmc_inference_openai.py

Then run:

cd slurm
sbatch run_clmc_inference_openai.batch

Run Evaluation

cd slurm
sbatch run_evaluate.batch

Aggregated results (Markdown and LaTeX tables) are stored in:

  • EXP/clmc/evaluation.md
  • EXP/clmc/evaluation.tex

License

MIT License

© AIC, Czech Technical University in Prague, 2026

About

LLM benchmarking for Czech legal reasoning tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors