This repository contains reusable code for the experiments in "Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective". The project studies whether long Chain-of-Thought (CoT) reasoning helps large language models approximate human label distributions on ambiguous NLI examples.
The code supports three workflows:
- Generate CoT traces for ChaosNLI examples.
- Split each CoT into cumulative prefixes and inject those prefixes into other models for Cross-CoT / step-wise logit tracing.
- Evaluate first-token option logits against human judgment distributions with accuracy, Jensen-Shannon distance, Spearman correlation, and additive ANOVA.
No data, model checkpoints, or generated traces are included in this repository. See docs/data.md for dataset sources and expected file names.
python -m venv .venv
source .venv/bin/activate
pip install -e .Install PyTorch according to your CUDA environment before running large-model experiments if the default wheel is not suitable.
Download ChaosNLI from the official release:
https://github.com/easonnie/ChaosNLI
Place the files in a local directory such as:
data/
chaosNLI_mnli_m.jsonl
chaosNLI_snli.jsonl
chaosNLI_alphanli.jsonl
The code intentionally uses command-line paths; no machine-specific paths are hard-coded.
Evaluate the tiny synthetic example:
python scripts/evaluate_trace.py \
--input examples/tiny_trace.jsonl \
--aggregatepython scripts/generate_cot.py \
--dataset snli \
--input data/chaosNLI_snli.jsonl \
--output outputs/snli/CoT_gpt.jsonl \
--model openai/gpt-oss-20b \
--cache-dir .cache/huggingfaceFor local checkpoints, pass the checkpoint directory to --model.
python scripts/make_chunks.py \
--input outputs/snli/CoT_gpt.jsonl \
--output outputs/snli/CoT_chunks_gpt.jsonl \
--num-chunks 10 \
--method tokenThis produces 11 prefixes: 0%, 10%, ..., 100%.
python scripts/run_logit_trace.py \
--input outputs/snli/CoT_chunks_gpt.jsonl \
--output outputs/snli/r1-qwen_logits_trace_for_CoT_gpt.jsonl \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--options A B C \
--template think \
--cache-dir .cache/huggingfaceFor alphaNLI, use --options A B. If a model tokenizer does not map these
letters to single tokens, pass explicit IDs with --option-token-ids.
Use --template gpt-oss for GPT-OSS style analysis/final channel delimiters and
--template think for models that use <think>...</think> conventions.
python scripts/evaluate_trace.py \
--input outputs/snli/r1-qwen_logits_trace_for_CoT_gpt.jsonl \
--output outputs/snli/r1-qwen_for_gpt_metrics.csvThe default probability conversion is the linear normalization used in the
paper. For a robustness check, use --probability-mode softmax.
After concatenating metric CSVs with model and cot_source columns:
python scripts/anova_contributions.py \
--input outputs/snli/final_step_metrics.csv \
--score jsdThis fits an additive two-way ANOVA model:
score ~ C(model) + C(cot_source)
and reports each factor's sum-of-squares contribution percentage.
configs/ Model and dataset metadata templates
docs/ Data provenance and usage notes
examples/ Tiny synthetic examples for smoke checks
scripts/ Command-line entry points
src/cot_hlv/ Reusable Python package
If you use this code, please cite:
@article{DBLP:journals/corr/abs-2601-03154,
author = {Beiduo Chen and Tiancheng Hu and Caiqi Zhang and Robert Litschko and Anna Korhonen and Barbara Plank},
title = {Decoupling the Effect of Chain-of-Thought Reasoning: {A} Human Label Variation Perspective},
journal = {CoRR},
volume = {abs/2601.03154},
year = {2026},
url = {https://doi.org/10.48550/arXiv.2601.03154},
doi = {10.48550/ARXIV.2601.03154},
eprinttype = {arXiv},
eprint = {2601.03154}
}- Large-model inference is compute intensive. The paper used multi-GPU inference for the largest models.
- Generated CoT traces may contain model-specific template markers. The helper
functions cover common
<think>and GPT-OSS channel formats, but you may need to adjustextract_reasoningfor new models. - This repository omits private paths, local caches, raw data, and intermediate outputs by design.