Repo for "CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation"
- [2026.02.02] We release CoDiQ-Corpus, containing 44K competition-grade question sequences!
- [2026.02.02] We open-source CoDiQ-Gen-8B, a specialized generator trained via RL to synthesize high-difficulty problems.
- [2026.02.02] The CoDiQ paper is released. We propose a framework enabling fine-grained difficulty control via test-time scaling.
Large Reasoning Models (LRMs) benefit substantially from training on challenging, competition-level questions. However, existing automated synthesis methods struggle with "fake hard" questions—problems that are complex but unsolvable or ill-defined.
CoDiQ (Controllable Difficult Question Generation) is a novel framework that enables fine-grained difficulty control via test-time scaling while ensuring solvability.
Key innovations include:
- Test-Time Scaling Tendency: We identify that extending the reasoning token budget boosts difficulty but can reduce solvability.
- CoDiQ-Generator: A specialized model (finetuned from Qwen3-8B) that improves the upper bound of valid, high-difficulty question generation.
- CoDiQ-Corpus: A dataset of 44K competition-grade math and coding question sequences, which is significantly more challenging than LiveCodeBench and AIME.
Training LRMs on CoDiQ-Corpus substantially enhances downstream reasoning performance. The CoDiQ-Generator and CoDiQ-Corpus are released.
The CoDiQ pipeline iteratively evolves a seed question
To avoid superficial difficulty, we guide the LLM with six specific cognitive scaffolds:
- Dimensionality & Constraints: Exploding data scale/dimensions to invalidate naive simulation.
- Mathematical Abstraction: Reframing procedural problems into formal models (e.g., Number Theory).
- Inverse & Constructive: Reversing the "Given X find Y" flow to "Construct X such that Y holds".
- State Explosion: Enriching DP states or dependencies.
- Theorem Disguise: Hiding standard algorithms behind abstract narratives.
- Edge Case & Rigor: Targeting precision, overflow, or degenerate cases.
We ensure quality using two verifiers during the scaling process:
- Difficulty Estimation: Uses LLM-Ranking and a ValueNetwork to ensure monotonic difficulty growth.
- Solvability Verification: A reasoner (e.g., Qwen3-32B) checks if the new problem remains well-defined and solvable.
The generation process is terminated if the difficulty becomes non-monotonic or if the newly generated question is unsolvable, in which case the invalid candidate is discarded and the sequence up to the previous step is retained.
To systematically evaluate the question generation capability of LRMs, we first construct CoDiQ-Bench, a curated dataset comprising 200 carefully selected cases across coding and mathematical domains.
CoDiQ-Gen-8B significantly outperforms the larger Qwen3-32B baseline in generating high-difficulty solvable questions on CoDiQ-Bench.
Table 2. Performance of different Long-CoT models on CoDiQBench. Group rankings based on the highest difficulty of solvable
questions generated across 8 rounds without difficulty degradation
on CoDiQ-Bench. The best, the second-best and the third-best
scores for each indicator are shown in box , bold and underlined,
respectively.
We highlight the positive correlation between token volume and difficulty rankings shown in the following Figure.
Figure 2. Question Difficulty Scaling on CoDiQ-Bench. Scatter plot showing the relationship between average reasoning tokens and difficulty ranking (DR-AVG) for models using CoDiQ Prompt. Each point represents a model, demonstrating the positive correlation between increased reasoning computation and generated problem difficulty.
CoDiQ-Corpus achieves higher difficulty ratings compared to standard competition benchmarks.
Table 3. Datasets Difficulty Comparison. The best, the second-best and the third-best scores for each indicator are shown in box , bold and underlined, respectively.
We employ CoDiQ-Gen-8B following the CoDiQ Pipeline to transform eight diverse mathematical and programming datasets into the more challenging CoDiQ-Corpus, which comprises approximately 44,453 question sequences with progressive difficulty from easy to hard. The detailed distribution is presented in the following Table.
git clone https://github.com/ALEX-nlp/CoDiQ.git
cd CoDiQ
pip install -r requirements.txtYou can leverage CoDiQ-Gen-8B to enhance the complexity of any seed problem. To begin, update the configuration in tools_api.py, codiq_api.py, count_tokens.py and then execute the following script:
bash run.shIf you find CoDiQ useful for your research, please consider citing our paper:
@article{codiq2026,
title={CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation},
author={Zhongyuan Peng, Caijun Xu, Changyi Xiao, Shibo Hong, Eli Zhang, Stephen Huang, Yixin Cao},
journal={arXiv preprint arXiv:2602.01660},
year={2026}
}



