CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Repo for "CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation"

🔥 News

[2026.02.02] We release CoDiQ-Corpus, containing 44K competition-grade question sequences!
[2026.02.02] We open-source CoDiQ-Gen-8B, a specialized generator trained via RL to synthesize high-difficulty problems.
[2026.02.02] The CoDiQ paper is released. We propose a framework enabling fine-grained difficulty control via test-time scaling.

💡 Introduction

Large Reasoning Models (LRMs) benefit substantially from training on challenging, competition-level questions. However, existing automated synthesis methods struggle with "fake hard" questions—problems that are complex but unsolvable or ill-defined.

CoDiQ (Controllable Difficult Question Generation) is a novel framework that enables fine-grained difficulty control via test-time scaling while ensuring solvability.

Key innovations include:

Test-Time Scaling Tendency: We identify that extending the reasoning token budget boosts difficulty but can reduce solvability.
CoDiQ-Generator: A specialized model (finetuned from Qwen3-8B) that improves the upper bound of valid, high-difficulty question generation.
CoDiQ-Corpus: A dataset of 44K competition-grade math and coding question sequences, which is significantly more challenging than LiveCodeBench and AIME.

Training LRMs on CoDiQ-Corpus substantially enhances downstream reasoning performance. The CoDiQ-Generator and CoDiQ-Corpus are released.

Figure 1. Distribution of CoDiQ-Corpus Dataset.

🚀 The CoDiQ Framework

The CoDiQ pipeline iteratively evolves a seed question $Q_0$ into harder variants ${Q_1, ..., Q_n}$ through a rigorous cycle of generation and verification.

1. Difficulty-Enhancement Strategies

To avoid superficial difficulty, we guide the LLM with six specific cognitive scaffolds:

Dimensionality & Constraints: Exploding data scale/dimensions to invalidate naive simulation.
Mathematical Abstraction: Reframing procedural problems into formal models (e.g., Number Theory).
Inverse & Constructive: Reversing the "Given X find Y" flow to "Construct X such that Y holds".
State Explosion: Enriching DP states or dependencies.
Theorem Disguise: Hiding standard algorithms behind abstract narratives.
Edge Case & Rigor: Targeting precision, overflow, or degenerate cases.

2. Hybrid Verification

We ensure quality using two verifiers during the scaling process:

Difficulty Estimation: Uses LLM-Ranking and a ValueNetwork to ensure monotonic difficulty growth.
Solvability Verification: A reasoner (e.g., Qwen3-32B) checks if the new problem remains well-defined and solvable.

The generation process is terminated if the difficulty becomes non-monotonic or if the newly generated question is unsolvable, in which case the invalid candidate is discarded and the sequence up to the previous step is retained.

📋 CoDiQ-Bench

To systematically evaluate the question generation capability of LRMs, we first construct CoDiQ-Bench, a curated dataset comprising 200 carefully selected cases across coding and mathematical domains.

Table 1. Dataset statistics of CoDiQ-Bench.

📊 Performance

Generator Capability

CoDiQ-Gen-8B significantly outperforms the larger Qwen3-32B baseline in generating high-difficulty solvable questions on CoDiQ-Bench.

Table 2. Performance of different Long-CoT models on CoDiQBench. Group rankings based on the highest difficulty of solvable questions generated across 8 rounds without difficulty degradation on CoDiQ-Bench. The best, the second-best and the third-best scores for each indicator are shown in box , bold and underlined, respectively.

Difficulty Scaling

We highlight the positive correlation between token volume and difficulty rankings shown in the following Figure.

Figure 2. Question Difficulty Scaling on CoDiQ-Bench. Scatter plot showing the relationship between average reasoning tokens and difficulty ranking (DR-AVG) for models using CoDiQ Prompt. Each point represents a model, demonstrating the positive correlation between increased reasoning computation and generated problem difficulty.

📋 CoDiQ-Corpus

Difficulty of CoDiQ-Corpus

CoDiQ-Corpus achieves higher difficulty ratings compared to standard competition benchmarks.

Table 3. Datasets Difficulty Comparison. The best, the second-best and the third-best scores for each indicator are shown in box , bold and underlined, respectively.

Statistics of CoDiQ-Corpus

We employ CoDiQ-Gen-8B following the CoDiQ Pipeline to transform eight diverse mathematical and programming datasets into the more challenging CoDiQ-Corpus, which comprises approximately 44,453 question sequences with progressive difficulty from easy to hard. The detailed distribution is presented in the following Table.

Table 4. Dataset statistics of CoDiQ-Corpus.

🛠️ Quick Start

Installation

git clone https://github.com/ALEX-nlp/CoDiQ.git
cd CoDiQ
pip install -r requirements.txt

Inference: Generating Difficult Questions

You can leverage CoDiQ-Gen-8B to enhance the complexity of any seed problem. To begin, update the configuration in tools_api.py, codiq_api.py, count_tokens.py and then execute the following script:

bash run.sh

📖 Citation

If you find CoDiQ useful for your research, please consider citing our paper:

@article{codiq2026,
  title={CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation},
  author={Zhongyuan Peng, Caijun Xu, Changyi Xiao, Shibo Hong, Eli Zhang, Stephen Huang, Yixin Cao},
  journal={arXiv preprint arXiv:2602.01660},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Data		Data
images		images
README.md		README.md
codiq_api.py		codiq_api.py
codiq_pipeline.py		codiq_pipeline.py
count_tokens.py		count_tokens.py
model_api.py		model_api.py
multi_thread.py		multi_thread.py
prompt.py		prompt.py
rank_diff.py		rank_diff.py
requirements.txt		requirements.txt
run.sh		run.sh
solvable.py		solvable.py
tools_api.py		tools_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

🔥 News

💡 Introduction

🚀 The CoDiQ Framework

1. Difficulty-Enhancement Strategies

2. Hybrid Verification

📋 CoDiQ-Bench

📊 Performance

Generator Capability

Difficulty Scaling

📋 CoDiQ-Corpus

Difficulty of CoDiQ-Corpus

Statistics of CoDiQ-Corpus

🛠️ Quick Start

Installation

Inference: Generating Difficult Questions

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

🔥 News

💡 Introduction

🚀 The CoDiQ Framework

1. Difficulty-Enhancement Strategies

2. Hybrid Verification

📋 CoDiQ-Bench

📊 Performance

Generator Capability

Difficulty Scaling

📋 CoDiQ-Corpus

Difficulty of CoDiQ-Corpus

Statistics of CoDiQ-Corpus

🛠️ Quick Start

Installation

Inference: Generating Difficult Questions

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages