- [Coming soon]: Full code, models, datasets, more r&d...
- [24/11/2024]: Uploaded paper to github repo.
In recent months, OpenAI o1 has shown promising progress in solving complex reasoning tasks by synthesizing long chain-of-thoughts (CoT) before giving a final answer. This approach has demonstrated the potential to enhance performance on reasoning and coding tasks by increasing test-time compute. Existing open-source approaches remain limited by the need for human labeling, distilled datasets, or grounded verifiers, however a open-ended self-improving framework has yet to be fully explored with open-ended reasoning tasks.
This paper introduces SeDiR, a novel framework for enabling fully open-ended self-improvement in reasoning LLMs. By leveraging the diversity of data at both pretraining and post-training stages, SeDiR iteratively generates and scores high-quality reasoning traces without requiring human intervention or seed data. This is a report of replicating o1 like reasoning capabilities with open-ended self-improving systems.