feat: add greedy packing, MiniCPM packing support, and dataset progress tracking by Lollipop · Pull Request #7904 · modelscope/ms-swift

Lollipop · 2026-01-26T13:20:22Z

Summary

This PR adds several new features for training optimization, based on release/3.12 branch.

Greedy Packing (greedy_packing=True)
- On-the-fly packing with O(1) overhead
- Alternative to bin-packing preprocessing
- Uses GreedyPackingDataLoader wrapper
MiniCPM-V Packing Support
- MiniCPMV2_6Template._data_collator for packing scenarios
- packing_row method for multimodal data merging
- Deferred dtype conversion for streaming+packing compatibility
Dataset Progress Tracking (track_dataset_progress=True)
- DatasetProgressCallback for per-dataset training progress
- TensorBoard integration
- Distributed training support
Training Speed Statistics
- train_speed(tokens/s) metric
- total_tokens logging

New Parameters

greedy_packing: bool = False - Enable greedy packing
track_dataset_progress: bool = False - Enable progress tracking

Files Changed

swift/llm/dataset/utils.py - GreedyPackingDataLoader
swift/llm/dataset/collator.py - ProgressTrackingCollator (new)
swift/llm/template/template/minicpm.py - MiniCPM packing support
swift/trainers/callback.py - DatasetProgressCallback
swift/trainers/mixin.py - GreedyPackingDataLoader integration

- Add packing_row method to MiniCPMV2_6Template for image_bound offset adjustment and pixel_values/tgt_sizes concatenation - Add packing_row method to MiniCPMV4_5Template with temporal_ids support - Add unit tests for packing functionality

- Add _dataset_source column injection in DatasetLoader for tracking - Preserve dataset_name or _dataset_source in Template.encode - Handle _dataset_source in packing_row as list for packed samples - Add _update_dataset_progress method to track and remove source field - Add DatasetProgressCallback with distributed gather support - Add track_dataset_progress argument in TrainArguments - Register callback in SwiftSft when enabled - Add unit tests (17 tests) for the complete flow Usage: swift sft --track_dataset_progress True --dataset data1.json data2.json TensorBoard will show dataset_progress/{source}: percentage for each dataset

- Fix pixel_values format: handle double-nested list [[Tensor]] correctly - Fix image_bound format: cat all bounds into single Tensor [N, 2] - Add _data_collator override to properly handle packing scenario - Update packing_row to flatten pixel_values from [[T]] to [T] - Add support_padding_free = True flag - Update unit tests with correct double-nested pixel_values format

Refactor dataset progress tracking to use a non-invasive collator wrapper approach: - Add ProgressTrackingCollator that extracts _dataset_source from batch samples - Remove template dependency from DatasetProgressCallback - Simplify progress tracking by collecting stats in main process via _batch_sources - Add comprehensive tests for collator wrapper and callback methods This approach minimizes code intrusion and makes it easier to upgrade ms-swift.

Fix NCCL timeout by moving is_world_process_zero check after _gather_counts(). Added distributed collective sync tests.

- Change progress calculation in DatasetProgressCallback to reflect epoch count instead of percentage. - Update logging precision to four decimal places for improved accuracy. - Add comment to clarify the new progress representation.

Handle dataset source extraction for packing, use spawn context queues, and make MiniCPM packing work in subprocesses.

…cesses Introduce a standalone packing processor function to enhance multiprocessing support in the PackingDataset class. This change avoids pickling issues with Queue objects by using a separate function instead of an instance method, improving the robustness of data encoding in a multiprocessing context.

Eliminate dataset progress tracking functionality from the Template class and related methods. This change simplifies the code by removing unused variables and methods associated with dataset source tracking, enhancing maintainability.

- Add greedy_packing parameter for non-streaming datasets - Avoids binpacking preprocessing overhead - Uses O(1) greedy strategy at DataLoader level - Automatically enables padding_free - Add tokens/s training speed statistics - Collect batch_lengths in ProgressTrackingCollator - Support distributed training with gather_object - Display train_speed(tokens/s) and total_tokens in logs - New GreedyPackingDataLoader class - Wraps DataLoader with greedy packing layer - Reuses DataLoader's multi-workers and prefetch - Calls template.packing_row for multimodal support

- Add greedy_packing and packing_length to TrainArgumentsMixin for proper parameter passing - Change packing_length to base_length * batch_size for better batch control - Preserve _dataset_source in LazyLLMDataset for progress tracking - Add _batch_lengths and _dataset_source to packed output for tokens/s statistics - Clean up debug logging code

Change _dataset_source to _batch_sources in GreedyPackingDataLoader output to match DatasetProgressCallback's expected field name.

The packing_collator (template.data_collator) modifies buffer in-place via `batch[:] = [packing_row(batch)]`, which destroys the original samples and their _dataset_source fields. Fix by collecting batch_lengths and sources BEFORE calling packing_collator. This ensures dataset_progress tracking works correctly with greedy_packing.

- Introduced tracking of original dataset sizes before mixing/resampling to improve training progress metrics. - Updated load_dataset function to store original sizes for both training and validation datasets. - Modified DatasetProgressCallback to utilize original dataset sizes for accurate epoch-based progress reporting. - Added logging for original dataset sizes to aid in debugging and monitoring during training.

…aset Remove the forced spawn multiprocessing context and standalone _packing_processor function. The default fork mode on Linux works correctly and is more efficient. This change restores the original _processor instance method implementation.

gemini-code-assist · 2026-01-26T13:20:50Z

Summary of Changes

Hello @Lollipop, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by introducing an efficient greedy packing method for datasets, enabling seamless packing for multimodal MiniCPM-V models, and providing detailed per-dataset training progress tracking. These improvements aim to optimize training performance, reduce preprocessing overhead, and offer better visibility into the training process, particularly for complex multi-dataset and multimodal setups.

Highlights

Greedy Packing Implementation: Introduced an on-the-fly greedy packing mechanism (greedy_packing=True) as an alternative to traditional bin-packing preprocessing. This is handled by a new GreedyPackingDataLoader wrapper, which reuses existing DataLoader functionalities and has O(1) overhead.
MiniCPM-V Multimodal Packing Support: Added comprehensive packing support for MiniCPM-V models, specifically MiniCPMV2_6Template and MiniCPMV4_5Template. This includes a specialized packing_row method to correctly merge multimodal data fields like pixel_values, image_bound, and tgt_sizes (and temporal_ids for V4.5), adjusting token offsets as needed. Dtype conversion for multimodal data is now deferred to the _data_collator to ensure compatibility with streaming and multiprocessing.
Dataset Progress Tracking: Implemented a DatasetProgressCallback (track_dataset_progress=True) to monitor training progress per dataset, especially useful in multi-dataset training scenarios with mixing or resampling. This callback integrates with TensorBoard and accurately tracks epoch-based progress by leveraging original dataset sizes. A new ProgressTrackingCollator wraps the data collator to extract dataset source and token length information from batches.
Enhanced Training Speed Statistics: Added train_speed(tokens/s) and total_tokens metrics to provide more granular insights into training performance. These metrics are gathered across distributed training processes for accurate reporting.
Argument and Infrastructure Updates: New arguments greedy_packing and track_dataset_progress have been added to TrainArguments. The dataset loading and processing pipeline has been updated to inject dataset source identifiers and store original dataset sizes for the progress tracking feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several significant features: greedy packing for training optimization, MiniCPM-V packing support for multimodal data, and comprehensive dataset progress tracking. The changes include new arguments for enabling these features, a ProgressTrackingCollator to extract dataset source and token length information, and a GreedyPackingDataLoader for on-the-fly packing. Additionally, the MiniCPM templates have been updated to correctly handle multimodal data during packing and to defer dtype conversion for streaming compatibility. Unit tests have been added for all new functionalities, including crucial distributed training scenarios, ensuring robustness and correctness. Overall, the PR enhances training efficiency and provides better observability into multi-dataset training progress.

swift/llm/dataset/collator.py

swift/llm/template/template/minicpm.py

- Move _extract_info from inner function to class method to avoid redefinition overhead on each __call__ invocation - Extract duplicated sources/lengths collection logic into _collect_sources_and_lengths method - Use item.pop() instead of get() + del for cleaner code - Add type hints for better code clarity

liuxiaoming added 15 commits January 26, 2026 21:19

fix(train): ensure all ranks participate in gather for dataset progress

a8a829f

Fix NCCL timeout by moving is_world_process_zero check after _gather_counts(). Added distributed collective sync tests.

fix(dataset): improve streaming packing safety

beb8065

Handle dataset source extraction for packing, use spawn context queues, and make MiniCPM packing work in subprocesses.

fix(greedy_packing): use _batch_sources for dataset progress tracking

b4ebf65

Change _dataset_source to _batch_sources in GreedyPackingDataLoader output to match DatasetProgressCallback's expected field name.

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

swift/llm/dataset/collator.py Outdated Show resolved Hide resolved

swift/llm/template/template/minicpm.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904

feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904
Lollipop wants to merge 16 commits intomodelscope:release/3.12from
Lollipop:feat/greedy-packing-3.12

Lollipop commented Jan 26, 2026

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lollipop commented Jan 26, 2026

Summary

New Parameters

Files Changed

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant