Skip to content

feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904

Open
Lollipop wants to merge 16 commits intomodelscope:release/3.12from
Lollipop:feat/greedy-packing-3.12
Open

feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904
Lollipop wants to merge 16 commits intomodelscope:release/3.12from
Lollipop:feat/greedy-packing-3.12

Conversation

@Lollipop
Copy link

Summary

This PR adds several new features for training optimization, based on release/3.12 branch.

  1. Greedy Packing (greedy_packing=True)

    • On-the-fly packing with O(1) overhead
    • Alternative to bin-packing preprocessing
    • Uses GreedyPackingDataLoader wrapper
  2. MiniCPM-V Packing Support

    • MiniCPMV2_6Template._data_collator for packing scenarios
    • packing_row method for multimodal data merging
    • Deferred dtype conversion for streaming+packing compatibility
  3. Dataset Progress Tracking (track_dataset_progress=True)

    • DatasetProgressCallback for per-dataset training progress
    • TensorBoard integration
    • Distributed training support
  4. Training Speed Statistics

    • train_speed(tokens/s) metric
    • total_tokens logging

New Parameters

  • greedy_packing: bool = False - Enable greedy packing
  • track_dataset_progress: bool = False - Enable progress tracking

Files Changed

  • swift/llm/dataset/utils.py - GreedyPackingDataLoader
  • swift/llm/dataset/collator.py - ProgressTrackingCollator (new)
  • swift/llm/template/template/minicpm.py - MiniCPM packing support
  • swift/trainers/callback.py - DatasetProgressCallback
  • swift/trainers/mixin.py - GreedyPackingDataLoader integration

liuxiaoming added 15 commits January 26, 2026 21:19
- Add packing_row method to MiniCPMV2_6Template for image_bound offset
  adjustment and pixel_values/tgt_sizes concatenation
- Add packing_row method to MiniCPMV4_5Template with temporal_ids support
- Add unit tests for packing functionality
- Add _dataset_source column injection in DatasetLoader for tracking
- Preserve dataset_name or _dataset_source in Template.encode
- Handle _dataset_source in packing_row as list for packed samples
- Add _update_dataset_progress method to track and remove source field
- Add DatasetProgressCallback with distributed gather support
- Add track_dataset_progress argument in TrainArguments
- Register callback in SwiftSft when enabled
- Add unit tests (17 tests) for the complete flow

Usage: swift sft --track_dataset_progress True --dataset data1.json data2.json
TensorBoard will show dataset_progress/{source}: percentage for each dataset
- Fix pixel_values format: handle double-nested list [[Tensor]] correctly
- Fix image_bound format: cat all bounds into single Tensor [N, 2]
- Add _data_collator override to properly handle packing scenario
- Update packing_row to flatten pixel_values from [[T]] to [T]
- Add support_padding_free = True flag
- Update unit tests with correct double-nested pixel_values format
Refactor dataset progress tracking to use a non-invasive collator wrapper approach:

- Add ProgressTrackingCollator that extracts _dataset_source from batch samples
- Remove template dependency from DatasetProgressCallback
- Simplify progress tracking by collecting stats in main process via _batch_sources
- Add comprehensive tests for collator wrapper and callback methods

This approach minimizes code intrusion and makes it easier to upgrade ms-swift.
Fix NCCL timeout by moving is_world_process_zero check after _gather_counts().
Added distributed collective sync tests.
- Change progress calculation in DatasetProgressCallback to reflect epoch count instead of percentage.
- Update logging precision to four decimal places for improved accuracy.
- Add comment to clarify the new progress representation.
Handle dataset source extraction for packing, use spawn context queues,
and make MiniCPM packing work in subprocesses.
…cesses

Introduce a standalone packing processor function to enhance multiprocessing support in the PackingDataset class. This change avoids pickling issues with Queue objects by using a separate function instead of an instance method, improving the robustness of data encoding in a multiprocessing context.
Eliminate dataset progress tracking functionality from the Template class and related methods. This change simplifies the code by removing unused variables and methods associated with dataset source tracking, enhancing maintainability.
- Add greedy_packing parameter for non-streaming datasets
  - Avoids binpacking preprocessing overhead
  - Uses O(1) greedy strategy at DataLoader level
  - Automatically enables padding_free

- Add tokens/s training speed statistics
  - Collect batch_lengths in ProgressTrackingCollator
  - Support distributed training with gather_object
  - Display train_speed(tokens/s) and total_tokens in logs

- New GreedyPackingDataLoader class
  - Wraps DataLoader with greedy packing layer
  - Reuses DataLoader's multi-workers and prefetch
  - Calls template.packing_row for multimodal support
- Add greedy_packing and packing_length to TrainArgumentsMixin for proper parameter passing
- Change packing_length to base_length * batch_size for better batch control
- Preserve _dataset_source in LazyLLMDataset for progress tracking
- Add _batch_lengths and _dataset_source to packed output for tokens/s statistics
- Clean up debug logging code
Change _dataset_source to _batch_sources in GreedyPackingDataLoader output
to match DatasetProgressCallback's expected field name.
The packing_collator (template.data_collator) modifies buffer in-place
via `batch[:] = [packing_row(batch)]`, which destroys the original
samples and their _dataset_source fields.

Fix by collecting batch_lengths and sources BEFORE calling packing_collator.
This ensures dataset_progress tracking works correctly with greedy_packing.
- Introduced tracking of original dataset sizes before mixing/resampling to improve training progress metrics.
- Updated load_dataset function to store original sizes for both training and validation datasets.
- Modified DatasetProgressCallback to utilize original dataset sizes for accurate epoch-based progress reporting.
- Added logging for original dataset sizes to aid in debugging and monitoring during training.
…aset

Remove the forced spawn multiprocessing context and standalone _packing_processor
function. The default fork mode on Linux works correctly and is more efficient.

This change restores the original _processor instance method implementation.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Lollipop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by introducing an efficient greedy packing method for datasets, enabling seamless packing for multimodal MiniCPM-V models, and providing detailed per-dataset training progress tracking. These improvements aim to optimize training performance, reduce preprocessing overhead, and offer better visibility into the training process, particularly for complex multi-dataset and multimodal setups.

Highlights

  • Greedy Packing Implementation: Introduced an on-the-fly greedy packing mechanism (greedy_packing=True) as an alternative to traditional bin-packing preprocessing. This is handled by a new GreedyPackingDataLoader wrapper, which reuses existing DataLoader functionalities and has O(1) overhead.
  • MiniCPM-V Multimodal Packing Support: Added comprehensive packing support for MiniCPM-V models, specifically MiniCPMV2_6Template and MiniCPMV4_5Template. This includes a specialized packing_row method to correctly merge multimodal data fields like pixel_values, image_bound, and tgt_sizes (and temporal_ids for V4.5), adjusting token offsets as needed. Dtype conversion for multimodal data is now deferred to the _data_collator to ensure compatibility with streaming and multiprocessing.
  • Dataset Progress Tracking: Implemented a DatasetProgressCallback (track_dataset_progress=True) to monitor training progress per dataset, especially useful in multi-dataset training scenarios with mixing or resampling. This callback integrates with TensorBoard and accurately tracks epoch-based progress by leveraging original dataset sizes. A new ProgressTrackingCollator wraps the data collator to extract dataset source and token length information from batches.
  • Enhanced Training Speed Statistics: Added train_speed(tokens/s) and total_tokens metrics to provide more granular insights into training performance. These metrics are gathered across distributed training processes for accurate reporting.
  • Argument and Infrastructure Updates: New arguments greedy_packing and track_dataset_progress have been added to TrainArguments. The dataset loading and processing pipeline has been updated to inject dataset source identifiers and store original dataset sizes for the progress tracking feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant features: greedy packing for training optimization, MiniCPM-V packing support for multimodal data, and comprehensive dataset progress tracking. The changes include new arguments for enabling these features, a ProgressTrackingCollator to extract dataset source and token length information, and a GreedyPackingDataLoader for on-the-fly packing. Additionally, the MiniCPM templates have been updated to correctly handle multimodal data during packing and to defer dtype conversion for streaming compatibility. Unit tests have been added for all new functionalities, including crucial distributed training scenarios, ensuring robustness and correctness. Overall, the PR enhances training efficiency and provides better observability into multi-dataset training progress.

- Move _extract_info from inner function to class method to avoid
  redefinition overhead on each __call__ invocation
- Extract duplicated sources/lengths collection logic into
  _collect_sources_and_lengths method
- Use item.pop() instead of get() + del for cleaner code
- Add type hints for better code clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant