feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904
feat: add greedy packing, MiniCPM packing support, and dataset progress tracking#7904Lollipop wants to merge 16 commits intomodelscope:release/3.12from
Conversation
- Add packing_row method to MiniCPMV2_6Template for image_bound offset adjustment and pixel_values/tgt_sizes concatenation - Add packing_row method to MiniCPMV4_5Template with temporal_ids support - Add unit tests for packing functionality
- Add _dataset_source column injection in DatasetLoader for tracking
- Preserve dataset_name or _dataset_source in Template.encode
- Handle _dataset_source in packing_row as list for packed samples
- Add _update_dataset_progress method to track and remove source field
- Add DatasetProgressCallback with distributed gather support
- Add track_dataset_progress argument in TrainArguments
- Register callback in SwiftSft when enabled
- Add unit tests (17 tests) for the complete flow
Usage: swift sft --track_dataset_progress True --dataset data1.json data2.json
TensorBoard will show dataset_progress/{source}: percentage for each dataset
- Fix pixel_values format: handle double-nested list [[Tensor]] correctly - Fix image_bound format: cat all bounds into single Tensor [N, 2] - Add _data_collator override to properly handle packing scenario - Update packing_row to flatten pixel_values from [[T]] to [T] - Add support_padding_free = True flag - Update unit tests with correct double-nested pixel_values format
Refactor dataset progress tracking to use a non-invasive collator wrapper approach: - Add ProgressTrackingCollator that extracts _dataset_source from batch samples - Remove template dependency from DatasetProgressCallback - Simplify progress tracking by collecting stats in main process via _batch_sources - Add comprehensive tests for collator wrapper and callback methods This approach minimizes code intrusion and makes it easier to upgrade ms-swift.
Fix NCCL timeout by moving is_world_process_zero check after _gather_counts(). Added distributed collective sync tests.
- Change progress calculation in DatasetProgressCallback to reflect epoch count instead of percentage. - Update logging precision to four decimal places for improved accuracy. - Add comment to clarify the new progress representation.
Handle dataset source extraction for packing, use spawn context queues, and make MiniCPM packing work in subprocesses.
…cesses Introduce a standalone packing processor function to enhance multiprocessing support in the PackingDataset class. This change avoids pickling issues with Queue objects by using a separate function instead of an instance method, improving the robustness of data encoding in a multiprocessing context.
Eliminate dataset progress tracking functionality from the Template class and related methods. This change simplifies the code by removing unused variables and methods associated with dataset source tracking, enhancing maintainability.
- Add greedy_packing parameter for non-streaming datasets - Avoids binpacking preprocessing overhead - Uses O(1) greedy strategy at DataLoader level - Automatically enables padding_free - Add tokens/s training speed statistics - Collect batch_lengths in ProgressTrackingCollator - Support distributed training with gather_object - Display train_speed(tokens/s) and total_tokens in logs - New GreedyPackingDataLoader class - Wraps DataLoader with greedy packing layer - Reuses DataLoader's multi-workers and prefetch - Calls template.packing_row for multimodal support
- Add greedy_packing and packing_length to TrainArgumentsMixin for proper parameter passing - Change packing_length to base_length * batch_size for better batch control - Preserve _dataset_source in LazyLLMDataset for progress tracking - Add _batch_lengths and _dataset_source to packed output for tokens/s statistics - Clean up debug logging code
Change _dataset_source to _batch_sources in GreedyPackingDataLoader output to match DatasetProgressCallback's expected field name.
The packing_collator (template.data_collator) modifies buffer in-place via `batch[:] = [packing_row(batch)]`, which destroys the original samples and their _dataset_source fields. Fix by collecting batch_lengths and sources BEFORE calling packing_collator. This ensures dataset_progress tracking works correctly with greedy_packing.
- Introduced tracking of original dataset sizes before mixing/resampling to improve training progress metrics. - Updated load_dataset function to store original sizes for both training and validation datasets. - Modified DatasetProgressCallback to utilize original dataset sizes for accurate epoch-based progress reporting. - Added logging for original dataset sizes to aid in debugging and monitoring during training.
…aset Remove the forced spawn multiprocessing context and standalone _packing_processor function. The default fork mode on Linux works correctly and is more efficient. This change restores the original _processor instance method implementation.
Summary of ChangesHello @Lollipop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the training capabilities by introducing an efficient greedy packing method for datasets, enabling seamless packing for multimodal MiniCPM-V models, and providing detailed per-dataset training progress tracking. These improvements aim to optimize training performance, reduce preprocessing overhead, and offer better visibility into the training process, particularly for complex multi-dataset and multimodal setups. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several significant features: greedy packing for training optimization, MiniCPM-V packing support for multimodal data, and comprehensive dataset progress tracking. The changes include new arguments for enabling these features, a ProgressTrackingCollator to extract dataset source and token length information, and a GreedyPackingDataLoader for on-the-fly packing. Additionally, the MiniCPM templates have been updated to correctly handle multimodal data during packing and to defer dtype conversion for streaming compatibility. Unit tests have been added for all new functionalities, including crucial distributed training scenarios, ensuring robustness and correctness. Overall, the PR enhances training efficiency and provides better observability into multi-dataset training progress.
- Move _extract_info from inner function to class method to avoid redefinition overhead on each __call__ invocation - Extract duplicated sources/lengths collection logic into _collect_sources_and_lengths method - Use item.pop() instead of get() + del for cleaner code - Add type hints for better code clarity
Summary
This PR adds several new features for training optimization, based on release/3.12 branch.
Greedy Packing (
greedy_packing=True)GreedyPackingDataLoaderwrapperMiniCPM-V Packing Support
MiniCPMV2_6Template._data_collatorfor packing scenariospacking_rowmethod for multimodal data mergingDataset Progress Tracking (
track_dataset_progress=True)DatasetProgressCallbackfor per-dataset training progressTraining Speed Statistics
train_speed(tokens/s)metrictotal_tokensloggingNew Parameters
greedy_packing: bool = False- Enable greedy packingtrack_dataset_progress: bool = False- Enable progress trackingFiles Changed
swift/llm/dataset/utils.py- GreedyPackingDataLoaderswift/llm/dataset/collator.py- ProgressTrackingCollator (new)swift/llm/template/template/minicpm.py- MiniCPM packing supportswift/trainers/callback.py- DatasetProgressCallbackswift/trainers/mixin.py- GreedyPackingDataLoader integration