-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Description:
Implement Phase 1 of the parallelism-aware optimization plan to reduce CI wall clock time by 45% (103 min → 56 min).
Background
Current CI performance (after PR #370 parallelization):
Wall clock: 102.6 minutes (1.7 hours)
Parallelization: 3.6× speedup
Critical path bottleneck: 8-rank jobs take 52.9 min (limit overall wall clock time)
Analysis in this PR identified that with parallelization, the critical path (longest-running jobs) determines wall clock time. The top 6 tensor creation test files contain 480K tests with excessive parametrization (8 dtypes × 8 shapes).
Scope: Phase 1 - Critical Path Optimization
Goal: Reduce parametrization in top 6 test files while maintaining multi-rank testing for all tests.
Changes needed:
Reduce parametrization in top 6 files (tests/unittests/):
test_zeros_like.py: 139,216 tests → 27,000 tests (80% reduction)
test_empty.py: 95,872 tests → 19,000 tests (80% reduction)
test_full.py: 76,608 tests → 15,000 tests (80% reduction)
test_randint.py: 59,360 tests → 12,000 tests (80% reduction)
test_ones.py: 59,136 tests → 12,000 tests (80% reduction)
test_zeros.py: 50,176 tests → 10,000 tests (80% reduction)
Parametrization strategy:
Current: 8 dtypes × 8 shapes = 64 base combinations
Target: 4 dtypes × 4 shapes = 16 base combinations (75% reduction)
Representative dtypes: torch.float32, torch.float16, torch.int32, torch.bool
Representative shapes: (1,), (100,), (32, 32), (4, 8, 16)
Add explicit edge case tests:
Large tensors: (1024, 1024) for memory validation
Edge dtypes: torch.int8, torch.float64 for numerical precision
Complex shapes: (2, 3, 4, 5) for multi-dimensional handling
Ensures coverage of removed parametrization combinations
Keep all multi-rank testing:
Unlike PR Reduce CI time 30% via marker-based multi-rank test filtering #356 (closed), this approach does NOT remove multi-rank testing
All tests still run on 1, 2, 4, 8 ranks
Only reduces the number of dtype/shape combinations tested
Expected impact:
Unittests (8-rank): 50 min → 29 min (42% reduction)
Examples (8-rank): 53 min → 35 min (34% reduction)
Wall clock: 103 min → 56 min (45% reduction)
Test count: 530,877 → ~95,000 (82% reduction)
Annual cost savings: $102K
Implementation approach:
Create parametrization constants for representative values
Update @pytest.mark.parametrize decorators in top 6 files
Add explicit edge case test functions
Verify coverage with pytest-cov
Reference: See PARALLELISM_AWARE_OPTIMIZATION_PLAN.md in this PR for complete analysis and implementation details.