feat(quantization): add calibration cache to quantize_static#28221
feat(quantization): add calibration cache to quantize_static#28221Rishi-Dave wants to merge 3 commits intomicrosoft:mainfrom
Conversation
Introduce an optional calibration_cache_path parameter on quantize_static so users can save the computed TensorsData after calibration and reload it on subsequent runs. This avoids repeating the expensive calibration inference pass when only post-calibration knobs (e.g. nodes_to_exclude, quant types) change between runs. The cache is a human-readable JSON file whose schema mirrors the encoder used by write_calibration_table: TensorData / TensorsData round-trip through new from_dict classmethods and module-level save_tensors_data / load_tensors_data helpers in calibrate.py. calibration_data_reader is now optional; at least one of it or an existing cache file must be provided. Fixes microsoft#21908
There was a problem hiding this comment.
Pull request overview
Adds a JSON-backed calibration cache for Python static quantization to allow reusing TensorsData across runs and skipping repeated calibration inference when inputs are unchanged.
Changes:
- Add JSON serialization/deserialization utilities for
TensorData/TensorsData(save/load calibration caches). - Extend
quantize_static()with optionalcalibration_cache_pathand makecalibration_data_readeroptional when an existing cache is provided. - Add test coverage for cache roundtrips and
quantize_staticcache hit/miss behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| onnxruntime/python/tools/quantization/calibrate.py | Adds from_dict support plus JSON cache encoder and save_tensors_data / load_tensors_data. |
| onnxruntime/python/tools/quantization/quantize.py | Adds calibration_cache_path to quantize_static and reuses cached TensorsData when present. |
| onnxruntime/python/tools/quantization/init.py | Exports cache-related helpers and data structures via onnxruntime.quantization. |
| onnxruntime/test/python/quantization/test_calibration.py | Adds TestCalibrationCache covering serialization and end-to-end cache usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- load_tensors_data: reject non-file paths up front with a ValueError instead of letting Path.open raise IsADirectoryError or similar. - quantize_static: when extra_options['SmoothQuant']=True, require a non-None calibration_data_reader since the cache stores per-tensor ranges only and cannot drive the SmoothQuant transform. - quantize_static: treat the cache path as a hit only when it is a regular file; raise ValueError if it exists but is e.g. a directory, so callers get a clear message instead of a low-level IOError.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
The calibration cache feature is well-motivated and the implementation is clean overall. The atomic write pattern, method mismatch validation, and SmoothQuant guard are good design choices.
However, there is a correctness bug in the deserialization path that will cause ValueError at runtime when caching Entropy or Distribution calibration results. The root cause is that numpy scalars (produced by ndarray.min()/ndarray.max() in the Entropy calibrator at line 1154) are encoded as plain JSON numbers, but from_dict() doesn't convert them back to numpy-typed values before passing to TensorData.__init__(), which requires a .dtype attribute on float fields.
…failure - TensorData.from_dict: wrap plain int/float values for _floats keys as np.array(value, dtype=np.float32) so cache round-trip works for Entropy/Distribution calibration, where hist_edges.min()/.max() are serialized as numpy scalars and deserialize as plain Python floats. - save_tensors_data: wrap json.dump + os.replace in try/except BaseException that unlinks the .tmp file on failure, so partial serialization (or KeyboardInterrupt mid-write) does not leave stray .tmp files behind.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary (Round 2)
Both concerns from the previous review round are addressed:
- numpy scalar roundtrip bug (thread #4):
from_dict()now coerces plain Python floats back to numpy arrays for float fields. Fixed. - Orphaned
.tmpfile (thread #5):save_tensors_data()now wraps the write intry/except BaseExceptionand removes the temp file on failure. Fixed.
The implementation is clean, backward-compatible, and well-tested. One remaining minor test coverage suggestion below.
Verdict: APPROVE
| self.assertEqual(lo.shape, ()) | ||
| self.assertEqual(hi.shape, ()) | ||
|
|
||
| def test_save_load_tensors_data_entropy_roundtrip(self): |
There was a problem hiding this comment.
Nitpick — test uses 0-d ndarray, not actual numpy scalars
This test uses np.array(-0.5, dtype=np.float32) (a 0-d ndarray) for lowest/highest, but the real Entropy calibrator produces numpy scalars via hist_edges.min()/hist_edges.max(). As a result, this test doesn't directly exercise the from_dict fix that coerces plain float → np.float32.
Consider adding a case with actual numpy scalars:
def test_save_load_tensors_data_numpy_scalar_roundtrip(self):
edges = np.array([0.0, 1.0, 2.0], dtype=np.float32)
td = TensorsData(
CalibrationMethod.Entropy,
{"t": TensorData(lowest=edges.min(), highest=edges.max(), hist=edges[:-1], hist_edges=edges)},
)
cache_path = Path(self._tmp_dir.name) / "scalar_roundtrip.json"
save_tensors_data(td, cache_path)
loaded = load_tensors_data(cache_path)
np.testing.assert_almost_equal(loaded["t"].range_value[0], 0.0)
np.testing.assert_almost_equal(loaded["t"].range_value[1], 2.0)
Summary
calibration_cache_pathparameter toquantize_static()so users can save and reload the calibration result (TensorsData) across runs.nodes_to_exclude,activation_type, orweight_type.write_calibration_table— no new serialization surface area.Motivation
Fixes #21908. Users commonly re-run
quantize_staticmultiple times on the same model and calibration dataset while varying the set of excluded nodes or the quant types, to trade off accuracy vs. speed. Today, every call repeats the full calibration inference loop even though the calibration result is identical, which is costly on large calibration datasets. There was no supported way to persist the computed tensor ranges —write_calibration_tablewrites a lossy table (drops histogram data) and has no paired reader. This PR closes that gap.Changes
python/tools/quantization/calibrate.py:TensorData.from_dictandTensorsData.from_dictclassmethods (inverse of existingto_dict)._CalibrationCacheEncoder(json.JSONEncoder),save_tensors_data(tensors, path), andload_tensors_data(path). The encoder handlesTensorData/TensorsData/np.ndarray/CalibrationMethod/numpy scalars. Writes are atomic (tmp file +os.replace) and auto-create parent directories.python/tools/quantization/quantize.py:quantize_staticgainscalibration_cache_path: str | Path | None = None. If the path exists, calibration is skipped and ranges are loaded from the cache. If the path is new, calibration runs and the result is saved. RaisesValueErrorif the cachedcalibration_methoddoes not match the caller'scalibrate_method.calibration_data_readerbecomes optional; at least one of it or an existing cache must be provided, elseValueError.python/tools/quantization/__init__.py: exportTensorData,TensorsData,save_tensors_data,load_tensors_data.TestCalibrationCacheintest/python/quantization/test_calibration.pycovering MinMax roundtrip, Entropy roundtrip (with histogram), missing-path error, parent-dir auto-creation, numpy scalarbinshandling, method-mismatch guard, end-to-endquantize_staticcache hit/miss, andValueErrorwhen neither reader nor cache is provided.Test Plan
python -m pytest onnxruntime/test/python/quantization/test_calibration.py::TestCalibrationCache -vpython -m pytest onnxruntime/test/python/quantization/test_calibration.py::TestCalibrateMinMaxCalibrator -v(regression)lintrunner -aon changed files: clean.Backward Compatibility
calibration_data_readerchanges from required-positional to optional-keyword. Existing call sites — whether positional or keyword — continue to work unchanged. The new behavior is only engaged whencalibration_cache_pathis provided.