perf: Vectorize PredictionsModelWrapper lookup with numpy arrays#187
Merged
imatiach-msft merged 1 commit intomicrosoft:mainfrom Mar 16, 2026
Conversation
Problem: PredictionsModelWrapper.predict() used row-by-row DataFrame operations that were extremely slow on large datasets. With 93K rows and 33 features, predict() achieved only ~130 rows/sec, causing RAI dashboard generation to timeout after 2+ hours when MimicExplainer called predict() on all training data for surrogate model training. Solution: - Pre-extract feature data as numpy arrays during initialization - Use str(row.tolist()) + MD5 for fast, consistent row hashing - Perform all lookups using numpy array indexing instead of DataFrame.iloc - Store predictions as numpy arrays for direct O(1) access - Add per-instance warning when fallback to slow path occurs Performance Results (93K rows, 33 features): | Metric | Before | After | Speedup | |--------------------------|-----------|-------------|---------| | predict() throughput | 130 rows/s| 28,000 rows/s| ~215x | | 93K row prediction | ~40 min | 3.3 sec | ~720x | | RAIInsights creation | ~110 min | 11.5 sec | ~570x | This fixes the RAI dashboard timeout issue for large AutoML datasets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7b10f9a to
249a025
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The previous hash-based optimization (0.6.2) still used expensive DataFrame operations inside the prediction loop, resulting in only ~130 rows/sec throughput. This caused RAI dashboard generation to still timeout on large datasets (~93K rows).
Root Cause
The 0.6.2 implementation had these bottlenecks:
query_test_data.iloc[index:index + 1]created a new DataFrame for each row_compute_row_hash()calledpd.isna()on each value individually_rows_equal()iterated through each column using pandas Series operationsSolution
This PR replaces DataFrame operations with numpy array operations:
Pre-extract numpy arrays during
__init__:self._feature_values = combined_data[feature_names].valuesself._predictions = y_pred.copy()Fast hashing using
str(row.tolist()) + MD5:Numpy array indexing in the predict loop:
query_values = query_test_data[feature_names].valuesself._predictions[matched_idx]Performance Results
Tested with production AutoML data (93,340 train rows, 5,000 test rows, 33 features):
Validation
Files Changed
python/ml_wrappers/model/predictions_wrapper.py- Vectorized implementationBackwards Compatibility