perf: Vectorize PredictionsModelWrapper lookup with numpy arrays by imatiach-msft · Pull Request #187 · microsoft/ml-wrappers

imatiach-msft · 2026-03-13T21:53:45Z

Problem

The previous hash-based optimization (0.6.2) still used expensive DataFrame operations inside the prediction loop, resulting in only ~130 rows/sec throughput. This caused RAI dashboard generation to still timeout on large datasets (~93K rows).

Root Cause

The 0.6.2 implementation had these bottlenecks:

query_test_data.iloc[index:index + 1] created a new DataFrame for each row
_compute_row_hash() called pd.isna() on each value individually
_rows_equal() iterated through each column using pandas Series operations
Hash lookup was O(1), but per-row preparation was still O(m) with high constant factors

Solution

This PR replaces DataFrame operations with numpy array operations:

Pre-extract numpy arrays during __init__:
- self._feature_values = combined_data[feature_names].values
- self._predictions = y_pred.copy()
Fast hashing using str(row.tolist()) + MD5:
- Handles NaN, mixed dtypes consistently
- Much faster than per-value type checking
Numpy array indexing in the predict loop:
- Extract all query data as numpy array once: query_values = query_test_data[feature_names].values
- Direct array indexing: self._predictions[matched_idx]

Performance Results

Tested with production AutoML data (93,340 train rows, 5,000 test rows, 33 features):

Metric	Before (0.6.2)	After	Speedup
predict() throughput	130 rows/s	28,000 rows/s	~215x
93K row prediction	~40 min	3.3 sec	~720x
RAIInsights creation	~110 min	11.5 sec	~570x
Total RAI job	~194 min	< 1 min	~200x

Validation

✅ All 60 existing tests pass
✅ Validated on AzureML with production AutoML RAI job
✅ Job completed successfully in < 1 minute

Files Changed

python/ml_wrappers/model/predictions_wrapper.py - Vectorized implementation

Backwards Compatibility

API unchanged - Same public interface
Behavior unchanged - Same results for all inputs
Pickle compatible - Numpy arrays rebuilt on deserialization

Problem: PredictionsModelWrapper.predict() used row-by-row DataFrame operations that were extremely slow on large datasets. With 93K rows and 33 features, predict() achieved only ~130 rows/sec, causing RAI dashboard generation to timeout after 2+ hours when MimicExplainer called predict() on all training data for surrogate model training. Solution: - Pre-extract feature data as numpy arrays during initialization - Use str(row.tolist()) + MD5 for fast, consistent row hashing - Perform all lookups using numpy array indexing instead of DataFrame.iloc - Store predictions as numpy arrays for direct O(1) access - Add per-instance warning when fallback to slow path occurs Performance Results (93K rows, 33 features): | Metric | Before | After | Speedup | |--------------------------|-----------|-------------|---------| | predict() throughput | 130 rows/s| 28,000 rows/s| ~215x | | 93K row prediction | ~40 min | 3.3 sec | ~720x | | RAIInsights creation | ~110 min | 11.5 sec | ~570x | This fixes the RAI dashboard timeout issue for large AutoML datasets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

imatiach-msft force-pushed the perf/vectorized-predictions-wrapper-lookup branch from 7b10f9a to 249a025 Compare March 16, 2026 18:43

imatiach-msft merged commit 5872fb1 into microsoft:main Mar 16, 2026
24 checks passed

imatiach-msft mentioned this pull request Mar 17, 2026

release ml-wrappers 0.6.3 #188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Vectorize PredictionsModelWrapper lookup with numpy arrays#187

perf: Vectorize PredictionsModelWrapper lookup with numpy arrays#187
imatiach-msft merged 1 commit intomicrosoft:mainfrom
imatiach-msft:perf/vectorized-predictions-wrapper-lookup

imatiach-msft commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

imatiach-msft commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Performance Results

Validation

Files Changed

Backwards Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

imatiach-msft commented Mar 13, 2026 •

edited

Loading