Skip to content

perf: Vectorize PredictionsModelWrapper lookup with numpy arrays#187

Merged
imatiach-msft merged 1 commit intomicrosoft:mainfrom
imatiach-msft:perf/vectorized-predictions-wrapper-lookup
Mar 16, 2026
Merged

perf: Vectorize PredictionsModelWrapper lookup with numpy arrays#187
imatiach-msft merged 1 commit intomicrosoft:mainfrom
imatiach-msft:perf/vectorized-predictions-wrapper-lookup

Conversation

@imatiach-msft
Copy link
Copy Markdown
Contributor

@imatiach-msft imatiach-msft commented Mar 13, 2026

Problem

The previous hash-based optimization (0.6.2) still used expensive DataFrame operations inside the prediction loop, resulting in only ~130 rows/sec throughput. This caused RAI dashboard generation to still timeout on large datasets (~93K rows).

Root Cause

The 0.6.2 implementation had these bottlenecks:

  • query_test_data.iloc[index:index + 1] created a new DataFrame for each row
  • _compute_row_hash() called pd.isna() on each value individually
  • _rows_equal() iterated through each column using pandas Series operations
  • Hash lookup was O(1), but per-row preparation was still O(m) with high constant factors

Solution

This PR replaces DataFrame operations with numpy array operations:

  1. Pre-extract numpy arrays during __init__:

    • self._feature_values = combined_data[feature_names].values
    • self._predictions = y_pred.copy()
  2. Fast hashing using str(row.tolist()) + MD5:

    • Handles NaN, mixed dtypes consistently
    • Much faster than per-value type checking
  3. Numpy array indexing in the predict loop:

    • Extract all query data as numpy array once: query_values = query_test_data[feature_names].values
    • Direct array indexing: self._predictions[matched_idx]

Performance Results

Tested with production AutoML data (93,340 train rows, 5,000 test rows, 33 features):

Metric Before (0.6.2) After Speedup
predict() throughput 130 rows/s 28,000 rows/s ~215x
93K row prediction ~40 min 3.3 sec ~720x
RAIInsights creation ~110 min 11.5 sec ~570x
Total RAI job ~194 min < 1 min ~200x

Validation

  • ✅ All 60 existing tests pass
  • ✅ Validated on AzureML with production AutoML RAI job
  • ✅ Job completed successfully in < 1 minute

Files Changed

  • python/ml_wrappers/model/predictions_wrapper.py - Vectorized implementation

Backwards Compatibility

  • API unchanged - Same public interface
  • Behavior unchanged - Same results for all inputs
  • Pickle compatible - Numpy arrays rebuilt on deserialization

Problem:
PredictionsModelWrapper.predict() used row-by-row DataFrame operations
that were extremely slow on large datasets. With 93K rows and 33 features,
predict() achieved only ~130 rows/sec, causing RAI dashboard generation
to timeout after 2+ hours when MimicExplainer called predict() on all
training data for surrogate model training.

Solution:
- Pre-extract feature data as numpy arrays during initialization
- Use str(row.tolist()) + MD5 for fast, consistent row hashing
- Perform all lookups using numpy array indexing instead of DataFrame.iloc
- Store predictions as numpy arrays for direct O(1) access
- Add per-instance warning when fallback to slow path occurs

Performance Results (93K rows, 33 features):
| Metric                   | Before    | After       | Speedup |
|--------------------------|-----------|-------------|---------|
| predict() throughput     | 130 rows/s| 28,000 rows/s| ~215x  |
| 93K row prediction       | ~40 min   | 3.3 sec     | ~720x   |
| RAIInsights creation     | ~110 min  | 11.5 sec    | ~570x   |

This fixes the RAI dashboard timeout issue for large AutoML datasets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@imatiach-msft imatiach-msft force-pushed the perf/vectorized-predictions-wrapper-lookup branch from 7b10f9a to 249a025 Compare March 16, 2026 18:43
@imatiach-msft imatiach-msft merged commit 5872fb1 into microsoft:main Mar 16, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant