Overview
Add SIMD vectorization to comparison operations. Currently, comparison kernels use scalar IL opcodes despite having the IL kernel infrastructure in place.
Parent issue: #545
Current State
| Aspect |
Status |
| IL kernel generation |
✅ Exists (ILKernelGenerator.cs:3156-3790) |
| Execution path classification |
✅ SimdFull, ScalarRight, General |
| Vector256 SIMD |
❌ NOT USED |
| Actual comparison |
Scalar IL opcodes (Ceq, Clt, Cgt) |
Current implementation (line 3698-3762):
case ComparisonOp.Greater:
il.Emit(OpCodes.Cgt); // Scalar comparison → 0 or 1
The Challenge
Vector256.GreaterThan() returns a mask, not bools:
Vector256<int> a = Vector256.Create(1, 2, 3, 4, 5, 6, 7, 8);
Vector256<int> b = Vector256.Create(5, 5, 5, 5, 5, 5, 5, 5);
Vector256<int> mask = Vector256.GreaterThan(a, b);
// mask = [0, 0, 0, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]
// (false, false, false, false, true, true, true, true)
Need to convert mask lanes (0x00000000 / 0xFFFFFFFF) to bool array.
Task List
Implementation Approach
Option A: MoveMask Extraction (Recommended for AVX2)
// For int32 (8 elements per Vector256):
var mask = Vector256.GreaterThan(lhsVec, rhsVec); // Vector256<int>
// MoveMask extracts MSB of each byte (32 bits total)
uint moveMask = (uint)Avx2.MoveMask(mask.AsByte());
// For int32, relevant bits are at positions 3, 7, 11, 15, 19, 23, 27, 31
// (MSB of each 4-byte lane)
result[i+0] = ((moveMask >> 3) & 1) != 0;
result[i+1] = ((moveMask >> 7) & 1) != 0;
result[i+2] = ((moveMask >> 11) & 1) != 0;
// ... etc for all 8 lanes
Option B: ConditionalSelect + Narrow
var mask = Vector256.GreaterThan(lhsVec, rhsVec);
var ones = Vector256.Create(1);
var zeros = Vector256<int>.Zero;
var boolAsInt = Vector256.ConditionalSelect(mask, ones, zeros);
// boolAsInt = [0, 0, 0, 0, 1, 1, 1, 1]
// Narrow to bytes and store
var narrow1 = Avx2.PackSignedSaturate(boolAsInt.GetLower(), boolAsInt.GetUpper());
// Continue narrowing to bytes...
Type-Specific Handling
| Type |
Elements/Vector |
Extraction Method |
| byte |
32 |
MoveMask direct (1 bit per element) |
| short |
16 |
MoveMask + stride 2 |
| int |
8 |
MoveMask + stride 4 |
| long |
4 |
MoveMask + stride 8 |
| float |
8 |
Same as int |
| double |
4 |
Same as long |
Files to Modify
| File |
Changes |
ILKernelGenerator.cs:3222-3339 |
Replace scalar loops with SIMD |
ILKernelGenerator.cs:3346-3409 |
New EmitComparisonSimdLoop() |
ILKernelGenerator.cs:3698-3762 |
EmitVectorComparisonOperation() |
Code Structure
private static void EmitComparisonSimdLoop(ILGenerator il, ComparisonKernelKey key)
{
int vectorCount = GetVectorCount(key.LhsType);
// Locals
var locI = il.DeclareLocal(typeof(int));
var locVectorEnd = il.DeclareLocal(typeof(int));
// vectorEnd = totalSize - vectorCount
// ...
// SIMD loop
var lblSimdLoop = il.DefineLabel();
il.MarkLabel(lblSimdLoop);
// if (i > vectorEnd) goto tail
// Load vectors
EmitVectorLoad(il, key.LhsType); // lhs
EmitVectorLoad(il, key.RhsType); // rhs
// Compare
EmitVectorComparison(il, key.Op, key.ComparisonType);
// Extract mask to bools
EmitMaskToBoolExtraction(il, key.ComparisonType, vectorCount);
// i += vectorCount
// goto lblSimdLoop
// Tail loop (scalar fallback for remainder)
// ...
}
Benchmarks
[Benchmark] public NDArray Greater_10M() => _a > _b;
[Benchmark] public NDArray Equal_10M() => _a == _b;
[Benchmark] public NDArray LessEqual_10M() => _a <= _b;
Expected Results
| Operation |
Current (scalar) |
With SIMD |
Speedup |
| a > b (10M int32) |
~15 ms |
~5-7 ms |
2-3× |
| a == b (10M float64) |
~20 ms |
~7-10 ms |
2-3× |
Success Criteria
- SIMD comparisons ≥2× faster than current scalar
- All existing comparison tests pass
- Correct handling of NaN comparisons (float/double)
- Works with all execution paths (SimdFull, ScalarRight, General)
Overview
Add SIMD vectorization to comparison operations. Currently, comparison kernels use scalar IL opcodes despite having the IL kernel infrastructure in place.
Parent issue: #545
Current State
ILKernelGenerator.cs:3156-3790)Current implementation (line 3698-3762):
The Challenge
Vector256.GreaterThan()returns a mask, not bools:Need to convert mask lanes (0x00000000 / 0xFFFFFFFF) to bool array.
Task List
Implement SIMD comparison loop
Vector256.GreaterThan/LessThan/Equalsetc.Handle all comparison types
Vector256.Equals()~Vector256.Equals()Vector256.LessThan()Vector256.LessThanOrEqual()Vector256.GreaterThan()Vector256.GreaterThanOrEqual()Mask-to-bool conversion strategies
Implementation Approach
Option A: MoveMask Extraction (Recommended for AVX2)
Option B: ConditionalSelect + Narrow
Type-Specific Handling
Files to Modify
ILKernelGenerator.cs:3222-3339ILKernelGenerator.cs:3346-3409EmitComparisonSimdLoop()ILKernelGenerator.cs:3698-3762EmitVectorComparisonOperation()Code Structure
Benchmarks
Expected Results
Success Criteria