Problem
UnmanagedMemoryBlock.Casting.cs contains 2,228 lines of repetitive type-dispatch code with 144 nested switch cases (12 input types × 12 output types), each containing nearly identical for-loops:
case NPTypeCode.Boolean:
{
var src = (bool*)source.Address;
switch (InfoOf<TOut>.NPTypeCode)
{
case NPTypeCode.Int32:
var dst = (int*)ret.Address;
for (int i = 0; i < len; i++)
*(dst + i) = Converts.ToInt32(*(src + i));
break;
// ... 11 more output types
}
break;
}
// ... 11 more input types (144 total combinations)
Issues
| Problem |
Impact |
| Code bloat |
2,228 lines for a simple operation |
| Maintenance burden |
Changes must be replicated across 144 branches |
| Regen dependency |
Uses #if _REGEN template generation |
| No SIMD |
Scalar loops where vectorization is possible |
| Cache pollution |
144 code paths = poor instruction cache utilization |
Proposed Solution
Replace with IL-generated kernels using the established ILKernelGenerator pattern:
New API (~20 lines)
public static partial class UnmanagedMemoryBlock
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static IMemoryBlock CastTo(this IMemoryBlock source, NPTypeCode to)
{
if (source.TypeCode == to)
return source.Clone();
return CastKernelGenerator.Execute(source, to);
}
}
Kernel Generator (~300 lines)
public static class CastKernelGenerator
{
private delegate void CastKernel(IntPtr src, IntPtr dst, int count);
private static readonly ConcurrentDictionary<(NPTypeCode, NPTypeCode), CastKernel> _cache = new();
public static IMemoryBlock Execute(IMemoryBlock source, NPTypeCode dstType)
{
var kernel = _cache.GetOrAdd(
(source.TypeCode, dstType),
key => GenerateKernel(key.Item1, key.Item2));
var dst = AllocateBlock(dstType, source.Count);
kernel((IntPtr)source.Address, (IntPtr)dst.Address, source.Count);
return dst;
}
private static CastKernel GenerateKernel(NPTypeCode srcType, NPTypeCode dstType)
{
var method = new DynamicMethod($"Cast_{srcType}_{dstType}", ...);
var il = method.GetILGenerator();
// Try SIMD for compatible types (widening, float<->double)
if (TryEmitSimdCast(il, srcType, dstType))
return (CastKernel)method.CreateDelegate(typeof(CastKernel));
// Fallback: scalar loop with IL conversion opcodes
EmitScalarCast(il, srcType, dstType);
return (CastKernel)method.CreateDelegate(typeof(CastKernel));
}
}
IL Emission (uses native conversion opcodes)
private static void EmitConversion(ILGenerator il, NPTypeCode srcType, NPTypeCode dstType)
{
switch (dstType)
{
case NPTypeCode.Byte: il.Emit(OpCodes.Conv_U1); break;
case NPTypeCode.Int16: il.Emit(OpCodes.Conv_I2); break;
case NPTypeCode.Int32: il.Emit(OpCodes.Conv_I4); break;
case NPTypeCode.Int64: il.Emit(OpCodes.Conv_I8); break;
case NPTypeCode.Single: il.Emit(OpCodes.Conv_R4); break;
case NPTypeCode.Double: il.Emit(OpCodes.Conv_R8); break;
// ... etc
}
}
Expected Outcome
| Metric |
Before |
After |
Change |
| Lines of code |
2,228 |
~320 |
-86% |
| Type switches |
144 |
2 |
-99% |
| For-loops in source |
291 |
0 |
-100% |
| SIMD support |
None |
Yes |
New |
| Regen dependency |
Yes |
No |
Removed |
SIMD Opportunities
| Conversion |
SIMD Method |
int32 → int64 |
Avx2.ConvertToVector256Int64(Vector128<int>) |
float → double |
Avx.ConvertToVector256Double(Vector128<float>) |
byte → int32 |
Avx2.ConvertToVector256Int32(Vector64<byte>) |
| Same-size reinterpret |
Buffer.MemoryCopy |
Implementation Plan
Complexity Assessment
| Aspect |
Difficulty |
Notes |
| IL emission basics |
Easy |
Copy patterns from ILKernelGenerator.Binary.cs |
| Conversion opcodes |
Easy |
IL has native Conv_* opcodes |
| Decimal handling |
Medium |
Requires Convert.ToDecimal() call |
| SIMD widening |
Medium |
Well-documented intrinsics |
| Testing 144 pairs |
Tedious |
Straightforward but time-consuming |
Related Files
Will be deleted:
src/NumSharp.Core/Backends/Unmanaged/UnmanagedMemoryBlock.Casting.cs (2,228 lines)
Will be simplified:
src/NumSharp.Core/Utilities/ArrayConvert.cs (can reuse cast kernels)
New file:
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Cast.cs (~300 lines)
References
- Existing pattern:
ILKernelGenerator.Binary.cs, ILKernelGenerator.Unary.cs
- Design doc:
docs/examples/CastKernel_Proposal.cs
- Parent tracking issue:
docs/ISSUE_IL_MIGRATION.md
Problem
UnmanagedMemoryBlock.Casting.cscontains 2,228 lines of repetitive type-dispatch code with 144 nested switch cases (12 input types × 12 output types), each containing nearly identical for-loops:Issues
#if _REGENtemplate generationProposed Solution
Replace with IL-generated kernels using the established
ILKernelGeneratorpattern:New API (~20 lines)
Kernel Generator (~300 lines)
IL Emission (uses native conversion opcodes)
Expected Outcome
SIMD Opportunities
int32 → int64Avx2.ConvertToVector256Int64(Vector128<int>)float → doubleAvx.ConvertToVector256Double(Vector128<float>)byte → int32Avx2.ConvertToVector256Int32(Vector64<byte>)Buffer.MemoryCopyImplementation Plan
ILKernelGenerator.Cast.cswith scalar conversion loop(srcType, dstType)keyUnmanagedMemoryBlock.CastToto use new generatorUnmanagedMemoryBlock.Casting.csArrayConvert.csto reuse cast kernelsComplexity Assessment
ILKernelGenerator.Binary.csConv_*opcodesConvert.ToDecimal()callRelated Files
Will be deleted:
src/NumSharp.Core/Backends/Unmanaged/UnmanagedMemoryBlock.Casting.cs(2,228 lines)Will be simplified:
src/NumSharp.Core/Utilities/ArrayConvert.cs(can reuse cast kernels)New file:
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Cast.cs(~300 lines)References
ILKernelGenerator.Binary.cs,ILKernelGenerator.Unary.csdocs/examples/CastKernel_Proposal.csdocs/ISSUE_IL_MIGRATION.md