SIMD diagonal fill (SSE4.2)

## Parent

#58 (Phase 1 MVP: Evidence-Informed Alignment with Deferred Filtering)

## What to build

Implement SSE4.2 SIMD-accelerated diagonal fill for WFA in phraya-align (x86_64 only). Use runtime CPU feature detection via is_x86_feature_detected!("sse4.2") to dispatch between SIMD and scalar implementations. Document all unsafe blocks with SAFETY invariants.

SIMD diagonal fill processes multiple cells in parallel, improving performance on modern x86 CPUs. Scalar fallback ensures correctness on older CPUs.

## Acceptance criteria

- [ ] wfa_diagonal_fill_sse42(diagonal: &mut [i32]) using SSE4.2 intrinsics
- [ ] Runtime CPU detection: use SIMD if available, scalar fallback otherwise
- [ ] All unsafe blocks documented with SAFETY comments
- [ ] Tests: correctness (SIMD result == scalar result)
- [ ] Tests: SIMD enabled path (requires SSE4.2 CPU or emulation)
- [ ] Tests: scalar fallback path (force disable SIMD)
- [ ] Tests: alignment correctness vs baseline scalar WFA
- [ ] Benchmark: 10kb alignment (scalar vs SIMD, measure speedup)
- [ ] Cross-platform: no-op on non-x86_64 (ARM, etc.)

## Blocked by

#70 (WFA base implementation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD diagonal fill (SSE4.2) #71

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SIMD diagonal fill (SSE4.2) #71

Description

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions