[CK_BUILDER] Testing-specific descriptor initialization #3563

JH-Leon-KIM-AMD · 2026-01-14T12:20:36Z

[CK_BUILDER] Testing-specific descriptor initialization

Problem

The CK Builder testing framework (conv_fwd.hpp) currently depends on old CK host descriptor helpers to generate tensor descriptors. This causes several issues:

Noisy warnings during tests:

Warning: Only RowMajor and ColumnMajor layouts are supported for empty strides...
Warning: Tensor layout verification for ck::tensor_layout::convolution layouts is not supported yet...

Unwanted coupling: The backend-agnostic testing layer depends on old CK utility code:

#include "ck/library/utility/convolution_parameter.hpp"
#include "ck/library/utility/convolution_host_tensor_descriptor_helper.hpp"

Packed-only limitation: Old CK helpers only support packed (contiguous) tensors; custom strides are not supported
Extra indirection: Must construct ck::utils::conv::ConvParam just to compute descriptors

Implementation

Implement testing-specific descriptor initialization in conv_fwd.hpp:

Core changes:

Remove old CK dependency
- Removed includes: ck/library/utility/convolution_*
- Removed calls: ck::utils::conv::make_*_host_tensor_descriptor_*_packed()
- Removed method: Args::to_ck_conv_param()
Add testing-owned conv parameter struct (ConvFwdProblem)
- Contains: G, N, C, K, spatial dims, strides, dilations, pads
- Computes output spatial using standard conv formula
- No dependency on old CK or CK Tile utility types
Add Args::make_conv_problem()
- Single source of truth for conv parameters
- Used by both descriptor generation and kernel launchers
Rewrite descriptor methods with testing-specific stride mapping
- make_input_descriptor() / make_weight_descriptor() / make_output_descriptor()
- Supports all forward layouts: GNHWC, NHWGC, NGCHW, etc. (1D/2D/3D)
- Includes *_strided layout variants
Add explicit tensor-memory stride API
- New optional fields: input_strides, weight_strides, output_strides
- Default behavior: packed (existing tests unchanged)
- Enables future non-contiguous tensor testing
Update launchers to use make_conv_problem()
- conv_fwd_ck.hpp (old CK runner)
- conv_fwd_reference.hpp (GPU reference runner)

Additional improvements:

Add CK Tile forward EndToEnd support
- New file: conv_fwd_ck_tile.hpp (CK Tile runner)
- Merged test: test_ckb_conv_fwd_2d_fp16_v3.cpp (Create + EndToEnd)
- Proves the shared Args/descriptor design works across backends
Fix smoke test (unit_validation.cpp)
- Changed BF16 initialization from bit pattern to numeric conversion
- Ensures reliable mismatch detection

Note on PR #3518

This PR conflicts with #3518 (barkocot/tile-builder-testing). Both modify conv_fwd.hpp but with different approaches:
This PR needs to wait for it to be merged and then fix the conflict again.

Checklist

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt (test already registered)
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes (internal testing framework change, not user-facing)
I have run clang-format on all changed files
Any dependent changes have been merged (based on current develop)

Discussion

Fix smoke test (unit_validation.cpp)
Bug: Converting integer → BF16 was treating the integer as raw bits instead of a number, making the test unreliable.
EX)
ck::type_convert<ck::bhalf_t>(100) // integer → BF16

Integer 100 in binary (16-bit): 0000 0000 0110 0100 = 0x0064

When interpreted as BF16 format:
┌─ sign (1 bit)
│ ┌─ exponent (8 bits)
│ │        ┌─ mantissa (7 bits)
│ │        │
0 00000000 1100100
│ │        │
│ │        └─ mantissa = 100
│ └─ exponent = 0 (special: subnormal number!)
└─ sign = 0 (positive)

BF16 with exponent=0 = subnormal = extremely tiny
≈ 1.4 × 10⁻⁴³ (basically 0.0)

Fix: Convert integer → float → BF16 so we get numeric values (5.0, 100.0, etc.) instead of raw bit patterns.
ck::type_convert<ck::bhalf_t>(static_cast<float>(100)) // integer → float → BF16

Step 1: integer 100 → float 100.0
  Float32 representation of 100.0:
  0x42C80000 = 0100 0010 1100 1000 0000 0000 0000 0000
               │        │                            │
               sign=0   exp=10000101 (133)           mantissa

Step 2: float 100.0 → BF16
  BF16 = take top 16 bits of float32
  0x42C8 = 0100 0010 1100 1000
           │        │      │
           sign=0   exp=10000101   mantissa (truncated)

This represents: 1.5625 × 2^(133-127) = 1.5625 × 2^6 = 100.0 ✓

Test code and Result

// Demonstration of BF16 conversion bug
// Compile: hipcc -std=c++20 bf16_conversion_demo.cpp -o bf16_demo
// Run: ./bf16_demo

#include <hip/hip_runtime.h>
#include <iostream>
#include <iomanip>
#include <cstdint>

// BF16 is just ushort (16-bit unsigned)
using bhalf_t = unsigned short;

// Generic type_convert (what was used before - WRONG for int→bf16)
template <typename Y, typename X>
Y type_convert_generic(X x)
{
    return static_cast<Y>(x);
}

// Proper conversion via float (CORRECT)
bhalf_t float_to_bf16(float x)
{
    // BF16 = top 16 bits of float32
    union {
        float f;
        uint32_t i;
    } u;
    u.f = x;
    return static_cast<bhalf_t>(u.i >> 16);
}

// Convert BF16 back to float for display
float bf16_to_float(bhalf_t x)
{
    union {
        float f;
        uint32_t i;
    } u;
    u.i = static_cast<uint32_t>(x) << 16;
    return u.f;
}

int main()
{
    std::cout << std::fixed << std::setprecision(10);
    std::cout << "=== BF16 Conversion Bug Demo ===\n\n";

    // Test values
    int test_values[] = {0, 1, 5, 100, 101};

    std::cout << "Converting integers to BF16:\n";
    std::cout << "-----------------------------------------------------------\n";
    std::cout << "Value  | Wrong (int→bf16)      | Correct (int→float→bf16)\n";
    std::cout << "-----------------------------------------------------------\n";

    for(int val : test_values)
    {
        // WRONG: direct integer cast
        bhalf_t wrong = type_convert_generic<bhalf_t>(val);
        
        // CORRECT: via float
        bhalf_t correct = float_to_bf16(static_cast<float>(val));

        std::cout << std::setw(6) << val << " | "
                  << "0x" << std::hex << std::setw(4) << std::setfill('0') << wrong
                  << " → " << std::dec << std::setfill(' ') << std::setw(12) 
                  << bf16_to_float(wrong) << " | "
                  << "0x" << std::hex << std::setw(4) << std::setfill('0') << correct
                  << " → " << std::dec << std::setfill(' ') << std::setw(12)
                  << bf16_to_float(correct) << "\n";
    }

    std::cout << "\n=== The Problem ===\n";
    std::cout << "When test fills tensor with i % 100 and i % 101:\n\n";

    // Show the bug at i=100 and i=101
    std::cout << "At index i=100:\n";
    bhalf_t a_100_wrong = type_convert_generic<bhalf_t>(100 % 100);  // = 0
    bhalf_t b_100_wrong = type_convert_generic<bhalf_t>(100 % 101);  // = 100
    std::cout << "  Tensor a (i%100): " << (100%100) << " → BF16 " 
              << bf16_to_float(a_100_wrong) << "\n";
    std::cout << "  Tensor b (i%101): " << (100%101) << " → BF16 " 
              << bf16_to_float(b_100_wrong) << "\n";
    std::cout << "  Compare: " << bf16_to_float(a_100_wrong) 
              << " vs " << bf16_to_float(b_100_wrong);
    
    float diff_wrong = std::abs(bf16_to_float(a_100_wrong) - bf16_to_float(b_100_wrong));
    if(diff_wrong < 1e-6)
        std::cout << " → Nearly identical! ❌\n";
    else
        std::cout << " → Different ✓\n";

    std::cout << "\nWith FIX (via float):\n";
    bhalf_t a_100_fixed = float_to_bf16(static_cast<float>(100 % 100));
    bhalf_t b_100_fixed = float_to_bf16(static_cast<float>(100 % 101));
    std::cout << "  Tensor a (i%100): " << (100%100) << " → BF16 " 
              << bf16_to_float(a_100_fixed) << "\n";
    std::cout << "  Tensor b (i%101): " << (100%101) << " → BF16 " 
              << bf16_to_float(b_100_fixed) << "\n";
    std::cout << "  Compare: " << bf16_to_float(a_100_fixed) 
              << " vs " << bf16_to_float(b_100_fixed);
    
    float diff_fixed = std::abs(bf16_to_float(a_100_fixed) - bf16_to_float(b_100_fixed));
    if(diff_fixed > 1.0)
        std::cout << " → Clearly different! ✓\n";
    else
        std::cout << " → Too similar ❌\n";


    return 0;
}

Result

=== BF16 Conversion Bug Demo ===

Converting integers to BF16:
-----------------------------------------------------------
Value  | Wrong (int→bf16)      | Correct (int→float→bf16)
-----------------------------------------------------------
     0 | 0x0000 → 0.0000000000 | 0x0000 → 0.0000000000
     1 | 0x0001 → 0.0000000000 | 0x3f80 → 1.0000000000
     5 | 0x0005 → 0.0000000000 | 0x40a0 → 5.0000000000
   100 | 0x0064 → 0.0000000000 | 0x42c8 → 100.0000000000
   101 | 0x0065 → 0.0000000000 | 0x42ca → 101.0000000000

=== The Problem ===
When test fills tensor with i % 100 and i % 101:

At index i=100:
  Tensor a (i%100): 0 → BF16 0.0000000000
  Tensor b (i%101): 100 → BF16 0.0000000000
  Compare: 0.0000000000 vs 0.0000000000 → Nearly identical! ❌

With FIX (via float):
  Tensor a (i%100): 0 → BF16 0.0000000000
  Tensor b (i%101): 100 → BF16 100.0000000000
  Compare: 0.0000000000 vs 100.0000000000 → Clearly different! ✓

Remove old CK host descriptor helper dependency from CK Builder testing framework and implement testing-owned descriptor computation. Core changes (ALMIOPEN-522): - Remove ck/library/utility/convolution_* includes from conv_fwd.hpp - Add ConvFwdProblem struct (testing-owned conv parameter description) - Add Args::make_conv_problem() (computes output spatial, no old CK dependency) - Rewrite make_input/weight/output_descriptor() with testing-specific stride mapping for all supported layouts (GNHWC, NHWGC, etc.) - Add optional explicit tensor-memory stride API (std::optional fields) - Update conv_fwd_ck.hpp and conv_fwd_reference.hpp to use make_conv_problem() - Remove to_ck_conv_param() method Additional improvements: - Add CK Tile forward EndToEnd support (conv_fwd_ck_tile.hpp + test) - Proves shared Args/descriptor design works across backends - Merge Create + EndToEnd tests into single file (matches old CK pattern) - Fix unit_validation.cpp BF16 initialization for reliable testing Results: - Old CK warnings removed (no more RowMajor/ColumnMajor spam) - All smoke tests pass (5/5) - Old CK EndToEnd passes (XDL vs GPU reference) - CK Tile EndToEnd passes (Tile vs GPU reference) Note: This PR conflicts with #3518 (tile-builder-testing). Both touch conv_fwd.hpp but with different approaches. This implementation directly addresses ALMIOPEN-522 requirements by removing old CK dependency.

Snektron · 2026-01-16T10:10:40Z

experimental/builder/include/ck_tile/builder/testing/conv_fwd.hpp

+template <int SPATIAL_DIM>
+struct ConvFwdProblem
+{
+    int G = 1;
+    int N = 1;
+    int C = 1;
+    int K = 1;
+
+    std::array<long_index_t, SPATIAL_DIM> input_spatial  = {};
+    std::array<long_index_t, SPATIAL_DIM> filter_spatial = {};
+    std::array<long_index_t, SPATIAL_DIM> output_spatial = {};
+
+    std::array<long_index_t, SPATIAL_DIM> conv_strides   = {};
+    std::array<long_index_t, SPATIAL_DIM> conv_dilations = {};
+    std::array<long_index_t, SPATIAL_DIM> left_pads      = {};
+    std::array<long_index_t, SPATIAL_DIM> right_pads     = {};
+};


Hmm, I'm not totally convinced that this structure is required since it basically encodes whats in the Args but in a slightly different way. In the ckt::run implementations, couldn't we just get these values straight from the Args struct? If the FilterExtent type is annoying, we could consider adding something like to_vector<index_type>() and/or to_array<index_type>() as members. For output_spatial, you could turn that into a member function of Args. That would then return a FilterExtent<SPATIAL_DIM> so that the type is the same a Args::lengths::image and Args::lengths::filter. The implementation of ckt::run would then have something like this:

auto ck_args = conv.MakeArgument( ..., args.filter_strides.to_array<ck::index_t>(), args.filter_dilations.to_array<ck::index_t>(), args.input_left_pad.to_array<ck::index_t>(), args.input_right_pad.to_array<ck::index_t>(), ... )

What do you think?

Snektron · 2026-01-16T10:13:55Z

experimental/builder/include/ck_tile/builder/testing/conv_fwd.hpp

+template <int SPATIAL_DIM>
+std::array<long_index_t, SPATIAL_DIM> to_spatial_array(const FilterExtent<SPATIAL_DIM>& extent)
+{
+    if constexpr(SPATIAL_DIM == 1)
+    {
+        return {static_cast<long_index_t>(extent.width)};
+    }
+    else if constexpr(SPATIAL_DIM == 2)
+    {
+        // CK Builder uses spatial ordering {H, W} for 2D.
+        return {static_cast<long_index_t>(extent.height), static_cast<long_index_t>(extent.width)};
+    }
+    else
+    {
+        // CK Builder uses spatial ordering {D, H, W} for 3D.
+        return {static_cast<long_index_t>(extent.depth),
+                static_cast<long_index_t>(extent.height),
+                static_cast<long_index_t>(extent.width)};
+    }
+}


You could replace this then by the hypothetical FilterExtent::to_array<index_type>()

Snektron · 2026-01-16T10:40:46Z

experimental/builder/include/ck_tile/builder/testing/conv_fwd.hpp

+            if constexpr(SPATIAL_DIM == 1)
+            {
+                if constexpr(layout == TensorLayout::GNCW)
+                    return detail::make_packed_strides_for_order<4>(lens, {0, 1, 2, 3});
+                else if constexpr(layout == TensorLayout::GNWC ||
+                                  layout == TensorLayout::G_NW_C_strided)
+                    return detail::make_packed_strides_for_order<4>(lens, {0, 1, 3, 2});
+                else if constexpr(layout == TensorLayout::NWGC)
+                    return detail::make_packed_strides_for_order<4>(lens, {1, 3, 0, 2});
+                else if constexpr(layout == TensorLayout::NGCW)
+                    return detail::make_packed_strides_for_order<4>(lens, {1, 0, 2, 3});
+                else
+                    static_assert(sizeof(UnsupportedEnumValue<layout>) == 0,
+                                  "Unsupported 1D input layout for descriptor initialization.");
+            }
+            else if constexpr(SPATIAL_DIM == 2)


I don't really like the amount of chained conditions going on in this function. The TensorLayout already kind of "encodes" the spatial dimension, so the first step I would do is turning this into a giant switch instead:

switch (layout) { // 1D layouts case TensorLayout::GNCW: ... case TensorLayout::GNWC: ... // 2D layouts case TensorLayout::GNCHW: ... /// 3D layouts case TensorLayout::GNCDHW: ... }

Note that this switch can in theory contain the logic for both input, weight, and output tensors, since all the layouts are in the same enum.

I also don't really like the way that the condition is tied to make_packed_strides_for_order. There is already functionality for constructing packed TensorDescriptors (see PackedRightLayout), and putting all this code here means a hard-to-test function. On top of that, I think a more generic way to query the layout order from a layout would be nice for downstream users, and this would also be much easier to test. I would try to extract this code into a separate function, maybe something like this:

// Get the order from dims=[G, N/K, K/C, D/Z, H/Y, W/X] to LAYOUT. // Returns an array which has on index i a mapping to the index in // the `dims` array. template <tensorLayout LAYOUT> constexpr auto layout_order() { // avoid manually writing the size of the order every time const auto order = [](auto... args) { return std::array<size_t, sizeof...(args)>{args...}; }; switch (LAYOUT) { case GNWC: return order(0, 1, 2, 3); case GNWC: return order(0, 1, 3, 2); ... etc, for 1D, 2D, 3D, and input/weight/output in the same function } }

Then your implementation of make_input/weight/output_descriptor looks like this:

InputDescriptor make_input_descriptor() const { Extent lens = // compute as before if (input_strides.has_value()) { return InputDescriptor(order, input_strides.value()); } const auto order = layout_order<SIGNATURE.input.config.layout>(); const auto ordered_lens = reorder(lens, order); // You can put this in tensor_descriptor.hpp return InputDescriptor(ordered_lens, PackedRightLayout{}); }

You cna now separately test layout_order and reorder, and there is no need for ALSO testing whether make_packed_strides_for_order actually packs the order or not, since constructing tensors using PackedRightLayout{} is already tested separately.

JH-Leon-KIM-AMD changed the title ~~[CK_BUILDER] ALMIOPEN-522: Testing-specific descriptor initialization~~ [CK_BUILDER] Testing-specific descriptor initialization Jan 14, 2026

JH-Leon-KIM-AMD force-pushed the jeonghyun/ckb-almiopen-522-descriptor-init branch from df4523d to ae55803 Compare January 15, 2026 09:53

Merge branch 'develop' into jeonghyun/ckb-almiopen-522-descriptor-init

84c0f2c

Snektron reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_BUILDER] Testing-specific descriptor initialization #3563

[CK_BUILDER] Testing-specific descriptor initialization #3563

JH-Leon-KIM-AMD commented Jan 14, 2026 •

edited

Loading

Uh oh!

Snektron Jan 16, 2026

Uh oh!

Snektron Jan 16, 2026

Uh oh!

Snektron Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CK_BUILDER] Testing-specific descriptor initialization #3563

Are you sure you want to change the base?

[CK_BUILDER] Testing-specific descriptor initialization #3563

Conversation

JH-Leon-KIM-AMD commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[CK_BUILDER] Testing-specific descriptor initialization

Problem

Implementation

Core changes:

Additional improvements:

Note on PR #3518

Checklist

Discussion

Uh oh!

Snektron Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Snektron Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Snektron Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JH-Leon-KIM-AMD commented Jan 14, 2026 •

edited

Loading

Snektron Jan 16, 2026 •

edited

Loading