Skip to content

Conversation

@JH-Leon-KIM-AMD
Copy link
Contributor

@JH-Leon-KIM-AMD JH-Leon-KIM-AMD commented Jan 14, 2026

[CK_BUILDER] Testing-specific descriptor initialization

Problem

The CK Builder testing framework (conv_fwd.hpp) currently depends on old CK host descriptor helpers to generate tensor descriptors. This causes several issues:

  1. Noisy warnings during tests:

    Warning: Only RowMajor and ColumnMajor layouts are supported for empty strides...
    Warning: Tensor layout verification for ck::tensor_layout::convolution layouts is not supported yet...
    
  2. Unwanted coupling: The backend-agnostic testing layer depends on old CK utility code:

    #include "ck/library/utility/convolution_parameter.hpp"
    #include "ck/library/utility/convolution_host_tensor_descriptor_helper.hpp"
  3. Packed-only limitation: Old CK helpers only support packed (contiguous) tensors; custom strides are not supported

  4. Extra indirection: Must construct ck::utils::conv::ConvParam just to compute descriptors

Implementation

Implement testing-specific descriptor initialization in conv_fwd.hpp:

Core changes:

  1. Remove old CK dependency

    • Removed includes: ck/library/utility/convolution_*
    • Removed calls: ck::utils::conv::make_*_host_tensor_descriptor_*_packed()
    • Removed method: Args::to_ck_conv_param()
  2. Add testing-owned conv parameter struct (ConvFwdProblem)

    • Contains: G, N, C, K, spatial dims, strides, dilations, pads
    • Computes output spatial using standard conv formula
    • No dependency on old CK or CK Tile utility types
  3. Add Args::make_conv_problem()

    • Single source of truth for conv parameters
    • Used by both descriptor generation and kernel launchers
  4. Rewrite descriptor methods with testing-specific stride mapping

    • make_input_descriptor() / make_weight_descriptor() / make_output_descriptor()
    • Supports all forward layouts: GNHWC, NHWGC, NGCHW, etc. (1D/2D/3D)
    • Includes *_strided layout variants
  5. Add explicit tensor-memory stride API

    • New optional fields: input_strides, weight_strides, output_strides
    • Default behavior: packed (existing tests unchanged)
    • Enables future non-contiguous tensor testing
  6. Update launchers to use make_conv_problem()

    • conv_fwd_ck.hpp (old CK runner)
    • conv_fwd_reference.hpp (GPU reference runner)

Additional improvements:

  1. Add CK Tile forward EndToEnd support

    • New file: conv_fwd_ck_tile.hpp (CK Tile runner)
    • Merged test: test_ckb_conv_fwd_2d_fp16_v3.cpp (Create + EndToEnd)
    • Proves the shared Args/descriptor design works across backends
  2. Fix smoke test (unit_validation.cpp)

    • Changed BF16 initialization from bit pattern to numeric conversion
    • Ensures reliable mismatch detection

Note on PR #3518

This PR conflicts with #3518 (barkocot/tile-builder-testing). Both modify conv_fwd.hpp but with different approaches:
This PR needs to wait for it to be merged and then fix the conflict again.

Checklist

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt (test already registered)
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes (internal testing framework change, not user-facing)
  • I have run clang-format on all changed files
  • Any dependent changes have been merged (based on current develop)

Discussion

Fix smoke test (unit_validation.cpp)
Bug: Converting integer → BF16 was treating the integer as raw bits instead of a number, making the test unreliable.
EX)
ck::type_convert<ck::bhalf_t>(100) // integer → BF16

Integer 100 in binary (16-bit): 0000 0000 0110 0100 = 0x0064

When interpreted as BF16 format:
┌─ sign (1 bit)
│ ┌─ exponent (8 bits)
│ │        ┌─ mantissa (7 bits)
│ │        │
0 00000000 1100100
│ │        │
│ │        └─ mantissa = 100
│ └─ exponent = 0 (special: subnormal number!)
└─ sign = 0 (positive)

BF16 with exponent=0 = subnormal = extremely tiny
≈ 1.4 × 10⁻⁴³ (basically 0.0)

Fix: Convert integer → float → BF16 so we get numeric values (5.0, 100.0, etc.) instead of raw bit patterns.
ck::type_convert<ck::bhalf_t>(static_cast<float>(100)) // integer → float → BF16

Step 1: integer 100 → float 100.0
  Float32 representation of 100.0:
  0x42C80000 = 0100 0010 1100 1000 0000 0000 0000 0000
               │        │                            │
               sign=0   exp=10000101 (133)           mantissa

Step 2: float 100.0 → BF16
  BF16 = take top 16 bits of float32
  0x42C8 = 0100 0010 1100 1000
           │        │      │
           sign=0   exp=10000101   mantissa (truncated)

This represents: 1.5625 × 2^(133-127) = 1.5625 × 2^6 = 100.0 ✓

Test code and Result

// Demonstration of BF16 conversion bug
// Compile: hipcc -std=c++20 bf16_conversion_demo.cpp -o bf16_demo
// Run: ./bf16_demo

#include <hip/hip_runtime.h>
#include <iostream>
#include <iomanip>
#include <cstdint>

// BF16 is just ushort (16-bit unsigned)
using bhalf_t = unsigned short;

// Generic type_convert (what was used before - WRONG for int→bf16)
template <typename Y, typename X>
Y type_convert_generic(X x)
{
    return static_cast<Y>(x);
}

// Proper conversion via float (CORRECT)
bhalf_t float_to_bf16(float x)
{
    // BF16 = top 16 bits of float32
    union {
        float f;
        uint32_t i;
    } u;
    u.f = x;
    return static_cast<bhalf_t>(u.i >> 16);
}

// Convert BF16 back to float for display
float bf16_to_float(bhalf_t x)
{
    union {
        float f;
        uint32_t i;
    } u;
    u.i = static_cast<uint32_t>(x) << 16;
    return u.f;
}

int main()
{
    std::cout << std::fixed << std::setprecision(10);
    std::cout << "=== BF16 Conversion Bug Demo ===\n\n";

    // Test values
    int test_values[] = {0, 1, 5, 100, 101};

    std::cout << "Converting integers to BF16:\n";
    std::cout << "-----------------------------------------------------------\n";
    std::cout << "Value  | Wrong (int→bf16)      | Correct (int→float→bf16)\n";
    std::cout << "-----------------------------------------------------------\n";

    for(int val : test_values)
    {
        // WRONG: direct integer cast
        bhalf_t wrong = type_convert_generic<bhalf_t>(val);
        
        // CORRECT: via float
        bhalf_t correct = float_to_bf16(static_cast<float>(val));

        std::cout << std::setw(6) << val << " | "
                  << "0x" << std::hex << std::setw(4) << std::setfill('0') << wrong
                  << " → " << std::dec << std::setfill(' ') << std::setw(12) 
                  << bf16_to_float(wrong) << " | "
                  << "0x" << std::hex << std::setw(4) << std::setfill('0') << correct
                  << " → " << std::dec << std::setfill(' ') << std::setw(12)
                  << bf16_to_float(correct) << "\n";
    }

    std::cout << "\n=== The Problem ===\n";
    std::cout << "When test fills tensor with i % 100 and i % 101:\n\n";

    // Show the bug at i=100 and i=101
    std::cout << "At index i=100:\n";
    bhalf_t a_100_wrong = type_convert_generic<bhalf_t>(100 % 100);  // = 0
    bhalf_t b_100_wrong = type_convert_generic<bhalf_t>(100 % 101);  // = 100
    std::cout << "  Tensor a (i%100): " << (100%100) << " → BF16 " 
              << bf16_to_float(a_100_wrong) << "\n";
    std::cout << "  Tensor b (i%101): " << (100%101) << " → BF16 " 
              << bf16_to_float(b_100_wrong) << "\n";
    std::cout << "  Compare: " << bf16_to_float(a_100_wrong) 
              << " vs " << bf16_to_float(b_100_wrong);
    
    float diff_wrong = std::abs(bf16_to_float(a_100_wrong) - bf16_to_float(b_100_wrong));
    if(diff_wrong < 1e-6)
        std::cout << " → Nearly identical! ❌\n";
    else
        std::cout << " → Different ✓\n";

    std::cout << "\nWith FIX (via float):\n";
    bhalf_t a_100_fixed = float_to_bf16(static_cast<float>(100 % 100));
    bhalf_t b_100_fixed = float_to_bf16(static_cast<float>(100 % 101));
    std::cout << "  Tensor a (i%100): " << (100%100) << " → BF16 " 
              << bf16_to_float(a_100_fixed) << "\n";
    std::cout << "  Tensor b (i%101): " << (100%101) << " → BF16 " 
              << bf16_to_float(b_100_fixed) << "\n";
    std::cout << "  Compare: " << bf16_to_float(a_100_fixed) 
              << " vs " << bf16_to_float(b_100_fixed);
    
    float diff_fixed = std::abs(bf16_to_float(a_100_fixed) - bf16_to_float(b_100_fixed));
    if(diff_fixed > 1.0)
        std::cout << " → Clearly different! ✓\n";
    else
        std::cout << " → Too similar ❌\n";


    return 0;
}

Result

=== BF16 Conversion Bug Demo ===

Converting integers to BF16:
-----------------------------------------------------------
Value  | Wrong (int→bf16)      | Correct (int→float→bf16)
-----------------------------------------------------------
     0 | 0x0000 → 0.0000000000 | 0x0000 → 0.0000000000
     1 | 0x0001 → 0.0000000000 | 0x3f80 → 1.0000000000
     5 | 0x0005 → 0.0000000000 | 0x40a0 → 5.0000000000
   100 | 0x0064 → 0.0000000000 | 0x42c8 → 100.0000000000
   101 | 0x0065 → 0.0000000000 | 0x42ca → 101.0000000000

=== The Problem ===
When test fills tensor with i % 100 and i % 101:

At index i=100:
  Tensor a (i%100): 0 → BF16 0.0000000000
  Tensor b (i%101): 100 → BF16 0.0000000000
  Compare: 0.0000000000 vs 0.0000000000 → Nearly identical! ❌

With FIX (via float):
  Tensor a (i%100): 0 → BF16 0.0000000000
  Tensor b (i%101): 100 → BF16 100.0000000000
  Compare: 0.0000000000 vs 100.0000000000 → Clearly different! ✓

@JH-Leon-KIM-AMD JH-Leon-KIM-AMD changed the title [CK_BUILDER] ALMIOPEN-522: Testing-specific descriptor initialization [CK_BUILDER] Testing-specific descriptor initialization Jan 14, 2026
Remove old CK host descriptor helper dependency from CK Builder testing
framework and implement testing-owned descriptor computation.

Core changes (ALMIOPEN-522):
- Remove ck/library/utility/convolution_* includes from conv_fwd.hpp
- Add ConvFwdProblem struct (testing-owned conv parameter description)
- Add Args::make_conv_problem() (computes output spatial, no old CK dependency)
- Rewrite make_input/weight/output_descriptor() with testing-specific
  stride mapping for all supported layouts (GNHWC, NHWGC, etc.)
- Add optional explicit tensor-memory stride API (std::optional fields)
- Update conv_fwd_ck.hpp and conv_fwd_reference.hpp to use make_conv_problem()
- Remove to_ck_conv_param() method

Additional improvements:
- Add CK Tile forward EndToEnd support (conv_fwd_ck_tile.hpp + test)
- Proves shared Args/descriptor design works across backends
- Merge Create + EndToEnd tests into single file (matches old CK pattern)
- Fix unit_validation.cpp BF16 initialization for reliable testing

Results:
- Old CK warnings removed (no more RowMajor/ColumnMajor spam)
- All smoke tests pass (5/5)
- Old CK EndToEnd passes (XDL vs GPU reference)
- CK Tile EndToEnd passes (Tile vs GPU reference)

Note: This PR conflicts with #3518 (tile-builder-testing). Both touch
conv_fwd.hpp but with different approaches. This implementation directly
addresses ALMIOPEN-522 requirements by removing old CK dependency.
@JH-Leon-KIM-AMD JH-Leon-KIM-AMD force-pushed the jeonghyun/ckb-almiopen-522-descriptor-init branch from df4523d to ae55803 Compare January 15, 2026 09:53
Comment on lines +60 to +76
template <int SPATIAL_DIM>
struct ConvFwdProblem
{
int G = 1;
int N = 1;
int C = 1;
int K = 1;

std::array<long_index_t, SPATIAL_DIM> input_spatial = {};
std::array<long_index_t, SPATIAL_DIM> filter_spatial = {};
std::array<long_index_t, SPATIAL_DIM> output_spatial = {};

std::array<long_index_t, SPATIAL_DIM> conv_strides = {};
std::array<long_index_t, SPATIAL_DIM> conv_dilations = {};
std::array<long_index_t, SPATIAL_DIM> left_pads = {};
std::array<long_index_t, SPATIAL_DIM> right_pads = {};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not totally convinced that this structure is required since it basically encodes whats in the Args but in a slightly different way. In the ckt::run implementations, couldn't we just get these values straight from the Args struct? If the FilterExtent type is annoying, we could consider adding something like to_vector<index_type>() and/or to_array<index_type>() as members. For output_spatial, you could turn that into a member function of Args. That would then return a FilterExtent<SPATIAL_DIM> so that the type is the same a Args::lengths::image and Args::lengths::filter. The implementation of ckt::run would then have something like this:

auto ck_args = conv.MakeArgument(
  ...,
  args.filter_strides.to_array<ck::index_t>(),
  args.filter_dilations.to_array<ck::index_t>(),
  args.input_left_pad.to_array<ck::index_t>(),
  args.input_right_pad.to_array<ck::index_t>(),
  ...
)

What do you think?

Comment on lines +109 to +128
template <int SPATIAL_DIM>
std::array<long_index_t, SPATIAL_DIM> to_spatial_array(const FilterExtent<SPATIAL_DIM>& extent)
{
if constexpr(SPATIAL_DIM == 1)
{
return {static_cast<long_index_t>(extent.width)};
}
else if constexpr(SPATIAL_DIM == 2)
{
// CK Builder uses spatial ordering {H, W} for 2D.
return {static_cast<long_index_t>(extent.height), static_cast<long_index_t>(extent.width)};
}
else
{
// CK Builder uses spatial ordering {D, H, W} for 3D.
return {static_cast<long_index_t>(extent.depth),
static_cast<long_index_t>(extent.height),
static_cast<long_index_t>(extent.width)};
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace this then by the hypothetical FilterExtent::to_array<index_type>()

Comment on lines +274 to +289
if constexpr(SPATIAL_DIM == 1)
{
if constexpr(layout == TensorLayout::GNCW)
return detail::make_packed_strides_for_order<4>(lens, {0, 1, 2, 3});
else if constexpr(layout == TensorLayout::GNWC ||
layout == TensorLayout::G_NW_C_strided)
return detail::make_packed_strides_for_order<4>(lens, {0, 1, 3, 2});
else if constexpr(layout == TensorLayout::NWGC)
return detail::make_packed_strides_for_order<4>(lens, {1, 3, 0, 2});
else if constexpr(layout == TensorLayout::NGCW)
return detail::make_packed_strides_for_order<4>(lens, {1, 0, 2, 3});
else
static_assert(sizeof(UnsupportedEnumValue<layout>) == 0,
"Unsupported 1D input layout for descriptor initialization.");
}
else if constexpr(SPATIAL_DIM == 2)
Copy link
Contributor

@Snektron Snektron Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the amount of chained conditions going on in this function. The TensorLayout already kind of "encodes" the spatial dimension, so the first step I would do is turning this into a giant switch instead:

switch (layout) {
// 1D layouts
case TensorLayout::GNCW:
  ...
case TensorLayout::GNWC:
  ...
// 2D layouts
case TensorLayout::GNCHW:
  ...
/// 3D layouts
case TensorLayout::GNCDHW:
  ...
}

Note that this switch can in theory contain the logic for both input, weight, and output tensors, since all the layouts are in the same enum.

I also don't really like the way that the condition is tied to make_packed_strides_for_order. There is already functionality for constructing packed TensorDescriptors (see PackedRightLayout), and putting all this code here means a hard-to-test function. On top of that, I think a more generic way to query the layout order from a layout would be nice for downstream users, and this would also be much easier to test. I would try to extract this code into a separate function, maybe something like this:

// Get the order from dims=[G, N/K, K/C, D/Z, H/Y, W/X] to LAYOUT.
// Returns an array which has on index i a mapping to the index in
// the `dims` array.
template <tensorLayout LAYOUT>
constexpr auto layout_order() {
  // avoid manually writing the size of the order every time
  const auto order = [](auto... args) {
    return std::array<size_t, sizeof...(args)>{args...};
  };
 
  switch (LAYOUT)
  {
  case GNWC:
    return order(0, 1, 2, 3);
  case GNWC:
    return order(0, 1, 3, 2);
  ... etc, for 1D, 2D, 3D, and input/weight/output in the same function
  }
}

Then your implementation of make_input/weight/output_descriptor looks like this:

InputDescriptor make_input_descriptor() const {
  Extent lens = // compute as before
  if (input_strides.has_value()) {
    return InputDescriptor(order, input_strides.value());
  }

  const auto order = layout_order<SIGNATURE.input.config.layout>();
  const auto ordered_lens = reorder(lens, order); // You can put this in tensor_descriptor.hpp
  return InputDescriptor(ordered_lens, PackedRightLayout{});
}

You cna now separately test layout_order and reorder, and there is no need for ALSO testing whether make_packed_strides_for_order actually packs the order or not, since constructing tensors using PackedRightLayout{} is already tested separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants