Status: Pre-release Alpha Language: C++20 / CUDA Location:
probe-agent/src/probes/
This guide walks through the process of designing, implementing, and integrating a new probe type into the SENTINEL probe agent. Probes are the foundation of SENTINEL's hardware integrity detection -- each one exercises a specific GPU functional unit and compares the output against a known-good (golden) answer.
- Architecture Overview
- Probe Interface Reference
- Step-by-Step: Creating a New Probe
- SM Affinity Deep Dive
- Golden Answer Generation with MPFR
- Tolerance Selection
- Performance Budgeting
- Example: Implementing a Custom Reduction Probe
- Testing Probes with the SDC Injector
The SENTINEL probe agent runs on every GPU node in the cluster. Its job is to continuously execute lightweight "probe kernels" on each GPU, compare their outputs against precomputed golden answers, and report results to the Correlation Engine.
+-----------------------------------------------------------+
| Probe Agent (per node) |
| |
| +-------------+ +-----------+ +------------------+ |
| | Scheduler |--->| Executor |--->| Result Reporter | |
| | (priority | | (SM pin, | | (batch, HMAC, | |
| | queue) | | stream) | | gRPC stream) | |
| +-------------+ +-----------+ +------------------+ |
| | | |
| v v |
| +-------------+ +-----------+ |
| | Probe | | SM | |
| | Registry | | Affinity | |
| | (all probe | | Engine | |
| | types) | | | |
| +-------------+ +-----------+ |
+-----------------------------------------------------------+
- Scheduler selects the next probe to run based on priority, period, and SM coverage configuration.
- SM Affinity Engine determines which SM(s) to target. It computes over-launched grid dimensions so that at least one thread block lands on each target SM.
- Executor launches the probe kernel on a dedicated CUDA stream. Blocks
that land on non-target SMs exit immediately (via
%smidPTX readback). - The probe kernel computes its test and writes results to a device buffer.
- Executor copies results to host, computes SHA-256 of the output buffer, and compares against the precomputed golden hash.
- Result Reporter batches probe results, computes HMAC-SHA256 signatures, and streams them to the Correlation Engine via gRPC.
| Probe | Functional Unit | Approach |
|---|---|---|
| FMA | FP32 FMA ALUs | 1024 test vectors, 16 repetitions, bit-exact comparison. |
| Tensor Core | Matrix units (HMMA) | FP16 GEMM on 16x16 tiles, bit-exact hash comparison. |
| Transcendental | SFU (sin, cos, exp, log) | 4096 inputs, ULP-tolerant comparison via MPFR golden answers. |
| AES | Integer ALU / combinational logic | AES-128 encryption of known plaintext, exact comparison. |
| Memory | Global memory (HBM) | Walking-ones and MATS+ pattern tests. |
| Register File | Register file | Write/readback of known patterns (all-zeros, all-ones, alternating). |
| Shared Memory | Shared memory (SMEM) | Pattern write/readback within each SM's shared memory. |
Every probe implements the ProbeInterface abstract base class defined in
probe-agent/src/probes/probe_interface.h.
class ProbeInterface {
public:
virtual ~ProbeInterface() = default;
// Identity
virtual ProbeType type() const noexcept = 0;
virtual std::string_view name() const noexcept = 0;
// Lifecycle
virtual bool initialize(int device_index) = 0;
virtual ProbeResult execute(uint32_t sm_id, cudaStream_t stream) = 0;
virtual void teardown() = 0;
// Scheduling hints
virtual std::size_t memory_footprint() const noexcept = 0;
virtual uint32_t default_period_seconds() const noexcept = 0;
virtual Priority default_priority() const noexcept = 0;
virtual SmSelection default_sm_selection() const noexcept = 0;
protected:
static uint64_t next_probe_id();
};| Method | Description |
|---|---|
type() |
Return the ProbeType enum value for this probe. Must match the protobuf enum. |
name() |
Return a short lowercase name (e.g., "fma", "tensor_core"). Used in logging and metrics. |
initialize(device_index) |
One-time setup: allocate device/host buffers, upload golden data, precompute test vectors. Called once per GPU at agent startup. Return true on success. |
execute(sm_id, stream) |
Run the probe on a specific SM using the given CUDA stream. Return a ProbeResult with pass/fail, timing data, and optional mismatch details. |
teardown() |
Release all device and host resources. Called at agent shutdown. |
memory_footprint() |
Return estimated device memory usage in bytes. Used by the scheduler to ensure probes fit within the memory budget. |
default_period_seconds() |
Recommended execution period. Operators can override via configuration. |
default_priority() |
Scheduling priority: kHigh runs before kMedium and kLow. |
default_sm_selection() |
Default SM coverage: kAll (every SM each period), kSample25Pct (random 25%), or kSample10Pct (random 10%). |
next_probe_id() |
Thread-safe monotonically increasing ID generator (inherited helper). |
struct ProbeResult {
uint64_t probe_id;
ProbeType probe_type;
std::string gpu_uuid;
uint32_t sm_id;
util::Sha256Digest expected_hash;
util::Sha256Digest actual_hash;
bool match;
std::vector<MismatchDetail> mismatch_details;
uint64_t execution_time_ns;
uint32_t gpu_clock_mhz;
float gpu_temp_c;
float gpu_power_w;
std::chrono::system_clock::time_point timestamp;
util::Sha256Digest hmac;
};The executor converts match == false to PROBE_RESULT_FAIL and match == true
to PROBE_RESULT_PASS. If execute() throws or the kernel fails to launch,
the result is PROBE_RESULT_ERROR. If execution exceeds the timeout, the
result is PROBE_RESULT_TIMEOUT.
This walkthrough creates a hypothetical Dot Product Probe that tests FP32 dot product operations.
Create the header file probe-agent/src/probes/dot_product_probe.h:
/// @file dot_product_probe.h
/// @brief Dot product determinism probe.
///
/// Tests FP32 dot product computation on individual SMs by computing
/// dot(a, b) for 512 carefully chosen vector pairs and comparing
/// results bit-exact against precomputed golden answers.
#pragma once
#include "probes/probe_interface.h"
#include <cstdint>
#include <memory>
namespace sentinel::probes {
class DotProductProbe : public ProbeInterface {
public:
DotProductProbe();
~DotProductProbe() override;
// Identity.
[[nodiscard]] ProbeType type() const noexcept override {
return ProbeType::kDotProduct; // Add to enum first!
}
[[nodiscard]] std::string_view name() const noexcept override {
return "dot_product";
}
// Lifecycle.
bool initialize(int device_index) override;
[[nodiscard]] ProbeResult execute(uint32_t sm_id,
cudaStream_t stream) override;
void teardown() override;
// Scheduling hints.
[[nodiscard]] std::size_t memory_footprint() const noexcept override;
[[nodiscard]] uint32_t default_period_seconds() const noexcept override {
return 120; // Every 2 minutes.
}
[[nodiscard]] Priority default_priority() const noexcept override {
return Priority::kMedium;
}
[[nodiscard]] SmSelection default_sm_selection() const noexcept override {
return SmSelection::kSample25Pct;
}
// Configuration.
static constexpr uint32_t kNumVectorPairs = 512;
static constexpr uint32_t kVectorLength = 256;
static constexpr uint32_t kRepetitions = 8;
private:
struct Impl;
std::unique_ptr<Impl> impl_;
};
} // namespace sentinel::probesKey design decisions:
- Pimpl pattern: The
Implstruct holds all CUDA-specific state (device pointers, host buffers). This keeps CUDA headers out of the public interface. - Constants:
kNumVectorPairs,kVectorLength, andkRepetitionsdefine the probe's workload size. More vectors = more coverage but higher overhead. - Scheduling:
kMediumpriority and 25% SM sampling because dot products are less critical than FMA units.
Create probe-agent/src/probes/dot_product_probe.cu:
/// @file dot_product_probe.cu
/// @brief Dot product probe CUDA kernel and host-side implementation.
#include "probes/dot_product_probe.h"
#include "probes/sm_affinity.h"
#include "util/crypto.h"
#include "util/logging.h"
#include <chrono>
#include <cstring>
#include <vector>
namespace sentinel::probes {
// -- Test vector structure --
struct DotProductVectors {
float a[DotProductProbe::kVectorLength];
float b[DotProductProbe::kVectorLength];
};
// -- Device kernel --
/// Dot product probe kernel. Each block checks if it is on the target SM.
/// If yes, computes dot products for all vector pairs with repetitions.
__global__ void dot_product_kernel(
const DotProductVectors* __restrict__ vectors,
float* __restrict__ results,
uint32_t num_pairs,
uint32_t vec_length,
uint32_t repetitions,
uint32_t target_sm_id,
int* __restrict__ executed_flag)
{
// SM affinity check via inline PTX.
uint32_t sm_id;
asm volatile("mov.u32 %0, %%smid;" : "=r"(sm_id));
if (sm_id != target_sm_id) {
return;
}
// Signal that at least one block ran on the target SM.
if (threadIdx.x == 0) {
atomicExch(executed_flag, 1);
}
__syncthreads();
const uint32_t tid = threadIdx.x;
const uint32_t stride = blockDim.x;
// Each thread handles a subset of vector pairs.
for (uint32_t pair = tid; pair < num_pairs; pair += stride) {
float sum = 0.0f;
for (uint32_t rep = 0; rep < repetitions; rep++) {
float dot = 0.0f;
for (uint32_t i = 0; i < vec_length; i++) {
dot = __fmaf_rn(vectors[pair].a[i], vectors[pair].b[i], dot);
}
sum += dot;
}
results[pair] = sum;
}
}
// -- Host implementation --
struct DotProductProbe::Impl {
int device_index = -1;
int sm_count = 0;
// Host-side golden data.
std::vector<DotProductVectors> h_vectors;
std::vector<float> h_golden_results;
util::Sha256Digest golden_hash;
// Device buffers.
DotProductVectors* d_vectors = nullptr;
float* d_results = nullptr;
int* d_executed_flag = nullptr;
// Host output buffer.
std::vector<float> h_results;
};
DotProductProbe::DotProductProbe() : impl_(std::make_unique<Impl>()) {}
DotProductProbe::~DotProductProbe() { teardown(); }
bool DotProductProbe::initialize(int device_index) {
impl_->device_index = device_index;
cudaSetDevice(device_index);
// Query SM count.
impl_->sm_count = get_sm_count(device_index);
if (impl_->sm_count == 0) {
SENTINEL_LOG_ERROR("Failed to query SM count for device {}", device_index);
return false;
}
// Generate test vectors (deterministic, seeded).
impl_->h_vectors.resize(kNumVectorPairs);
generate_test_vectors(impl_->h_vectors); // See golden answer section.
// Compute golden answers on host using double precision.
impl_->h_golden_results.resize(kNumVectorPairs);
compute_golden_results(impl_->h_vectors, impl_->h_golden_results);
// Compute golden hash.
impl_->golden_hash = util::sha256(
impl_->h_golden_results.data(),
impl_->h_golden_results.size() * sizeof(float));
// Allocate device buffers.
cudaMalloc(&impl_->d_vectors,
kNumVectorPairs * sizeof(DotProductVectors));
cudaMalloc(&impl_->d_results, kNumVectorPairs * sizeof(float));
cudaMalloc(&impl_->d_executed_flag, sizeof(int));
// Upload test vectors.
cudaMemcpy(impl_->d_vectors, impl_->h_vectors.data(),
kNumVectorPairs * sizeof(DotProductVectors),
cudaMemcpyHostToDevice);
// Allocate host output buffer.
impl_->h_results.resize(kNumVectorPairs);
SENTINEL_LOG_INFO("DotProductProbe initialized on device {} ({} SMs)",
device_index, impl_->sm_count);
return true;
}
ProbeResult DotProductProbe::execute(uint32_t sm_id, cudaStream_t stream) {
ProbeResult result;
result.probe_id = next_probe_id();
result.probe_type = ProbeType::kDotProduct;
result.sm_id = sm_id;
result.expected_hash = impl_->golden_hash;
auto start = std::chrono::high_resolution_clock::now();
// Clear executed flag.
int zero = 0;
cudaMemcpyAsync(impl_->d_executed_flag, &zero, sizeof(int),
cudaMemcpyHostToDevice, stream);
// Clear results.
cudaMemsetAsync(impl_->d_results, 0,
kNumVectorPairs * sizeof(float), stream);
// Compute launch parameters for SM pinning.
dim3 grid, block;
compute_sm_pinned_launch_params(impl_->sm_count, 256, grid, block);
// Launch kernel.
dot_product_kernel<<<grid, block, 0, stream>>>(
impl_->d_vectors, impl_->d_results,
kNumVectorPairs, kVectorLength, kRepetitions,
sm_id, impl_->d_executed_flag);
// Copy results to host.
cudaMemcpyAsync(impl_->h_results.data(), impl_->d_results,
kNumVectorPairs * sizeof(float),
cudaMemcpyDeviceToHost, stream);
// Copy executed flag.
int executed = 0;
cudaMemcpyAsync(&executed, impl_->d_executed_flag, sizeof(int),
cudaMemcpyDeviceToHost, stream);
// Synchronize.
cudaStreamSynchronize(stream);
auto end = std::chrono::high_resolution_clock::now();
result.execution_time_ns =
std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
result.timestamp = std::chrono::system_clock::now();
if (!executed) {
// No block landed on the target SM. This is rare but possible.
// Mark as ERROR so the scheduler retries.
result.match = false;
return result;
}
// Compute SHA-256 of actual results.
result.actual_hash = util::sha256(
impl_->h_results.data(),
impl_->h_results.size() * sizeof(float));
// Compare.
result.match = (result.actual_hash == result.expected_hash);
if (!result.match) {
// Populate mismatch details.
for (uint32_t i = 0; i < kNumVectorPairs; i++) {
if (impl_->h_results[i] != impl_->h_golden_results[i]) {
MismatchDetail detail;
detail.byte_offset = i * sizeof(float);
detail.expected_value.assign(
reinterpret_cast<uint8_t*>(&impl_->h_golden_results[i]),
reinterpret_cast<uint8_t*>(&impl_->h_golden_results[i]) + sizeof(float));
detail.actual_value.assign(
reinterpret_cast<uint8_t*>(&impl_->h_results[i]),
reinterpret_cast<uint8_t*>(&impl_->h_results[i]) + sizeof(float));
// Compute differing bits.
uint32_t expected_bits, actual_bits;
std::memcpy(&expected_bits, &impl_->h_golden_results[i], 4);
std::memcpy(&actual_bits, &impl_->h_results[i], 4);
uint32_t xor_val = expected_bits ^ actual_bits;
for (int b = 0; b < 32; b++) {
if (xor_val & (1u << b)) {
detail.differing_bits.push_back(b);
}
}
result.mismatch_details.push_back(detail);
}
}
}
return result;
}
void DotProductProbe::teardown() {
if (impl_->d_vectors) { cudaFree(impl_->d_vectors); impl_->d_vectors = nullptr; }
if (impl_->d_results) { cudaFree(impl_->d_results); impl_->d_results = nullptr; }
if (impl_->d_executed_flag) { cudaFree(impl_->d_executed_flag); impl_->d_executed_flag = nullptr; }
}
std::size_t DotProductProbe::memory_footprint() const noexcept {
return kNumVectorPairs * sizeof(DotProductVectors) // Input vectors.
+ kNumVectorPairs * sizeof(float) // Results.
+ sizeof(int); // Executed flag.
}
} // namespace sentinel::probesGolden answers must be generated with higher precision than the GPU will use. For FP32 probes, compute golden answers in FP64 or with MPFR at 128-bit precision.
// In dot_product_probe.cu (or a separate golden generation tool)
#include <cmath>
#include <random>
void generate_test_vectors(std::vector<DotProductVectors>& vectors) {
// Use a fixed seed for deterministic test vectors.
std::mt19937 rng(0xDEADBEEF);
// Distribution categories:
std::uniform_real_distribution<float> normal(-1.0f, 1.0f);
std::uniform_real_distribution<float> large(1e30f, 1e38f);
std::uniform_real_distribution<float> small(1e-38f, 1e-30f);
for (size_t pair = 0; pair < vectors.size(); pair++) {
auto& v = vectors[pair];
for (uint32_t i = 0; i < DotProductProbe::kVectorLength; i++) {
if (pair < vectors.size() / 4) {
// Normal range.
v.a[i] = normal(rng);
v.b[i] = normal(rng);
} else if (pair < vectors.size() / 2) {
// Large values (near overflow).
v.a[i] = large(rng);
v.b[i] = small(rng); // Keep product in normal range.
} else if (pair < 3 * vectors.size() / 4) {
// Denormals.
v.a[i] = std::ldexp(normal(rng), -126);
v.b[i] = std::ldexp(normal(rng), -126);
} else {
// Special mantissa patterns.
uint32_t bits = rng();
std::memcpy(&v.a[i], &bits, sizeof(float));
bits = rng();
std::memcpy(&v.b[i], &bits, sizeof(float));
// Ensure we do not get NaN/Inf.
if (!std::isfinite(v.a[i])) v.a[i] = 1.0f;
if (!std::isfinite(v.b[i])) v.b[i] = 1.0f;
}
}
}
}
void compute_golden_results(const std::vector<DotProductVectors>& vectors,
std::vector<float>& results) {
// Compute in double precision, then cast to float.
// This matches IEEE 754 round-to-nearest-even for the final result.
for (size_t pair = 0; pair < vectors.size(); pair++) {
float sum = 0.0f;
for (uint32_t rep = 0; rep < DotProductProbe::kRepetitions; rep++) {
float dot = 0.0f;
for (uint32_t i = 0; i < DotProductProbe::kVectorLength; i++) {
// Use __fmaf_rn equivalent: standard FMA with round-to-nearest.
dot = std::fma(vectors[pair].a[i], vectors[pair].b[i], dot);
}
sum += dot;
}
results[pair] = sum;
}
}Important: The golden computation must exactly replicate the GPU kernel's
arithmetic sequence, including accumulation order. FMA operations are
deterministic within a single SM on NVIDIA GPUs when using __fmaf_rn
(round-to-nearest-even). The golden host computation must use the same
rounding mode.
Edit probe-agent/src/probes/probe_interface.h to add the new probe type:
enum class ProbeType : uint8_t {
kFma = 1,
kTensorCore = 2,
kTranscendental = 3,
kAes = 4,
kMemory = 5,
kRegisterFile = 6,
kSharedMemory = 7,
kDotProduct = 8, // <-- Add this.
};Update probe_type_name():
case ProbeType::kDotProduct: return "dot_product";Add the probe to the factory function in probe_interface.cpp:
std::vector<std::unique_ptr<ProbeInterface>> create_all_probes() {
std::vector<std::unique_ptr<ProbeInterface>> probes;
probes.push_back(std::make_unique<FmaProbe>());
probes.push_back(std::make_unique<TensorCoreProbe>());
probes.push_back(std::make_unique<TranscendentalProbe>());
probes.push_back(std::make_unique<AesProbe>());
probes.push_back(std::make_unique<MemoryProbe>());
probes.push_back(std::make_unique<RegisterFileProbe>());
probes.push_back(std::make_unique<SharedMemoryProbe>());
probes.push_back(std::make_unique<DotProductProbe>()); // <-- Add this.
return probes;
}Edit proto/sentinel/v1/probe.proto to add the new enum value:
enum ProbeType {
PROBE_TYPE_UNSPECIFIED = 0;
PROBE_TYPE_FMA = 1;
PROBE_TYPE_TENSOR_CORE = 2;
PROBE_TYPE_TRANSCENDENTAL = 3;
PROBE_TYPE_AES = 4;
PROBE_TYPE_MEMORY = 5;
PROBE_TYPE_REGISTER_FILE = 6;
PROBE_TYPE_SHARED_MEMORY = 7;
PROBE_TYPE_DOT_PRODUCT = 8; // <-- Add this.
}Update the corresponding SDK types:
-
Python (
sdk/python/src/sentinel_sdk/types.py):class ProbeType(IntEnum): DOT_PRODUCT = 8
-
Go (
sdk/go/sentinel/types.go):ProbeTypeDotProduct ProbeType = 8
Regenerate protobuf stubs:
# From the repository root:
buf generateCreate probe-agent/tests/dot_product_probe_test.cu:
#include "probes/dot_product_probe.h"
#include <gtest/gtest.h>
namespace sentinel::probes {
class DotProductProbeTest : public ::testing::Test {
protected:
void SetUp() override {
int device_count;
cudaGetDeviceCount(&device_count);
if (device_count == 0) {
GTEST_SKIP() << "No CUDA devices available";
}
probe_ = std::make_unique<DotProductProbe>();
ASSERT_TRUE(probe_->initialize(0));
cudaStreamCreate(&stream_);
}
void TearDown() override {
if (probe_) probe_->teardown();
if (stream_) cudaStreamDestroy(stream_);
}
std::unique_ptr<DotProductProbe> probe_;
cudaStream_t stream_ = nullptr;
};
// Verify that the probe passes on a healthy GPU.
TEST_F(DotProductProbeTest, PassesOnHealthyGpu) {
int sm_count = get_sm_count(0);
ASSERT_GT(sm_count, 0);
// Test on SM 0.
ProbeResult result = probe_->execute(0, stream_);
EXPECT_TRUE(result.match) << "Probe should pass on a healthy GPU";
EXPECT_EQ(result.expected_hash, result.actual_hash);
EXPECT_GT(result.execution_time_ns, 0);
}
// Verify determinism: same SM produces the same hash.
TEST_F(DotProductProbeTest, Deterministic) {
ProbeResult r1 = probe_->execute(0, stream_);
ProbeResult r2 = probe_->execute(0, stream_);
EXPECT_EQ(r1.actual_hash, r2.actual_hash)
<< "Same probe on same SM should produce identical results";
}
// Verify the probe runs on all SMs without error.
TEST_F(DotProductProbeTest, AllSMs) {
int sm_count = get_sm_count(0);
for (int sm = 0; sm < sm_count; sm++) {
ProbeResult result = probe_->execute(sm, stream_);
EXPECT_TRUE(result.match)
<< "Probe failed on SM " << sm;
}
}
// Verify memory footprint is reasonable.
TEST_F(DotProductProbeTest, MemoryFootprint) {
size_t footprint = probe_->memory_footprint();
EXPECT_GT(footprint, 0);
EXPECT_LT(footprint, 256 * 1024 * 1024)
<< "Probe should use less than 256 MiB of device memory";
}
// Verify scheduling defaults.
TEST_F(DotProductProbeTest, SchedulingDefaults) {
EXPECT_EQ(probe_->type(), ProbeType::kDotProduct);
EXPECT_EQ(probe_->name(), "dot_product");
EXPECT_EQ(probe_->default_period_seconds(), 120);
EXPECT_EQ(probe_->default_priority(), Priority::kMedium);
EXPECT_EQ(probe_->default_sm_selection(), SmSelection::kSample25Pct);
}
} // namespace sentinel::probesSM pinning is essential for attributing probe failures to specific SMs. SENTINEL uses an "over-launch with filter" strategy.
-
Over-launch: The kernel is launched with enough thread blocks to cover every SM on the GPU (typically
sm_count * 2blocks). CUDA's hardware scheduler distributes blocks across SMs round-robin. -
SM ID readback: Each block reads its SM ID using inline PTX:
asm volatile("mov.u32 %0, %%smid;" : "=r"(sm_id)); -
Filter: Blocks on non-target SMs return immediately. Only the block(s) on the target SM perform the actual probe computation.
-
Execution flag: An atomic flag in device memory confirms that at least one block executed on the target SM. If not (rare, due to scheduling anomalies), the probe is retried.
void compute_sm_pinned_launch_params(int sm_count,
int threads_per_block,
dim3& grid_dim,
dim3& block_dim) {
// Launch 2x SM count blocks to ensure coverage.
grid_dim = dim3(sm_count * 2);
block_dim = dim3(threads_per_block);
}CU_STREAM_PER_SMis not publicly available in all CUDA toolkit versions.- CUDA Green Contexts (Compute Capability 9.0+) offer SM partitioning but require CUDA 12+ and are not universally supported.
- The over-launch approach works on all GPU architectures from Volta (CC 7.0) onwards and has minimal overhead (early-exit blocks consume only a few nanoseconds).
The overhead of non-target blocks is negligible:
| SM Count | Blocks launched | Target blocks | Non-target blocks | Overhead |
|---|---|---|---|---|
| 132 (H100) | 264 | 1-2 | 262-263 | < 1 microsecond |
| 108 (A100) | 216 | 1-2 | 214-215 | < 1 microsecond |
For probes that test floating-point operations, golden answers must be computed at higher precision than the GPU uses. SENTINEL uses the GNU MPFR library for this purpose.
- IEEE 754 compliance with configurable precision (up to millions of bits)
- Correctly rounded results for all operations
- Support for all rounding modes (RN, RZ, RU, RD)
- Deterministic across platforms
- Generate test vectors using a deterministic PRNG (fixed seed).
- Compute golden answers using MPFR at 128-bit precision.
- Round to target precision (FP32 or FP16) using
mpfr_get_flt()with round-to-nearest-even (MPFR_RNDN). - Hash the golden output using SHA-256.
- Embed the hash in the probe (compiled into the agent binary or loaded from a golden data file).
#include <mpfr.h>
float compute_fma_golden(float a, float b, float c) {
mpfr_t ma, mb, mc, result;
mpfr_init2(ma, 128);
mpfr_init2(mb, 128);
mpfr_init2(mc, 128);
mpfr_init2(result, 128);
mpfr_set_flt(ma, a, MPFR_RNDN);
mpfr_set_flt(mb, b, MPFR_RNDN);
mpfr_set_flt(mc, c, MPFR_RNDN);
// FMA: result = a * b + c with a single rounding.
mpfr_fma(result, ma, mb, mc, MPFR_RNDN);
float golden = mpfr_get_flt(result, MPFR_RNDN);
mpfr_clear(ma);
mpfr_clear(mb);
mpfr_clear(mc);
mpfr_clear(result);
return golden;
}Some probes do not require high-precision golden answers:
- AES probe: AES is integer-based. The golden answer is computed using a reference AES implementation (e.g., OpenSSL).
- Memory probe: Walking-ones and MATS+ patterns are exact. Golden answers are the patterns themselves.
- Register file probe: Known bit patterns. Golden answers are trivial.
Most SENTINEL probes use bit-exact comparison: the SHA-256 of the GPU's output must match the SHA-256 of the golden output. This is the strongest possible test for SDC.
This works because:
- IEEE 754 requires deterministic results for +, -, *, /, sqrt, and FMA with a given rounding mode.
- NVIDIA GPUs guarantee bit-exact reproducibility for FP32 operations within a single SM when using the same instruction sequence.
- Probes are designed to use only deterministic operations.
Some operations are not guaranteed bit-exact:
- Transcendental functions (sin, cos, exp, log): NVIDIA specifies accuracy
in ULPs (Units in the Last Place), not bit-exactness. For example,
__sinf()is accurate to 2 ULP. - Cross-SM operations: If a probe spans multiple SMs (avoid this), floating-point accumulation order may vary.
For these cases, SENTINEL uses ULP-based comparison:
int32_t ulp_difference(float expected, float actual) {
// Reinterpret as integers for ULP computation.
int32_t expected_bits, actual_bits;
std::memcpy(&expected_bits, &expected, sizeof(float));
std::memcpy(&actual_bits, &actual, sizeof(float));
// Handle sign bit (IEEE 754 integer ordering works for same-sign values).
if (expected_bits < 0) expected_bits = 0x80000000 - expected_bits;
if (actual_bits < 0) actual_bits = 0x80000000 - actual_bits;
return std::abs(expected_bits - actual_bits);
}The transcendental probe allows up to 2 ULP deviation for results within the normal range. Any deviation beyond the documented accuracy bound indicates potential SDC.
| Probe type | Tolerance | Rationale |
|---|---|---|
| FMA | Exact (0 ULP) | IEEE 754 mandates exact rounding for FMA. |
| Tensor Core | Exact (hash) | HMMA is deterministic on a single SM. |
| AES | Exact (hash) | Integer operations are always bit-exact. |
| Memory | Exact (pattern) | Memory reads must return exactly what was written. |
| Register file | Exact (pattern) | Same as memory. |
| Shared memory | Exact (pattern) | Same as memory. |
| Transcendental | 2 ULP | NVIDIA documents 2 ULP accuracy for SFU. |
SENTINEL probes must fit within a strict overhead budget. The default is 2% of
GPU time, configurable via OverheadBudgetUpdate.
Each probe reports execution_time_ns in its result. The scheduler tracks the
aggregate GPU time consumed by all probes and throttles execution if the budget
is exceeded.
overhead_pct = (sum of probe execution times per second) / (1 second) * 100
For a budget of 2% on an H100:
Available probe time per second = 0.02 * 1,000,000,000 ns = 20,000,000 ns = 20 ms
| Probe | Typical execution time | SM coverage | Period | Overhead per GPU |
|---|---|---|---|---|
| FMA | 200 us | 100% (132 SMs) | 60s | 132 * 200us / 60s = 0.44 ms/s (0.044%) |
| Tensor Core | 500 us | 25% (33 SMs) | 120s | 33 * 500us / 120s = 0.14 ms/s (0.014%) |
| Transcendental | 150 us | 100% | 60s | 132 * 150us / 60s = 0.33 ms/s (0.033%) |
| AES | 100 us | 10% (13 SMs) | 120s | 13 * 100us / 120s = 0.01 ms/s (0.001%) |
| Memory | 1 ms | 100% | 300s | 132 * 1ms / 300s = 0.44 ms/s (0.044%) |
| Register file | 50 us | 100% | 120s | 132 * 50us / 120s = 0.06 ms/s (0.006%) |
| Shared memory | 80 us | 100% | 120s | 132 * 80us / 120s = 0.09 ms/s (0.009%) |
| Total | ~1.0 ms/s (0.1%) |
This leaves significant headroom within the 2% budget for custom probes and increased frequency during investigations.
- Minimize data transfers: Upload test vectors once at
initialize(), not per execution. - Use small output buffers: Hash the output on the GPU if possible to reduce DtoH transfer size.
- Avoid synchronization within kernels: SM-pinned probes run on a single SM; no inter-SM sync needed.
- Keep repetition counts reasonable: More repetitions increase sensitivity to intermittent faults but cost proportionally more time.
- Use CUDA streams: The executor uses a dedicated stream so probes do not block production kernels.
This complete example implements a tree-reduction probe that tests the parallel reduction pattern commonly used in softmax, layer norm, and loss computation.
- What it tests: FP32 addition in a tree-reduction pattern (warp shuffles and shared memory reduces).
- Why it matters: SDC in reduction operations can cause subtle errors in softmax outputs, loss values, and gradient norms.
- Test vectors: 64 fixed input arrays of 1024 elements each.
- Golden answer: Sum computed with Kahan compensated summation in FP64.
- Tolerance: Exact (the reduction uses
__shfl_xor_syncand shared memory adds, which are deterministic on a single SM).
#pragma once
#include "probes/probe_interface.h"
#include <memory>
namespace sentinel::probes {
class ReductionProbe : public ProbeInterface {
public:
ReductionProbe();
~ReductionProbe() override;
[[nodiscard]] ProbeType type() const noexcept override { return ProbeType::kReduction; }
[[nodiscard]] std::string_view name() const noexcept override { return "reduction"; }
bool initialize(int device_index) override;
[[nodiscard]] ProbeResult execute(uint32_t sm_id, cudaStream_t stream) override;
void teardown() override;
[[nodiscard]] std::size_t memory_footprint() const noexcept override;
[[nodiscard]] uint32_t default_period_seconds() const noexcept override { return 90; }
[[nodiscard]] Priority default_priority() const noexcept override { return Priority::kMedium; }
[[nodiscard]] SmSelection default_sm_selection() const noexcept override { return SmSelection::kSample25Pct; }
static constexpr uint32_t kNumArrays = 64;
static constexpr uint32_t kArrayLength = 1024;
static constexpr uint32_t kRepetitions = 4;
private:
struct Impl;
std::unique_ptr<Impl> impl_;
};
}#include "probes/reduction_probe.h"
#include "probes/sm_affinity.h"
#include "util/crypto.h"
namespace sentinel::probes {
// Warp-level reduction using shuffle.
__device__ float warp_reduce_sum(float val) {
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_xor_sync(0xffffffff, val, offset);
}
return val;
}
// Block-level reduction using shared memory.
__device__ float block_reduce_sum(float val) {
__shared__ float shared[32]; // One slot per warp.
int lane = threadIdx.x % 32;
int warp_id = threadIdx.x / 32;
val = warp_reduce_sum(val);
if (lane == 0) {
shared[warp_id] = val;
}
__syncthreads();
// First warp reduces the per-warp sums.
val = (threadIdx.x < blockDim.x / 32) ? shared[lane] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}
__global__ void reduction_probe_kernel(
const float* __restrict__ input,
float* __restrict__ output,
uint32_t array_length,
uint32_t num_arrays,
uint32_t repetitions,
uint32_t target_sm_id,
int* __restrict__ executed_flag)
{
uint32_t sm_id;
asm volatile("mov.u32 %0, %%smid;" : "=r"(sm_id));
if (sm_id != target_sm_id) return;
if (threadIdx.x == 0) atomicExch(executed_flag, 1);
__syncthreads();
// Each block processes one array.
for (uint32_t arr = blockIdx.x; arr < num_arrays; arr += gridDim.x) {
float accumulated = 0.0f;
for (uint32_t rep = 0; rep < repetitions; rep++) {
// Load and reduce.
float thread_sum = 0.0f;
for (uint32_t i = threadIdx.x; i < array_length; i += blockDim.x) {
thread_sum += input[arr * array_length + i];
}
float block_sum = block_reduce_sum(thread_sum);
if (threadIdx.x == 0) {
accumulated += block_sum;
}
}
if (threadIdx.x == 0) {
output[arr] = accumulated;
}
}
}
// ... (Impl struct, initialize, execute, teardown follow the same pattern
// as the DotProductProbe example above.)
}This probe tests the exact GPU functional units used by common DNN operations. A single-bit error in a warp shuffle or shared memory write would cause a hash mismatch and trigger investigation.
SENTINEL includes an SDC fault injector for testing probes in controlled conditions. It uses CUDA Compute Sanitizer's fault injection API (where available) or a software-based bit-flip injection.
The injector modifies the GPU's output buffer after kernel execution but before hash computation. This simulates an SDC event and verifies the probe correctly detects it.
# Run the probe agent with fault injection enabled:
sentinel-probe-agent --inject-sdc \
--inject-probe-type fma \
--inject-sm 42 \
--inject-bit-position 17 \
--inject-probability 0.01When testing a new probe with the injector:
-
Single-bit flip: Inject a 1-bit error. Verify the probe reports
PROBE_RESULT_FAILwith the correct SM ID and mismatch details. -
Multi-bit flip: Inject errors in multiple bit positions. Verify all differing bits are reported in
mismatch_details.differing_bits. -
Different SMs: Inject faults on different SMs. Verify the correct SM is attributed in the probe result.
-
No injection: Run without injection for 1000+ executions. Verify zero false positives.
-
Performance: Verify execution time stays within budget. Measure with
nsys profileor the probe's reportedexecution_time_ns.
# Run the full test suite with SDC injection:
cd probe-agent
cmake --build build --target test
./build/tests/probe_integration_test --gtest_filter="*SDCInjection*"Expected output:
[ PASSED ] SDCInjection.DetectsFlippedBit (42 ms)
[ PASSED ] SDCInjection.ReportsCorrectSM (38 ms)
[ PASSED ] SDCInjection.ReportsMismatchDetails (41 ms)
[ PASSED ] SDCInjection.NoFalsePositives (1204 ms)
[ PASSED ] SDCInjection.WithinPerformanceBudget (52 ms)