Reference C implementation of all ten CNN architectures defined in
cnn_models.py. The models originate from the following peer-reviewed paper —
please cite it if you use this code or the architectures in your work:
L. Gutiérrez-Martín, C. López-Ongil, and J. A. Miranda-Calero, "DeepBindi: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment," IEEE Journal of Biomedical and Health Informatics, vol. 30, no. 1, Jan. 2026. DOI: 10.1109/JBHI.2025.3587961
Three C variants are provided:
| Variant | Directory | Memory strategy | Data type | Use case |
|---|---|---|---|---|
| Dynamic | c_port/ |
malloc / free – heap allocated |
float |
Rapid prototyping, host machines |
| Static | static_port/ |
Static global pools, no heap | float |
embedded targets, no OS |
| X-HEEP / int32 | deepbindi_cnn_x_heep/ |
Static global pools, no heap | int32_t |
X-HEEP RISC-V SoC |
The dynamic and static variants are self-contained (no dependencies beyond libc and libm),
produce identical checksums, and use the same tensor layout, layer
primitives, and compute loops.
The X-HEEP variant targets CNN_1D_v2 (PYE) only, uses 32-bit integer arithmetic throughout
(no float, no math.h), and is built to run bare-metal on X-HEEP with or without the FPU.
# Dynamic version (debug build – logging enabled)
cd c_port
make run
# Static version (debug build – logging enabled)
cd static_port
make run
# Production / embedded build (silent, no printf anywhere)
make
# Verify both produce the same numbers
cd static_port
make verifyExpected output (same for both):
CNN_2D_v1 | shape=(1,1,1,1) | checksum=0.496194 | values=[0.496194]
CNN_2D_v2 | shape=(1,1,1,1) | checksum=0.496471 | values=[0.496471]
CNN_2D_v3 | shape=(1,1,1,1) | checksum=0.498775 | values=[0.498775]
CNN_1D_v1 | shape=(1,1,1,1) | checksum=0.496733 | values=[0.496733]
CNN_1D_v2 | shape=(1,1,1,1) | checksum=0.503341 | values=[0.503341]
CNN_1D_v3 | shape=(1,1,1,1) | checksum=0.502004 | values=[0.502004]
MobileNetV3Custom | shape=(1,1,1,1) | checksum=0.496607 | values=[0.496607]
CNN_1d_tf_sigmoid | shape=(1,1,1,1) | checksum=0.496598 | values=[0.496598]
CNN_1d_tf_softmax | shape=(1,2,1,1) | checksum=1.501474 | values=[0.498526, 0.501474]
CNN_2d_tf_softmax | shape=(1,2,1,1) | checksum=1.511614 | values=[0.488386, 0.511614]
Requires: gcc (≥ C99) and libm. No Python, PyTorch, or TensorFlow needed.
| ID | C function | Python class | Architecture summary |
|---|---|---|---|
| PYA | run_cnn_2d_v1 |
CNN_2D_v1 |
2D-CNN · 2 conv + 2 FC · binary sigmoid |
| PYB | run_cnn_2d_v2 |
CNN_2D_v2 |
2D-CNN · 3 parallel branches + 2nd conv + 2 FC |
| PYC | run_cnn_2d_v3 |
CNN_2D_v3 |
2D-CNN · 2 parallel branches + 2nd conv + 2 FC |
| PYD | run_cnn_1d_v1 |
CNN_1D_v1 |
1D-CNN · 1 conv + 1 FC |
| PYE | run_cnn_1d_v2 |
CNN_1D_v2 |
1D-CNN · 2 conv + 1 FC |
| PYF | run_cnn_1d_v3 |
CNN_1D_v3 |
1D-CNN · 1 conv + 2 FC |
| PYG | run_mobilenet_v3_custom |
MobileNetV3Custom |
MobileNetV3-Small (width×0.35, 1-ch), 11 MB blocks + SE |
| TF1 | run_cnn_1d_tensorflow_sigmoid |
CNN_1d_tensorflow_sigmoid |
Keras 1D-CNN, sigmoid binary output |
| TF2 | run_cnn_1d_tensorflow_softmax |
CNN_1d_tensorflow_softmax |
Keras 1D-CNN, softmax 2-class output |
| TF3 | run_cnn_2d_tensorflow_softmax |
CNN_2d_tensorflow_softmax |
Keras 2D-CNN (1×k kernel), softmax 2-class output |
All tensors use NCHW layout (batch × channels × height × width).
1-D signals are stored as (N, C, 1, W) so the same conv2d_forward
primitive handles both 1-D and 2-D cases uniformly.
| Model group | Input shape (NCHW) | Meaning |
|---|---|---|
| 2D-CNN (PYA–PYC) | (1, 1, 57, 57) | Single-channel 57×57 feature map |
| 1D-CNN (PYD–PYF) | (1, 57, 1, 10) | 57 features × 10 time steps |
| MobileNetV3 (PYG) | (1, 1, 57, 57) | Single-channel 57×57 feature map |
| Keras 1D (TF1–TF2) | (1, 57, 1, 10) | 57 features × 10 time steps |
| Keras 2D (TF3) | (1, 57, 10, 10) | 57 channels × 10×10 spatial map |
model/
├── cnn_models.py Original Python model definitions (PyTorch + Keras)
│
├── c_port/ ── Dynamic variant (float, all 10 models) ───────────
│ ├── deepbindi_config.h Logging + fatal-error macros
│ ├── nn_runtime.h/c Scalar float kernels using malloc/free
│ ├── cnn_models_c.h/c All 10 model forward passes
│ ├── main.c Demo driver: runs all models, prints checksums
│ └── Makefile Builds deepbindi_c_demo
│
├── static_port/ ── Static variant (float, all 10 models) ────────────
│ ├── deepbindi_config.h Logging + fatal-error macros
│ ├── arena.h/c g_weight_pool / g_act_arena bump allocators
│ ├── nn_runtime.h/c Same kernels; tensor_free / layer_free are no-ops
│ ├── cnn_models_c.h/c Same models; act_arena_reset() between runs
│ ├── main.c Driver; calls arena_stats() after all models
│ └── Makefile Builds deepbindi_static_demo
│
└── deepbindi_cnn_x_heep/ ── X-HEEP / int32 variant (CNN_1D_v2 only) ─────────
├── deepbindi_config.h TARGET_PC stubs + DEEPBINDI_ENABLE_FPU guard
├── arena.h/c int32_t pools (WEIGHT_POOL_WORDS / ACT_ARENA_WORDS)
├── nn_runtime.h/c int32_t kernels; no float, no math.h
├── cnn_models_c.h/c CNN_1D_v2 only; accepts real or dummy int32 input
├── main.c X-HEEP driver (CSR cycle count, FPU optional)
├── test_input.h 570 int32_t values from test_data.txt, sample 0
├── Makefile PC build: `make run` (gcc -DTARGET_PC)
└── extract_weights.py TFLite model inspector + C int32_t header generator
| Aspect | c_port/ (dynamic) |
static_port/ (static) |
|---|---|---|
| Weight allocation | malloc inside *_layer_create |
weight_alloc (bump into g_weight_pool[]) |
| Activation allocation | malloc inside tensor_create |
act_alloc (bump into g_act_arena[]) |
| Freeing | free in tensor_free / *_layer_free |
no-op – arena is bulk-reset |
| Between-model cleanup | Each tensor freed after use | act_arena_reset() at start of each model |
| Static RAM (all 10 models) | OS heap (invisible) | ≈ 10 MB BSS (g_weight_pool + g_act_arena) |
| OS / libc requirement | malloc / free |
No heap; stdio only when logging is on |
| Logging / output | make run (debug build) |
make run (debug build) |
| Embedded / bare-metal ready | No (needs heap) | Yes |
The static version prints a memory usage table at the end:
── Arena usage ──────────────────────────────────────────
Weight pool : 1748492 / 2000000 floats (6830 / 7813 KB)
Act arena : 372770 / 500000 floats HWM (1456 / 1953 KB)
Tensor pool : 5 / 128 structs
─────────────────────────────────────────────────────────
Use this to right-size WEIGHT_POOL_FLOATS and ACT_ARENA_FLOATS in
static_port/arena.h when deploying a single model.
Both ports are designed to run on bare-metal targets (no OS, no heap).
All platform-specific behaviour is isolated in deepbindi_config.h,
which is the only file that needs to change per target.
By default the build is fully silent — no printf or fprintf anywhere.
Enable output for debugging with:
make run # debug build (adds -DDEEPBINDI_ENABLE_LOGGING automatically)
make debug # same, without running
make # production / embedded build – no output, no stdio dependencyOr pass the flag directly to your cross-compiler:
CFLAGS += -DDEEPBINDI_ENABLE_LOGGING # host / UART debug buildShape mismatches and pool overflows call DEEPBINDI_FATAL(msg). The default
behaviour depends on the build mode:
| Build | Default behaviour |
|---|---|
| With logging | Print message to stderr + exit(1) |
| Without logging | Infinite loop for(;;){} → triggers watchdog reset |
Override for your target by defining DEEPBINDI_FATAL before the build:
/* ARM Cortex-M: halt at breakpoint */
#define DEEPBINDI_FATAL(msg) do { __BKPT(0); for(;;){} } while(0)
/* RISC-V: illegal instruction trap */
#define DEEPBINDI_FATAL(msg) do { __asm__("unimp"); for(;;){} } while(0)
/* Custom UART + watchdog reset */
#define DEEPBINDI_FATAL(msg) do { uart_puts(msg); system_reset(); } while(0)| Issue | c_port/ / static_port/ |
deepbindi_cnn_x_heep/ |
|---|---|---|
int width |
Shape fields are plain int; 16-bit MCUs may overflow on large 2-D tensors. Use a 32-bit toolchain. |
Same. |
| Arithmetic type | float throughout; no implicit double promotion. |
int32_t throughout; no float, no math.h. |
memset to zero |
Relies on IEEE-754 all-zero = 0.0f. Safe on all common targets. | Uses explicit scalar zero-fill loops; no libc dependency. |
%f format specifier |
Not used (newlib-nano limitation). Values printed as scaled integers. | Not used; values are integers, printed with %d directly. |
%zu format specifier |
Not supported by newlib-nano; %u with (unsigned) cast used instead. |
Same. |
stdio.h / stdlib.h |
Included only when DEEPBINDI_ENABLE_LOGGING is defined. |
Always included (logging always on); stdlib.h included only with TARGET_PC. |
| FPU | Not required; no CSR writes. | Not required for int32 port. Guard with #ifdef DEEPBINDI_ENABLE_FPU if adding FP code. |
| Primitive | Description |
|---|---|
conv2d_forward |
2-D convolution with padding, stride, groups (incl. depthwise) |
batchnorm_forward_inplace |
Inference BN: γ·(x−μ)/√(σ²+ε) + β |
maxpool2d_forward |
Sliding-window max reduction |
adaptive_avg_pool2d_forward |
Global (or partial) average pooling to target H×W |
flatten_forward |
Reshape to (N, C·H·W, 1, 1) |
dense_forward |
Fully-connected: y = x·Wᵀ + b |
concat_height |
Concatenate two tensors along the H axis |
add_forward |
Element-wise addition (residual connections) |
channel_scale_forward |
Per-channel scalar multiply (SE attention gate) |
relu_inplace / sigmoid_inplace / softmax_inplace |
Standard activations |
hardsigmoid_inplace / hardswish_inplace |
MobileNetV3 approximated activations |
| Helper | Fuses |
|---|---|
apply_conv_bn_act |
Conv2D → BatchNorm → Activation |
apply_dense_bn_act |
Dense → (optional BN) → Activation |
apply_se_block |
GlobalAvgPool → FC → ReLU → FC → HardSigmoid → Scale |
apply_mobilenet_block |
Expand conv → Depthwise conv → SE → Project conv → (Residual add) |
Dropout is a training-only operation. It is a pure identity at inference time and is omitted entirely.
BatchNorm is applied in inference mode (frozen running mean/var). The
closed-form formula γ·(x−μ)/√(σ²+ε) + β is computed directly. Dummy
parameters are deterministic and reproducible – replace with real trained values
for production.
1-D convolutions as 2-D: the Keras models (TF1–TF3) already represent 1-D
convolutions as 2-D kernels of shape (1×k). The C port adopts the same
representation uniformly, so all convolutions go through conv2d_forward.
MobileNetV3 channel scaling: width_mult = 0.35 is applied to every
channel count with a make_divisible(..., 8) rounding step (matching
torchvision) to keep memory accesses aligned.
This directory contains a dedicated port of CNN_1D_v2 (PYE) for the X-HEEP RISC-V SoC, targeting bare-metal inference with or without a hardware FPU
| Constraint | Source | Implication for C code |
|---|---|---|
| 32-bit word length | HW accelerator requirement | int32_t throughout; no int8_t TFLite quantization |
| No FPU guarantee | X-HEEP bare-metal startup | No float; no math.h (expf, sqrtf, fabsf banned) |
| No heap allocator | Bare-metal, no OS | Static global pools; malloc/free banned |
No memset / memcpy |
Avoid libc symbol dependencies | Explicit scalar loops everywhere |
No %f in printf |
newlib-nano limitation | Integer printing only |
| CSR cycle counter | X-HEEP hardware performance measurement | CSR_WRITE/READ(CSR_REG_MCYCLE, ...) |
All arithmetic is int32_t — no float anywhere in the data path:
typedef struct {
int n, c, h, w;
int32_t *data; /* was float * */
} Tensor;BatchNorm is pre-folded to Q7 scale + offset — eliminates sqrtf from the
forward pass entirely:
typedef struct {
int num_features;
int32_t *scale; /* Q7: scale[c] = round(gamma[c]/sqrt(var[c]+eps) * 128) */
int32_t *offset; /* offset[c] = round(beta[c] - gamma[c]*mean[c]/sqrt(var[c]+eps)) */
} BatchNormLayer;
/* Forward: */
y = (int32_t)(((int64_t)x * scale[c]) >> 7) + offset[c];Sigmoid replaced by a sign threshold — the output layer is binary (fear / no fear),
so sigmoid(x) > 0.5 is equivalent to x > 0, which requires no expf:
void sigmoid_inplace(Tensor *input) {
for (i = 0; i < total; ++i)
input->data[i] = (input->data[i] > 0) ? 1 : 0;
}FPU enable is optional — the int32 port does not trigger any FP instruction,
so the mstatus.FS write is guarded:
#ifdef DEEPBINDI_ENABLE_FPU
CSR_SET_BITS(CSR_REG_MSTATUS, (FS_INITIAL << 13));
#endifPC testing — build with a standard host gcc using -DTARGET_PC (which
the Makefile sets automatically). This stubs all CSR macros and redirects
DEEPBINDI_FATAL to exit(1):
cd deepbindi_cnn_x_heep
make runExpected output (dummy weights, test sample 0):
DeepBindi CNN_1D_v2 on X-HEEP
int32 inference, test sample 0 (label=0)
Output : 1 (FEAR)
Cycles : 0
-- Arena usage --
Weight pool : 19713 / 24000 words (77 / 93 KB)
Act arena : 1211 / 2048 words (4 / 8 KB)
Tensor pool : 7 / 16 structs
-----------------
Note: with dummy (seeded pseudo-random) weights the output is meaningless — FEAR
here does not indicate a real prediction. The arena numbers are the meaningful
check: weight pool usage (19 713/24 000) and act arena (1 211/2 048) must match
these values exactly for any correct build.
| Buffer | Elements | Size |
|---|---|---|
g_weight_pool[] |
24 000 × 4 B | 93.75 KB (move to flash for production) |
g_act_arena[] |
2 048 × 4 B | 8.00 KB |
g_tensor_pool[] |
16 structs | 0.38 KB |
| Total | ~102 KB |
Of the 93.75 KB weight pool, 19 713 words (77 KB) are actually used by CNN_1D_v2
with dummy weights. Once trained weights are loaded as const int32_t arrays
in flash (.rodata), the weight pool can be eliminated and SRAM drops to ~8 KB.
| Layer | Max accumulator value | Headroom |
|---|---|---|
| Conv1: 57×5 MACs, inputs ≤ 127, weights ≤ 8 | 285 × 127 × 8 ≈ 290 K | INT32_MAX = 2.1 G ✓ |
| Conv2: 32×5 MACs, inputs ≤ 290 K, weights ≤ 8 | 160 × 290 K × 8 ≈ 371 M | INT32_MAX = 2.1 G ✓ |
| BN multiply (int64 intermediate): 371 M × 128 ≈ 47 G | handled by int64_t cast |
✓ |
| Dense: 64 MACs, inputs ≤ 371 M, weights ≤ 8 | 64 × 371 M × 8 ≈ 190 G | handled at INT32 post-BN clamp |
In practice, pseudo-random dummy weights cancel out; worst-case values are achieved only when all weights and inputs have the same sign.
test_input.h contains sample 0 from
CH07_TFLite/saved_model/micro/test_data.txt (label = 0, NO_FEAR):
static const int32_t test_input_0[570] = { 7, 8, 13, 10, ... };Layout: data[ch * 10 + t] for channel ch ∈ [0, 56], time step t ∈ [0, 9].
Values are original int8-range integers widened to int32_t.
The .tflite files in CH07_TFLite/saved_model/tflite/ are trained weights for
CNN_2d_tensorflow_softmax (TF3 / model_quant_1FC.tflite), not for the
PyTorch CNN_1D_v2 (PYE) that this C port implements:
| CNN_1D_v2 (this C port) | TFLite micro (model_quant_1FC) | |
|---|---|---|
| Input layout | NCHW (1, 57, 1, 10) |
NHWC (1, 57, 10, 1) |
| Conv blocks | 2 — channels 57→32→64 | 1 — filters 1→64 |
| Kernel | (1×5) × 2 |
(1×5) × 1 |
| Flatten features | 64 | 57 × 3 × 64 = 10 944 |
| Output | 1 × threshold(0) | 2-class softmax + argmax |
Consequently the .tflite weights cannot be loaded directly into the C port.
Use extract_weights.py to inspect the TFLite model structure and quantization
parameters. To load real weights into the C port, export the PyTorch CNN_1D_v2
checkpoint instead (see Replacing dummy weights below).
Add the application to the X-HEEP build system and build as usual:
cmake -DAPP=deepbindi_cnn_x_heep [other X-HEEP flags] ..
makeThe application directory (deepbindi_cnn_x_heep/) is self-contained and
follows the same conventions as other X-HEEP example applications
(example_matadd, example_matfloat, etc.).
A Coarse-Grained Reconfigurable Array (CGRA) accelerates computation by mapping loop nests onto a 2-D array of functional units connected by a configurable interconnect. CGRAs excel at data-parallel, regular loop structures with predictable memory access patterns – exactly what neural network inference provides.
Located in nn_runtime.c. The 7-level loop nest is:
for n // batch – independent per sample
for oc // output channel
for oh, ow // spatial output ← tile across CGRA rows/cols
sum = bias[oc]
for icg, kh, kw // filter window ← MAC chain on FUs
sum += input[...] * weight[...] /* one MAC per iteration */
output[n][oc][oh][ow] = sumKey observations:
- The innermost
(kh, kw)loops are one MAC with no loop-carried dependence across different output positions – the textbook CGRA MAC-chain. - The
(oh, ow)loops produce independent output pixels; distribute them across CGRA rows/columns as a spatial tile. - Depthwise convolutions (MobileNetV3,
groups == in_channels) collapse theicgloop to 1, making scheduling simpler with the same MAC structure. - BN + activation fusion: BN is a per-channel scale+shift; ReLU is a compare-with-zero. Both can be merged into the CGRA output stage immediately after the final accumulate, eliminating two memory round-trips per element.
for n // batch
for out // output neuron ← map across CGRA rows
sum = bias[out]
for in // inner product ← MAC pipeline per row
sum += x[in] * W[out][in]
output[out] = sumFC layers in these models are small (32–192 neurons) – a compact CGRA covers them without tiling.
| Access | Pattern | CGRA hint |
|---|---|---|
| Conv weights | Sequential; reused over all (oh,ow) |
Broadcast / double-buffer |
| Input activations (conv) | Sliding window stencil | Line buffer / shift register |
| Output activations | One write per (n,oc,oh,ow) |
Direct DMA out |
| Dense weight matrix | Sequential row reads | Sequential DMA |
| BN parameters | One scalar per channel, broadcast over H×W | Constant broadcast |
| SE squeeze vector | One scalar per channel after global avg-pool | Small local buffer |
| Model | Dominant ops | CGRA notes |
|---|---|---|
PYA (CNN_2D_v1) |
2×Conv2D(5×5) + 2×Dense | Simplest 2-D model; good first 2-D test |
PYB (CNN_2D_v2) |
3 parallel Conv2D + Conv2D + 2×Dense | Branches are fully independent (parallelisable) |
PYC (CNN_2D_v3) |
2 parallel Conv2D + Conv2D + 2×Dense | Two-branch variant of PYB |
PYD (CNN_1D_v1) |
1×Conv(1×5) + 1×Dense | Recommended first CGRA target |
PYE (CNN_1D_v2) |
2×Conv(1×5) + 1×Dense | Two sequential conv stages |
PYF (CNN_1D_v3) |
1×Conv(1×5) + 2×Dense | Two FC stages |
PYG (MobileNetV3) |
13×pointwise + 11×depthwise + 11×SE + 2×Dense | Most complex; SE adds GlobalAvgPool + 2 small FC per block |
| TF1–TF2 | Conv(1×5) + 2×Dense | Keras equivalents of PYF |
| TF3 | Conv(1×5) + 1×Dense | 2-D kernel emulating 1-D |
-
Start with
run_cnn_1d_v1(PYD) – oneconv2d_forwardwith a(1×5)kernel (57 input channels → 64 output, 1-D FIR pattern) plus onedense_forward(64→1). Total MACs ≈ 21 900. Easy to verify. -
Scale to
run_cnn_2d_v1(PYA) – 2-D spatial tiling over 57×57 feature maps. -
Tackle
run_mobilenet_v3_custom(PYG) – full depthwise + SE pipeline with 11 inverted-residual blocks.
Each primitive has a well-defined C function signature. To swap in a CGRA version without touching model code:
-
Implement the same signature in
nn_runtime_cgra.c. -
In the Makefile, replace
nn_runtime.c:SOURCES := main.c nn_runtime_cgra.c cnn_models_c.c # dynamic SOURCES := main.c arena.c nn_runtime_cgra.c cnn_models_c.c # static
-
cnn_models_c.candmain.care unchanged – they call through the same header (nn_runtime.h).
For partial acceleration (e.g. only conv2d_forward), keep nn_runtime.c and
guard with a compile-time flag:
/* nn_runtime_cgra.c */
#include "nn_runtime.h"
Tensor *conv2d_forward(...) { /* CGRA path */ }
/* all other primitives: link against nn_runtime.c for the scalar fallback */Both c_port/main.c and static_port/main.c print a checksum (sum of
absolute output values) for each model. Use these as reference values:
# software reference
cd c_port && make run > ref.txt
# after CGRA substitution
make run > cgra.txt
diff ref.txt cgra.txt # should be identical (or within ~1e-4 tolerance)The tensor_checksum helper is defined in nn_runtime.c. For stricter
validation, compare element-wise with a tolerance of 1e-4.
- Export PyTorch weights to a flat binary (
torch.save+ a custom extraction script, or ONNX export +onnxPython package). - Replace the
*_layer_create()calls incnn_models_c.cwith a loader that fills pre-allocatedfloatarrays from the binary file. - For the static variant, pre-populate
g_weight_pool[]at link time using a generated C header (weights_cnn_1d_v2.h) with trained values as astatic const floatarray. - Verify correctness by comparing
tensor_checksumagainst a Python reference forward pass on the same input values.
The quantized .tflite models in CH07_TFLite/saved_model/tflite/ are for a
different architecture (see architecture mismatch note above). To load real
weights into the CNN_1D_v2 C port:
-
Option A — PyTorch export (recommended): Load the trained PyTorch
CNN_1D_v2checkpoint, iterate overmodel.state_dict(), quantize each weight tensor to int8 range (multiply by a per-layer scale, round, clamp to ±127), and write aweights_cnn_1d_v2.hheader withconst int32_tarrays. -
Option B — TFLite inspection only: Run
extract_weights.pyto inspect the TFLite model and understand its quantization parameters. These weights are not directly usable in the C port but are helpful for comparison and cross-validation.python deepbindi_cnn_x_heep/extract_weights.py --inspect python deepbindi_cnn_x_heep/extract_weights.py # writes weights_tflite_1FC.h -
Once
weights_cnn_1d_v2.hexists, innn_runtime.creplace theseeded_value_int32()loops with reads from the const arrays:/* in conv2d_layer_create: */ for (i = 0; i < weight_count; ++i) layer.weights[i] = cnn1d_conv1_weights[i];
The linker places
constarrays in.rodata(flash), reducing SRAM from ~94 KB to ~8 KB (activations only). -
Pre-fold trained BN parameters into Q7 scale+offset:
scale_int = np.round(gamma / np.sqrt(var + eps) * 128).astype(np.int32) offset_int = np.round(beta - gamma * mean / np.sqrt(var + eps)).astype(np.int32)
If you use these models or this C port in your work, please cite the original paper:
@article{gutierrez2026deepbindi,
author = {Gutiérrez-Martín, Laura and López-Ongil, Celia and Miranda-Calero, Jose A.},
title = {{DeepBindi}: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment},
journal = {IEEE Journal of Biomedical and Health Informatics},
volume = {30},
number = {1},
year = {2026},
doi = {10.1109/JBHI.2025.3587961},
note = {Date of publication: 10 July 2025; date of current version: 8 January 2026}
}Plain-text reference:
L. Gutiérrez-Martín, C. López-Ongil, and J. A. Miranda-Calero, "DeepBindi: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment," IEEE Journal of Biomedical and Health Informatics, vol. 30, no. 1, Jan. 2026, doi: 10.1109/JBHI.2025.3587961.
Context: The paper presents a fear-recognition system based on physiological signals (BVP, SKT, GSR) from the WEMAC dataset, achieving 80% F1-score and 74% accuracy. The system was validated on an ultra-low-power ARM Cortex-M4 (16 mW @ 5 V, 496 ms per inference). The C port in this repository implements the same model architectures to support deployment on similar extreme-edge targets.