DeepBindi CNN models – C inference port

Reference C implementation of all ten CNN architectures defined in cnn_models.py. The models originate from the following peer-reviewed paper — please cite it if you use this code or the architectures in your work:

L. Gutiérrez-Martín, C. López-Ongil, and J. A. Miranda-Calero, "DeepBindi: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment," IEEE Journal of Biomedical and Health Informatics, vol. 30, no. 1, Jan. 2026. DOI: 10.1109/JBHI.2025.3587961

Three C variants are provided:

Variant	Directory	Memory strategy	Data type	Use case
Dynamic	`c_port/`	`malloc` / `free` – heap allocated	`float`	Rapid prototyping, host machines
Static	`static_port/`	Static global pools, no heap	`float`	embedded targets, no OS
X-HEEP / int32	`deepbindi_cnn_x_heep/`	Static global pools, no heap	`int32_t`	X-HEEP RISC-V SoC

The dynamic and static variants are self-contained (no dependencies beyond libc and libm), produce identical checksums, and use the same tensor layout, layer primitives, and compute loops.

The X-HEEP variant targets CNN_1D_v2 (PYE) only, uses 32-bit integer arithmetic throughout (no float, no math.h), and is built to run bare-metal on X-HEEP with or without the FPU.

Quick start

# Dynamic version (debug build – logging enabled)
cd c_port
make run

# Static version (debug build – logging enabled)
cd static_port
make run

# Production / embedded build (silent, no printf anywhere)
make

# Verify both produce the same numbers
cd static_port
make verify

Expected output (same for both):

CNN_2D_v1  | shape=(1,1,1,1) | checksum=0.496194 | values=[0.496194]
CNN_2D_v2  | shape=(1,1,1,1) | checksum=0.496471 | values=[0.496471]
CNN_2D_v3  | shape=(1,1,1,1) | checksum=0.498775 | values=[0.498775]
CNN_1D_v1  | shape=(1,1,1,1) | checksum=0.496733 | values=[0.496733]
CNN_1D_v2  | shape=(1,1,1,1) | checksum=0.503341 | values=[0.503341]
CNN_1D_v3  | shape=(1,1,1,1) | checksum=0.502004 | values=[0.502004]
MobileNetV3Custom | shape=(1,1,1,1) | checksum=0.496607 | values=[0.496607]
CNN_1d_tf_sigmoid | shape=(1,1,1,1) | checksum=0.496598 | values=[0.496598]
CNN_1d_tf_softmax | shape=(1,2,1,1) | checksum=1.501474 | values=[0.498526, 0.501474]
CNN_2d_tf_softmax | shape=(1,2,1,1) | checksum=1.511614 | values=[0.488386, 0.511614]

Requires: gcc (≥ C99) and libm. No Python, PyTorch, or TensorFlow needed.

Model catalogue

ID	C function	Python class	Architecture summary
PYA	`run_cnn_2d_v1`	`CNN_2D_v1`	2D-CNN · 2 conv + 2 FC · binary sigmoid
PYB	`run_cnn_2d_v2`	`CNN_2D_v2`	2D-CNN · 3 parallel branches + 2nd conv + 2 FC
PYC	`run_cnn_2d_v3`	`CNN_2D_v3`	2D-CNN · 2 parallel branches + 2nd conv + 2 FC
PYD	`run_cnn_1d_v1`	`CNN_1D_v1`	1D-CNN · 1 conv + 1 FC
PYE	`run_cnn_1d_v2`	`CNN_1D_v2`	1D-CNN · 2 conv + 1 FC
PYF	`run_cnn_1d_v3`	`CNN_1D_v3`	1D-CNN · 1 conv + 2 FC
PYG	`run_mobilenet_v3_custom`	`MobileNetV3Custom`	MobileNetV3-Small (width×0.35, 1-ch), 11 MB blocks + SE
TF1	`run_cnn_1d_tensorflow_sigmoid`	`CNN_1d_tensorflow_sigmoid`	Keras 1D-CNN, sigmoid binary output
TF2	`run_cnn_1d_tensorflow_softmax`	`CNN_1d_tensorflow_softmax`	Keras 1D-CNN, softmax 2-class output
TF3	`run_cnn_2d_tensorflow_softmax`	`CNN_2d_tensorflow_softmax`	Keras 2D-CNN (1×k kernel), softmax 2-class output

Input conventions

All tensors use NCHW layout (batch × channels × height × width). 1-D signals are stored as (N, C, 1, W) so the same conv2d_forward primitive handles both 1-D and 2-D cases uniformly.

Model group	Input shape (NCHW)	Meaning
2D-CNN (PYA–PYC)	(1, 1, 57, 57)	Single-channel 57×57 feature map
1D-CNN (PYD–PYF)	(1, 57, 1, 10)	57 features × 10 time steps
MobileNetV3 (PYG)	(1, 1, 57, 57)	Single-channel 57×57 feature map
Keras 1D (TF1–TF2)	(1, 57, 1, 10)	57 features × 10 time steps
Keras 2D (TF3)	(1, 57, 10, 10)	57 channels × 10×10 spatial map

Directory structure

model/
├── cnn_models.py                   Original Python model definitions (PyTorch + Keras)
│
├── c_port/                         ── Dynamic variant (float, all 10 models) ───────────
│   ├── deepbindi_config.h          Logging + fatal-error macros
│   ├── nn_runtime.h/c              Scalar float kernels using malloc/free
│   ├── cnn_models_c.h/c            All 10 model forward passes
│   ├── main.c                      Demo driver: runs all models, prints checksums
│   └── Makefile                    Builds deepbindi_c_demo
│
├── static_port/                    ── Static variant (float, all 10 models) ────────────
│   ├── deepbindi_config.h          Logging + fatal-error macros
│   ├── arena.h/c                   g_weight_pool / g_act_arena bump allocators
│   ├── nn_runtime.h/c              Same kernels; tensor_free / layer_free are no-ops
│   ├── cnn_models_c.h/c            Same models; act_arena_reset() between runs
│   ├── main.c                      Driver; calls arena_stats() after all models
│   └── Makefile                    Builds deepbindi_static_demo
│
└── deepbindi_cnn_x_heep/           ── X-HEEP / int32 variant (CNN_1D_v2 only) ─────────
    ├── deepbindi_config.h          TARGET_PC stubs + DEEPBINDI_ENABLE_FPU guard
    ├── arena.h/c                   int32_t pools (WEIGHT_POOL_WORDS / ACT_ARENA_WORDS)
    ├── nn_runtime.h/c              int32_t kernels; no float, no math.h
    ├── cnn_models_c.h/c            CNN_1D_v2 only; accepts real or dummy int32 input
    ├── main.c                      X-HEEP driver (CSR cycle count, FPU optional)
    ├── test_input.h                570 int32_t values from test_data.txt, sample 0
    ├── Makefile                    PC build: `make run` (gcc -DTARGET_PC)
    └── extract_weights.py          TFLite model inspector + C int32_t header generator

Dynamic vs. static – key differences

Aspect	`c_port/` (dynamic)	`static_port/` (static)
Weight allocation	`malloc` inside `*_layer_create`	`weight_alloc` (bump into `g_weight_pool[]`)
Activation allocation	`malloc` inside `tensor_create`	`act_alloc` (bump into `g_act_arena[]`)
Freeing	`free` in `tensor_free` / `*_layer_free`	no-op – arena is bulk-reset
Between-model cleanup	Each tensor freed after use	`act_arena_reset()` at start of each model
Static RAM (all 10 models)	OS heap (invisible)	≈ 10 MB BSS (`g_weight_pool` + `g_act_arena`)
OS / libc requirement	`malloc` / `free`	No heap; `stdio` only when logging is on
Logging / output	`make run` (debug build)	`make run` (debug build)
Embedded / bare-metal ready	No (needs heap)	Yes

The static version prints a memory usage table at the end:

── Arena usage ──────────────────────────────────────────
  Weight pool : 1748492 / 2000000 floats  (6830 / 7813 KB)
  Act arena   : 372770  / 500000  floats HWM  (1456 / 1953 KB)
  Tensor pool : 5 / 128 structs
─────────────────────────────────────────────────────────

Use this to right-size WEIGHT_POOL_FLOATS and ACT_ARENA_FLOATS in static_port/arena.h when deploying a single model.

Embedded portability

Both ports are designed to run on bare-metal targets (no OS, no heap). All platform-specific behaviour is isolated in deepbindi_config.h, which is the only file that needs to change per target.

Logging (silent by default)

By default the build is fully silent — no printf or fprintf anywhere. Enable output for debugging with:

make run      # debug build (adds -DDEEPBINDI_ENABLE_LOGGING automatically)
make debug    # same, without running
make          # production / embedded build – no output, no stdio dependency

Or pass the flag directly to your cross-compiler:

CFLAGS += -DDEEPBINDI_ENABLE_LOGGING   # host / UART debug build

Fatal error handler

Shape mismatches and pool overflows call DEEPBINDI_FATAL(msg). The default behaviour depends on the build mode:

Build	Default behaviour
With logging	Print message to stderr + `exit(1)`
Without logging	Infinite loop `for(;;){}` → triggers watchdog reset

Override for your target by defining DEEPBINDI_FATAL before the build:

/* ARM Cortex-M: halt at breakpoint */
#define DEEPBINDI_FATAL(msg)  do { __BKPT(0); for(;;){} } while(0)

/* RISC-V: illegal instruction trap */
#define DEEPBINDI_FATAL(msg)  do { __asm__("unimp"); for(;;){} } while(0)

/* Custom UART + watchdog reset */
#define DEEPBINDI_FATAL(msg)  do { uart_puts(msg); system_reset(); } while(0)

Other portability notes

Issue	`c_port/` / `static_port/`	`deepbindi_cnn_x_heep/`
`int` width	Shape fields are plain `int`; 16-bit MCUs may overflow on large 2-D tensors. Use a 32-bit toolchain.	Same.
Arithmetic type	`float` throughout; no implicit `double` promotion.	`int32_t` throughout; no `float`, no `math.h`.
`memset` to zero	Relies on IEEE-754 all-zero = 0.0f. Safe on all common targets.	Uses explicit scalar zero-fill loops; no libc dependency.
`%f` format specifier	Not used (newlib-nano limitation). Values printed as scaled integers.	Not used; values are integers, printed with `%d` directly.
`%zu` format specifier	Not supported by newlib-nano; `%u` with `(unsigned)` cast used instead.	Same.
`stdio.h` / `stdlib.h`	Included only when `DEEPBINDI_ENABLE_LOGGING` is defined.	Always included (logging always on); `stdlib.h` included only with `TARGET_PC`.
FPU	Not required; no CSR writes.	Not required for int32 port. Guard with `#ifdef DEEPBINDI_ENABLE_FPU` if adding FP code.

Layer primitives (`nn_runtime.c` / `static_port/nn_runtime.c`)

Primitive	Description
`conv2d_forward`	2-D convolution with padding, stride, groups (incl. depthwise)
`batchnorm_forward_inplace`	Inference BN: `γ·(x−μ)/√(σ²+ε) + β`
`maxpool2d_forward`	Sliding-window max reduction
`adaptive_avg_pool2d_forward`	Global (or partial) average pooling to target H×W
`flatten_forward`	Reshape to `(N, C·H·W, 1, 1)`
`dense_forward`	Fully-connected: `y = x·Wᵀ + b`
`concat_height`	Concatenate two tensors along the H axis
`add_forward`	Element-wise addition (residual connections)
`channel_scale_forward`	Per-channel scalar multiply (SE attention gate)
`relu_inplace` / `sigmoid_inplace` / `softmax_inplace`	Standard activations
`hardsigmoid_inplace` / `hardswish_inplace`	MobileNetV3 approximated activations

Fused building blocks (`cnn_models_c.c`)

Helper	Fuses
`apply_conv_bn_act`	Conv2D → BatchNorm → Activation
`apply_dense_bn_act`	Dense → (optional BN) → Activation
`apply_se_block`	GlobalAvgPool → FC → ReLU → FC → HardSigmoid → Scale
`apply_mobilenet_block`	Expand conv → Depthwise conv → SE → Project conv → (Residual add)

Design decisions

Dropout is a training-only operation. It is a pure identity at inference time and is omitted entirely.

BatchNorm is applied in inference mode (frozen running mean/var). The closed-form formula γ·(x−μ)/√(σ²+ε) + β is computed directly. Dummy parameters are deterministic and reproducible – replace with real trained values for production.

1-D convolutions as 2-D: the Keras models (TF1–TF3) already represent 1-D convolutions as 2-D kernels of shape (1×k). The C port adopts the same representation uniformly, so all convolutions go through conv2d_forward.

MobileNetV3 channel scaling: width_mult = 0.35 is applied to every channel count with a make_divisible(..., 8) rounding step (matching torchvision) to keep memory accesses aligned.

X-HEEP deployment variant (`deepbindi_cnn_x_heep/`)

This directory contains a dedicated port of CNN_1D_v2 (PYE) for the X-HEEP RISC-V SoC, targeting bare-metal inference with or without a hardware FPU

Why a separate variant?

Constraint	Source	Implication for C code
32-bit word length	HW accelerator requirement	`int32_t` throughout; no `int8_t` TFLite quantization
No FPU guarantee	X-HEEP bare-metal startup	No `float`; no `math.h` (`expf`, `sqrtf`, `fabsf` banned)
No heap allocator	Bare-metal, no OS	Static global pools; `malloc`/`free` banned
No `memset` / `memcpy`	Avoid libc symbol dependencies	Explicit scalar loops everywhere
No `%f` in `printf`	newlib-nano limitation	Integer printing only
CSR cycle counter	X-HEEP hardware performance measurement	`CSR_WRITE/READ(CSR_REG_MCYCLE, ...)`

Key design changes vs `static_port/`

All arithmetic is int32_t — no float anywhere in the data path:

typedef struct {
    int      n, c, h, w;
    int32_t *data;          /* was float * */
} Tensor;

BatchNorm is pre-folded to Q7 scale + offset — eliminates sqrtf from the forward pass entirely:

typedef struct {
    int      num_features;
    int32_t *scale;   /* Q7: scale[c] = round(gamma[c]/sqrt(var[c]+eps) * 128) */
    int32_t *offset;  /* offset[c] = round(beta[c] - gamma[c]*mean[c]/sqrt(var[c]+eps)) */
} BatchNormLayer;

/* Forward: */
y = (int32_t)(((int64_t)x * scale[c]) >> 7) + offset[c];

Sigmoid replaced by a sign threshold — the output layer is binary (fear / no fear), so sigmoid(x) > 0.5 is equivalent to x > 0, which requires no expf:

void sigmoid_inplace(Tensor *input) {
    for (i = 0; i < total; ++i)
        input->data[i] = (input->data[i] > 0) ? 1 : 0;
}

FPU enable is optional — the int32 port does not trigger any FP instruction, so the mstatus.FS write is guarded:

#ifdef DEEPBINDI_ENABLE_FPU
    CSR_SET_BITS(CSR_REG_MSTATUS, (FS_INITIAL << 13));
#endif

PC testing — build with a standard host gcc using -DTARGET_PC (which the Makefile sets automatically). This stubs all CSR macros and redirects DEEPBINDI_FATAL to exit(1):

cd deepbindi_cnn_x_heep
make run

Expected output (dummy weights, test sample 0):

DeepBindi CNN_1D_v2 on X-HEEP
int32 inference, test sample 0 (label=0)
Output : 1 (FEAR)
Cycles : 0
-- Arena usage --
  Weight pool : 19713 / 24000 words  (77 / 93 KB)
  Act arena   : 1211 / 2048 words  (4 / 8 KB)
  Tensor pool : 7 / 16 structs
-----------------

Note: with dummy (seeded pseudo-random) weights the output is meaningless — FEAR here does not indicate a real prediction. The arena numbers are the meaningful check: weight pool usage (19 713/24 000) and act arena (1 211/2 048) must match these values exactly for any correct build.

Static SRAM footprint (CNN_1D_v2, int32_t)

Buffer	Elements	Size
`g_weight_pool[]`	24 000 × 4 B	93.75 KB (move to flash for production)
`g_act_arena[]`	2 048 × 4 B	8.00 KB
`g_tensor_pool[]`	16 structs	0.38 KB
Total		~102 KB

Of the 93.75 KB weight pool, 19 713 words (77 KB) are actually used by CNN_1D_v2 with dummy weights. Once trained weights are loaded as const int32_t arrays in flash (.rodata), the weight pool can be eliminated and SRAM drops to ~8 KB.

Overflow analysis

Layer	Max accumulator value	Headroom
Conv1: 57×5 MACs, inputs ≤ 127, weights ≤ 8	285 × 127 × 8 ≈ 290 K	INT32_MAX = 2.1 G ✓
Conv2: 32×5 MACs, inputs ≤ 290 K, weights ≤ 8	160 × 290 K × 8 ≈ 371 M	INT32_MAX = 2.1 G ✓
BN multiply (int64 intermediate): 371 M × 128 ≈ 47 G	handled by `int64_t` cast	✓
Dense: 64 MACs, inputs ≤ 371 M, weights ≤ 8	64 × 371 M × 8 ≈ 190 G	handled at INT32 post-BN clamp

In practice, pseudo-random dummy weights cancel out; worst-case values are achieved only when all weights and inputs have the same sign.

Test data

test_input.h contains sample 0 from CH07_TFLite/saved_model/micro/test_data.txt (label = 0, NO_FEAR):

static const int32_t test_input_0[570] = { 7, 8, 13, 10, ... };

Layout: data[ch * 10 + t] for channel ch ∈ [0, 56], time step t ∈ [0, 9]. Values are original int8-range integers widened to int32_t.

Important: TFLite model vs CNN_1D_v2 architecture mismatch

The .tflite files in CH07_TFLite/saved_model/tflite/ are trained weights for CNN_2d_tensorflow_softmax (TF3 / model_quant_1FC.tflite), not for the PyTorch CNN_1D_v2 (PYE) that this C port implements:

	CNN_1D_v2 (this C port)	TFLite micro (model_quant_1FC)
Input layout	NCHW `(1, 57, 1, 10)`	NHWC `(1, 57, 10, 1)`
Conv blocks	2 — channels 57→32→64	1 — filters 1→64
Kernel	`(1×5)` × 2	`(1×5)` × 1
Flatten features	64	57 × 3 × 64 = 10 944
Output	1 × threshold(0)	2-class softmax + argmax

Consequently the .tflite weights cannot be loaded directly into the C port. Use extract_weights.py to inspect the TFLite model structure and quantization parameters. To load real weights into the C port, export the PyTorch CNN_1D_v2 checkpoint instead (see Replacing dummy weights below).

Building for X-HEEP (CMake)

Add the application to the X-HEEP build system and build as usual:

cmake -DAPP=deepbindi_cnn_x_heep [other X-HEEP flags] ..
make

The application directory (deepbindi_cnn_x_heep/) is self-contained and follows the same conventions as other X-HEEP example applications (example_matadd, example_matfloat, etc.).

CGRA acceleration guide

A Coarse-Grained Reconfigurable Array (CGRA) accelerates computation by mapping loop nests onto a 2-D array of functional units connected by a configurable interconnect. CGRAs excel at data-parallel, regular loop structures with predictable memory access patterns – exactly what neural network inference provides.

Primary kernel: `conv2d_forward` ★★★ (highest priority)

Located in nn_runtime.c. The 7-level loop nest is:

for n                        // batch – independent per sample
  for oc                     // output channel
    for oh, ow               // spatial output  ← tile across CGRA rows/cols
      sum = bias[oc]
      for icg, kh, kw        // filter window   ← MAC chain on FUs
        sum += input[...] * weight[...]   /* one MAC per iteration */
      output[n][oc][oh][ow] = sum

Key observations:

The innermost (kh, kw) loops are one MAC with no loop-carried dependence across different output positions – the textbook CGRA MAC-chain.
The (oh, ow) loops produce independent output pixels; distribute them across CGRA rows/columns as a spatial tile.
Depthwise convolutions (MobileNetV3, groups == in_channels) collapse the icg loop to 1, making scheduling simpler with the same MAC structure.
BN + activation fusion: BN is a per-channel scale+shift; ReLU is a compare-with-zero. Both can be merged into the CGRA output stage immediately after the final accumulate, eliminating two memory round-trips per element.

Secondary kernel: `dense_forward` ★★

for n               // batch
  for out           // output neuron   ← map across CGRA rows
    sum = bias[out]
    for in          // inner product   ← MAC pipeline per row
      sum += x[in] * W[out][in]
    output[out] = sum

FC layers in these models are small (32–192 neurons) – a compact CGRA covers them without tiling.

Memory access patterns

Access	Pattern	CGRA hint
Conv weights	Sequential; reused over all `(oh,ow)`	Broadcast / double-buffer
Input activations (conv)	Sliding window stencil	Line buffer / shift register
Output activations	One write per `(n,oc,oh,ow)`	Direct DMA out
Dense weight matrix	Sequential row reads	Sequential DMA
BN parameters	One scalar per channel, broadcast over H×W	Constant broadcast
SE squeeze vector	One scalar per channel after global avg-pool	Small local buffer

Operation count by model

Model	Dominant ops	CGRA notes
PYA (`CNN_2D_v1`)	2×Conv2D(5×5) + 2×Dense	Simplest 2-D model; good first 2-D test
PYB (`CNN_2D_v2`)	3 parallel Conv2D + Conv2D + 2×Dense	Branches are fully independent (parallelisable)
PYC (`CNN_2D_v3`)	2 parallel Conv2D + Conv2D + 2×Dense	Two-branch variant of PYB
PYD (`CNN_1D_v1`)	1×Conv(1×5) + 1×Dense	Recommended first CGRA target
PYE (`CNN_1D_v2`)	2×Conv(1×5) + 1×Dense	Two sequential conv stages
PYF (`CNN_1D_v3`)	1×Conv(1×5) + 2×Dense	Two FC stages
PYG (`MobileNetV3`)	13×pointwise + 11×depthwise + 11×SE + 2×Dense	Most complex; SE adds GlobalAvgPool + 2 small FC per block
TF1–TF2	Conv(1×5) + 2×Dense	Keras equivalents of PYF
TF3	Conv(1×5) + 1×Dense	2-D kernel emulating 1-D

Suggested progression

Start with run_cnn_1d_v1 (PYD) – one conv2d_forward with a (1×5) kernel (57 input channels → 64 output, 1-D FIR pattern) plus one dense_forward (64→1). Total MACs ≈ 21 900. Easy to verify.
Scale to run_cnn_2d_v1 (PYA) – 2-D spatial tiling over 57×57 feature maps.
Tackle run_mobilenet_v3_custom (PYG) – full depthwise + SE pipeline with 11 inverted-residual blocks.

Replacing a primitive with a CGRA implementation

Each primitive has a well-defined C function signature. To swap in a CGRA version without touching model code:

Implement the same signature in nn_runtime_cgra.c.

In the Makefile, replace nn_runtime.c:

SOURCES := main.c nn_runtime_cgra.c cnn_models_c.c      # dynamic
SOURCES := main.c arena.c nn_runtime_cgra.c cnn_models_c.c  # static

cnn_models_c.c and main.c are unchanged – they call through the same header (nn_runtime.h).

For partial acceleration (e.g. only conv2d_forward), keep nn_runtime.c and guard with a compile-time flag:

/* nn_runtime_cgra.c */
#include "nn_runtime.h"
Tensor *conv2d_forward(...) { /* CGRA path */ }
/* all other primitives: link against nn_runtime.c for the scalar fallback */

Validating CGRA results

Both c_port/main.c and static_port/main.c print a checksum (sum of absolute output values) for each model. Use these as reference values:

# software reference
cd c_port && make run > ref.txt

# after CGRA substitution
make run > cgra.txt

diff ref.txt cgra.txt    # should be identical (or within ~1e-4 tolerance)

The tensor_checksum helper is defined in nn_runtime.c. For stricter validation, compare element-wise with a tolerance of 1e-4.

Replacing dummy weights with real trained weights

`c_port/` and `static_port/` (float)

Export PyTorch weights to a flat binary (torch.save + a custom extraction script, or ONNX export + onnx Python package).
Replace the *_layer_create() calls in cnn_models_c.c with a loader that fills pre-allocated float arrays from the binary file.
For the static variant, pre-populate g_weight_pool[] at link time using a generated C header (weights_cnn_1d_v2.h) with trained values as a static const float array.
Verify correctness by comparing tensor_checksum against a Python reference forward pass on the same input values.

`deepbindi_cnn_x_heep/` (int32)

The quantized .tflite models in CH07_TFLite/saved_model/tflite/ are for a different architecture (see architecture mismatch note above). To load real weights into the CNN_1D_v2 C port:

Option A — PyTorch export (recommended): Load the trained PyTorch CNN_1D_v2 checkpoint, iterate over model.state_dict(), quantize each weight tensor to int8 range (multiply by a per-layer scale, round, clamp to ±127), and write a weights_cnn_1d_v2.h header with const int32_t arrays.
Option B — TFLite inspection only: Run extract_weights.py to inspect the TFLite model and understand its quantization parameters. These weights are not directly usable in the C port but are helpful for comparison and cross-validation.
```
python deepbindi_cnn_x_heep/extract_weights.py --inspect
python deepbindi_cnn_x_heep/extract_weights.py  # writes weights_tflite_1FC.h
```
Once weights_cnn_1d_v2.h exists, in nn_runtime.c replace the seeded_value_int32() loops with reads from the const arrays:
```
/* in conv2d_layer_create: */
for (i = 0; i < weight_count; ++i)
    layer.weights[i] = cnn1d_conv1_weights[i];
```
The linker places const arrays in .rodata (flash), reducing SRAM from ~94 KB to ~8 KB (activations only).

Pre-fold trained BN parameters into Q7 scale+offset:

scale_int  = np.round(gamma / np.sqrt(var + eps) * 128).astype(np.int32)
offset_int = np.round(beta - gamma * mean / np.sqrt(var + eps)).astype(np.int32)

Citation

If you use these models or this C port in your work, please cite the original paper:

@article{gutierrez2026deepbindi,
  author    = {Gutiérrez-Martín, Laura and López-Ongil, Celia and Miranda-Calero, Jose A.},
  title     = {{DeepBindi}: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment},
  journal   = {IEEE Journal of Biomedical and Health Informatics},
  volume    = {30},
  number    = {1},
  year      = {2026},
  doi       = {10.1109/JBHI.2025.3587961},
  note      = {Date of publication: 10 July 2025; date of current version: 8 January 2026}
}

Plain-text reference:

L. Gutiérrez-Martín, C. López-Ongil, and J. A. Miranda-Calero, "DeepBindi: An End-to-End Fear Detection System Optimized for Extreme-Edge Deployment," IEEE Journal of Biomedical and Health Informatics, vol. 30, no. 1, Jan. 2026, doi: 10.1109/JBHI.2025.3587961.

Context: The paper presents a fear-recognition system based on physiological signals (BVP, SKT, GSR) from the WEMAC dataset, achieving 80% F1-score and 74% accuracy. The system was validated on an ultra-low-power ARM Cortex-M4 (16 mW @ 5 V, 496 ms per inference). The C port in this repository implements the same model architectures to support deployment on similar extreme-edge targets.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
c_port		c_port
deepbindi_cnn_x_heep		deepbindi_cnn_x_heep
static_port		static_port
README.md		README.md
cnn_models.py		cnn_models.py
config.py		config.py
efficientnet.py		efficientnet.py
metrics.py		metrics.py

Folders and files

Latest commit

History

Repository files navigation

DeepBindi CNN models – C inference port

Quick start

Model catalogue

Input conventions

Directory structure

Dynamic vs. static – key differences

Embedded portability

Logging (silent by default)

Fatal error handler

Other portability notes

Layer primitives (nn_runtime.c / static_port/nn_runtime.c)

Fused building blocks (cnn_models_c.c)

Design decisions

X-HEEP deployment variant (deepbindi_cnn_x_heep/)

Why a separate variant?

Key design changes vs static_port/

Static SRAM footprint (CNN_1D_v2, int32_t)

Overflow analysis

Test data

Important: TFLite model vs CNN_1D_v2 architecture mismatch

Building for X-HEEP (CMake)

CGRA acceleration guide

Primary kernel: conv2d_forward ★★★ (highest priority)

Secondary kernel: dense_forward ★★

Memory access patterns

Operation count by model

Suggested progression

Replacing a primitive with a CGRA implementation

Validating CGRA results

Replacing dummy weights with real trained weights

c_port/ and static_port/ (float)

deepbindi_cnn_x_heep/ (int32)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Layer primitives (`nn_runtime.c` / `static_port/nn_runtime.c`)

Fused building blocks (`cnn_models_c.c`)

X-HEEP deployment variant (`deepbindi_cnn_x_heep/`)

Key design changes vs `static_port/`

Primary kernel: `conv2d_forward` ★★★ (highest priority)

Secondary kernel: `dense_forward` ★★

`c_port/` and `static_port/` (float)

`deepbindi_cnn_x_heep/` (int32)

Packages