Skip to content

Handle addrList allocation failures safely in RF24Mesh master nodes#266

Merged
TMRh20 merged 7 commits into
masterfrom
copilot/harden-allocation-failure-handling
Jun 6, 2026
Merged

Handle addrList allocation failures safely in RF24Mesh master nodes#266
TMRh20 merged 7 commits into
masterfrom
copilot/harden-allocation-failure-handling

Conversation

Copilot AI commented May 23, 2026

Copy link
Copy Markdown
Contributor
  • Inspect PR feedback that explicitly mentions @copilot
  • Run clang-format-14 on the modified PR file (RF24Mesh.cpp)
  • Run clang-tidy-14 against RF24Mesh.cpp and review diagnostics
  • Commit any required formatting updates
Original prompt

Create a pull request in repository nRF24/RF24Mesh against base branch master that hardens allocation failure handling for the master node address list in RF24Mesh.cpp.

Scope:

  • Modify RF24Mesh.cpp.
  • In ESBMesh<network_t, radio_t>::begin(...), check the result of the initial malloc() used to allocate addrList.
  • If allocation fails, return false and do not set addrMemAllocated = true.
  • In ESBMesh<network_t, radio_t>::setAddress(...), replace the direct assignment of realloc() back to addrList with a temporary pointer.
  • If realloc() fails, return early without losing the original addrList pointer.
  • Add a defensive null check before writing the new entry.

Constraints:

  • Keep the change minimal and localized.
  • Preserve existing behavior during successful allocations.
  • Do not refactor unrelated code.
  • Follow the repository's existing coding style.

Rationale:

  • The current code assigns realloc() directly to addrList, which can lose the original allocation on failure and lead to a memory leak plus null dereference.
  • The initial malloc() in begin() is also unchecked.

Use this PR metadata:
Title: Handle addrList allocation failures safely in RF24Mesh master nodes

Body:

Summary

This PR hardens dynamic memory handling for the master node's addrList in RF24Mesh.

Specifically:

  • checks the initial malloc() in begin()
  • avoids overwriting addrList directly with realloc() in setAddress()
  • preserves the existing allocation if realloc() fails
  • avoids dereferencing a null addrList after allocation failure

Problem

RF24Mesh allocates the master node's address list dynamically:

  • initial allocation in begin()
  • growth allocation in setAddress()

Before this change, setAddress() assigned realloc() directly back to addrList:

addrList = (addrListStruct*)realloc(addrList, ...);

If realloc() fails, it returns NULL while the original allocation remains valid. Assigning that return value directly to addrList loses the only pointer to the original allocation, which creates:

  • a memory leak of the old buffer
  • a likely null dereference on the next access to addrList

Similarly, the initial malloc() in begin() was not checked before loadDHCP() and later use.

Changes

begin()

  • check the result of malloc()
  • return false if allocation fails
  • only set addrMemAllocated = true after successful allocation

setAddress()

  • use a temporary pointer for realloc()
  • return early if reallocation fails
  • keep the original addrList untouched on failure
  • add a defensive null check before writing the new entry

Why this is useful

On many embedded targets allocation failure may be rare, but this still improves correctness and makes failure behavior safer and more predictable, especially on:

  • Linux builds
  • constrained-memory systems
  • long-running master nodes with growing addrList state

Behavior impact

There should be no behavior change during normal operation.

On allocation failure:

  • begin() now fails cleanly instead of proceeding with an invalid addrList
  • setAddress() now leaves the existing address list intact instead of losing it

Files changed

  • RF24Mesh.cpp

Notes

This PR intentionally keeps the current allocation model and only hardens failure handling. It does not change the overall ownership/lifetime design of addrList.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Can you find any memory leaks in RF24Mesh?

Assistant: Yes — I found at least one real memory-management bug that can leak memory and also lose the only pointer to allocated storage.

Most likely leak: unchecked realloc() in setAddress().

if (addrListTop > 0 && addrListTop % MESH_MEM_ALLOC_SIZE == 0) {
    addrList = (addrListStruct*)realloc(addrList, (addrListTop + MESH_MEM_ALLOC_SIZE) * sizeof(addrListStruct));
}
addrList[addrListTop].address = address;
addrList[addrListTop++].nodeID = nodeID;

Why this is a bug

If realloc() fails, it returns NULL and the original pointer is not freed. But this code assigns the return value directly back to addrList, so on failure:

  • the original allocation becomes unreachable from the program
  • that memory is effectively leaked
  • the next line dereferences addrList, which can also crash

So this is both:

  1. a memory leak on allocation failure
  2. a null-pointer dereference risk

Safer pattern

Use a temporary pointer:

auto* newList = (addrListStruct*)realloc(addrList, newSize);
if (!newList) {
    // handle allocation failure
    return;
}
addrList = newList;

Another memory-lifetime concern

The master node allocates addrList with malloc() in begin():

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

Woah! Was that prompt generated? It is ridiculously long.

Copilot AI changed the title [WIP] Fix allocation handling for master node address list in RF24Mesh Handle addrList allocation failures safely in RF24Mesh master nodes May 23, 2026
Copilot AI requested a review from TMRh20 May 23, 2026 11:53
Copilot finished work on behalf of TMRh20 May 23, 2026 11:53
@TMRh20

TMRh20 commented May 23, 2026

Copy link
Copy Markdown
Member

Woah! Was that prompt generated? It is ridiculously long.

What is ridiculously long? This PR is all prompt generated.

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

Click "Original prompt" in the description (at the bottom), and you'll see what I mean.

@TMRh20

TMRh20 commented May 23, 2026

Copy link
Copy Markdown
Member

lol, i see. Yeah, I just asked it to find memory leaks in the RF24Mesh repo lol.

@copilot run clang-tidy and clang-format on this pr

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

I think Anthropic (& some exclusive friends) has been doing the same for a lot of package registries recently. I've seen a lot of CVE alerts about python, npm, and rust deps lately.

Copilot AI commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

lol, i see. Yeah, I just asked it to find memory leaks in the RF24Mesh repo lol.

@copilot run clang-tidy and clang-format on this pr
...

Ran both locally on this PR.

  • clang-format-14: applied and committed in 67f0e6f.
  • clang-tidy-14 on RF24Mesh.cpp: executed with checks (bugprone-*, clang-analyzer-*); it reports one environment dependency error (RF24/RF24_config.h not found in this sandbox) plus existing warnings outside this change set.

Copilot finished work on behalf of TMRh20 May 23, 2026 12:03
@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

We may need to add an AGENTS.md file in these repos, so that they know how to setup the build env (including RF24* deps and their git HEAD refs).

@TMRh20

TMRh20 commented May 23, 2026

Copy link
Copy Markdown
Member

I have no idea how to do that :p

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

It is basically like a CONTRIBUTING.md doc for any AI agents: totally free-form, no standard convention.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Memory usage change @ 55a6ce3

Board Flash % RAM for global variables %
arduino:avr:nano 🔺 0 - +20 0 - +0.07 0 - 0 0 - 0
arduino:samd:mkrzero 🔺 +4 - +16 0 - +0.01 0 - 0 0 - 0
Click for full report per board

arduino:avr:nano

Sketch Flash % RAM for global variables %
examples/RF24Mesh_Example 0 0 0 0
examples/RF24Mesh_Example_Master_Statics +16 +0.05 0 0
examples/RF24Mesh_Example_Master_To_Nodes +20 +0.07 0 0
examples/RF24Mesh_Example_Node2Node 0 0 0 0
examples/RF24Mesh_Example_Node2NodeExtra 0 0 0 0
examples/RF24Mesh_SerialConfig +8 +0.03 0 0
examples/RF24Mesh_Example_Master +20 +0.07 0 0

arduino:samd:mkrzero

Sketch Flash % RAM for global variables %
examples/RF24Mesh_Example +4 0 0 0
examples/RF24Mesh_Example_Master_Statics +12 0 0 0
examples/RF24Mesh_Example_Master_To_Nodes +16 +0.01 0 0
examples/RF24Mesh_Example_Node2Node +4 0 0 0
examples/RF24Mesh_Example_Node2NodeExtra +4 0 0 0
examples/RF24Mesh_SerialConfig +4 0 0 0
examples/RF24Mesh_Example_Master +16 +0.01 0 0

@TMRh20

TMRh20 commented May 23, 2026

Copy link
Copy Markdown
Member

Would you want to tackle that, or are you leaving it up to me?

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

I was just thinking about how I'd make copilot do it for all the RF24 repos at once... I think I can do it in a VS Code workspace with all the local RF24 repos added.

PS - I've been rewriting the arduino/compile-sketches action in rust. When I'm done, it should be a lot faster (parallel compilation, no need for a complicated CI workflow matrix per board, and stuff). The current arduino/compile-sketches action we're using hasn't been updated in couple years.

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

Hmm, having an agent build RF24Ethernet is very different from other Linux-compatible RF24 repos. It would have to use arduino-cli, and I doubt we can use clang-tidy unless the arduino-cli fixed their compilation database support.

I think I'll skip that one for now. Are you planning on tasking AI to analyze RF24Ethernet?

@TMRh20

TMRh20 commented May 23, 2026

Copy link
Copy Markdown
Member

Are you planning on tasking AI to analyze RF24Ethernet?

Already done to find that previous memory leak, this time I used it just to confirm what I already suspected and found another one. Its just a matter of testing and figuring out lwIP process now without crashing the MCU. This whole memory leak stuff is what is killing me as far as a timeframe for addressing these other PRs,

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

You heavily rely on calling begin() a lot too. That's not something I was considering in my past code contributions/reviews.

Almost done with the AGENTS.md and clang-tidy configs. I usually waste about 3 prompts before I throw my hands in the air and just say aloud, "Damn it, I'll just do it myself!"

@2bndy5

2bndy5 commented May 23, 2026

Copy link
Copy Markdown
Member

This whole memory leak stuff is what is killing me ...

Another reason why I F&#$ing ❤️ Rust (🦀).

@TMRh20

TMRh20 commented Jun 3, 2026

Copy link
Copy Markdown
Member

@copilot are you able to modify this PR to suit the changes in #248 that now exist in the master branch?

Copilot AI commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@copilot are you able to modify this PR to suit the changes in #248 that now exist in the master branch?

Updated this PR to match the #248 master changes. The allocation-failure handling is now applied in the new code path (setNodeID() + setAddress()), with the update in commit e9b0033.

Copilot finished work on behalf of TMRh20 June 3, 2026 13:31
@TMRh20 TMRh20 marked this pull request as ready for review June 4, 2026 07:30
@TMRh20 TMRh20 requested a review from 2bndy5 June 4, 2026 07:30

@2bndy5 2bndy5 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea to check the result of (re)malloc calls, and only proceed normally when successfully allocated the requested mem block.

If this wasn't embedded C++, then I'd insist that an exception is thrown to propagate alloc errors to the caller. Unfortunately, throwing exceptions in embedded C++ isn't really a thing, at least not for most implementations (I think the ESP32 core actually allows exceptions).

I say this because if alloc fails, this patch will silently drop any incoming requests for newly assigned mesh addresses. As it stands now, this patch just prevents data loss of previously assigned addresses.

  1. Is there some kind of error handling elsewhere that I am missing? I think this is heavily relying on child nodes to interpret a master's response, be it silent, good, or somehow erroneous/unexpected.
  2. I'm guessing the child node's double-check of a newly assigned address would surface this alloc problem in an obscure way. Didn't we add a Error message type in the reserved system message types?

@TMRh20

TMRh20 commented Jun 4, 2026

Copy link
Copy Markdown
Member

I'm running this code on all of my Linux/Gateway nodes with no apparent issues so far.

I think this is heavily relying on child nodes to interpret a master's response, be it silent, good, or somehow erroneous/unexpected.

The child nodes should either see themselves de-allocated or allocated with the wrong address, meaning they should just continue retrying to renew their addresses. There should be nothing new to interpret.

I'm not sure how to test this fully, other than hard-coding it to fail once or twice, but in a working situation, there should be no real "unexpected" behaviour that I can see.

@2bndy5

2bndy5 commented Jun 4, 2026

Copy link
Copy Markdown
Member

I'm envisioning a DoS when mem is saturated on master node.

@TMRh20

TMRh20 commented Jun 4, 2026

Copy link
Copy Markdown
Member

What else? If memory is saturated, the saveDHCP and actual list updating is handled prior to re-allocation, so existing nodes would have no problems, just any additional nodes would experience a DOS condition.

@2bndy5

2bndy5 commented Jun 4, 2026

Copy link
Copy Markdown
Member

yeah, that's what I'm thinking too. I was just thinking it would be better to have a message saying "we're at capacity" to prospecting child nodes, instead of letting them clog the bandwidth trying to reconnect. I'm probably overthinking again.

@TMRh20

TMRh20 commented Jun 4, 2026

Copy link
Copy Markdown
Member

Haha, yeah I'd agree with that expectation for Linux devices, but for embedded systems I think that is probably overkill. There should typically be a timeout on address renewals, so it's not that bad, but yeah a large number of nodes trying to connect constantly is a burden & could even create a DOS for the whole system.

@TMRh20

TMRh20 commented Jun 6, 2026

Copy link
Copy Markdown
Member

Wondering if you approve this PR or have more concerns @2bndy5 ?

@2bndy5 2bndy5 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. thought my last review was an approval

@TMRh20 TMRh20 merged commit 7dc5f18 into master Jun 6, 2026
57 checks passed
@TMRh20 TMRh20 deleted the copilot/harden-allocation-failure-handling branch June 6, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants