Skip to content

lib: implement hashset#470

Open
pzmarzly wants to merge 3 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl
Open

lib: implement hashset#470
pzmarzly wants to merge 3 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl

Conversation

@pzmarzly
Copy link
Copy Markdown
Contributor

@pzmarzly pzmarzly commented Mar 12, 2026

Currently, bf_set uses bf_list under the hood. This is suboptimal performance-wise. Here I am changing bf_set to be backed by bf_hashset.

  • lib: helper: add FNV-1a hash function - imported from another PR, a simple hash function.
  • lib: add bf_hashset - main commit. Implements a simple hashset (tombstones on deletion, grows 2x when load factor exceeds 50%, never shrinks, linear probing on collision). See https://en.wikipedia.org/wiki/Linear_probing (with "special flag value" strategy in "Deletion" section).
  • lib: set: use bf_hashset for elems field - changes bf_set's inner structure

See #460 for previous attempt.

Fixes #418

@pzmarzly pzmarzly requested a review from qdeslandes as a code owner March 12, 2026 15:00
@pzmarzly pzmarzly marked this pull request as draft March 12, 2026 15:00
@meta-cla meta-cla bot added the cla signed label Mar 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

Claude review of PR #470 (c911d48)

Suggestions

  • Defensive assert for free_idxsrc/libbpfilter/core/hashset.c:284free_idx could remain SIZE_MAX if internal invariants are violated; add assert(free_idx < set->cap) before first use
  • Deduplicate _bf_round_next_power_of_2src/libbpfilter/core/hashset.c:26 — duplicated with cgen/program.c:66; the two copies differ in correctness (zero handling, 64-bit support)
  • _bf_hashset_resize allows new_cap == set->len — guard changed from < to <=
  • Document silent -EEXIST dedup in bf_set_add_elem — Doxygen now documents "if it is not already present"
  • Clarify bf_hashset_foreach delete-safetysrc/libbpfilter/include/bpfilter/core/hashset.h:107 — documentation should note that only the current element is safe to delete during iteration; deleting a different element could corrupt the pre-captured __next pointer

Nits

  • Transient dangling state in bf_hashset_cleansrc/libbpfilter/core/hashset.c:201set->head and slot pointers are stale between foreach frees and final cleanup; add a comment noting intentional ordering
  • Test harness accesses hashset internalstests/harness/test.c:262bft_set_eq_ordered reads elems.head directly despite API warning against field access
  • bf_set_add_many copies instead of transferring ownershipsrc/libbpfilter/set.c:498 — doubles per-element allocation vs old bf_list_node_take_data approach; note for future optimization
  • Document tombstone sentinel assumptionsrc/libbpfilter/core/hashset.c:14(bf_hashset_elem *)1 relies on address 0x1 never being a valid pointer; add a brief comment

CLAUDE.md improvements

  • The __next variable in bf_hashset_foreach (and existing bf_list_foreach/bf_list_foreach_rev) uses a double-underscore prefix reserved by the C standard (ISO C 7.1.3). The PR author has already flagged this as an open thread. Consider a project-wide convention change (e.g., _bf_* prefix) or documenting the intentional use of GCC extensions in CLAUDE.md

Workflow run

@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 314af18 to 02938a3 Compare March 12, 2026 23:56
Copy link
Copy Markdown
Contributor

@qdeslandes qdeslandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, first pass of review and a few things to fix. I'll do a second, deeper pass, when those are solved. That being said, it's a very welcome addition! :D

@qdeslandes qdeslandes mentioned this pull request Mar 13, 2026
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 02938a3 to 1aae907 Compare March 13, 2026 15:07
@pzmarzly pzmarzly changed the title lib: create data_structures directory, implement hashset data structure lib: create directory for data structures, implement hashset data structure Mar 13, 2026
@pzmarzly pzmarzly changed the title lib: create directory for data structures, implement hashset data structure lib: create directory for data structures, implement hashset Mar 13, 2026
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 1aae907 to 550c254 Compare March 13, 2026 15:46
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 550c254 to d6883e1 Compare March 16, 2026 22:13
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 2111d2d to 8f94d51 Compare March 31, 2026 16:12
@pzmarzly pzmarzly requested a review from yaakov-stein March 31, 2026 16:12
@pzmarzly
Copy link
Copy Markdown
Contributor Author

Ready for review. Remaining open threads:

  1. The code here is not proper C17, and uses GCC extensions. Many other places in the codebase do, actually, and pedantic_c test doesn't check macros. Wanna make a follow up? Or give up on pure C17? lib: implement hashset #470 (comment)
  2. Would a bf_hashset_take function make sense? What should it return? lib: implement hashset #470 (comment)
  3. If someone calls bf_hashset_delete and then tries to use the deleted data, what should happen? Same problem with bf_list_delete, by the way. FWIW I think Claude is overly pedantic. lib: implement hashset #470 (comment)
  4. The structure is optimized for foreach speed, not for random access. lib: implement hashset #470 (comment)

/**
* @brief Iterate over all elements in a hashset in insertion order.
*
* Unsafe to add elements during iteration (can cause rehashing).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: bf_hashset_foreach doc should clarify dangling pointer after bf_hashset_delete

The doc says "Safe to bf_hashset_delete the current element during iteration" but does not mention that elem_var becomes a dangling pointer after the delete call and must not be accessed in the same iteration body. Consider:

Safe to call `bf_hashset_delete()` on the current element during iteration.
Do not access `elem_var` after the delete call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's obvious?

*/
typedef struct bf_hashset_elem
{
/// User-provided data pointer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: bf_hashset_elem.data accessor pattern contradicts @warning

The @warning on bf_hashset_elem (line 44) says "the inside of this structure should not be directly accessed", yet the bf_hashset_foreach macro doc (line 107) instructs users to "Access the stored data via elem_var->data". This is the intended access pattern (there's no getter function like bf_list_node_get_data). Either relax the @warning to exempt data, or add a bf_hashset_elem_get_data() inline accessor for consistency with the list API.

Copy link
Copy Markdown
Contributor Author

@pzmarzly pzmarzly Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern in bf_list_foreach. Macros need access to struct internals.

for (; n0 || n1; n0 = bf_list_node_next(n0), n1 = bf_list_node_next(n1)) {
if (0 != memcmp(bf_list_node_get_data(n0), bf_list_node_get_data(n1),
lhs->elem_size))
for (bf_hashset_elem *n0 = lhs->elems.head, *n1 = rhs->elems.head; n0 || n1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: bft_set_eq_ordered directly accesses bf_hashset struct internals

lhs->elems.head and n0->next bypass the public API, which the @warning on bf_hashset discourages. This could be rewritten using bf_hashset_foreach:

bf_hashset_elem *n1 = rhs->elems.head; // or use a second manual iteration
bf_hashset_foreach (&lhs->elems, n0) {
    if (!n1 || memcmp(n0->data, n1->data, lhs->elem_size) != 0)
        return false;
    n1 = n1->next;
}

Alternatively, since this is test harness code, a comment noting the intentional internal access would suffice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test code, should be fine.

@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch 2 times, most recently from fd7b629 to 3fd44fb Compare March 31, 2026 17:08
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 3fd44fb to c911d48 Compare April 1, 2026 09:29
// Take ownership of data to stop to_add cleanup from freeing it.
bf_list_node_take_data(elem_node);
}
bf_hashset_foreach (&(*to_add)->elems, elem) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: The old list-based code used bf_list_node_take_data to transfer ownership from to_add to dest without copying. The new implementation calls bf_set_add_elem which does malloc + memcpy for each element. This doubles per-element allocations for large sets. Worth noting as a potential future optimization (e.g., a bf_hashset_take that moves an element between sets without reallocation).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, something like bf_hashset_take_node and bf_hashset_add_node` (that rewires existing bf_hashset_elem*) would work, though I'm not sure if it's worth it?

In my intuition, a free(<bf_hashset_elem*>) followed by malloc(sizeof(bf_hashset_elem)) should be pretty cheap, since we have a perfect gap of unallocated memory to reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use hashset data structure in bf_set

3 participants