Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
4c980c1
docs: add repository agent guide
Str-Gen Mar 25, 2026
47960a8
build: support non-linux userspace development
Str-Gen Mar 25, 2026
b3eed7a
test: restore focused flow coverage
Str-Gen Mar 25, 2026
c161f86
chore: clean workspace and ebpf build warnings
Str-Gen Mar 25, 2026
7d5c098
docs: add concise protocol coverage guide (#88)
Str-Gen Mar 25, 2026
ec87d27
test: restore current exporter coverage (#78)
Str-Gen Mar 25, 2026
a292df4
fix: parse IPv6 extension headers in pcap path (#86)
Str-Gen Mar 25, 2026
05df785
test: add feature module invariants
Str-Gen Mar 25, 2026
e3fc34a
test: add tiny pcap regression fixture
Str-Gen Mar 25, 2026
68ec270
test: add concap udp fixture coverage
Str-Gen Mar 25, 2026
088dbac
test: add concap smoke helper
Str-Gen Mar 25, 2026
5460a7b
docs: add agent priorities and performance roadmap
Str-Gen Mar 25, 2026
3867f4e
realtime: preserve kernel capture timestamps
Str-Gen Mar 25, 2026
9910faf
realtime: align ebpf packet length fields
Str-Gen Mar 25, 2026
278d564
docs: note deferred stress testing options
Str-Gen Mar 25, 2026
b787c86
features: preserve sub-millisecond timing precision
Str-Gen Mar 25, 2026
1c5cde2
docs: condense agent workplan
Str-Gen Mar 25, 2026
6673889
features: refine retransmission tracking
Str-Gen Mar 25, 2026
51f37f5
features: tighten active and subflow thresholds
Str-Gen Mar 25, 2026
e5ee6dd
features: expand icmp behavior tracking
Str-Gen Mar 25, 2026
c7c092c
features: expose tcp lifecycle quality
Str-Gen Mar 25, 2026
d95f4df
test: harden parser and lifecycle edge cases
Str-Gen Mar 25, 2026
8eb37b2
fix: preserve flow table termination causes
Str-Gen Mar 25, 2026
04b7904
test: harden tcp lifecycle and flow integration cases
Str-Gen Mar 25, 2026
a7e46b2
fix: reject non-first ipv4 fragments offline
Str-Gen Mar 25, 2026
63d6c89
docs: refresh agent engineering checklist
Str-Gen Mar 25, 2026
30b29f1
test: remove stale unwired cic test
Str-Gen Mar 25, 2026
cc78a45
feat: add tcp quality signals to rustiflow
Str-Gen Mar 25, 2026
153fd9b
feat: export ip version in nf flow
Str-Gen Mar 25, 2026
4532465
perf: use typed flow keys in hot paths
Str-Gen Mar 25, 2026
65ce1ae
perf: use welford variance in feature stats
Str-Gen Mar 25, 2026
11d5a6f
perf: bypass packet graph state in headless mode
Str-Gen Mar 25, 2026
c4cd821
feat: add rustiflow ip scope and path locality
Str-Gen Mar 25, 2026
eb9bdc0
perf: keep flow table updates in place
Str-Gen Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# RustiFlow Agent Guide

This repository is a Rust workspace for a network flow extractor. The main crates are:

- `rustiflow`: user-space CLI, pcap reader, realtime capture, flow extraction, CSV/TUI output
- `common`: shared packet/event structs used by user space and eBPF programs
- `xtask`: helper commands for building and running the project
- `ebpf-ipv4` / `ebpf-ipv6`: Linux eBPF programs used for realtime capture

## Remote Machine Guardrails

- Remote Linux machines reachable over SSH may be used only for this RustiFlow project.
- On those machines, run only RustiFlow-related commands, builds, checks, and tests.
- Do not use those machines for unrelated exploration, installs, experiments, or general development tasks.
- If remote work needs a dedicated workspace or directory, ask the user to create/provide it first.
- If any software or dependency needs to be installed on those machines, ask the user to do it.
- If there is any uncertainty about whether a command is appropriate to run on those machines, ask the user before running it.

## Non-Negotiables

- When writing or editing Rust in this repository, always apply the Rust guidance in this file first. Treat it as an active coding standard, not optional reading.
- Prefer changes that are small, local, and easy to review. Avoid broad opportunistic refactors unless the task specifically calls for them.
- Preserve the existing human-made structure of the codebase where possible. Fit new work into current boundaries before creating new ones.

## Commit Hygiene

- Keep commits clean, bounded, and purpose-specific.
- Prefer one logical change per commit. Do not mix unrelated fixes, refactors, docs updates, and test rewrites unless they are tightly coupled.
- When work spans multiple concerns, split it into a short chain of commits with readable messages.
- Before committing, check that the diff matches the stated purpose of the commit and does not include unrelated workspace noise.
- If a change is exploratory or lower confidence, prefer using a separate branch until it is trusted.

## Working Principles

- Prefer small, targeted changes over broad rewrites.
- Keep flow logic modular. Shared measurement logic belongs in `rustiflow/src/flows/features/`; exporter-specific schema logic belongs in the relevant flow type.
- Preserve output compatibility unless a schema change is intentional and documented.
- When changing CLI behavior, config structure, or CSV headers, update the README and any related examples.
- When using `format!`, inline variables into `{}` when possible.
- Prefer exhaustive `match` statements when practical; avoid wildcard arms that hide protocol or feature cases.
- Avoid bool-heavy APIs that create unclear call sites. Prefer enums or named methods when that improves clarity.
- Prefer comparing whole values in tests instead of asserting many individual fields when feasible.
- Do not add one-off helper functions that are only used once unless they make a complex block substantially clearer.

## Rust Style

- Follow `rustfmt` and Clippy guidance.
- Collapse nested `if` statements when it improves readability.
- Inline `format!` arguments when possible.
- Use method references instead of trivial closures when that is clearer.
- Keep modules from growing unnecessarily large. Prefer extracting a focused submodule instead of adding more unrelated logic to an already large file.

## RustiFlow-Specific Guidance

- Treat the offline pcap path and the realtime eBPF path as two distinct ingestion modes that should stay semantically aligned.
- Be careful with timing-related features. Realtime and offline timestamp sources differ, so changes to timing, IAT, active/idle, or expiration logic should be validated deliberately.
- Be careful with packet length semantics. Realtime and offline paths may observe slightly different length fields.
- `BasicFlow` owns flow lifecycle and termination behavior. Do not duplicate expiration or TCP teardown logic in higher-level flow types unless there is a strong reason.
- If you add a new feature family, first decide whether it belongs in:
- a reusable `FlowFeature` implementation, or
- one exporter only
- If you change contamination-free exports, keep in mind that these outputs intentionally avoid raw identifiers such as exact ports/IPs.

## Platform Notes

- Realtime eBPF support is Linux-specific.
- macOS may be usable for some read-only work, formatting, and limited code inspection, but Linux is the source of truth for full build and runtime validation.
- Do not assume that successful macOS builds imply realtime correctness.
- When touching `aya`/eBPF/realtime code, prefer validating on Linux or in a Linux container/VM.

## Commands

Use the smallest command that gives confidence:

- Format:
- `cargo fmt`
- Check the main crate:
- `cargo check -p rustiflow`
- Run Rust tests for the main crate:
- `cargo test -p rustiflow`
- Build eBPF programs:
- `cargo xtask ebpf-ipv4`
- `cargo xtask ebpf-ipv6`
- Run in dev mode:
- `cargo xtask run -- [OPTIONS] <COMMAND>`

If a change touches shared code used by multiple crates, prefer checking the workspace as needed.

## Validation Expectations

- After Rust code changes, run `cargo fmt`.
- Run the narrowest relevant check/test command for the code you changed.
- If you change dependencies, run at least `cargo check` again after the dependency update.
- If you change CSV headers, config behavior, or user-facing commands, verify the corresponding documentation and examples.

## Notes On Existing Tests

- Treat the current test suite carefully: some tests may be stale or incomplete relative to the active code.
- When adding or repairing tests, prefer tests that reflect the current flow architecture and public behavior rather than resurrecting outdated internal field expectations.
- Before adding more feature work, prefer adversarial deterministic tests around TCP lifecycle, parser edge cases, and tiny offline fixtures that prove exported semantics.

## Engineering Checklist

Keep this section short and current. Completed work and decision history belong
in `docs/engineering-notes.md`.

### Current Focus

- [ ] Stabilize and measure before expanding the eBPF event payload further.
- [x] Finish the remaining TCP quality signals that current metadata already supports:
duplicate ACKs, zero-window events, and close style.
- [x] Add the next IP and path signals once they can be trusted in both offline
and realtime modes.

Primary files:

- `rustiflow/src/packet_features.rs`
- `rustiflow/src/pcap.rs`
- `rustiflow/src/realtime.rs`
- `common/src/lib.rs`
- `ebpf-ipv4/src/main.rs`
- `ebpf-ipv6/src/main.rs`
- `rustiflow/src/flows/basic_flow.rs`
- `rustiflow/src/flows/features/`

### Later Work

- [ ] Optional lightweight application-aware metadata: DNS, TLS, HTTP, QUIC.
- [ ] Better contamination-free abstractions than only coarse IANA port buckets.
- [ ] Fill remaining `nf_flow` gaps such as `vlan_id` and `tunnel_id` once
packet metadata exists in both ingestion modes.

### Working rule

Before adding a new feature, ask:

- Is the underlying packet metadata trustworthy in both offline and realtime modes?
- Does this improve diagnostics more than refining an existing weak feature?
- Can it live in a reusable `FlowFeature`?
- Can it be tested with a tiny deterministic fixture?
3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[workspace]
members = ["rustiflow", "common", "xtask"]
members = ["rustiflow", "common", "xtask"]
resolver = "2"
39 changes: 36 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,30 @@ This tool is engineered for robust and efficient feature extraction, particularl
- **Versatile Feature Sets:** Offers a variety of pre-defined feature sets (flows) and the flexibility to create custom feature sets tailored to specific requirements. An example of the custom flow is shown [here](https://github.com/idlab-discover/RustiFlow/blob/main/rustiflow/src/flows/custom_flow.rs).
- **Pcap File Support:** Facilitates packet analysis from pcap files, compatible with both Linux and Windows generated files.
- **Diverse Output Options:** Features can be outputted to the console, a CSV file, or other formats with minimal effort.
- **Richer TCP Quality Signals:** The RustiFlow feature set exports duplicate ACK counts, zero-window observations, and TCP close style in addition to the existing lifecycle and retransmission fields.
- **Endpoint-Aware IP Context:** The RustiFlow feature set exports `ip_version`, endpoint IP scope, and coarse `path_locality` derived from normalized addresses without expanding the eBPF event payload.

## Feature sets

See the [wiki](https://github.com/idlab-discover/RustiFlow/wiki) for the different feature sets available.

## Supported Packet/Header Coverage

RustiFlow currently extracts flows from the following protocol/header combinations:

| Layer | Offline pcap | Realtime eBPF |
| --- | --- | --- |
| Link | Ethernet, Linux cooked capture, 802.1Q VLAN | Ethernet |
| Network | IPv4, IPv6 | IPv4, IPv6 |
| IPv6 extras | Extension headers supported before transport parsing | Extension headers supported before transport parsing |
| Transport | TCP, UDP, ICMP, ICMPv6 | TCP, UDP, ICMP, ICMPv6 |

Notes:

- Realtime support is Linux-only.
- Offline and realtime aim to expose the same flow semantics, but timestamp and packet-length sources can differ slightly.
- Realtime VLAN parsing is not implemented yet.

## <img src="figures/RustiFlow_nobg.png" width="60px"/> Architecture

### Realtime processing
Expand Down Expand Up @@ -200,10 +219,10 @@ Options:

Possible values:
- basic: A basic flow that stores the basic features of a flow
- cic: Represents the CIC Flow, giving 83 features
- cic: Represents the CIC Flow, giving 90 features
- cidds: Represents the CIDDS Flow, giving 10 features
- nfstream: Represents a nfstream inspired flow, giving 69 features
- rustiflow: Represents the Rusti Flow, giving 120 features
- nfstream: Represents a nfstream inspired flow, giving 71 features
- rustiflow: Represents the Rusti Flow, giving 203 features
- custom: Represents a flow that you can implement yourself

--active-timeout <ACTIVE_TIMEOUT>
Expand Down Expand Up @@ -259,6 +278,20 @@ Options:
RUST_LOG=info cargo xtask run --
```

Run the focused Rust test suite with:

```bash
cargo test -p rustiflow
```

If you also have the sibling `concap` repository and a reachable Kubernetes cluster,
you can run a tiny ConCap-backed smoke check and then reprocess the downloaded pcap
with your current local RustiFlow checkout:

```bash
./scripts/concap_smoke.sh ../concap nmap-tcp-syn-version.yaml
```

### Binary

```bash
Expand Down
12 changes: 9 additions & 3 deletions common/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

pub use network_types::{icmp::IcmpHdr, tcp::TcpHdr, udp::UdpHdr};

/// BasicFeaturesIpv4 is a struct collection all ipv4 traffic data and is 32 bytes in size.
/// BasicFeaturesIpv4 is a struct collection all ipv4 traffic data.
#[repr(C, packed)]
#[derive(Copy, Clone)]
pub struct EbpfEventIpv4 {
pub timestamp_ns: u64,
pub ipv4_destination: u32,
pub ipv4_source: u32,
pub port_destination: u16,
Expand All @@ -25,6 +26,7 @@ pub struct EbpfEventIpv4 {

impl EbpfEventIpv4 {
pub fn new(
timestamp_ns: u64,
ipv4_destination: u32,
ipv4_source: u32,
port_destination: u16,
Expand All @@ -41,6 +43,7 @@ impl EbpfEventIpv4 {
icmp_code: u8,
) -> Self {
EbpfEventIpv4 {
timestamp_ns,
ipv4_destination,
ipv4_source,
port_destination,
Expand All @@ -63,10 +66,11 @@ impl EbpfEventIpv4 {
#[cfg(feature = "user")]
unsafe impl aya::Pod for EbpfEventIpv4 {}

/// BasicFeaturesIpv6 is a struct collection all ipv6 traffic data and is 64 bytes in size.
/// BasicFeaturesIpv6 is a struct collection all ipv6 traffic data.
#[repr(C, packed)]
#[derive(Clone, Copy)]
pub struct EbpfEventIpv6 {
pub timestamp_ns: u64,
pub ipv6_destination: u128,
pub ipv6_source: u128,
pub port_destination: u16,
Expand All @@ -86,6 +90,7 @@ pub struct EbpfEventIpv6 {

impl EbpfEventIpv6 {
pub fn new(
timestamp_ns: u64,
ipv6_destination: u128,
ipv6_source: u128,
port_destination: u16,
Expand All @@ -102,6 +107,7 @@ impl EbpfEventIpv6 {
icmp_code: u8,
) -> Self {
EbpfEventIpv6 {
timestamp_ns,
ipv6_destination,
ipv6_source,
port_destination,
Expand Down Expand Up @@ -157,7 +163,7 @@ impl NetworkHeader for TcpHdr {
| ((self.cwr() as u8) << 7)
}
fn header_length(&self) -> u8 {
TcpHdr::LEN as u8
(self.doff() * 4) as u8
}
fn sequence_number(&self) -> u32 {
self.seq
Expand Down
80 changes: 80 additions & 0 deletions docs/engineering-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Engineering Notes

This file keeps short-lived design choices and execution notes that would make
`AGENTS.md` too long.

## 2026-03-25

- Use branch `codex/ingestion-semantics-foundation` for the AGENTS-driven
improvement track instead of landing exploratory changes directly on `main`.
- Prefer performance-aware correctness for ingestion work so foundational
metadata changes do not need to be redone later.
- Realtime packet events now carry kernel monotonic timestamps and aligned
packet/header/payload length semantics. Stabilize and measure before adding
more event fields.
- Timing and IAT features now preserve sub-millisecond precision internally.
- Retransmission work should stay bounded: fix non-TCP false positives, move
beyond exact duplicate sequence numbers, and leave richer TCP quality signals
such as duplicate ACKs and handshake analysis for later checklist items.
- Retransmission stats now stay TCP-only and count overlap in TCP sequence
space, including SYN and FIN sequence-number use, instead of only exact
duplicate sequence numbers.
- Active/idle tracking now compares thresholds in microseconds before converting
to exported millisecond values, and subflow counting now represents actual
subflows instead of only counting gap boundaries after the first packet.
- ICMP stats now keep the original first seen type and code, but also track
echo request and reply counts plus error and destination-unreachable counts
across ICMPv4 and ICMPv6 traffic.
- TCP lifecycle export now distinguishes observed handshake completion from
resets seen before or after that observed handshake, so richer flow schemas
do not have to infer lifecycle quality from flag totals alone.
- RustiFlow export now includes duplicate ACK counts, zero-window
observations, and `tcp_close_style`. Duplicate ACKs currently mean repeated
pure ACKs with the same ACK number and advertised window; zero-window events
count TCP packets advertising a zero receive window; close style stays rooted
in `BasicFlow` lifecycle state so timeout/reset/FIN semantics are not
reimplemented in exporter code.
- `nf_flow` now exports `ip_version` without expanding the eBPF event payload.
The value is derived from the normalized `IpAddr` already shared by offline
and realtime ingestion, and fixture-backed tests lock down the IPv4 path
while direct flow construction locks down the IPv6 path.
- Internal sharding and flow-table lookup now use typed `FlowKey` values
instead of rebuilding formatted strings on the hot path. String flow ids are
still created when a new flow is instantiated for export compatibility.
- `FeatureStats` now keeps running variance state (`m2`) and derives standard
deviation on demand instead of updating `std` itself on every packet.
Dedicated tests now lock down population-std semantics, order invariance, and
merged directional variance behavior.
- Realtime packet-graph mode is now explicit and testable. When the graph is
disabled, RustiFlow no longer constructs the packet-count watch channel or
mutex-protected counter state, so high-throughput runs skip that observability
plumbing entirely instead of merely branching around it in the loop body.
- RustiFlow now exports `ip_version`, `source_ip_scope`,
`destination_ip_scope`, and `path_locality` derived from the normalized
`IpAddr` endpoints already shared by offline and realtime ingestion. The
adversarial test matrix covers private/shared/link-local/loopback/multicast
cases across IPv4 and IPv6 so these coarse path signals do not depend on
extra kernel event fields.
- `FlowTable` now keeps the ordinary existing-flow update path in place instead
of removing and reinserting the map entry on every packet. Table-level tests
now lock down two semantics that matter for that optimization: replacing an
expired flow with a fresh flow on the same key, and early export that keeps
the live flow resident for later final export.
- Current test-hardening focus is to add adversarial deterministic cases before
more feature work: false handshake completion, teardown edge cases, parser
rejection behavior, and tiny fixture assertions that prove exported
lifecycle semantics.
- Test hardening already exposed two parser quirks worth locking down: short
unsupported offline frames must not panic the reader, and non-first IPv6
fragments should be dropped instead of being treated like fresh transport
headers.
- Test hardening also exposed a real `FlowTable` lifecycle bug: packet-driven
termination export could overwrite `TcpReset` with `TcpTermination`, and a
first-packet-terminated flow could be left behind for duplicate export.
- The next adversarial test layer should prefer integrated semantics over raw
test count: simultaneous close teardown, contiguous-versus-overlapping TCP
segments, and wrapper-level feature coordination in `RustiFlow`.
- That test layer exposed another real parser bug: offline IPv4 parsing was
treating non-first IPv4 fragments as if they started with a fresh transport
header. Non-first IPv4 fragments should now be dropped while first fragments
still parse their transport header normally.
Loading
Loading