Multipass performance optimizations (~4.3x speedup) by schulzchristian · Pull Request #4 · KaHIP/FREIGHT

schulzchristian · 2026-04-02T13:52:58Z

Review Round 2 changes: verified first-pass connectivity shortcut (k<=64, unweighted guard), OOM guard for bitset, fstat/offset validation in mmap, PGO mutual exclusivity
Fix integer underflow in mmap fallback: guard remaining with (end_off > offset) ? (end_off - offset) : 0 (both hypergraph and graph)
Fix invalid seekg(cur) in mmap fallback: guard entire fallback read with if (cur >= 0) (both variants)
Add in.good() check after seekg(cur) before read() (both variants)
Both builds compile cleanly; CodeQL scan: 0 alerts

codacy-production · 2026-04-02T13:55:31Z

Not up to standards ⛔

🔴 Issues 2 critical · 13 medium

Alerts:
⚠ 15 issues (≤ 0 issues of at least minor severity)

Results:
15 new issues

Category Results

UnusedCode 1 medium

BestPractice 3 medium

ErrorProne 9 medium

Security 2 critical

View in Codacy

🟢 Metrics 154 complexity · -7 duplication

Metric Results

Complexity 154

Duplication -7

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Agent-Logs-Url: https://github.com/KaHIP/FREIGHT/sessions/70ec7d09-80a2-4157-ba10-86a0c3d6872c Co-authored-by: schulzchristian <5701578+schulzchristian@users.noreply.github.com>

Copilot

Pull request overview

This PR targets major speedups for multi-pass streaming partitioning (especially --ram_stream with --num_streams_passes > 1) by caching inputs across passes, reducing per-pass I/O, and optimizing evaluation and inner-loop overhead for both hypergraph (freight_cut / freight_con) and graph (freight_graphs) paths.

Changes:

Add ram_stream input caching via mmap/zero-copy parsing and avoid re-reading the file on subsequent passes.
Make evaluation cheaper by computing only the needed metric per mode and replacing per-block unordered_set tracking with a bitset approach.
Add optional PGO build flags and remove allocator “no-builtin” flags.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
code_for_hypergraphs/lib/partition/partition_config.h	Adds runtime flags to control which evaluation metrics to compute.
code_for_hypergraphs/lib/partition/onepass_partitioning/vertex_partitioning.h	Skips OpenMP critical section overhead when single-threaded.
code_for_hypergraphs/lib/io/graph_io_stream.h	Adds mmap-based cached input loading and a fast registration path using cached input.
code_for_hypergraphs/lib/io/graph_io_stream.cpp	Reworks netlist evaluation to optionally use cached input and faster connectivity tracking.
code_for_hypergraphs/CMakeLists.txt	Adds PGO options and removes `-fno-builtin-malloc*` flags.
code_for_hypergraphs/app/freight.cpp	Avoids redundant I/O across passes, adds cached-input registration, and reduces evaluation work per pass.
code_for_hypergraphs/app/configuration.h	Initializes new evaluation flags in the standard configuration.
code_for_graphs/lib/partition/onepass_partitioning/vertex_partitioning.h	Same OpenMP critical-section optimization as hypergraphs.
code_for_graphs/lib/io/graph_io_stream.h	Adds mmap-based cached input loading for `ram_stream`.
code_for_graphs/CMakeLists.txt	Adds PGO options and removes `-fno-builtin-malloc*` flags.
code_for_graphs/app/streammultisection.cpp	Avoids redundant I/O across passes by resetting state for subsequent `ram_stream` passes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T15:07:03Z

+		// First pass: compute connectivity from bitset (sequential, after parallel loop)
+		if (n_edges_for_bits > 0 && config.ram_stream) {
+			PartitionID* nodes_data = config.stream_nodes_assign->data();
+			for (LongNodeID node = 0; node < config.n_batches; node++) {
+				PartitionID block = nodes_data[node];
+				auto& ln = (*input)[node];
+				LongNodeID cc = config.read_nw ? 1 : 0;
+				LongNodeID ls = ln.size();
+				PartitionID sf = 1 + (PartitionID)config.read_ew;
+				while (cc < ls) {
+					size_t net_idx = ln[cc] - 1;
+					cc += sf;
+					uint64_t bit = 1ULL << (block % 64);
+					uint64_t old_word = first_pass_bits[net_idx];
+					if (!(old_word & bit)) {
+						if (old_word != 0) first_pass_connectivity += 1;
+						first_pass_bits[net_idx] = old_word | bit;
+					}
+				}


The first-pass connectivity shortcut uses a single 64-bit word per net and increments by 1, which breaks correctness when (a) k > 64 (bit collisions via block % 64) and/or (b) the input has edge weights (connectivity should add edge_weight, not 1). Consider reusing streamEvaluateHPartition_netl with cached_input for the first pass, or implement a weighted multi-word bitset with words_per_net = (k+63)/64 (matching the evaluator).

Suggested change

// First pass: compute connectivity from bitset (sequential, after parallel loop)

if (n_edges_for_bits > 0 && config.ram_stream) {

PartitionID* nodes_data = config.stream_nodes_assign->data();

for (LongNodeID node = 0; node < config.n_batches; node++) {

PartitionID block = nodes_data[node];

auto& ln = (*input)[node];

LongNodeID cc = config.read_nw ? 1 : 0;

LongNodeID ls = ln.size();

PartitionID sf = 1 + (PartitionID)config.read_ew;

while (cc < ls) {

size_t net_idx = ln[cc] - 1;

cc += sf;

uint64_t bit = 1ULL << (block % 64);

uint64_t old_word = first_pass_bits[net_idx];

if (!(old_word & bit)) {

if (old_word != 0) first_pass_connectivity += 1;

first_pass_bits[net_idx] = old_word | bit;

}

}

// First pass: compute connectivity (sequential, after parallel loop)

if (n_edges_for_bits > 0 && config.ram_stream) {

PartitionID* nodes_data = config.stream_nodes_assign->data();

// Use the bitset shortcut only when it is safe:

// - unweighted edges (config.read_ew == 0), so each cut contributes 1

// - number of blocks k <= 64, so block indices fit in a single 64-bit word

const bool use_bitset_shortcut = (!config.read_ew) && (config.k <= 64);

if (use_bitset_shortcut) {

// Original bitset-based connectivity computation

for (LongNodeID node = 0; node < config.n_batches; node++) {

PartitionID block = nodes_data[node];

auto& ln = (*input)[node];

LongNodeID cc = config.read_nw ? 1 : 0;

LongNodeID ls = ln.size();

PartitionID sf = 1 + (PartitionID)config.read_ew;

while (cc < ls) {

size_t net_idx = ln[cc] - 1;

cc += sf;

uint64_t bit = 1ULL << (block % 64);

uint64_t old_word = first_pass_bits[net_idx];

if (!(old_word & bit)) {

if (old_word != 0) first_pass_connectivity += 1;

first_pass_bits[net_idx] = old_word | bit;

}

}

}

} else {

// Fallback: exact connectivity for k > 64 and/or weighted edges.

// Track, for each net, which blocks have been seen.

std::vector<std::vector<PartitionID>> net_blocks(n_edges_for_bits);

for (LongNodeID node = 0; node < config.n_batches; node++) {

PartitionID block = nodes_data[node];

auto& ln = (*input)[node];

LongNodeID cc = config.read_nw ? 1 : 0;

LongNodeID ls = ln.size();

PartitionID sf = 1 + (PartitionID)config.read_ew;

while (cc < ls) {

size_t net_idx = ln[cc] - 1;

// Edge weight (if present) is stored immediately after net id.

LongNodeID edge_weight = 1;

if (config.read_ew && cc + 1 < ls) {

edge_weight = ln[cc + 1];

}

cc += sf;

auto& blocks_for_net = net_blocks[net_idx];

bool seen = false;

for (PartitionID b : blocks_for_net) {

if (b == block) {

seen = true;

break;

}

}

if (!seen) {

// If this net already touches some block(s), adding a new block

// increases connectivity by the edge weight (or 1 if unweighted).

if (!blocks_for_net.empty()) {

first_pass_connectivity += static_cast<double>(edge_weight);

}

blocks_for_net.push_back(block);

}

}

}

Copilot · 2026-04-02T15:07:03Z

+				// First pass cut mode: count CUT_NET from stream_edges_assign directly
+				for (const auto& entry : *config.stream_edges_assign) {
+					if (entry == CUT_NET) pass_cut += 1;
+				}
+				pass_qap = pass_cut;


In cut-net mode, the first-pass objective shortcut counts CUT_NET entries rather than summing net weights from the input. This changes the objective when read_ew is enabled (or when nets have non-unit weights), potentially selecting a different “best” pass than the baseline. Safer option: run streamEvaluateHPartition_netl using cached_input (or store per-net weights once) even on the first pass.

Copilot · 2026-04-02T15:07:04Z

+	size_t words_per_net = ((size_t)config.k + 63) / 64;
+	std::vector<uint64_t> net_block_bits;
+	if (compute_connectivity) {
+		net_block_bits.assign(n_edges * words_per_net, 0);


Connectivity evaluation now allocates net_block_bits as n_edges * words_per_net (where words_per_net = (k+63)/64). For large k and/or large n_edges this can become very large (O(n_edges*k) bits) and may cause OOM or heavy paging during evaluation. Consider using the bitset fast-path only for small k (e.g., k<=64 or a memory threshold) and falling back to the previous sparse approach for larger k.

Suggested change

size_t words_per_net = ((size_t)config.k + 63) / 64;

std::vector<uint64_t> net_block_bits;

if (compute_connectivity) {

net_block_bits.assign(n_edges * words_per_net, 0);

// For connectivity tracking, use a bitset fast path only when memory usage is reasonable.

// Otherwise, disable connectivity evaluation to avoid excessive memory consumption.

size_t words_per_net = 0;

std::vector<uint64_t> net_block_bits;

if (compute_connectivity) {

words_per_net = ((size_t)config.k + 63) / 64;

// Guard against overflow in n_edges * words_per_net

bool disable_connectivity = false;

if (n_edges != 0 && words_per_net > std::numeric_limits<size_t>::max() / n_edges) {

disable_connectivity = true;

} else {

size_t total_words = n_edges * words_per_net;

// Impose a maximum size (in 64-bit words) for the connectivity bitset.

// Here we cap at 256 MiB worth of uint64_t entries.

const size_t MAX_CONNECTIVITY_WORDS =

(static_cast<size_t>(256) * 1024 * 1024) / sizeof(uint64_t);

if (total_words > MAX_CONNECTIVITY_WORDS) {

disable_connectivity = true;

} else {

net_block_bits.assign(total_words, 0);

}

}

if (disable_connectivity) {

compute_connectivity = false;

}

Copilot · 2026-04-02T15:07:04Z

+			fstat(fd, &st);
+			file_size = (size_t)st.st_size;
+#ifdef MAP_POPULATE
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
+#else
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
+#endif


The mmap fast-path does not check fstat()’s return value (or that file_size is non-zero) before using st.st_size to size the mapping. If fstat fails, file_size may be garbage and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.

Suggested change

fstat(fd, &st);

file_size = (size_t)st.st_size;

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

if (fstat(fd, &st) == 0) {

file_size = (size_t)st.st_size;

if (file_size > 0) {

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

}

}

Copilot · 2026-04-02T15:07:05Z

+		std::ifstream& in = *(partition_config.stream_in);
+		std::streampos cur = in.tellg();
+		size_t offset = (size_t)cur;
+
+		// mmap the entire file for zero-copy access (POSIX: Linux + macOS)
+		int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);


offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and clamp/guard offset against file_size (fall back to the buffered read path when invalid).

Copilot · 2026-04-02T15:07:05Z

+			fstat(fd, &st);
+			file_size = (size_t)st.st_size;
+#ifdef MAP_POPULATE
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
+#else
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
+#endif


The mmap fast-path does not check fstat()’s return value before using st.st_size to size the mapping. If fstat fails, file_size may be invalid and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.

Suggested change

fstat(fd, &st);

file_size = (size_t)st.st_size;

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

int stat_res = fstat(fd, &st);

if (stat_res == 0 && st.st_size > 0 && (off_t)offset < st.st_size) {

file_size = (size_t)st.st_size;

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

} else {

mmap_addr = MAP_FAILED;

}

Copilot · 2026-04-02T15:07:06Z

+		size_t offset = (size_t)cur;
+
+		int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);
+		void* mmap_addr = MAP_FAILED;
+		size_t file_size = 0;
+		if (fd != -1) {
+			struct stat st;
+			fstat(fd, &st);
+			file_size = (size_t)st.st_size;
+#ifdef MAP_POPULATE
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
+#else
+			mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
+#endif


offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and guard offset against file_size (fall back to the buffered read path when invalid).

Suggested change

size_t offset = (size_t)cur;

int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);

void* mmap_addr = MAP_FAILED;

size_t file_size = 0;

if (fd != -1) {

struct stat st;

fstat(fd, &st);

file_size = (size_t)st.st_size;

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

size_t offset = 0;

int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);

void* mmap_addr = MAP_FAILED;

size_t file_size = 0;

if (fd != -1) {

struct stat st;

if (fstat(fd, &st) == 0 && st.st_size > 0) {

file_size = static_cast<size_t>(st.st_size);

#ifdef MAP_POPULATE

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);

#else

mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);

#endif

if (mmap_addr != MAP_FAILED) {

if (cur < 0) {

// Invalid stream position: do not use mmap, fall back to buffered path.

munmap(mmap_addr, file_size);

mmap_addr = MAP_FAILED;

} else {

offset = static_cast<size_t>(cur);

if (offset > file_size) {

// Offset beyond end of mapped file: avoid out-of-bounds access.

munmap(mmap_addr, file_size);

mmap_addr = MAP_FAILED;

}

}

}

}

}

Copilot · 2026-04-02T15:07:06Z

+option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF)
+option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF)
+if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
+  if(PGO_GENERATE)
+    add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data)
+    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data")
+  endif()
+  if(PGO_USE)
+    add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction)
+    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data")
+  endif()


PGO_GENERATE and PGO_USE can currently both be enabled at the same time, which will add conflicting -fprofile-generate / -fprofile-use flags and can lead to compiler/linker errors or unpredictable behavior. Consider making these options mutually exclusive (e.g., error out when both are ON, or have PGO_USE override PGO_GENERATE).

Copilot · 2026-04-02T15:07:06Z

+option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF)
+option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF)
+if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
+  if(PGO_GENERATE)
+    add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data)
+    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data")
+  endif()
+  if(PGO_USE)
+    add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction)
+    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data")
+  endif()


PGO_GENERATE and PGO_USE can both be enabled simultaneously, which will add conflicting -fprofile-generate / -fprofile-use flags and can cause build failures. Consider enforcing mutual exclusivity (or explicit precedence) between these options.

… freight_graphs ~4.3x speedup for multipass streaming partitioning while preserving bit-identical partition output. Applied to both hypergraph and graph streaming partitioners. I/O optimizations: - Cache binary input across multipass iterations (ram_stream) - Skip readFirstLineStream on subsequent ram_stream passes - Use mmap with MAP_POPULATE for zero-copy file loading (macOS compat) - Bulk-read + parse from char buffer with pre-reserved vectors - Skip per-node gettimeofday syscalls in inner loop Evaluation optimizations (hypergraphs): - Use cached data for evaluation instead of re-reading file - Replace unordered_set with flat bitset for connectivity tracking - Skip unused metric per mode (cut-only or connectivity-only) - Specialize bitset check for k<=64 (single word) - Cheap first-pass shortcuts for cut/connectivity (unweighted, k<=64) - Use local scratch vector for inter-pass eval to avoid save/restore - Skip evaluation when partition has converged Processing optimizations: - Bypass OMP critical sections when single-threaded - Eliminate valid_neighboring_nets vector for ram_stream (hypergraphs) - Prefetch stream_edges_assign entries in hot loops (hypergraphs) - Remove -fno-builtin-malloc flags - Use memset for all_blocks_to_keys initialization - Skip dead computations for default config - Add PGO support to CMakeLists.txt (GCC/Clang only) Safety: - OMP atomic for nodes_moved; first-pass bitset in sequential post-pass - Hardened mmap: fstat/tellg/offset validation with ifstream fallback - OOM guard for connectivity bitset (256 MiB cap) - PGO_GENERATE/PGO_USE mutual exclusivity enforced - Default initializers for evaluate_cut/evaluate_connectivity

schulzchristian · 2026-04-02T16:05:11Z

Addressed Copilot review feedback (#5 + inline comments)

Round 1 (from #5)

Fixed race condition on nodes_moved: added #pragma omp atomic
Fixed race on first_pass_bits/first_pass_connectivity: moved bitset computation to sequential post-pass after the parallel loop
Checked open() before fstat(): added fd != -1 guard in both mmap paths
Guarded PGO CMake flags: wrapped with CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang"
Added default initializers for evaluate_cut/evaluate_connectivity in partition_config.h
Fixed stale .size() in prefetch guard: now uses pre-computed line_size
Removed dead best_cut/best_con/best_qap/best_pins variables
Added comment explaining intentional OMP critical guard duplication
Added comment explaining remaining_stream_nodes-- removal safety
Removed redundant cursor = curr_node alias

Round 2 (from inline review comments)

Guarded first-pass connectivity shortcut: only used when k <= 64 and unweighted edges; falls back to full evaluation otherwise (fixes k>64 bit collision and weighted edge correctness)
Guarded first-pass cut shortcut: only used when unweighted; falls back to full evaluation otherwise
Added OOM guard for connectivity bitset: 256 MiB cap with overflow check; disables connectivity when exceeded
Hardened mmap in both variants: checks fstat() return, validates tellg() and offset <= file_size, falls back to ifstream::read on any failure
Made PGO_GENERATE/PGO_USE mutually exclusive: FATAL_ERROR when both enabled

All 32 configs verified bit-identical (24 hypergraph + 8 graph).

schulzchristian · 2026-04-02T16:06:16Z

@copilot check the new changes

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T16:25:35Z

 				decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread);
+				// Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped
+				if (n_threads > 1) {
 #pragma omp critical(update_self_sorting_vector)
-				this->sorted_blocks.increment(decision);
+					this->sorted_blocks.increment(decision);
+				} else {
+					this->sorted_blocks.increment(decision);


The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).

Copilot · 2026-04-02T16:25:36Z

 			case SAMPLING_INACTIVE_LINEAR_COMPLEXITY:
 				decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread);
+				// Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped
+				if (n_threads > 1) {
 #pragma omp critical(update_self_sorting_vector)
-				this->sorted_blocks.increment(decision);
+					this->sorted_blocks.increment(decision);
+				} else {
+					this->sorted_blocks.increment(decision);


The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).

Copilot · 2026-04-02T16:25:36Z

+	size_t words_per_net = ((size_t)config.k + 63) / 64;
+	std::vector<uint64_t> net_block_bits;
+	if (compute_connectivity) {
+		// Guard against excessive memory usage for large k * n_edges
+		size_t total_words = n_edges * words_per_net;
+		const size_t MAX_BITSET_WORDS = (size_t(256) * 1024 * 1024) / sizeof(uint64_t); // 256 MiB cap
+		if (words_per_net != 0 && n_edges <= MAX_BITSET_WORDS / words_per_net) {
+			net_block_bits.assign(total_words, 0);
+		} else {
+			compute_connectivity = false; // too large, skip connectivity computation
+		}


When the bitset would exceed the 256 MiB cap, the code sets compute_connectivity = false and proceeds, which will silently return connectivity = 0 and can affect best-pass selection/output. Consider falling back to the previous (set-based) connectivity computation, or at least emitting an explicit error/warning and returning a sentinel so callers don’t treat the result as valid.

Copilot · 2026-04-02T16:25:36Z

 	std::vector< NodeID > *perm_rank = NULL;
 	matrix* D=NULL;
 	if (config.enable_mapping) {
 		perm_rank = new std::vector< NodeID >(config.k);
 		for( unsigned i = 0; i < perm_rank->size(); i++) {
 			(*perm_rank)[i] = i;
 		}
 		D = new online_distance_matrix(config.k, config.k);
 		D->setPartitionConfig(config);
 	}


perm_rank and D are heap-allocated when enable_mapping is true but are never freed before returning, causing a leak on every evaluation call. Prefer stack ownership (e.g., std::vector and std::unique_ptr) or ensure the allocations are deleted on all exit paths.

Copilot · 2026-04-02T16:25:37Z

+			size_t remaining = (size_t)in.tellg() - offset;
+			in.seekg(cur);
+			buf.resize(remaining);
+			in.read(buf.data(), remaining);
+			actual = (size_t)in.gcount();


Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.

Suggested change

size_t remaining = (size_t)in.tellg() - offset;

in.seekg(cur);

buf.resize(remaining);

in.read(buf.data(), remaining);

actual = (size_t)in.gcount();

std::streampos endpos = in.tellg();

size_t remaining = 0;

if (!in.fail() && endpos >= 0) {

size_t end_off = static_cast<size_t>(endpos);

if (end_off > offset) {

remaining = end_off - offset;

}

}

// Restore original position before reading

in.clear();

in.seekg(cur);

if (remaining > 0) {

buf.resize(remaining);

in.read(buf.data(), remaining);

actual = static_cast<size_t>(in.gcount());

} else {

buf.clear();

actual = 0;

}

Copilot · 2026-04-02T16:25:37Z

+			in.seekg(0, std::ios::end);
+			size_t remaining = (size_t)in.tellg() - offset;
+			in.seekg(cur);
+			buf.resize(remaining);
+			in.read(buf.data(), remaining);
+			actual = (size_t)in.gcount();
+			p = buf.data();
+			pend = p + actual;


Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.

Suggested change

in.seekg(0, std::ios::end);

size_t remaining = (size_t)in.tellg() - offset;

in.seekg(cur);

buf.resize(remaining);

in.read(buf.data(), remaining);

actual = (size_t)in.gcount();

p = buf.data();

pend = p + actual;

// Fallback to reading from stream: guard tellg() and size computations

in.clear();

in.seekg(0, std::ios::end);

std::streampos end_pos = in.tellg();

if (!in || end_pos < 0 || static_cast<size_t>(end_pos) < offset) {

// Invalid stream position or offset beyond end: treat as no remaining data

in.clear();

in.seekg(cur);

actual = 0;

buf.clear();

p = buf.data();

pend = p;

} else {

size_t remaining = static_cast<size_t>(end_pos) - offset;

in.clear();

in.seekg(cur);

buf.resize(remaining);

in.read(buf.data(), remaining);

actual = static_cast<size_t>(in.gcount());

p = buf.data();

pend = p + actual;

}

Copilot · 2026-04-02T16:25:37Z

+				io_t.restart();
+				if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) {
+					graph_io_stream::loadBufferLinesToBinary(config, input, 1);
+				}


buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.

Suggested change

}

}

#pragma omp atomic

Copilot · 2026-04-02T16:25:38Z

+			if (!config.ram_stream) {
+				io_t.restart();
+				if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) {
+					graph_io_stream::loadBufferLinesToBinary(config, input, 1);
+				}
+				buffer_io_time += io_t.elapsed();
 			}


buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.

Copilot · 2026-04-02T16:25:38Z

+	(*config.stream_nodes_assign)[curr_node] = assigned_block;
+	(*config.stream_blocks_weight)[assigned_block] += 1;
+#if defined MODE_NETLIST
+	PartitionID* edges_data = config.stream_edges_assign->data();
+	LongNodeID col_counter = config.read_nw ? 1 : 0;
+	LongNodeID line_size = line_numbers.size();
+	PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;
+	while (col_counter < line_size) {
+		LongEdgeID net = line_numbers[col_counter];
+		col_counter += selecting_factor;
+		PartitionID & old_block = edges_data[net-1];
+		if (old_block == CUT_NET) continue;
+#if defined MODE_CONNECTIVITY
+		old_block = assigned_block;
+#else
+		old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;
+#endif
+	}
+#endif


register_result_from_input mutates shared state (stream_blocks_weight and stream_edges_assign) without any synchronization. When called from the OpenMP parallel loop this is a data race and can lead to non-deterministic results. Consider using atomics/per-thread buffers with a merge step, or otherwise ensure this function is only called in a single-threaded context.

Suggested change

(*config.stream_nodes_assign)[curr_node] = assigned_block;

(*config.stream_blocks_weight)[assigned_block] += 1;

#if defined MODE_NETLIST

PartitionID* edges_data = config.stream_edges_assign->data();

LongNodeID col_counter = config.read_nw ? 1 : 0;

LongNodeID line_size = line_numbers.size();

PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;

while (col_counter < line_size) {

LongEdgeID net = line_numbers[col_counter];

col_counter += selecting_factor;

PartitionID & old_block = edges_data[net-1];

if (old_block == CUT_NET) continue;

#if defined MODE_CONNECTIVITY

old_block = assigned_block;

#else

old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;

#endif

}

#endif

#ifdef _OPENMP

#pragma omp critical(register_result_from_input)

#endif

{

(*config.stream_nodes_assign)[curr_node] = assigned_block;

(*config.stream_blocks_weight)[assigned_block] += 1;

#if defined MODE_NETLIST

PartitionID* edges_data = config.stream_edges_assign->data();

LongNodeID col_counter = config.read_nw ? 1 : 0;

LongNodeID line_size = line_numbers.size();

PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;

while (col_counter < line_size) {

LongEdgeID net = line_numbers[col_counter];

col_counter += selecting_factor;

PartitionID & old_block = edges_data[net-1];

if (old_block == CUT_NET) continue;

#if defined MODE_CONNECTIVITY

old_block = assigned_block;

#else

old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;

#endif

}

#endif

}

Copilot · 2026-04-02T16:31:38Z

@copilot check the new changes

Reviewed the Round 2 changes — the main fixes all look correct:

First-pass connectivity shortcut (freight.cpp lines 177–178): correctly gated on k <= 64 && !config.read_ew, so n_edges_for_bits is 0 for k>64 or weighted edges, making the bitset loop a no-op. ✅
First-pass cut shortcut (lines 281–294): correctly gated on !config.read_ew. ✅
OOM guard for connectivity bitset (graph_io_stream.cpp lines 215–226): 256 MiB cap with overflow-safe comparison. ✅
mmap hardening (both .h files): fstat return check, cur >= 0 guard, and offset <= file_size guard are all in place. ✅
PGO mutual exclusivity (both CMakeLists.txt): FATAL_ERROR when both PGO_GENERATE and PGO_USE are ON. ✅

Two residual edge-case bugs remain in the mmap fallback path (only reached when mmap fails, e.g. on systems where it's unsupported):

Integer underflow — (size_t)in.tellg() - offset underflows when offset > end_off (stream somehow positioned past EOF).
Invalid seekg — in.seekg(cur) is called without checking cur >= 0; if tellg() returned -1 earlier, this has undefined behaviour.

Fix for both (same in both variants):

} else {
    actual = 0;
    if (cur >= 0) {
        in.seekg(0, std::ios::end);
        std::streampos end_pos = in.tellg();
        if (end_pos >= 0) {
            size_t end_off = (size_t)end_pos;
            size_t remaining = (end_off > offset) ? (end_off - offset) : 0;
            in.seekg(cur);
            if (remaining > 0 && in.good()) {
                buf.resize(remaining);
                in.read(buf.data(), remaining);
                actual = (size_t)in.gcount();
            }
        }
    }
    p = buf.data();
    pend = p + actual;
}

These only affect error paths (mmap failure + invalid stream state), so they're non-critical for normal operation, but worth a follow-up fix for robustness.

Copilot AI mentioned this pull request Apr 2, 2026

Add code review for PR #4 (multipass performance optimizations) #5

Draft

schulzchristian force-pushed the perf/multipass-speedup branch from e13230c to db91616 Compare April 2, 2026 14:53

schulzchristian requested a review from Copilot April 2, 2026 14:58

Copilot started reviewing on behalf of schulzchristian April 2, 2026 14:59 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

schulzchristian force-pushed the perf/multipass-speedup branch from db91616 to 2e76bfb Compare April 2, 2026 16:03

schulzchristian requested a review from Copilot April 2, 2026 16:18

Copilot started reviewing on behalf of schulzchristian April 2, 2026 16:18 View session

Copilot started work on behalf of schulzchristian April 2, 2026 16:22 View session

schulzchristian merged commit 64b132b into main Apr 2, 2026
3 of 5 checks passed

Copilot AI reviewed Apr 2, 2026

View reviewed changes

-	size_t words_per_net = ((size_t)config.k + 63) / 64;
-	std::vector<uint64_t> net_block_bits;
-	if (compute_connectivity) {
-		net_block_bits.assign(n_edges * words_per_net, 0);
+	// For connectivity tracking, use a bitset fast path only when memory usage is reasonable.
+	// Otherwise, disable connectivity evaluation to avoid excessive memory consumption.
+	size_t words_per_net = 0;
+	std::vector<uint64_t> net_block_bits;
+	if (compute_connectivity) {
+		words_per_net = ((size_t)config.k + 63) / 64;
+		// Guard against overflow in n_edges * words_per_net
+		bool disable_connectivity = false;
+		if (n_edges != 0 && words_per_net > std::numeric_limits<size_t>::max() / n_edges) {
+			disable_connectivity = true;
+		} else {
+			size_t total_words = n_edges * words_per_net;
+			// Impose a maximum size (in 64-bit words) for the connectivity bitset.
+			// Here we cap at 256 MiB worth of uint64_t entries.
+			const size_t MAX_CONNECTIVITY_WORDS =
+				(static_cast<size_t>(256) * 1024 * 1024) / sizeof(uint64_t);
+			if (total_words > MAX_CONNECTIVITY_WORDS) {
+				disable_connectivity = true;
+			} else {
+				net_block_bits.assign(total_words, 0);
+			}
+		}
+		if (disable_connectivity) {
+			compute_connectivity = false;
+		}

-			size_t remaining = (size_t)in.tellg() - offset;
-			in.seekg(cur);
-			buf.resize(remaining);
-			in.read(buf.data(), remaining);
-			actual = (size_t)in.gcount();
+			std::streampos endpos = in.tellg();
+			size_t remaining = 0;
+			if (!in.fail() && endpos >= 0) {
+				size_t end_off = static_cast<size_t>(endpos);
+				if (end_off > offset) {
+					remaining = end_off - offset;
+				}
+			}
+			// Restore original position before reading
+			in.clear();
+			in.seekg(cur);
+			if (remaining > 0) {
+				buf.resize(remaining);
+				in.read(buf.data(), remaining);
+				actual = static_cast<size_t>(in.gcount());
+			} else {
+				buf.clear();
+				actual = 0;
+			}

Conversation

schulzchristian commented Apr 2, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codacy-production Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

schulzchristian commented Apr 2, 2026

Addressed Copilot review feedback (#5 + inline comments)

Round 1 (from #5)

Round 2 (from inline review comments)

Uh oh!

schulzchristian commented Apr 2, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Apr 2, 2026

schulzchristian commented Apr 2, 2026 •

edited by Copilot AI

Loading

codacy-production Bot commented Apr 2, 2026 •

edited

Loading