Skip to content

Multipass performance optimizations (~4.3x speedup)#4

Merged
schulzchristian merged 1 commit into
mainfrom
perf/multipass-speedup
Apr 2, 2026
Merged

Multipass performance optimizations (~4.3x speedup)#4
schulzchristian merged 1 commit into
mainfrom
perf/multipass-speedup

Conversation

@schulzchristian
Copy link
Copy Markdown
Member

@schulzchristian schulzchristian commented Apr 2, 2026

  • Review Round 2 changes: verified first-pass connectivity shortcut (k<=64, unweighted guard), OOM guard for bitset, fstat/offset validation in mmap, PGO mutual exclusivity
  • Fix integer underflow in mmap fallback: guard remaining with (end_off > offset) ? (end_off - offset) : 0 (both hypergraph and graph)
  • Fix invalid seekg(cur) in mmap fallback: guard entire fallback read with if (cur >= 0) (both variants)
  • Add in.good() check after seekg(cur) before read() (both variants)
  • Both builds compile cleanly; CodeQL scan: 0 alerts

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 2, 2026

Not up to standards ⛔

🔴 Issues 2 critical · 13 medium

Alerts:
⚠ 15 issues (≤ 0 issues of at least minor severity)

Results:
15 new issues

Category Results
UnusedCode 1 medium
BestPractice 3 medium
ErrorProne 9 medium
Security 2 critical

View in Codacy

🟢 Metrics 154 complexity · -7 duplication

Metric Results
Complexity 154
Duplication -7

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets major speedups for multi-pass streaming partitioning (especially --ram_stream with --num_streams_passes > 1) by caching inputs across passes, reducing per-pass I/O, and optimizing evaluation and inner-loop overhead for both hypergraph (freight_cut / freight_con) and graph (freight_graphs) paths.

Changes:

  • Add ram_stream input caching via mmap/zero-copy parsing and avoid re-reading the file on subsequent passes.
  • Make evaluation cheaper by computing only the needed metric per mode and replacing per-block unordered_set tracking with a bitset approach.
  • Add optional PGO build flags and remove allocator “no-builtin” flags.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
code_for_hypergraphs/lib/partition/partition_config.h Adds runtime flags to control which evaluation metrics to compute.
code_for_hypergraphs/lib/partition/onepass_partitioning/vertex_partitioning.h Skips OpenMP critical section overhead when single-threaded.
code_for_hypergraphs/lib/io/graph_io_stream.h Adds mmap-based cached input loading and a fast registration path using cached input.
code_for_hypergraphs/lib/io/graph_io_stream.cpp Reworks netlist evaluation to optionally use cached input and faster connectivity tracking.
code_for_hypergraphs/CMakeLists.txt Adds PGO options and removes -fno-builtin-malloc* flags.
code_for_hypergraphs/app/freight.cpp Avoids redundant I/O across passes, adds cached-input registration, and reduces evaluation work per pass.
code_for_hypergraphs/app/configuration.h Initializes new evaluation flags in the standard configuration.
code_for_graphs/lib/partition/onepass_partitioning/vertex_partitioning.h Same OpenMP critical-section optimization as hypergraphs.
code_for_graphs/lib/io/graph_io_stream.h Adds mmap-based cached input loading for ram_stream.
code_for_graphs/CMakeLists.txt Adds PGO options and removes -fno-builtin-malloc* flags.
code_for_graphs/app/streammultisection.cpp Avoids redundant I/O across passes by resetting state for subsequent ram_stream passes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +263
// First pass: compute connectivity from bitset (sequential, after parallel loop)
if (n_edges_for_bits > 0 && config.ram_stream) {
PartitionID* nodes_data = config.stream_nodes_assign->data();
for (LongNodeID node = 0; node < config.n_batches; node++) {
PartitionID block = nodes_data[node];
auto& ln = (*input)[node];
LongNodeID cc = config.read_nw ? 1 : 0;
LongNodeID ls = ln.size();
PartitionID sf = 1 + (PartitionID)config.read_ew;
while (cc < ls) {
size_t net_idx = ln[cc] - 1;
cc += sf;
uint64_t bit = 1ULL << (block % 64);
uint64_t old_word = first_pass_bits[net_idx];
if (!(old_word & bit)) {
if (old_word != 0) first_pass_connectivity += 1;
first_pass_bits[net_idx] = old_word | bit;
}
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first-pass connectivity shortcut uses a single 64-bit word per net and increments by 1, which breaks correctness when (a) k > 64 (bit collisions via block % 64) and/or (b) the input has edge weights (connectivity should add edge_weight, not 1). Consider reusing streamEvaluateHPartition_netl with cached_input for the first pass, or implement a weighted multi-word bitset with words_per_net = (k+63)/64 (matching the evaluator).

Suggested change
// First pass: compute connectivity from bitset (sequential, after parallel loop)
if (n_edges_for_bits > 0 && config.ram_stream) {
PartitionID* nodes_data = config.stream_nodes_assign->data();
for (LongNodeID node = 0; node < config.n_batches; node++) {
PartitionID block = nodes_data[node];
auto& ln = (*input)[node];
LongNodeID cc = config.read_nw ? 1 : 0;
LongNodeID ls = ln.size();
PartitionID sf = 1 + (PartitionID)config.read_ew;
while (cc < ls) {
size_t net_idx = ln[cc] - 1;
cc += sf;
uint64_t bit = 1ULL << (block % 64);
uint64_t old_word = first_pass_bits[net_idx];
if (!(old_word & bit)) {
if (old_word != 0) first_pass_connectivity += 1;
first_pass_bits[net_idx] = old_word | bit;
}
}
// First pass: compute connectivity (sequential, after parallel loop)
if (n_edges_for_bits > 0 && config.ram_stream) {
PartitionID* nodes_data = config.stream_nodes_assign->data();
// Use the bitset shortcut only when it is safe:
// - unweighted edges (config.read_ew == 0), so each cut contributes 1
// - number of blocks k <= 64, so block indices fit in a single 64-bit word
const bool use_bitset_shortcut = (!config.read_ew) && (config.k <= 64);
if (use_bitset_shortcut) {
// Original bitset-based connectivity computation
for (LongNodeID node = 0; node < config.n_batches; node++) {
PartitionID block = nodes_data[node];
auto& ln = (*input)[node];
LongNodeID cc = config.read_nw ? 1 : 0;
LongNodeID ls = ln.size();
PartitionID sf = 1 + (PartitionID)config.read_ew;
while (cc < ls) {
size_t net_idx = ln[cc] - 1;
cc += sf;
uint64_t bit = 1ULL << (block % 64);
uint64_t old_word = first_pass_bits[net_idx];
if (!(old_word & bit)) {
if (old_word != 0) first_pass_connectivity += 1;
first_pass_bits[net_idx] = old_word | bit;
}
}
}
} else {
// Fallback: exact connectivity for k > 64 and/or weighted edges.
// Track, for each net, which blocks have been seen.
std::vector<std::vector<PartitionID>> net_blocks(n_edges_for_bits);
for (LongNodeID node = 0; node < config.n_batches; node++) {
PartitionID block = nodes_data[node];
auto& ln = (*input)[node];
LongNodeID cc = config.read_nw ? 1 : 0;
LongNodeID ls = ln.size();
PartitionID sf = 1 + (PartitionID)config.read_ew;
while (cc < ls) {
size_t net_idx = ln[cc] - 1;
// Edge weight (if present) is stored immediately after net id.
LongNodeID edge_weight = 1;
if (config.read_ew && cc + 1 < ls) {
edge_weight = ln[cc + 1];
}
cc += sf;
auto& blocks_for_net = net_blocks[net_idx];
bool seen = false;
for (PartitionID b : blocks_for_net) {
if (b == block) {
seen = true;
break;
}
}
if (!seen) {
// If this net already touches some block(s), adding a new block
// increases connectivity by the edge weight (or 1 if unweighted).
if (!blocks_for_net.empty()) {
first_pass_connectivity += static_cast<double>(edge_weight);
}
blocks_for_net.push_back(block);
}
}
}

Copilot uses AI. Check for mistakes.
Comment thread code_for_hypergraphs/app/freight.cpp Outdated
Comment on lines +282 to +286
// First pass cut mode: count CUT_NET from stream_edges_assign directly
for (const auto& entry : *config.stream_edges_assign) {
if (entry == CUT_NET) pass_cut += 1;
}
pass_qap = pass_cut;
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cut-net mode, the first-pass objective shortcut counts CUT_NET entries rather than summing net weights from the input. This changes the objective when read_ew is enabled (or when nets have non-unit weights), potentially selecting a different “best” pass than the baseline. Safer option: run streamEvaluateHPartition_netl using cached_input (or store per-net weights once) even on the first pass.

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +218
size_t words_per_net = ((size_t)config.k + 63) / 64;
std::vector<uint64_t> net_block_bits;
if (compute_connectivity) {
net_block_bits.assign(n_edges * words_per_net, 0);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connectivity evaluation now allocates net_block_bits as n_edges * words_per_net (where words_per_net = (k+63)/64). For large k and/or large n_edges this can become very large (O(n_edges*k) bits) and may cause OOM or heavy paging during evaluation. Consider using the bitset fast-path only for small k (e.g., k<=64 or a memory threshold) and falling back to the previous sparse approach for larger k.

Suggested change
size_t words_per_net = ((size_t)config.k + 63) / 64;
std::vector<uint64_t> net_block_bits;
if (compute_connectivity) {
net_block_bits.assign(n_edges * words_per_net, 0);
// For connectivity tracking, use a bitset fast path only when memory usage is reasonable.
// Otherwise, disable connectivity evaluation to avoid excessive memory consumption.
size_t words_per_net = 0;
std::vector<uint64_t> net_block_bits;
if (compute_connectivity) {
words_per_net = ((size_t)config.k + 63) / 64;
// Guard against overflow in n_edges * words_per_net
bool disable_connectivity = false;
if (n_edges != 0 && words_per_net > std::numeric_limits<size_t>::max() / n_edges) {
disable_connectivity = true;
} else {
size_t total_words = n_edges * words_per_net;
// Impose a maximum size (in 64-bit words) for the connectivity bitset.
// Here we cap at 256 MiB worth of uint64_t entries.
const size_t MAX_CONNECTIVITY_WORDS =
(static_cast<size_t>(256) * 1024 * 1024) / sizeof(uint64_t);
if (total_words > MAX_CONNECTIVITY_WORDS) {
disable_connectivity = true;
} else {
net_block_bits.assign(total_words, 0);
}
}
if (disable_connectivity) {
compute_connectivity = false;
}

Copilot uses AI. Check for mistakes.
Comment on lines +361 to +367
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mmap fast-path does not check fstat()’s return value (or that file_size is non-zero) before using st.st_size to size the mapping. If fstat fails, file_size may be garbage and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.

Suggested change
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
if (fstat(fd, &st) == 0) {
file_size = (size_t)st.st_size;
if (file_size > 0) {
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +351 to +356
std::ifstream& in = *(partition_config.stream_in);
std::streampos cur = in.tellg();
size_t offset = (size_t)cur;

// mmap the entire file for zero-copy access (POSIX: Linux + macOS)
int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and clamp/guard offset against file_size (fall back to the buffered read path when invalid).

Copilot uses AI. Check for mistakes.
Comment on lines +210 to +216
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mmap fast-path does not check fstat()’s return value before using st.st_size to size the mapping. If fstat fails, file_size may be invalid and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.

Suggested change
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
int stat_res = fstat(fd, &st);
if (stat_res == 0 && st.st_size > 0 && (off_t)offset < st.st_size) {
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
} else {
mmap_addr = MAP_FAILED;
}

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +216
size_t offset = (size_t)cur;

int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);
void* mmap_addr = MAP_FAILED;
size_t file_size = 0;
if (fd != -1) {
struct stat st;
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and guard offset against file_size (fall back to the buffered read path when invalid).

Suggested change
size_t offset = (size_t)cur;
int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);
void* mmap_addr = MAP_FAILED;
size_t file_size = 0;
if (fd != -1) {
struct stat st;
fstat(fd, &st);
file_size = (size_t)st.st_size;
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
size_t offset = 0;
int fd = open(partition_config.graph_filename.c_str(), O_RDONLY);
void* mmap_addr = MAP_FAILED;
size_t file_size = 0;
if (fd != -1) {
struct stat st;
if (fstat(fd, &st) == 0 && st.st_size > 0) {
file_size = static_cast<size_t>(st.st_size);
#ifdef MAP_POPULATE
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0);
#else
mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
#endif
if (mmap_addr != MAP_FAILED) {
if (cur < 0) {
// Invalid stream position: do not use mmap, fall back to buffered path.
munmap(mmap_addr, file_size);
mmap_addr = MAP_FAILED;
} else {
offset = static_cast<size_t>(cur);
if (offset > file_size) {
// Offset beyond end of mapped file: avoid out-of-bounds access.
munmap(mmap_addr, file_size);
mmap_addr = MAP_FAILED;
}
}
}
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +39
option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF)
option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
if(PGO_GENERATE)
add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data")
endif()
if(PGO_USE)
add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data")
endif()
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGO_GENERATE and PGO_USE can currently both be enabled at the same time, which will add conflicting -fprofile-generate / -fprofile-use flags and can lead to compiler/linker errors or unpredictable behavior. Consider making these options mutually exclusive (e.g., error out when both are ON, or have PGO_USE override PGO_GENERATE).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +38
option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF)
option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
if(PGO_GENERATE)
add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data")
endif()
if(PGO_USE)
add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data")
endif()
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGO_GENERATE and PGO_USE can both be enabled simultaneously, which will add conflicting -fprofile-generate / -fprofile-use flags and can cause build failures. Consider enforcing mutual exclusivity (or explicit precedence) between these options.

Copilot uses AI. Check for mistakes.
… freight_graphs

~4.3x speedup for multipass streaming partitioning while preserving
bit-identical partition output. Applied to both hypergraph and graph
streaming partitioners.

I/O optimizations:
- Cache binary input across multipass iterations (ram_stream)
- Skip readFirstLineStream on subsequent ram_stream passes
- Use mmap with MAP_POPULATE for zero-copy file loading (macOS compat)
- Bulk-read + parse from char buffer with pre-reserved vectors
- Skip per-node gettimeofday syscalls in inner loop

Evaluation optimizations (hypergraphs):
- Use cached data for evaluation instead of re-reading file
- Replace unordered_set with flat bitset for connectivity tracking
- Skip unused metric per mode (cut-only or connectivity-only)
- Specialize bitset check for k<=64 (single word)
- Cheap first-pass shortcuts for cut/connectivity (unweighted, k<=64)
- Use local scratch vector for inter-pass eval to avoid save/restore
- Skip evaluation when partition has converged

Processing optimizations:
- Bypass OMP critical sections when single-threaded
- Eliminate valid_neighboring_nets vector for ram_stream (hypergraphs)
- Prefetch stream_edges_assign entries in hot loops (hypergraphs)
- Remove -fno-builtin-malloc flags
- Use memset for all_blocks_to_keys initialization
- Skip dead computations for default config
- Add PGO support to CMakeLists.txt (GCC/Clang only)

Safety:
- OMP atomic for nodes_moved; first-pass bitset in sequential post-pass
- Hardened mmap: fstat/tellg/offset validation with ifstream fallback
- OOM guard for connectivity bitset (256 MiB cap)
- PGO_GENERATE/PGO_USE mutual exclusivity enforced
- Default initializers for evaluate_cut/evaluate_connectivity
@schulzchristian schulzchristian force-pushed the perf/multipass-speedup branch from db91616 to 2e76bfb Compare April 2, 2026 16:03
@schulzchristian
Copy link
Copy Markdown
Member Author

Addressed Copilot review feedback (#5 + inline comments)

Round 1 (from #5)

  • Fixed race condition on nodes_moved: added #pragma omp atomic
  • Fixed race on first_pass_bits/first_pass_connectivity: moved bitset computation to sequential post-pass after the parallel loop
  • Checked open() before fstat(): added fd != -1 guard in both mmap paths
  • Guarded PGO CMake flags: wrapped with CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang"
  • Added default initializers for evaluate_cut/evaluate_connectivity in partition_config.h
  • Fixed stale .size() in prefetch guard: now uses pre-computed line_size
  • Removed dead best_cut/best_con/best_qap/best_pins variables
  • Added comment explaining intentional OMP critical guard duplication
  • Added comment explaining remaining_stream_nodes-- removal safety
  • Removed redundant cursor = curr_node alias

Round 2 (from inline review comments)

  • Guarded first-pass connectivity shortcut: only used when k <= 64 and unweighted edges; falls back to full evaluation otherwise (fixes k>64 bit collision and weighted edge correctness)
  • Guarded first-pass cut shortcut: only used when unweighted; falls back to full evaluation otherwise
  • Added OOM guard for connectivity bitset: 256 MiB cap with overflow check; disables connectivity when exceeded
  • Hardened mmap in both variants: checks fstat() return, validates tellg() and offset <= file_size, falls back to ifstream::read on any failure
  • Made PGO_GENERATE/PGO_USE mutually exclusive: FATAL_ERROR when both enabled

All 32 configs verified bit-identical (24 hypergraph + 8 graph).

@schulzchristian
Copy link
Copy Markdown
Member Author

@copilot check the new changes

@schulzchristian schulzchristian requested a review from Copilot April 2, 2026 16:18
@schulzchristian schulzchristian merged commit 64b132b into main Apr 2, 2026
3 of 5 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 175 to +181
decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread);
// Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped
if (n_threads > 1) {
#pragma omp critical(update_self_sorting_vector)
this->sorted_blocks.increment(decision);
this->sorted_blocks.increment(decision);
} else {
this->sorted_blocks.increment(decision);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).

Copilot uses AI. Check for mistakes.
Comment on lines 175 to +182
case SAMPLING_INACTIVE_LINEAR_COMPLEXITY:
decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread);
// Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped
if (n_threads > 1) {
#pragma omp critical(update_self_sorting_vector)
this->sorted_blocks.increment(decision);
this->sorted_blocks.increment(decision);
} else {
this->sorted_blocks.increment(decision);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +225
size_t words_per_net = ((size_t)config.k + 63) / 64;
std::vector<uint64_t> net_block_bits;
if (compute_connectivity) {
// Guard against excessive memory usage for large k * n_edges
size_t total_words = n_edges * words_per_net;
const size_t MAX_BITSET_WORDS = (size_t(256) * 1024 * 1024) / sizeof(uint64_t); // 256 MiB cap
if (words_per_net != 0 && n_edges <= MAX_BITSET_WORDS / words_per_net) {
net_block_bits.assign(total_words, 0);
} else {
compute_connectivity = false; // too large, skip connectivity computation
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bitset would exceed the 256 MiB cap, the code sets compute_connectivity = false and proceeds, which will silently return connectivity = 0 and can affect best-pass selection/output. Consider falling back to the previous (set-based) connectivity computation, or at least emitting an explicit error/warning and returning a sentinel so callers don’t treat the result as valid.

Copilot uses AI. Check for mistakes.
Comment on lines 184 to 193
std::vector< NodeID > *perm_rank = NULL;
matrix* D=NULL;
if (config.enable_mapping) {
perm_rank = new std::vector< NodeID >(config.k);
for( unsigned i = 0; i < perm_rank->size(); i++) {
(*perm_rank)[i] = i;
}
D = new online_distance_matrix(config.k, config.k);
D->setPartitionConfig(config);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perm_rank and D are heap-allocated when enable_mapping is true but are never freed before returning, causing a leak on every evaluation call. Prefer stack ownership (e.g., std::vector and std::unique_ptr) or ensure the allocations are deleted on all exit paths.

Copilot uses AI. Check for mistakes.
Comment on lines +386 to +390
size_t remaining = (size_t)in.tellg() - offset;
in.seekg(cur);
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = (size_t)in.gcount();
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.

Suggested change
size_t remaining = (size_t)in.tellg() - offset;
in.seekg(cur);
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = (size_t)in.gcount();
std::streampos endpos = in.tellg();
size_t remaining = 0;
if (!in.fail() && endpos >= 0) {
size_t end_off = static_cast<size_t>(endpos);
if (end_off > offset) {
remaining = end_off - offset;
}
}
// Restore original position before reading
in.clear();
in.seekg(cur);
if (remaining > 0) {
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = static_cast<size_t>(in.gcount());
} else {
buf.clear();
actual = 0;
}

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +240
in.seekg(0, std::ios::end);
size_t remaining = (size_t)in.tellg() - offset;
in.seekg(cur);
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = (size_t)in.gcount();
p = buf.data();
pend = p + actual;
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.

Suggested change
in.seekg(0, std::ios::end);
size_t remaining = (size_t)in.tellg() - offset;
in.seekg(cur);
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = (size_t)in.gcount();
p = buf.data();
pend = p + actual;
// Fallback to reading from stream: guard tellg() and size computations
in.clear();
in.seekg(0, std::ios::end);
std::streampos end_pos = in.tellg();
if (!in || end_pos < 0 || static_cast<size_t>(end_pos) < offset) {
// Invalid stream position or offset beyond end: treat as no remaining data
in.clear();
in.seekg(cur);
actual = 0;
buf.clear();
p = buf.data();
pend = p;
} else {
size_t remaining = static_cast<size_t>(end_pos) - offset;
in.clear();
in.seekg(cur);
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = static_cast<size_t>(in.gcount());
p = buf.data();
pend = p + actual;
}

Copilot uses AI. Check for mistakes.
io_t.restart();
if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) {
graph_io_stream::loadBufferLinesToBinary(config, input, 1);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.

Suggested change
}
}
#pragma omp atomic

Copilot uses AI. Check for mistakes.
Comment on lines +156 to 162
if (!config.ram_stream) {
io_t.restart();
if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) {
graph_io_stream::loadBufferLinesToBinary(config, input, 1);
}
buffer_io_time += io_t.elapsed();
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.

Copilot uses AI. Check for mistakes.
Comment on lines +466 to +484
(*config.stream_nodes_assign)[curr_node] = assigned_block;
(*config.stream_blocks_weight)[assigned_block] += 1;
#if defined MODE_NETLIST
PartitionID* edges_data = config.stream_edges_assign->data();
LongNodeID col_counter = config.read_nw ? 1 : 0;
LongNodeID line_size = line_numbers.size();
PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;
while (col_counter < line_size) {
LongEdgeID net = line_numbers[col_counter];
col_counter += selecting_factor;
PartitionID & old_block = edges_data[net-1];
if (old_block == CUT_NET) continue;
#if defined MODE_CONNECTIVITY
old_block = assigned_block;
#else
old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;
#endif
}
#endif
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register_result_from_input mutates shared state (stream_blocks_weight and stream_edges_assign) without any synchronization. When called from the OpenMP parallel loop this is a data race and can lead to non-deterministic results. Consider using atomics/per-thread buffers with a merge step, or otherwise ensure this function is only called in a single-threaded context.

Suggested change
(*config.stream_nodes_assign)[curr_node] = assigned_block;
(*config.stream_blocks_weight)[assigned_block] += 1;
#if defined MODE_NETLIST
PartitionID* edges_data = config.stream_edges_assign->data();
LongNodeID col_counter = config.read_nw ? 1 : 0;
LongNodeID line_size = line_numbers.size();
PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;
while (col_counter < line_size) {
LongEdgeID net = line_numbers[col_counter];
col_counter += selecting_factor;
PartitionID & old_block = edges_data[net-1];
if (old_block == CUT_NET) continue;
#if defined MODE_CONNECTIVITY
old_block = assigned_block;
#else
old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;
#endif
}
#endif
#ifdef _OPENMP
#pragma omp critical(register_result_from_input)
#endif
{
(*config.stream_nodes_assign)[curr_node] = assigned_block;
(*config.stream_blocks_weight)[assigned_block] += 1;
#if defined MODE_NETLIST
PartitionID* edges_data = config.stream_edges_assign->data();
LongNodeID col_counter = config.read_nw ? 1 : 0;
LongNodeID line_size = line_numbers.size();
PartitionID selecting_factor = 1 + (PartitionID)config.read_ew;
while (col_counter < line_size) {
LongEdgeID net = line_numbers[col_counter];
col_counter += selecting_factor;
PartitionID & old_block = edges_data[net-1];
if (old_block == CUT_NET) continue;
#if defined MODE_CONNECTIVITY
old_block = assigned_block;
#else
old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET;
#endif
}
#endif
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI commented Apr 2, 2026

@copilot check the new changes

Reviewed the Round 2 changes — the main fixes all look correct:

  • First-pass connectivity shortcut (freight.cpp lines 177–178): correctly gated on k <= 64 && !config.read_ew, so n_edges_for_bits is 0 for k>64 or weighted edges, making the bitset loop a no-op. ✅
  • First-pass cut shortcut (lines 281–294): correctly gated on !config.read_ew. ✅
  • OOM guard for connectivity bitset (graph_io_stream.cpp lines 215–226): 256 MiB cap with overflow-safe comparison. ✅
  • mmap hardening (both .h files): fstat return check, cur >= 0 guard, and offset <= file_size guard are all in place. ✅
  • PGO mutual exclusivity (both CMakeLists.txt): FATAL_ERROR when both PGO_GENERATE and PGO_USE are ON. ✅

Two residual edge-case bugs remain in the mmap fallback path (only reached when mmap fails, e.g. on systems where it's unsupported):

  1. Integer underflow(size_t)in.tellg() - offset underflows when offset > end_off (stream somehow positioned past EOF).
  2. Invalid seekgin.seekg(cur) is called without checking cur >= 0; if tellg() returned -1 earlier, this has undefined behaviour.

Fix for both (same in both variants):

} else {
    actual = 0;
    if (cur >= 0) {
        in.seekg(0, std::ios::end);
        std::streampos end_pos = in.tellg();
        if (end_pos >= 0) {
            size_t end_off = (size_t)end_pos;
            size_t remaining = (end_off > offset) ? (end_off - offset) : 0;
            in.seekg(cur);
            if (remaining > 0 && in.good()) {
                buf.resize(remaining);
                in.read(buf.data(), remaining);
                actual = (size_t)in.gcount();
            }
        }
    }
    p = buf.data();
    pend = p + actual;
}

These only affect error paths (mmap failure + invalid stream state), so they're non-critical for normal operation, but worth a follow-up fix for robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants