Multipass performance optimizations (~4.3x speedup)#4
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 1 medium |
| BestPractice | 3 medium |
| ErrorProne | 9 medium |
| Security | 2 critical |
🟢 Metrics 154 complexity · -7 duplication
Metric Results Complexity 154 Duplication -7
TIP This summary will be updated as you push new changes. Give us feedback
Agent-Logs-Url: https://github.com/KaHIP/FREIGHT/sessions/70ec7d09-80a2-4157-ba10-86a0c3d6872c Co-authored-by: schulzchristian <5701578+schulzchristian@users.noreply.github.com>
e13230c to
db91616
Compare
There was a problem hiding this comment.
Pull request overview
This PR targets major speedups for multi-pass streaming partitioning (especially --ram_stream with --num_streams_passes > 1) by caching inputs across passes, reducing per-pass I/O, and optimizing evaluation and inner-loop overhead for both hypergraph (freight_cut / freight_con) and graph (freight_graphs) paths.
Changes:
- Add
ram_streaminput caching via mmap/zero-copy parsing and avoid re-reading the file on subsequent passes. - Make evaluation cheaper by computing only the needed metric per mode and replacing per-block
unordered_settracking with a bitset approach. - Add optional PGO build flags and remove allocator “no-builtin” flags.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| code_for_hypergraphs/lib/partition/partition_config.h | Adds runtime flags to control which evaluation metrics to compute. |
| code_for_hypergraphs/lib/partition/onepass_partitioning/vertex_partitioning.h | Skips OpenMP critical section overhead when single-threaded. |
| code_for_hypergraphs/lib/io/graph_io_stream.h | Adds mmap-based cached input loading and a fast registration path using cached input. |
| code_for_hypergraphs/lib/io/graph_io_stream.cpp | Reworks netlist evaluation to optionally use cached input and faster connectivity tracking. |
| code_for_hypergraphs/CMakeLists.txt | Adds PGO options and removes -fno-builtin-malloc* flags. |
| code_for_hypergraphs/app/freight.cpp | Avoids redundant I/O across passes, adds cached-input registration, and reduces evaluation work per pass. |
| code_for_hypergraphs/app/configuration.h | Initializes new evaluation flags in the standard configuration. |
| code_for_graphs/lib/partition/onepass_partitioning/vertex_partitioning.h | Same OpenMP critical-section optimization as hypergraphs. |
| code_for_graphs/lib/io/graph_io_stream.h | Adds mmap-based cached input loading for ram_stream. |
| code_for_graphs/CMakeLists.txt | Adds PGO options and removes -fno-builtin-malloc* flags. |
| code_for_graphs/app/streammultisection.cpp | Avoids redundant I/O across passes by resetting state for subsequent ram_stream passes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // First pass: compute connectivity from bitset (sequential, after parallel loop) | ||
| if (n_edges_for_bits > 0 && config.ram_stream) { | ||
| PartitionID* nodes_data = config.stream_nodes_assign->data(); | ||
| for (LongNodeID node = 0; node < config.n_batches; node++) { | ||
| PartitionID block = nodes_data[node]; | ||
| auto& ln = (*input)[node]; | ||
| LongNodeID cc = config.read_nw ? 1 : 0; | ||
| LongNodeID ls = ln.size(); | ||
| PartitionID sf = 1 + (PartitionID)config.read_ew; | ||
| while (cc < ls) { | ||
| size_t net_idx = ln[cc] - 1; | ||
| cc += sf; | ||
| uint64_t bit = 1ULL << (block % 64); | ||
| uint64_t old_word = first_pass_bits[net_idx]; | ||
| if (!(old_word & bit)) { | ||
| if (old_word != 0) first_pass_connectivity += 1; | ||
| first_pass_bits[net_idx] = old_word | bit; | ||
| } | ||
| } |
There was a problem hiding this comment.
The first-pass connectivity shortcut uses a single 64-bit word per net and increments by 1, which breaks correctness when (a) k > 64 (bit collisions via block % 64) and/or (b) the input has edge weights (connectivity should add edge_weight, not 1). Consider reusing streamEvaluateHPartition_netl with cached_input for the first pass, or implement a weighted multi-word bitset with words_per_net = (k+63)/64 (matching the evaluator).
| // First pass: compute connectivity from bitset (sequential, after parallel loop) | |
| if (n_edges_for_bits > 0 && config.ram_stream) { | |
| PartitionID* nodes_data = config.stream_nodes_assign->data(); | |
| for (LongNodeID node = 0; node < config.n_batches; node++) { | |
| PartitionID block = nodes_data[node]; | |
| auto& ln = (*input)[node]; | |
| LongNodeID cc = config.read_nw ? 1 : 0; | |
| LongNodeID ls = ln.size(); | |
| PartitionID sf = 1 + (PartitionID)config.read_ew; | |
| while (cc < ls) { | |
| size_t net_idx = ln[cc] - 1; | |
| cc += sf; | |
| uint64_t bit = 1ULL << (block % 64); | |
| uint64_t old_word = first_pass_bits[net_idx]; | |
| if (!(old_word & bit)) { | |
| if (old_word != 0) first_pass_connectivity += 1; | |
| first_pass_bits[net_idx] = old_word | bit; | |
| } | |
| } | |
| // First pass: compute connectivity (sequential, after parallel loop) | |
| if (n_edges_for_bits > 0 && config.ram_stream) { | |
| PartitionID* nodes_data = config.stream_nodes_assign->data(); | |
| // Use the bitset shortcut only when it is safe: | |
| // - unweighted edges (config.read_ew == 0), so each cut contributes 1 | |
| // - number of blocks k <= 64, so block indices fit in a single 64-bit word | |
| const bool use_bitset_shortcut = (!config.read_ew) && (config.k <= 64); | |
| if (use_bitset_shortcut) { | |
| // Original bitset-based connectivity computation | |
| for (LongNodeID node = 0; node < config.n_batches; node++) { | |
| PartitionID block = nodes_data[node]; | |
| auto& ln = (*input)[node]; | |
| LongNodeID cc = config.read_nw ? 1 : 0; | |
| LongNodeID ls = ln.size(); | |
| PartitionID sf = 1 + (PartitionID)config.read_ew; | |
| while (cc < ls) { | |
| size_t net_idx = ln[cc] - 1; | |
| cc += sf; | |
| uint64_t bit = 1ULL << (block % 64); | |
| uint64_t old_word = first_pass_bits[net_idx]; | |
| if (!(old_word & bit)) { | |
| if (old_word != 0) first_pass_connectivity += 1; | |
| first_pass_bits[net_idx] = old_word | bit; | |
| } | |
| } | |
| } | |
| } else { | |
| // Fallback: exact connectivity for k > 64 and/or weighted edges. | |
| // Track, for each net, which blocks have been seen. | |
| std::vector<std::vector<PartitionID>> net_blocks(n_edges_for_bits); | |
| for (LongNodeID node = 0; node < config.n_batches; node++) { | |
| PartitionID block = nodes_data[node]; | |
| auto& ln = (*input)[node]; | |
| LongNodeID cc = config.read_nw ? 1 : 0; | |
| LongNodeID ls = ln.size(); | |
| PartitionID sf = 1 + (PartitionID)config.read_ew; | |
| while (cc < ls) { | |
| size_t net_idx = ln[cc] - 1; | |
| // Edge weight (if present) is stored immediately after net id. | |
| LongNodeID edge_weight = 1; | |
| if (config.read_ew && cc + 1 < ls) { | |
| edge_weight = ln[cc + 1]; | |
| } | |
| cc += sf; | |
| auto& blocks_for_net = net_blocks[net_idx]; | |
| bool seen = false; | |
| for (PartitionID b : blocks_for_net) { | |
| if (b == block) { | |
| seen = true; | |
| break; | |
| } | |
| } | |
| if (!seen) { | |
| // If this net already touches some block(s), adding a new block | |
| // increases connectivity by the edge weight (or 1 if unweighted). | |
| if (!blocks_for_net.empty()) { | |
| first_pass_connectivity += static_cast<double>(edge_weight); | |
| } | |
| blocks_for_net.push_back(block); | |
| } | |
| } | |
| } |
| // First pass cut mode: count CUT_NET from stream_edges_assign directly | ||
| for (const auto& entry : *config.stream_edges_assign) { | ||
| if (entry == CUT_NET) pass_cut += 1; | ||
| } | ||
| pass_qap = pass_cut; |
There was a problem hiding this comment.
In cut-net mode, the first-pass objective shortcut counts CUT_NET entries rather than summing net weights from the input. This changes the objective when read_ew is enabled (or when nets have non-unit weights), potentially selecting a different “best” pass than the baseline. Safer option: run streamEvaluateHPartition_netl using cached_input (or store per-net weights once) even on the first pass.
| size_t words_per_net = ((size_t)config.k + 63) / 64; | ||
| std::vector<uint64_t> net_block_bits; | ||
| if (compute_connectivity) { | ||
| net_block_bits.assign(n_edges * words_per_net, 0); |
There was a problem hiding this comment.
Connectivity evaluation now allocates net_block_bits as n_edges * words_per_net (where words_per_net = (k+63)/64). For large k and/or large n_edges this can become very large (O(n_edges*k) bits) and may cause OOM or heavy paging during evaluation. Consider using the bitset fast-path only for small k (e.g., k<=64 or a memory threshold) and falling back to the previous sparse approach for larger k.
| size_t words_per_net = ((size_t)config.k + 63) / 64; | |
| std::vector<uint64_t> net_block_bits; | |
| if (compute_connectivity) { | |
| net_block_bits.assign(n_edges * words_per_net, 0); | |
| // For connectivity tracking, use a bitset fast path only when memory usage is reasonable. | |
| // Otherwise, disable connectivity evaluation to avoid excessive memory consumption. | |
| size_t words_per_net = 0; | |
| std::vector<uint64_t> net_block_bits; | |
| if (compute_connectivity) { | |
| words_per_net = ((size_t)config.k + 63) / 64; | |
| // Guard against overflow in n_edges * words_per_net | |
| bool disable_connectivity = false; | |
| if (n_edges != 0 && words_per_net > std::numeric_limits<size_t>::max() / n_edges) { | |
| disable_connectivity = true; | |
| } else { | |
| size_t total_words = n_edges * words_per_net; | |
| // Impose a maximum size (in 64-bit words) for the connectivity bitset. | |
| // Here we cap at 256 MiB worth of uint64_t entries. | |
| const size_t MAX_CONNECTIVITY_WORDS = | |
| (static_cast<size_t>(256) * 1024 * 1024) / sizeof(uint64_t); | |
| if (total_words > MAX_CONNECTIVITY_WORDS) { | |
| disable_connectivity = true; | |
| } else { | |
| net_block_bits.assign(total_words, 0); | |
| } | |
| } | |
| if (disable_connectivity) { | |
| compute_connectivity = false; | |
| } |
| fstat(fd, &st); | ||
| file_size = (size_t)st.st_size; | ||
| #ifdef MAP_POPULATE | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | ||
| #else | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | ||
| #endif |
There was a problem hiding this comment.
The mmap fast-path does not check fstat()’s return value (or that file_size is non-zero) before using st.st_size to size the mapping. If fstat fails, file_size may be garbage and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.
| fstat(fd, &st); | |
| file_size = (size_t)st.st_size; | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| if (fstat(fd, &st) == 0) { | |
| file_size = (size_t)st.st_size; | |
| if (file_size > 0) { | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| } | |
| } |
| std::ifstream& in = *(partition_config.stream_in); | ||
| std::streampos cur = in.tellg(); | ||
| size_t offset = (size_t)cur; | ||
|
|
||
| // mmap the entire file for zero-copy access (POSIX: Linux + macOS) | ||
| int fd = open(partition_config.graph_filename.c_str(), O_RDONLY); |
There was a problem hiding this comment.
offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and clamp/guard offset against file_size (fall back to the buffered read path when invalid).
| fstat(fd, &st); | ||
| file_size = (size_t)st.st_size; | ||
| #ifdef MAP_POPULATE | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | ||
| #else | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | ||
| #endif |
There was a problem hiding this comment.
The mmap fast-path does not check fstat()’s return value before using st.st_size to size the mapping. If fstat fails, file_size may be invalid and mmap/munmap behavior becomes undefined. Please check the return code from fstat (and ensure file_size > 0) and fall back to the ifstream::read path on failure.
| fstat(fd, &st); | |
| file_size = (size_t)st.st_size; | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| int stat_res = fstat(fd, &st); | |
| if (stat_res == 0 && st.st_size > 0 && (off_t)offset < st.st_size) { | |
| file_size = (size_t)st.st_size; | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| } else { | |
| mmap_addr = MAP_FAILED; | |
| } |
| size_t offset = (size_t)cur; | ||
|
|
||
| int fd = open(partition_config.graph_filename.c_str(), O_RDONLY); | ||
| void* mmap_addr = MAP_FAILED; | ||
| size_t file_size = 0; | ||
| if (fd != -1) { | ||
| struct stat st; | ||
| fstat(fd, &st); | ||
| file_size = (size_t)st.st_size; | ||
| #ifdef MAP_POPULATE | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | ||
| #else | ||
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | ||
| #endif |
There was a problem hiding this comment.
offset is derived from in.tellg() and cast to size_t without validation. If tellg() returns std::streampos(-1) (stream error) or if offset > file_size, pointer arithmetic on the mapped buffer will go out of bounds. Please validate tellg() and guard offset against file_size (fall back to the buffered read path when invalid).
| size_t offset = (size_t)cur; | |
| int fd = open(partition_config.graph_filename.c_str(), O_RDONLY); | |
| void* mmap_addr = MAP_FAILED; | |
| size_t file_size = 0; | |
| if (fd != -1) { | |
| struct stat st; | |
| fstat(fd, &st); | |
| file_size = (size_t)st.st_size; | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| size_t offset = 0; | |
| int fd = open(partition_config.graph_filename.c_str(), O_RDONLY); | |
| void* mmap_addr = MAP_FAILED; | |
| size_t file_size = 0; | |
| if (fd != -1) { | |
| struct stat st; | |
| if (fstat(fd, &st) == 0 && st.st_size > 0) { | |
| file_size = static_cast<size_t>(st.st_size); | |
| #ifdef MAP_POPULATE | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE | MAP_POPULATE, fd, 0); | |
| #else | |
| mmap_addr = mmap(nullptr, file_size, PROT_READ, MAP_PRIVATE, fd, 0); | |
| #endif | |
| if (mmap_addr != MAP_FAILED) { | |
| if (cur < 0) { | |
| // Invalid stream position: do not use mmap, fall back to buffered path. | |
| munmap(mmap_addr, file_size); | |
| mmap_addr = MAP_FAILED; | |
| } else { | |
| offset = static_cast<size_t>(cur); | |
| if (offset > file_size) { | |
| // Offset beyond end of mapped file: avoid out-of-bounds access. | |
| munmap(mmap_addr, file_size); | |
| mmap_addr = MAP_FAILED; | |
| } | |
| } | |
| } | |
| } | |
| } |
| option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF) | ||
| option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF) | ||
| if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang") | ||
| if(PGO_GENERATE) | ||
| add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data) | ||
| set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data") | ||
| endif() | ||
| if(PGO_USE) | ||
| add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction) | ||
| set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data") | ||
| endif() |
There was a problem hiding this comment.
PGO_GENERATE and PGO_USE can currently both be enabled at the same time, which will add conflicting -fprofile-generate / -fprofile-use flags and can lead to compiler/linker errors or unpredictable behavior. Consider making these options mutually exclusive (e.g., error out when both are ON, or have PGO_USE override PGO_GENERATE).
| option(PGO_GENERATE "Build with PGO instrumentation (phase 1)" OFF) | ||
| option(PGO_USE "Build with PGO optimization from collected profiles (phase 2)" OFF) | ||
| if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang") | ||
| if(PGO_GENERATE) | ||
| add_definitions(-fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data) | ||
| set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-generate=${CMAKE_BINARY_DIR}/pgo_data") | ||
| endif() | ||
| if(PGO_USE) | ||
| add_definitions(-fprofile-use=${CMAKE_BINARY_DIR}/pgo_data -fprofile-correction) | ||
| set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-use=${CMAKE_BINARY_DIR}/pgo_data") | ||
| endif() |
There was a problem hiding this comment.
PGO_GENERATE and PGO_USE can both be enabled simultaneously, which will add conflicting -fprofile-generate / -fprofile-use flags and can cause build failures. Consider enforcing mutual exclusivity (or explicit precedence) between these options.
… freight_graphs ~4.3x speedup for multipass streaming partitioning while preserving bit-identical partition output. Applied to both hypergraph and graph streaming partitioners. I/O optimizations: - Cache binary input across multipass iterations (ram_stream) - Skip readFirstLineStream on subsequent ram_stream passes - Use mmap with MAP_POPULATE for zero-copy file loading (macOS compat) - Bulk-read + parse from char buffer with pre-reserved vectors - Skip per-node gettimeofday syscalls in inner loop Evaluation optimizations (hypergraphs): - Use cached data for evaluation instead of re-reading file - Replace unordered_set with flat bitset for connectivity tracking - Skip unused metric per mode (cut-only or connectivity-only) - Specialize bitset check for k<=64 (single word) - Cheap first-pass shortcuts for cut/connectivity (unweighted, k<=64) - Use local scratch vector for inter-pass eval to avoid save/restore - Skip evaluation when partition has converged Processing optimizations: - Bypass OMP critical sections when single-threaded - Eliminate valid_neighboring_nets vector for ram_stream (hypergraphs) - Prefetch stream_edges_assign entries in hot loops (hypergraphs) - Remove -fno-builtin-malloc flags - Use memset for all_blocks_to_keys initialization - Skip dead computations for default config - Add PGO support to CMakeLists.txt (GCC/Clang only) Safety: - OMP atomic for nodes_moved; first-pass bitset in sequential post-pass - Hardened mmap: fstat/tellg/offset validation with ifstream fallback - OOM guard for connectivity bitset (256 MiB cap) - PGO_GENERATE/PGO_USE mutual exclusivity enforced - Default initializers for evaluate_cut/evaluate_connectivity
db91616 to
2e76bfb
Compare
Addressed Copilot review feedback (#5 + inline comments)Round 1 (from #5)
Round 2 (from inline review comments)
All 32 configs verified bit-identical (24 hypergraph + 8 graph). |
|
@copilot check the new changes |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread); | ||
| // Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped | ||
| if (n_threads > 1) { | ||
| #pragma omp critical(update_self_sorting_vector) | ||
| this->sorted_blocks.increment(decision); | ||
| this->sorted_blocks.increment(decision); | ||
| } else { | ||
| this->sorted_blocks.increment(decision); |
There was a problem hiding this comment.
The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).
| case SAMPLING_INACTIVE_LINEAR_COMPLEXITY: | ||
| decision = solve_linear_complexity(curr_node_id, curr_node_weight, my_thread); | ||
| // Duplicate call is intentional: #pragma omp critical cannot be conditionally skipped | ||
| if (n_threads > 1) { | ||
| #pragma omp critical(update_self_sorting_vector) | ||
| this->sorted_blocks.increment(decision); | ||
| this->sorted_blocks.increment(decision); | ||
| } else { | ||
| this->sorted_blocks.increment(decision); |
There was a problem hiding this comment.
The added comment is inaccurate/misleading: the call to sorted_blocks.increment(decision) is not duplicated, and an OpenMP critical section can be conditionally executed as written. Please update/remove this comment to reflect the actual reason for the branching (skipping critical overhead when n_threads == 1).
| size_t words_per_net = ((size_t)config.k + 63) / 64; | ||
| std::vector<uint64_t> net_block_bits; | ||
| if (compute_connectivity) { | ||
| // Guard against excessive memory usage for large k * n_edges | ||
| size_t total_words = n_edges * words_per_net; | ||
| const size_t MAX_BITSET_WORDS = (size_t(256) * 1024 * 1024) / sizeof(uint64_t); // 256 MiB cap | ||
| if (words_per_net != 0 && n_edges <= MAX_BITSET_WORDS / words_per_net) { | ||
| net_block_bits.assign(total_words, 0); | ||
| } else { | ||
| compute_connectivity = false; // too large, skip connectivity computation | ||
| } |
There was a problem hiding this comment.
When the bitset would exceed the 256 MiB cap, the code sets compute_connectivity = false and proceeds, which will silently return connectivity = 0 and can affect best-pass selection/output. Consider falling back to the previous (set-based) connectivity computation, or at least emitting an explicit error/warning and returning a sentinel so callers don’t treat the result as valid.
| std::vector< NodeID > *perm_rank = NULL; | ||
| matrix* D=NULL; | ||
| if (config.enable_mapping) { | ||
| perm_rank = new std::vector< NodeID >(config.k); | ||
| for( unsigned i = 0; i < perm_rank->size(); i++) { | ||
| (*perm_rank)[i] = i; | ||
| } | ||
| D = new online_distance_matrix(config.k, config.k); | ||
| D->setPartitionConfig(config); | ||
| } |
There was a problem hiding this comment.
perm_rank and D are heap-allocated when enable_mapping is true but are never freed before returning, causing a leak on every evaluation call. Prefer stack ownership (e.g., std::vector and std::unique_ptr) or ensure the allocations are deleted on all exit paths.
| size_t remaining = (size_t)in.tellg() - offset; | ||
| in.seekg(cur); | ||
| buf.resize(remaining); | ||
| in.read(buf.data(), remaining); | ||
| actual = (size_t)in.gcount(); |
There was a problem hiding this comment.
Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.
| size_t remaining = (size_t)in.tellg() - offset; | |
| in.seekg(cur); | |
| buf.resize(remaining); | |
| in.read(buf.data(), remaining); | |
| actual = (size_t)in.gcount(); | |
| std::streampos endpos = in.tellg(); | |
| size_t remaining = 0; | |
| if (!in.fail() && endpos >= 0) { | |
| size_t end_off = static_cast<size_t>(endpos); | |
| if (end_off > offset) { | |
| remaining = end_off - offset; | |
| } | |
| } | |
| // Restore original position before reading | |
| in.clear(); | |
| in.seekg(cur); | |
| if (remaining > 0) { | |
| buf.resize(remaining); | |
| in.read(buf.data(), remaining); | |
| actual = static_cast<size_t>(in.gcount()); | |
| } else { | |
| buf.clear(); | |
| actual = 0; | |
| } |
| in.seekg(0, std::ios::end); | ||
| size_t remaining = (size_t)in.tellg() - offset; | ||
| in.seekg(cur); | ||
| buf.resize(remaining); | ||
| in.read(buf.data(), remaining); | ||
| actual = (size_t)in.gcount(); | ||
| p = buf.data(); | ||
| pend = p + actual; |
There was a problem hiding this comment.
Fallback path casts tellg() results to size_t without validating they’re non-negative/valid. If tellg() returns -1 (or offset > end), this underflows and can attempt to allocate a huge buffer. Add stream state checks and guard the remaining-size computation before resizing the buffer.
| in.seekg(0, std::ios::end); | |
| size_t remaining = (size_t)in.tellg() - offset; | |
| in.seekg(cur); | |
| buf.resize(remaining); | |
| in.read(buf.data(), remaining); | |
| actual = (size_t)in.gcount(); | |
| p = buf.data(); | |
| pend = p + actual; | |
| // Fallback to reading from stream: guard tellg() and size computations | |
| in.clear(); | |
| in.seekg(0, std::ios::end); | |
| std::streampos end_pos = in.tellg(); | |
| if (!in || end_pos < 0 || static_cast<size_t>(end_pos) < offset) { | |
| // Invalid stream position or offset beyond end: treat as no remaining data | |
| in.clear(); | |
| in.seekg(cur); | |
| actual = 0; | |
| buf.clear(); | |
| p = buf.data(); | |
| pend = p; | |
| } else { | |
| size_t remaining = static_cast<size_t>(end_pos) - offset; | |
| in.clear(); | |
| in.seekg(cur); | |
| buf.resize(remaining); | |
| in.read(buf.data(), remaining); | |
| actual = static_cast<size_t>(in.gcount()); | |
| p = buf.data(); | |
| pend = p + actual; | |
| } |
| io_t.restart(); | ||
| if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) { | ||
| graph_io_stream::loadBufferLinesToBinary(config, input, 1); | ||
| } |
There was a problem hiding this comment.
buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.
| } | |
| } | |
| #pragma omp atomic |
| if (!config.ram_stream) { | ||
| io_t.restart(); | ||
| if((config.one_pass_algorithm != ONEPASS_HASHING) && (config.one_pass_algorithm != ONEPASS_HASHING_CRC32)) { | ||
| graph_io_stream::loadBufferLinesToBinary(config, input, 1); | ||
| } | ||
| buffer_io_time += io_t.elapsed(); | ||
| } |
There was a problem hiding this comment.
buffer_io_time is updated from multiple threads inside the OpenMP parallel loop without synchronization, which is a data race (undefined behavior) even if it’s “just timing”. Use a reduction, an atomic add, or a per-thread accumulator merged after the parallel region.
| (*config.stream_nodes_assign)[curr_node] = assigned_block; | ||
| (*config.stream_blocks_weight)[assigned_block] += 1; | ||
| #if defined MODE_NETLIST | ||
| PartitionID* edges_data = config.stream_edges_assign->data(); | ||
| LongNodeID col_counter = config.read_nw ? 1 : 0; | ||
| LongNodeID line_size = line_numbers.size(); | ||
| PartitionID selecting_factor = 1 + (PartitionID)config.read_ew; | ||
| while (col_counter < line_size) { | ||
| LongEdgeID net = line_numbers[col_counter]; | ||
| col_counter += selecting_factor; | ||
| PartitionID & old_block = edges_data[net-1]; | ||
| if (old_block == CUT_NET) continue; | ||
| #if defined MODE_CONNECTIVITY | ||
| old_block = assigned_block; | ||
| #else | ||
| old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET; | ||
| #endif | ||
| } | ||
| #endif |
There was a problem hiding this comment.
register_result_from_input mutates shared state (stream_blocks_weight and stream_edges_assign) without any synchronization. When called from the OpenMP parallel loop this is a data race and can lead to non-deterministic results. Consider using atomics/per-thread buffers with a merge step, or otherwise ensure this function is only called in a single-threaded context.
| (*config.stream_nodes_assign)[curr_node] = assigned_block; | |
| (*config.stream_blocks_weight)[assigned_block] += 1; | |
| #if defined MODE_NETLIST | |
| PartitionID* edges_data = config.stream_edges_assign->data(); | |
| LongNodeID col_counter = config.read_nw ? 1 : 0; | |
| LongNodeID line_size = line_numbers.size(); | |
| PartitionID selecting_factor = 1 + (PartitionID)config.read_ew; | |
| while (col_counter < line_size) { | |
| LongEdgeID net = line_numbers[col_counter]; | |
| col_counter += selecting_factor; | |
| PartitionID & old_block = edges_data[net-1]; | |
| if (old_block == CUT_NET) continue; | |
| #if defined MODE_CONNECTIVITY | |
| old_block = assigned_block; | |
| #else | |
| old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET; | |
| #endif | |
| } | |
| #endif | |
| #ifdef _OPENMP | |
| #pragma omp critical(register_result_from_input) | |
| #endif | |
| { | |
| (*config.stream_nodes_assign)[curr_node] = assigned_block; | |
| (*config.stream_blocks_weight)[assigned_block] += 1; | |
| #if defined MODE_NETLIST | |
| PartitionID* edges_data = config.stream_edges_assign->data(); | |
| LongNodeID col_counter = config.read_nw ? 1 : 0; | |
| LongNodeID line_size = line_numbers.size(); | |
| PartitionID selecting_factor = 1 + (PartitionID)config.read_ew; | |
| while (col_counter < line_size) { | |
| LongEdgeID net = line_numbers[col_counter]; | |
| col_counter += selecting_factor; | |
| PartitionID & old_block = edges_data[net-1]; | |
| if (old_block == CUT_NET) continue; | |
| #if defined MODE_CONNECTIVITY | |
| old_block = assigned_block; | |
| #else | |
| old_block = (old_block==assigned_block || old_block==INVALID_PARTITION) ? assigned_block : CUT_NET; | |
| #endif | |
| } | |
| #endif | |
| } |
Reviewed the Round 2 changes — the main fixes all look correct:
Two residual edge-case bugs remain in the mmap fallback path (only reached when mmap fails, e.g. on systems where it's unsupported):
Fix for both (same in both variants): } else {
actual = 0;
if (cur >= 0) {
in.seekg(0, std::ios::end);
std::streampos end_pos = in.tellg();
if (end_pos >= 0) {
size_t end_off = (size_t)end_pos;
size_t remaining = (end_off > offset) ? (end_off - offset) : 0;
in.seekg(cur);
if (remaining > 0 && in.good()) {
buf.resize(remaining);
in.read(buf.data(), remaining);
actual = (size_t)in.gcount();
}
}
}
p = buf.data();
pend = p + actual;
}These only affect error paths (mmap failure + invalid stream state), so they're non-critical for normal operation, but worth a follow-up fix for robustness. |
remainingwith(end_off > offset) ? (end_off - offset) : 0(both hypergraph and graph)seekg(cur)in mmap fallback: guard entire fallback read withif (cur >= 0)(both variants)in.good()check afterseekg(cur)beforeread()(both variants)