Repo for FuseFlow artifact generation
- Getting Started (5 human-minutes + 30 compute-minutes)
- Data Persistence Setup
- Quick Start: Run All Benchmarks (5 human-minutes + 96 compute-hours)
- Run Experiments:
- Run Top-Level Script (5 human-minutes + 96 compute-hours)
- Run Figure 12: Performance Comparison (5 human-minutes + up to 96 compute-hours)
- Run Figure 14: GCN FLOPs and Memory Analysis (5 human-minutes + 5 compute-minutes)
- Run Figure 15a: Parallelism Factor Sweep (5 human-minutes + 1 compute-hours)
- Run Figure 15b: Parallelism Location Sweep (5 human-minutes + 10 compute-minutes)
- Run Figure 16: Block Size Comparison (5 human-minutes + 30 compute-minutes)
- Run Figure 17: Dataflow Order Sweep (5 human-minutes + 5 compute-minutes)
- Validate All Results
- [Optional] How to Reuse Artifact Beyond the Paper (10+ human-minutes)
- [Optional] Detailed Explanation of What the Top-Level Script Does
This guide assumes the user has a working installation of Docker and some version of Python 3 installed.
- Run the following commands to build the docker image named
fuseflow-artifactlocally from the files in this GitHub repo.NOTE: Building the Docker image requires ~16GB RAM during LLVM compilation, ~50GB disk space, and takes 10-20 minutes depending on CPU.git submodule update --init --recursive docker build -t fuseflow-artifact .
mkdir -p ~/fuseflow_data/results
mkdir -p ~/fuseflow_data/logs
mkdir -p ~/fuseflow_data/checkpointsdocker run -d -it \
-v ~/fuseflow_data/results:/fuseflow-artifact/results \
-v ~/fuseflow_data/logs:/fuseflow-artifact/logs \
-v ~/fuseflow_data/checkpoints:/fuseflow-artifact/checkpoints \
--name fuseflow-container \
fuseflow-artifact bashdocker attach fuseflow-containerUse CTRL-p, CTRL-q to detach without killing the container.
docker start fuseflow-container # Start if stopped
docker attach fuseflow-container # Reattach to running containerWithin the Docker container:
./scripts/run_all_benchmarks.sh 2>&1 | tee logs/run_all_benchmarks.log
This script will:
-
Run all benchmark experiments (Figures 12, 14, 15a, 15b, 16, 17)
-
Generate all PDF plots automatically
-
Save results to the
results/directory (persisted to host via volume mount) -
Log all output to
logs/run_all_benchmarks.log -
Monitor progress (from host machine):
tail -f ~/fuseflow_data/logs/run_all_benchmarks.log -
Check results while running (from host machine):
ls -lh ~/fuseflow_data/results/ -
Once complete, exit the Docker container (
CTRL-p, CTRL-q) and view results:ls ~/fuseflow_data/results/All PDFs and JSON files are now on your host filesystem.
Within the Docker container, run the following commands to generate all results:
# Figure 12 - Main performance comparison (SAE, GCN, GraphSAGE, GPT-3)
python3 scripts/run_figure12_benchmarks.py --mode complete 2>&1 | tee logs/figure12.log
# Figure 14 - GCN FLOPs and memory metrics
python3 scripts/process_figure14_metrics.py 2>&1 | tee logs/figure14.log
# Figure 15a - Sparsity sweep
python3 scripts/run_figure15a_sweep.py 2>&1 | tee logs/figure15a.log
# Figure 15b - Parallelism sweep
python3 scripts/run_figure15b_sweep.py 2>&1 | tee logs/figure15b.log
# Figure 16 - Block size comparison
python3 scripts/run_figure16.py 2>&1 | tee logs/figure16.log
# Figure 17 - Dataflow order sweep
python3 scripts/run_figure17_sweep.py 2>&1 | tee logs/figure17.log
Results are saved to ~/fuseflow_data/results/ on your host machine.
Figure 12 compares performance across four model architectures (SAE, GCN, GraphSAGE, GPT-3).
Default Configuration (Medium Mode, No HBM):
By default, the artifact runs Figure 12 in medium mode with HBM simulation disabled (--no-hbm) to validate the fusion performance trends shown in the paper while keeping evaluation time practical (~96 compute-hours). This configuration:
- Runs all datasets for all models
- Skips the
fully_fusedconfiguration only for the largest datasets (GCN MAG, GraphSAGE Collab/MAG), sincefully_fusedis shown to be inefficient overall for GCN and GraphSAGE across all datasets - Disables HBM memory simulation to reduce total runtime from weeks to days
Parallel Execution:
The Figure 12 benchmark script supports parallel execution of simulation jobs using the --workers flag. By default, the artifact runs with 4 parallel workers (requires peak ~140GB memory). Users can adjust the number of workers to balance speed vs. memory usage:
To modify the worker count, edit the --workers/-w parameter in scripts/run_all_benchmarks.sh or specify it when running manually:
python3 scripts/run_figure12_benchmarks.py --mode medium --workers 4 --no-hbm 2>&1 | tee logs/figure12.logHBM Simulation: HBM simulation is disabled by default to keep artifact evaluation time reasonable. Enabling HBM increases total runtime to over a week due to detailed memory simulation overhead. Disabling HBM affects absolute latency values but preserves qualitative fusion trends (relative ordering of unfused, partially fused, and fully fused configurations), since these trends are primarily driven by fusion-induced changes in intermediate materialization, recomputation, and coordinate processing rather than peak off-chip bandwidth.
Reviewers who wish to reproduce the paper's HBM-backed results may enable it by removing the --no-hbm flag in scripts/run_all_benchmarks.sh.
Benchmark Modes: The Figure 12 sweep script supports three modes with different dataset coverage:
| Mode | SAE Datasets | GCN Datasets | GraphSAGE Datasets | GPT-3 Block Sizes | Notes |
|---|---|---|---|---|---|
fast |
All 3 (ImageNet, NIH-CXR, LUNA16) | 3 smaller (Cora, Cora-ML, DBLP) | 3 smaller (Cora, Cora-ML, DBLP) | All 3 (16, 32, 64) | Quick testing (~1 compute-hour) |
medium |
All 3 (ImageNet, NIH-CXR, LUNA16) | All 5 (Cora, Cora-ML, DBLP, Collab, MAG) | All 5 (Cora, Cora-ML, DBLP, Collab, MAG) | All 3 (16, 32, 64) | Default for artifact (~96 compute-hours). Skips fully_fused for MAG (GCN) and Collab/MAG (GraphSAGE) |
complete |
All 3 (ImageNet, NIH-CXR, LUNA16) | All 5 with all fusion types | All 5 with all fusion types | All 3 (16, 32, 64) | Full paper results (>1 week). Runs fully_fused on large datasets |
The medium mode skips the fully_fused configuration for the largest GCN and GraphSAGE datasets (which are shown to be inefficient overall) to avoid long-running experiments that would add over a week of simulation time. We only run the experiments that complete in a reasonable time while still validating the key fusion performance trends.
Models evaluated:
| Model | Datasets/Configs |
|---|---|
| SAE (Sparse Autoencoder) | ImageNet, NIH-CXR, LUNA16 |
| GCN | Cora, Cora-ML, DBLP, Collab, MAG |
| GraphSAGE | Cora, Cora-ML, DBLP, Collab, MAG |
| GPT-3 w/ BigBird | Block sizes: 16, 32, 64 |
Choose one of the following options to run:
-
Run
--mode fastto run a restricted set of experiments for quick testing (about 1 compute-hour):python3 scripts/run_figure12_benchmarks.py --model gcn --gcn-datasets cora --mode fast 2>&1 | tee logs/figure12_fast.log -
Run
--mode completeto run the full set of benchmarks that will take over a week:python3 scripts/run_figure12_benchmarks.py --mode complete 2>&1 | tee logs/figure12_complete.log -
Run specific models or datasets:
# Run only GCN on specific datasets python3 scripts/run_figure12_benchmarks.py --model gcn --gcn-datasets cora cora_ml dblp 2>&1 | tee logs/figure12_gcn.log # Run only SAE benchmarks python3 scripts/run_figure12_benchmarks.py --model sae 2>&1 | tee logs/figure12_sae.log # Run only GPT-3 BigBird benchmarks python3 scripts/run_figure12_benchmarks.py --model gpt3 2>&1 | tee logs/figure12_gpt3.log
-
The script generates a
figure12_results.jsonfile with cycle counts for each configuration (automatically saved toresults/). -
Once all desired benchmarks are run, generate Figure 12 as a PDF:
python3 scripts/plot_figure12.py --input figure12_results.json --output results/figure12.pdf
Figure 14 analyzes computational efficiency and memory access patterns for GCN.
-
Run the following commands:
python3 scripts/process_figure14_metrics.py 2>&1 | tee logs/figure14.log -
Generate Figure 14 as a PDF:
python3 scripts/plot_figure14.py- The script will create a plot at the location
/fuseflow-artifact/results/figure14.pdf.
- The script will create a plot at the location
Figure 15a shows how performance varies with different sparsity levels.
-
Run the sparsity sweep script:
python3 scripts/run_figure15a_sweep.py 2>&1 | tee logs/figure15a.log- Results are saved to
figure15a_results.json
- Results are saved to
-
Generate Figure 15a as a PDF:
python3 scripts/plot_figure15a.py- The script will create a plot at the location
/fuseflow-artifact/results/figure15a.pdf.
- The script will create a plot at the location
Figure 15b shows how performance scales with different parallelization factors.
-
Run the parallelism sweep script:
python3 scripts/run_figure15b_sweep.py 2>&1 | tee logs/figure15b.log- Results are saved to
figure15b_results.json
- Results are saved to
-
Generate Figure 15b as a PDF:
python3 scripts/plot_figure15b.py- The script will create a plot at the location
/fuseflow-artifact/results/figure15b.pdf.
- The script will create a plot at the location
Figure 16 compares performance across different block sizes.
-
Run the block size comparison script:
python3 scripts/run_figure16.py 2>&1 | tee logs/figure16.log- Results are saved to
figure16_results.json
- Results are saved to
-
Generate Figure 16 as a PDF:
python3 scripts/plot_figure16.py- The script will create a plot at the location
/fuseflow-artifact/results/figure16.pdf.
- The script will create a plot at the location
Figure 17 evaluates different dataflow ordering strategies.
-
Run the dataflow order sweep script:
python3 scripts/run_figure17_sweep.py 2>&1 | tee logs/figure17.log- Results are saved to
figure17_results.json
- Results are saved to
-
Generate Figure 17 as a PDF:
python3 scripts/plot_figure17.py- The script will create a plot at the location
/fuseflow-artifact/results/figure17.pdf.
- The script will create a plot at the location
Results are on your host machine at ~/fuseflow_data/results/:
ls -lh ~/fuseflow_data/results/You can open the PDFs directly from your host machine:
# On Linux
xdg-open ~/fuseflow_data/results/figure12.pdf
# On macOS
open ~/fuseflow_data/results/figure12.pdf- Validate that the plot in
figure12.pdfmatches Figure 12 in the paper (performance comparison). - Validate that the plot in
figure14.pdfmatches Figure 14 in the paper (GCN FLOPs and memory analysis). - Validate that the plot in
figure15a.pdfmatches Figure 15a in the paper (sparsity sweep). - Validate that the plot in
figure15b.pdfmatches Figure 15b in the paper (parallelism sweep). - Validate that the plot in
figure16.pdfmatches Figure 16 in the paper (block size comparison). - Validate that the plot in
figure17.pdfmatches Figure 17 in the paper (dataflow order sweep).
Please note that all active development beyond this paper is located in the main repositories and not this artifact repository.
The FuseFlow compiler transforms high-level MLIR (Linalg + SparseTensor) to SAM dataflow programs.
All compiler source files can be found in /fuseflow-artifact/fuseflow-compiler/.
To compile MLIR to SAM dataflow:
cd /fuseflow-artifact
# Compile MLIR to SAMML dialect
./fuseflow-compiler/build/tools/sam-opt --linalg-to-sam <input.mlir>
# Emit protobuf binary for simulator
./fuseflow-compiler/build/tools/sam-opt --linalg-to-sam <input.mlir> | \
./fuseflow-compiler/build/tools/sam-translate --emit-proto
# With parallelization
./fuseflow-compiler/build/tools/sam-opt --linalg-to-sam \
"--stream-parallelizer=stream-level=0 par-factor=4" <input.mlir> | \
./fuseflow-compiler/build/tools/sam-translate --emit-proto
# With vectorization
./fuseflow-compiler/build/tools/sam-opt --linalg-to-sam \
"--stream-vectorizer=stream-shape=16" <input.mlir> | \
./fuseflow-compiler/build/tools/sam-translate --emit-proto
Use ./fuseflow-compiler/build/tools/sam-opt --help for specific instructions on compiler passes.
The Comal simulator is a cycle-accurate dataflow simulator written in Rust with Python bindings.
Source files can be found in /fuseflow-artifact/comal/.
To use the simulator programmatically:
import comal
# See scripts/run_end_to_end.py for usage examplesFor manual testing or debugging, use the end-to-end script directly:
# GCN example
python3 scripts/run_end_to_end.py \
--infile fuseflow-compiler/tests/models/gcn_unfused/gcn_sparse.mlir \
--build fuseflow-compiler/build \
--sparsity 0.5 \
--par 1
# GPT-3 with BigBird example
python3 scripts/run_end_to_end.py \
--infile fuseflow-compiler/tests/models/gpt-3/outLinear_layernorm_FFN_layernorm_QKVprojection.mlir \
--build fuseflow-compiler/build \
--block 64 \
--sparsity 0.9 \
--outformat UNC
The Tortilla visualizer provides dataflow graph visualization.
Source files can be found in /fuseflow-artifact/tortilla-visualizer/.
The run_figure12_benchmarks.py script performs the following:
-
For each model (SAE, GCN, GraphSAGE, GPT-3) and dataset:
- Compiles the MLIR representation using FuseFlow
- Runs benchmark configurations
- Collects cycle counts from the Comal simulator
-
Saves all results to
figure12_results.json -
The
plot_figure12.pyscript:- Loads results from JSON
- Generates bar charts showing performance comparison across models
Expected results: Performance variations across different models and datasets.
The process_figure14_metrics.py script:
- Analyzes GCN workloads across different fusion configurations
- Collects FLOPs (floating-point operations) counts
- Collects memory access patterns and traffic
The plot_figure14.py script generates visualizations showing:
- Computational efficiency improvements
- Memory traffic reduction through fusion
Expected results: FuseFlow reduces memory traffic by exploiting spatial locality through fusion.
The run_figure15a_sweep.py script:
- Sweeps across different sparsity levels
- Records performance metrics for each sparsity level
- Saves results to
figure15a_results.json
The plot_figure15a.py script generates visualizations showing performance vs sparsity.
Expected results: Performance trends correlate with sparsity levels.
The run_figure15b_sweep.py script:
- Sweeps parallelization factors (e.g., 1, 2, 4, 8, 16, 32, 64)
- Records cycle counts for each configuration
- Saves results to
figure15b_results.json
The plot_figure15b.py script generates:
- Line plot showing cycles vs parallelization factor
- Demonstrates scaling behavior
Expected results: Near-linear scaling with parallelization factor up to hardware limits.
The run_figure16.py script:
- Runs benchmarks with different block sizes
- Collects performance metrics
- Saves results to
figure16_results.json
The plot_figure16.py script generates visualizations comparing block size performance.
Expected results: Performance varies with block size based on hardware characteristics.
The run_figure17_sweep.py script:
- Evaluates different dataflow ordering strategies
- Measures performance impact of each ordering
- Saves results to
figure17_results.json
The plot_figure17.py script generates visualizations showing impact of dataflow ordering.
Expected results: Different dataflow orders exhibit varying performance characteristics.
fuseflow-artifact/
├── fuseflow-compiler/ # FuseFlow compiler (MLIR-based)
│ ├── external/ # External dependencies (llvm-project, or-tools)
│ ├── lib/ # Compiler library sources
│ ├── tools/ # sam-opt, sam-translate binaries
│ └── tests/ # Test MLIR files and models
├── comal/ # Cycle-accurate dataflow simulator (Rust)
│ ├── src/ # Simulator source code
│ └── external/ # Ramulator2 DRAM wrapper
├── sam/ # Sparse Abstract Machine library
├── scripts/ # Benchmark and plotting scripts
├── tortilla-visualizer/ # Dataflow graph visualization tool
├── results/ # Output directory (mounted to host)
├── logs/ # Log files (mounted to host)
├── checkpoints/ # Checkpoint files (mounted to host)
├── setup.sh # Automated setup script
├── Dockerfile # Docker build configuration
└── README.md # This file
- CPU: x86_64 processor (tested on Intel/AMD)
- RAM: 32GB minimum (64GB recommended for LLVM build and full benchmarks)
- Disk: 50GB free space (LLVM build is large)
- OS: Linux (Ubuntu 20.04+ recommended)
| Issue | Solution |
|---|---|
| LLVM build fails with OOM | Reduce parallelism: use -j4 or -j2 instead of -j$(nproc) |
protoc version mismatch |
Install protoc version 24.0 from GitHub releases |
| Rust compilation errors | Update Rust: rustup update |
maturin build fails |
Ensure virtual environment is activated |
| Out of memory during benchmarks | Use --mode fast or reduce parallelization |
Missing clang/lld |
Install: sudo apt install clang lld |
| Container deleted after crash | DO NOT use --rm flag when running container |
| Data lost on reboot | Ensure volume mounts are specified in docker run |
# Check FuseFlow compiler
./fuseflow-compiler/build/tools/sam-opt --help
# Check Comal Python bindings
python3 -c "import comal; print('Comal OK')"
# Check SAM
python3 -c "import sam; print('SAM OK')"
The artifact should reproduce the following key findings from the paper:
- Figure 12: Performance comparison across SAE, GCN, GraphSAGE, and GPT-3 models
- Figure 14: FuseFlow reduces memory traffic by exploiting spatial locality through fusion
- Figure 15a: Performance trends correlate with sparsity levels
- Figure 15b: Near-linear scaling with parallelization factor up to hardware limits
- Figure 16: Performance varies with block size based on hardware characteristics
- Figure 17: Different dataflow orders exhibit varying performance characteristics
Note: Exact cycle counts may vary slightly due to:
- Random initialization of sparse tensors
- Different system configurations
- Floating-point non-determinism
The relative speedups and trends should match the paper's results.