-
Notifications
You must be signed in to change notification settings - Fork 0
Performance Optimization
This guide covers techniques for optimizing DTUTMO performance, from basic speed improvements to advanced GPU acceleration.
- Performance Overview
- CPU Optimization
- GPU Acceleration
- Memory Optimization
- Profiling and Benchmarking
- Architecture-Specific Tips
CPU (Intel i7-12700K, single-threaded):
| Resolution | Default | Fast | Research |
|---|---|---|---|
| 1920×1080 | 0.7s | 0.2s | 2.1s |
| 2560×1440 | 1.2s | 0.4s | 3.8s |
| 3840×2160 | 2.8s | 0.9s | 8.4s |
GPU (NVIDIA RTX 4090):
| Resolution | Default | Fast | Research |
|---|---|---|---|
| 1920×1080 | 0.08s | 0.04s | 0.15s |
| 2560×1440 | 0.14s | 0.07s | 0.28s |
| 3840×2160 | 0.32s | 0.16s | 0.65s |
Speedup: 8-15× on GPU
config = DTUTMOConfig(
use_otf=False,
use_glare=False,
use_bilateral=False,
use_cam=CAMType.NONE,
display_mapping=DisplayMapping.WHITEBOARD,
)1080p: ~0.2s CPU, ~0.04s GPU
config = DTUTMOConfig(
use_otf=True,
use_glare=True,
use_bilateral=True,
use_cam=CAMType.DTUCAM,
display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)1080p: ~0.7s CPU, ~0.08s GPU
config = DTUTMOConfig(
use_otf=True,
use_glare=True,
use_bilateral=True,
use_cam=CAMType.DTUCAM,
display_mapping=DisplayMapping.FULL_INVERSE,
)1080p: ~2.1s CPU, ~0.15s GPU
NumPy uses multithreaded BLAS for matrix operations. Control threads:
import os
# Set before importing numpy
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OPENBLAS_NUM_THREADS'] = '8'
import numpy as np
from dtutmo import CompleteDTUTMOBenchmark:
| Threads | 1080p Time | Speedup |
|---|---|---|
| 1 | 1.2s | 1.0× |
| 4 | 0.8s | 1.5× |
| 8 | 0.7s | 1.7× |
| 16 | 0.7s | 1.7× |
Recommendation: Use 4-8 threads (diminishing returns beyond 8).
Each stage has computational cost:
| Stage | 1080p Time | Speedup if Disabled |
|---|---|---|
| OTF | 0.08s | 1.13× |
| Glare | 0.12s | 1.21× |
| Bilateral | 0.15s | 1.27× |
| CAM | 0.10s | 1.17× |
Cumulative speedup (all disabled): 2.5-3×
# Fastest configuration
config = DTUTMOConfig(
use_otf=False,
use_glare=False,
use_bilateral=False,
use_cam=CAMType.NONE,
)| Strategy | 1080p Time | Relative Speed |
|---|---|---|
| WHITEBOARD | 0.15s | 1.0× (fastest) |
| HYBRID | 0.28s | 0.54× |
| PRODUCTION_HYBRID | 0.35s | 0.43× |
| FULL_INVERSE | 0.85s | 0.18× |
Recommendation:
- Preview:
WHITEBOARD - Production:
PRODUCTION_HYBRID - Research:
FULL_INVERSE
Process at lower resolution, upscale result:
import cv2
def fast_process(tmo, hdr_large, scale=0.5):
"""Process at reduced resolution"""
# Downscale
h, w = int(hdr_large.shape[0] * scale), int(hdr_large.shape[1] * scale)
hdr_small = cv2.resize(hdr_large, (w, h), interpolation=cv2.INTER_AREA)
# Process
ldr_small = tmo.process(hdr_small)
# Upscale
ldr_large = cv2.resize(ldr_small,
(hdr_large.shape[1], hdr_large.shape[0]),
interpolation=cv2.INTER_CUBIC)
return ldr_largePerformance:
| Scale | 4K→1080p Time | Quality Loss |
|---|---|---|
| 1.0 | 2.8s | 0% |
| 0.75 | 1.6s | <5% |
| 0.5 | 0.7s | ~10% |
| 0.25 | 0.2s | ~25% |
OTF and glare use FFT (via scipy.fft):
# Use pyfftw for faster FFTs (optional)
pip install pyfftw
# Enable in code
import pyfftw
pyfftw.interfaces.cache.enable()Speedup: 1.2-1.5× for FFT operations
Use float32 instead of float64:
# Convert input
hdr_f32 = hdr.astype(np.float32)
# Process
ldr = tmo.process(hdr_f32)Benefits:
- 2× less memory
- Faster operations on some hardware
- Minimal precision loss for imaging
Install PyTorch with CUDA:
# CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# ROCm (AMD)
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6Verify:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device: {torch.cuda.get_device_name(0)}")import torch
from dtutmo import TorchDTUTMO
# Create GPU tone mapper
tmo = TorchDTUTMO().cuda()
# Prepare input (BCHW format)
hdr_tensor = torch.from_numpy(hdr).permute(2, 0, 1).unsqueeze(0).cuda()
# Process
ldr_tensor = tmo.process(hdr_tensor)
# Convert back
ldr = ldr_tensor.squeeze(0).permute(1, 2, 0).cpu().numpy()Process multiple images in parallel:
import torch
from dtutmo import TorchDTUTMO
def batch_process_gpu(image_paths, batch_size=4):
"""Process images in batches on GPU"""
tmo = TorchDTUTMO().cuda()
results = []
for i in range(0, len(image_paths), batch_size):
batch_paths = image_paths[i:i+batch_size]
# Load batch
batch = []
for path in batch_paths:
hdr = load_hdr(path)
tensor = torch.from_numpy(hdr).permute(2, 0, 1)
batch.append(tensor)
batch_tensor = torch.stack(batch).cuda()
# Process batch
with torch.no_grad():
ldr_batch = tmo.process(batch_tensor)
# Store results
for j, ldr_tensor in enumerate(ldr_batch):
ldr = ldr_tensor.permute(1, 2, 0).cpu().numpy()
results.append(ldr)
return resultsSpeedup: Near-linear scaling up to batch size ~8
Use automatic mixed precision for faster processing:
from torch.cuda.amp import autocast
tmo = TorchDTUTMO().cuda()
with torch.no_grad(), autocast():
ldr_tensor = tmo.process(hdr_tensor)Benefits:
- 1.5-2× faster on Tensor Core GPUs (Volta+)
- 50% less memory
- Minimal precision loss
import torch
# Clear cache between images
torch.cuda.empty_cache()
# Set memory allocator
torch.cuda.set_per_process_memory_fraction(0.8) # Use 80% of GPU memory
# Monitor usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")Distribute across multiple GPUs:
import torch
from torch.nn.parallel import DataParallel
from dtutmo import TorchDTUTMO
# Wrap in DataParallel
tmo = TorchDTUTMO()
tmo_parallel = DataParallel(tmo)
# Process (automatically distributes across GPUs)
ldr_tensor = tmo_parallel(hdr_tensor)Scaling:
| GPUs | 4K Time | Speedup |
|---|---|---|
| 1 | 0.32s | 1.0× |
| 2 | 0.18s | 1.8× |
| 4 | 0.11s | 2.9× |
Scaling efficiency: ~75% (communication overhead)
For image size
| Component | Memory |
|---|---|
| Input HDR (float32) |
|
| FFT buffers (complex64) |
|
| Intermediate stages |
|
| Output LDR (float32) |
|
| Peak |
|
Examples:
| Resolution | Peak Memory |
|---|---|
| 1920×1080 | ~600 MB |
| 2560×1440 | ~1.1 GB |
| 3840×2160 | ~2.4 GB |
| 7680×4320 | ~9.6 GB |
For large images, process in tiles:
import numpy as np
def process_tiled(tmo, hdr, tile_size=1024, overlap=64):
"""Process image in overlapping tiles"""
h, w = hdr.shape[:2]
output = np.zeros((h, w, 3), dtype=np.float32)
weight = np.zeros((h, w, 1), dtype=np.float32)
# Generate tiles
for y in range(0, h, tile_size - overlap):
for x in range(0, w, tile_size - overlap):
# Extract tile
y_end = min(y + tile_size, h)
x_end = min(x + tile_size, w)
tile = hdr[y:y_end, x:x_end]
# Process
tile_out = tmo.process(tile)
# Create blend weights (linear ramp in overlap regions)
tile_weight = np.ones((tile_out.shape[0], tile_out.shape[1], 1))
# Blend into output
output[y:y_end, x:x_end] += tile_out * tile_weight
weight[y:y_end, x:x_end] += tile_weight
# Normalize
output /= np.maximum(weight, 1e-6)
return outputMemory reduction: Process 8K image with 2GB instead of 10GB
Reduce copies by modifying arrays in-place:
# Instead of:
img_new = img * scale # Creates copy
# Use:
img *= scale # In-placeIn DTUTMO internals, many operations are already in-place.
Force garbage collection between images:
import gc
for img_path in image_paths:
hdr = load_hdr(img_path)
ldr = tmo.process(hdr)
save_ldr(img_path, ldr)
# Force cleanup
del hdr, ldr
gc.collect()import time
start = time.time()
ldr = tmo.process(hdr)
elapsed = time.time() - start
print(f"Processing time: {elapsed:.3f}s")
print(f"Throughput: {hdr.size / elapsed / 1e6:.1f} Mpixels/s")Use cProfile for detailed analysis:
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
ldr = tmo.process(hdr)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(30) # Top 30 functionsSample output:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.721 0.721 pipeline.py:45(process)
1 0.156 0.156 0.156 0.156 optics.py:23(apply_otf)
1 0.098 0.098 0.098 0.098 glare.py:67(apply_glare)
1 0.234 0.234 0.234 0.234 photoreceptors.py:145(compute_response)
...
result = tmo.process(hdr, return_intermediate=True)
# Access stage timings
timings = result.get('timings', {})
for stage, duration in timings.items():
print(f"{stage}: {duration:.3f}s")Use PyTorch profiler:
import torch
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
ldr = tmo.process(hdr_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))import tracemalloc
tracemalloc.start()
ldr = tmo.process(hdr)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory: {current / 1e6:.1f} MB")
print(f"Peak memory: {peak / 1e6:.1f} MB")
tracemalloc.stop()Use MKL:
pip install mkl mkl-serviceEnable MKL optimizations:
import mkl
mkl.set_num_threads(8)Benefits: 1.2-1.5× faster linear algebra
Use OpenBLAS:
pip install openblas-develSet threads:
import os
os.environ['OPENBLAS_NUM_THREADS'] = '8'Use Accelerate framework (automatic with NumPy 1.21+):
import numpy as np
# Automatically uses Accelerate BLASMetal GPU (experimental):
# PyTorch with MPS backend
import torch
device = torch.device("mps")
tmo = TorchDTUTMO().to(device)Performance: 3-5× faster than CPU on M1 Max/Ultra
Tensor Cores (RTX 20/30/40 series):
- Use mixed precision (
autocast()) - Ensure tensor dimensions are multiples of 8
CUDA Graphs (advanced):
# Capture processing graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
ldr = tmo.process(hdr_tensor)
# Replay graph (faster)
graph.replay()Benefits: 1.2-1.3× faster, lower latency
ROCm optimization:
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6Set environment:
export HSA_OVERRIDE_GFX_VERSION=10.3.0 # For RX 6000 series
export ROCR_VISIBLE_DEVICES=0- Use GPU if available (8-15× speedup)
- Set optimal thread count (4-8 threads)
- Use
PRODUCTION_HYBRIDorWHITEBOARDmapping - Disable optional stages if quality allows
- Use
float32instead offloat64 - Process multiple images in batches (GPU)
- Enable mixed precision (GPU, Tensor Cores)
- Use tiled processing for large images
- Clear GPU cache between images
- Use
float32to halve memory - Disable
use_bilateral(saves ~15% memory) - Process at reduced resolution
- Force garbage collection
- Use
FULL_INVERSEdisplay mapping - Enable all stages (OTF, glare, bilateral)
- Use
DTUCAMfor color appearance - Process at native resolution
- Use
float64for maximum precision
Target: <50ms for 1080p preview
config = DTUTMOConfig(
use_otf=False,
use_glare=False,
use_bilateral=False,
use_local_adapt=True, # Keep for quality
use_cam=CAMType.NONE,
display_mapping=DisplayMapping.WHITEBOARD,
)
tmo = TorchDTUTMO(config).cuda() # GPU
# Result: ~40ms for 1080pTarget: 30 fps at 1080p
import torch
from dtutmo import TorchDTUTMO
config = DTUTMOConfig(
use_otf=True,
use_glare=False, # Skip expensive glare
use_bilateral=False,
display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)
tmo = TorchDTUTMO(config).cuda()
# Process frames in batches
batch_size = 8
for frame_batch in video_batches:
with torch.no_grad(), torch.cuda.amp.autocast():
ldr_batch = tmo.process(frame_batch)
# Write to video
# Result: ~30-35 fpsTarget: Maximum quality, time not critical
config = DTUTMOConfig(
observer_age=24,
use_otf=True,
use_glare=True,
use_bilateral=True,
use_cam=CAMType.DTUCAM,
display_mapping=DisplayMapping.FULL_INVERSE,
)
tmo = CompleteDTUTMO(config)
# Use multiprocessing
from multiprocessing import Pool
def process_one(path):
hdr = load_hdr(path)
return tmo.process(hdr)
with Pool(8) as pool:
results = pool.map(process_one, image_paths)
# Result: 8× parallel speedupRun this script to benchmark your system:
import time
import numpy as np
from dtutmo import CompleteDTUTMO, DTUTMOConfig, DisplayMapping, CAMType
def benchmark():
# Create test image
hdr = np.random.rand(1920, 1080, 3).astype(np.float32) * 100
configs = {
'Fast': DTUTMOConfig(
use_otf=False, use_glare=False, use_bilateral=False,
use_cam=CAMType.NONE, display_mapping=DisplayMapping.WHITEBOARD
),
'Balanced': DTUTMOConfig(
display_mapping=DisplayMapping.PRODUCTION_HYBRID
),
'Quality': DTUTMOConfig(
display_mapping=DisplayMapping.FULL_INVERSE
),
}
print(f"{'Config':<12} {'Time (s)':<10} {'Mpixels/s':<12}")
print("-" * 40)
for name, config in configs.items():
tmo = CompleteDTUTMO(config)
# Warmup
_ = tmo.process(hdr)
# Benchmark
start = time.time()
for _ in range(5):
_ = tmo.process(hdr)
elapsed = (time.time() - start) / 5
mpixels_per_sec = hdr.size / elapsed / 1e6
print(f"{name:<12} {elapsed:<10.3f} {mpixels_per_sec:<12.1f}")
if __name__ == '__main__':
benchmark()- Configuration Guide - Tune parameters for your needs
- Pipeline Stages - Understand stage costs
- System Architecture - Memory and threading details
- API Reference - Implementation details