Skip to content

Performance Optimization

Thierry Soreze edited this page Oct 28, 2025 · 1 revision

Performance Optimization

This guide covers techniques for optimizing DTUTMO performance, from basic speed improvements to advanced GPU acceleration.

Table of Contents

  1. Performance Overview
  2. CPU Optimization
  3. GPU Acceleration
  4. Memory Optimization
  5. Profiling and Benchmarking
  6. Architecture-Specific Tips

Performance Overview

Baseline Performance

CPU (Intel i7-12700K, single-threaded):

Resolution Default Fast Research
1920×1080 0.7s 0.2s 2.1s
2560×1440 1.2s 0.4s 3.8s
3840×2160 2.8s 0.9s 8.4s

GPU (NVIDIA RTX 4090):

Resolution Default Fast Research
1920×1080 0.08s 0.04s 0.15s
2560×1440 0.14s 0.07s 0.28s
3840×2160 0.32s 0.16s 0.65s

Speedup: 8-15× on GPU

Performance Profiles

Fast Preview

config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_cam=CAMType.NONE,
    display_mapping=DisplayMapping.WHITEBOARD,
)

1080p: ~0.2s CPU, ~0.04s GPU

Balanced (Default)

config = DTUTMOConfig(
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)

1080p: ~0.7s CPU, ~0.08s GPU

Maximum Quality

config = DTUTMOConfig(
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.FULL_INVERSE,
)

1080p: ~2.1s CPU, ~0.15s GPU


CPU Optimization

1. NumPy/BLAS Threading

NumPy uses multithreaded BLAS for matrix operations. Control threads:

import os

# Set before importing numpy
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OPENBLAS_NUM_THREADS'] = '8'

import numpy as np
from dtutmo import CompleteDTUTMO

Benchmark:

Threads 1080p Time Speedup
1 1.2s 1.0×
4 0.8s 1.5×
8 0.7s 1.7×
16 0.7s 1.7×

Recommendation: Use 4-8 threads (diminishing returns beyond 8).

2. Disable Optional Stages

Each stage has computational cost:

Stage 1080p Time Speedup if Disabled
OTF 0.08s 1.13×
Glare 0.12s 1.21×
Bilateral 0.15s 1.27×
CAM 0.10s 1.17×

Cumulative speedup (all disabled): 2.5-3×

# Fastest configuration
config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_cam=CAMType.NONE,
)

3. Display Mapping Strategy

Strategy 1080p Time Relative Speed
WHITEBOARD 0.15s 1.0× (fastest)
HYBRID 0.28s 0.54×
PRODUCTION_HYBRID 0.35s 0.43×
FULL_INVERSE 0.85s 0.18×

Recommendation:

  • Preview: WHITEBOARD
  • Production: PRODUCTION_HYBRID
  • Research: FULL_INVERSE

4. Image Downsampling

Process at lower resolution, upscale result:

import cv2

def fast_process(tmo, hdr_large, scale=0.5):
    """Process at reduced resolution"""
    # Downscale
    h, w = int(hdr_large.shape[0] * scale), int(hdr_large.shape[1] * scale)
    hdr_small = cv2.resize(hdr_large, (w, h), interpolation=cv2.INTER_AREA)
    
    # Process
    ldr_small = tmo.process(hdr_small)
    
    # Upscale
    ldr_large = cv2.resize(ldr_small, 
                          (hdr_large.shape[1], hdr_large.shape[0]),
                          interpolation=cv2.INTER_CUBIC)
    
    return ldr_large

Performance:

Scale 4K→1080p Time Quality Loss
1.0 2.8s 0%
0.75 1.6s <5%
0.5 0.7s ~10%
0.25 0.2s ~25%

5. FFT Optimization

OTF and glare use FFT (via scipy.fft):

# Use pyfftw for faster FFTs (optional)
pip install pyfftw

# Enable in code
import pyfftw
pyfftw.interfaces.cache.enable()

Speedup: 1.2-1.5× for FFT operations

6. Data Type Optimization

Use float32 instead of float64:

# Convert input
hdr_f32 = hdr.astype(np.float32)

# Process
ldr = tmo.process(hdr_f32)

Benefits:

  • 2× less memory
  • Faster operations on some hardware
  • Minimal precision loss for imaging

GPU Acceleration

Setup

Install PyTorch with CUDA:

# CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# ROCm (AMD)
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6

Verify:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device: {torch.cuda.get_device_name(0)}")

Basic GPU Usage

import torch
from dtutmo import TorchDTUTMO

# Create GPU tone mapper
tmo = TorchDTUTMO().cuda()

# Prepare input (BCHW format)
hdr_tensor = torch.from_numpy(hdr).permute(2, 0, 1).unsqueeze(0).cuda()

# Process
ldr_tensor = tmo.process(hdr_tensor)

# Convert back
ldr = ldr_tensor.squeeze(0).permute(1, 2, 0).cpu().numpy()

Batch Processing

Process multiple images in parallel:

import torch
from dtutmo import TorchDTUTMO

def batch_process_gpu(image_paths, batch_size=4):
    """Process images in batches on GPU"""
    tmo = TorchDTUTMO().cuda()
    results = []
    
    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i+batch_size]
        
        # Load batch
        batch = []
        for path in batch_paths:
            hdr = load_hdr(path)
            tensor = torch.from_numpy(hdr).permute(2, 0, 1)
            batch.append(tensor)
        
        batch_tensor = torch.stack(batch).cuda()
        
        # Process batch
        with torch.no_grad():
            ldr_batch = tmo.process(batch_tensor)
        
        # Store results
        for j, ldr_tensor in enumerate(ldr_batch):
            ldr = ldr_tensor.permute(1, 2, 0).cpu().numpy()
            results.append(ldr)
    
    return results

Speedup: Near-linear scaling up to batch size ~8

Mixed Precision

Use automatic mixed precision for faster processing:

from torch.cuda.amp import autocast

tmo = TorchDTUTMO().cuda()

with torch.no_grad(), autocast():
    ldr_tensor = tmo.process(hdr_tensor)

Benefits:

  • 1.5-2× faster on Tensor Core GPUs (Volta+)
  • 50% less memory
  • Minimal precision loss

Memory Management

import torch

# Clear cache between images
torch.cuda.empty_cache()

# Set memory allocator
torch.cuda.set_per_process_memory_fraction(0.8)  # Use 80% of GPU memory

# Monitor usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Multi-GPU Processing

Distribute across multiple GPUs:

import torch
from torch.nn.parallel import DataParallel
from dtutmo import TorchDTUTMO

# Wrap in DataParallel
tmo = TorchDTUTMO()
tmo_parallel = DataParallel(tmo)

# Process (automatically distributes across GPUs)
ldr_tensor = tmo_parallel(hdr_tensor)

Scaling:

GPUs 4K Time Speedup
1 0.32s 1.0×
2 0.18s 1.8×
4 0.11s 2.9×

Scaling efficiency: ~75% (communication overhead)


Memory Optimization

Memory Usage Analysis

For image size $H \times W$:

Component Memory
Input HDR (float32) $12HW$ bytes
FFT buffers (complex64) $24HW$ bytes
Intermediate stages $\sim 20HW$ bytes
Output LDR (float32) $12HW$ bytes
Peak $\sim 80HW$ bytes

Examples:

Resolution Peak Memory
1920×1080 ~600 MB
2560×1440 ~1.1 GB
3840×2160 ~2.4 GB
7680×4320 ~9.6 GB

Tiled Processing

For large images, process in tiles:

import numpy as np

def process_tiled(tmo, hdr, tile_size=1024, overlap=64):
    """Process image in overlapping tiles"""
    h, w = hdr.shape[:2]
    output = np.zeros((h, w, 3), dtype=np.float32)
    weight = np.zeros((h, w, 1), dtype=np.float32)
    
    # Generate tiles
    for y in range(0, h, tile_size - overlap):
        for x in range(0, w, tile_size - overlap):
            # Extract tile
            y_end = min(y + tile_size, h)
            x_end = min(x + tile_size, w)
            tile = hdr[y:y_end, x:x_end]
            
            # Process
            tile_out = tmo.process(tile)
            
            # Create blend weights (linear ramp in overlap regions)
            tile_weight = np.ones((tile_out.shape[0], tile_out.shape[1], 1))
            
            # Blend into output
            output[y:y_end, x:x_end] += tile_out * tile_weight
            weight[y:y_end, x:x_end] += tile_weight
    
    # Normalize
    output /= np.maximum(weight, 1e-6)
    
    return output

Memory reduction: Process 8K image with 2GB instead of 10GB

In-Place Operations

Reduce copies by modifying arrays in-place:

# Instead of:
img_new = img * scale  # Creates copy

# Use:
img *= scale  # In-place

In DTUTMO internals, many operations are already in-place.

Garbage Collection

Force garbage collection between images:

import gc

for img_path in image_paths:
    hdr = load_hdr(img_path)
    ldr = tmo.process(hdr)
    save_ldr(img_path, ldr)
    
    # Force cleanup
    del hdr, ldr
    gc.collect()

Profiling and Benchmarking

Simple Timing

import time

start = time.time()
ldr = tmo.process(hdr)
elapsed = time.time() - start

print(f"Processing time: {elapsed:.3f}s")
print(f"Throughput: {hdr.size / elapsed / 1e6:.1f} Mpixels/s")

Detailed Profiling

Use cProfile for detailed analysis:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

ldr = tmo.process(hdr)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(30)  # Top 30 functions

Sample output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.721    0.721 pipeline.py:45(process)
        1    0.156    0.156    0.156    0.156 optics.py:23(apply_otf)
        1    0.098    0.098    0.098    0.098 glare.py:67(apply_glare)
        1    0.234    0.234    0.234    0.234 photoreceptors.py:145(compute_response)
        ...

Stage-by-Stage Timing

result = tmo.process(hdr, return_intermediate=True)

# Access stage timings
timings = result.get('timings', {})
for stage, duration in timings.items():
    print(f"{stage}: {duration:.3f}s")

GPU Profiling

Use PyTorch profiler:

import torch
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    ldr = tmo.process(hdr_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Memory Profiling

import tracemalloc

tracemalloc.start()

ldr = tmo.process(hdr)

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory: {current / 1e6:.1f} MB")
print(f"Peak memory: {peak / 1e6:.1f} MB")

tracemalloc.stop()

Architecture-Specific Tips

Intel CPUs

Use MKL:

pip install mkl mkl-service

Enable MKL optimizations:

import mkl
mkl.set_num_threads(8)

Benefits: 1.2-1.5× faster linear algebra

AMD CPUs

Use OpenBLAS:

pip install openblas-devel

Set threads:

import os
os.environ['OPENBLAS_NUM_THREADS'] = '8'

Apple Silicon (M1/M2/M3)

Use Accelerate framework (automatic with NumPy 1.21+):

import numpy as np
# Automatically uses Accelerate BLAS

Metal GPU (experimental):

# PyTorch with MPS backend
import torch
device = torch.device("mps")
tmo = TorchDTUTMO().to(device)

Performance: 3-5× faster than CPU on M1 Max/Ultra

NVIDIA GPUs

Tensor Cores (RTX 20/30/40 series):

  • Use mixed precision (autocast())
  • Ensure tensor dimensions are multiples of 8

CUDA Graphs (advanced):

# Capture processing graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
    ldr = tmo.process(hdr_tensor)

# Replay graph (faster)
graph.replay()

Benefits: 1.2-1.3× faster, lower latency

AMD GPUs

ROCm optimization:

pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6

Set environment:

export HSA_OVERRIDE_GFX_VERSION=10.3.0  # For RX 6000 series
export ROCR_VISIBLE_DEVICES=0

Optimization Checklist

For Speed

  • Use GPU if available (8-15× speedup)
  • Set optimal thread count (4-8 threads)
  • Use PRODUCTION_HYBRID or WHITEBOARD mapping
  • Disable optional stages if quality allows
  • Use float32 instead of float64
  • Process multiple images in batches (GPU)
  • Enable mixed precision (GPU, Tensor Cores)

For Memory

  • Use tiled processing for large images
  • Clear GPU cache between images
  • Use float32 to halve memory
  • Disable use_bilateral (saves ~15% memory)
  • Process at reduced resolution
  • Force garbage collection

For Quality

  • Use FULL_INVERSE display mapping
  • Enable all stages (OTF, glare, bilateral)
  • Use DTUCAM for color appearance
  • Process at native resolution
  • Use float64 for maximum precision

Performance Tuning Examples

Example 1: Real-Time Preview

Target: <50ms for 1080p preview

config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_local_adapt=True,  # Keep for quality
    use_cam=CAMType.NONE,
    display_mapping=DisplayMapping.WHITEBOARD,
)

tmo = TorchDTUTMO(config).cuda()  # GPU

# Result: ~40ms for 1080p

Example 2: Video Processing

Target: 30 fps at 1080p

import torch
from dtutmo import TorchDTUTMO

config = DTUTMOConfig(
    use_otf=True,
    use_glare=False,  # Skip expensive glare
    use_bilateral=False,
    display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)

tmo = TorchDTUTMO(config).cuda()

# Process frames in batches
batch_size = 8
for frame_batch in video_batches:
    with torch.no_grad(), torch.cuda.amp.autocast():
        ldr_batch = tmo.process(frame_batch)
    # Write to video

# Result: ~30-35 fps

Example 3: High-Quality Batch

Target: Maximum quality, time not critical

config = DTUTMOConfig(
    observer_age=24,
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.FULL_INVERSE,
)

tmo = CompleteDTUTMO(config)

# Use multiprocessing
from multiprocessing import Pool

def process_one(path):
    hdr = load_hdr(path)
    return tmo.process(hdr)

with Pool(8) as pool:
    results = pool.map(process_one, image_paths)

# Result: 8× parallel speedup

Benchmarking Your System

Run this script to benchmark your system:

import time
import numpy as np
from dtutmo import CompleteDTUTMO, DTUTMOConfig, DisplayMapping, CAMType

def benchmark():
    # Create test image
    hdr = np.random.rand(1920, 1080, 3).astype(np.float32) * 100
    
    configs = {
        'Fast': DTUTMOConfig(
            use_otf=False, use_glare=False, use_bilateral=False,
            use_cam=CAMType.NONE, display_mapping=DisplayMapping.WHITEBOARD
        ),
        'Balanced': DTUTMOConfig(
            display_mapping=DisplayMapping.PRODUCTION_HYBRID
        ),
        'Quality': DTUTMOConfig(
            display_mapping=DisplayMapping.FULL_INVERSE
        ),
    }
    
    print(f"{'Config':<12} {'Time (s)':<10} {'Mpixels/s':<12}")
    print("-" * 40)
    
    for name, config in configs.items():
        tmo = CompleteDTUTMO(config)
        
        # Warmup
        _ = tmo.process(hdr)
        
        # Benchmark
        start = time.time()
        for _ in range(5):
            _ = tmo.process(hdr)
        elapsed = (time.time() - start) / 5
        
        mpixels_per_sec = hdr.size / elapsed / 1e6
        print(f"{name:<12} {elapsed:<10.3f} {mpixels_per_sec:<12.1f}")

if __name__ == '__main__':
    benchmark()

Next Steps

Clone this wiki locally