Performance Optimization

This guide covers techniques for optimizing DTUTMO performance, from basic speed improvements to advanced GPU acceleration.

Performance Overview

Baseline Performance

CPU (Intel i7-12700K, single-threaded):

Resolution	Default	Fast	Research
1920×1080	0.7s	0.2s	2.1s
2560×1440	1.2s	0.4s	3.8s
3840×2160	2.8s	0.9s	8.4s

GPU (NVIDIA RTX 4090):

Resolution	Default	Fast	Research
1920×1080	0.08s	0.04s	0.15s
2560×1440	0.14s	0.07s	0.28s
3840×2160	0.32s	0.16s	0.65s

Speedup: 8-15× on GPU

Performance Profiles

Fast Preview

config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_cam=CAMType.NONE,
    display_mapping=DisplayMapping.WHITEBOARD,
)

1080p: ~0.2s CPU, ~0.04s GPU

Balanced (Default)

config = DTUTMOConfig(
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)

1080p: ~0.7s CPU, ~0.08s GPU

Maximum Quality

config = DTUTMOConfig(
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.FULL_INVERSE,
)

1080p: ~2.1s CPU, ~0.15s GPU

CPU Optimization

1. NumPy/BLAS Threading

NumPy uses multithreaded BLAS for matrix operations. Control threads:

import os

# Set before importing numpy
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OPENBLAS_NUM_THREADS'] = '8'

import numpy as np
from dtutmo import CompleteDTUTMO

Benchmark:

Threads	1080p Time	Speedup
1	1.2s	1.0×
4	0.8s	1.5×
8	0.7s	1.7×
16	0.7s	1.7×

Recommendation: Use 4-8 threads (diminishing returns beyond 8).

2. Disable Optional Stages

Each stage has computational cost:

Stage	1080p Time	Speedup if Disabled
OTF	0.08s	1.13×
Glare	0.12s	1.21×
Bilateral	0.15s	1.27×
CAM	0.10s	1.17×

Cumulative speedup (all disabled): 2.5-3×

# Fastest configuration
config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_cam=CAMType.NONE,
)

3. Display Mapping Strategy

Strategy	1080p Time	Relative Speed
WHITEBOARD	0.15s	1.0× (fastest)
HYBRID	0.28s	0.54×
PRODUCTION_HYBRID	0.35s	0.43×
FULL_INVERSE	0.85s	0.18×

Recommendation:

Preview: WHITEBOARD
Production: PRODUCTION_HYBRID
Research: FULL_INVERSE

4. Image Downsampling

Process at lower resolution, upscale result:

import cv2

def fast_process(tmo, hdr_large, scale=0.5):
    """Process at reduced resolution"""
    # Downscale
    h, w = int(hdr_large.shape[0] * scale), int(hdr_large.shape[1] * scale)
    hdr_small = cv2.resize(hdr_large, (w, h), interpolation=cv2.INTER_AREA)
    
    # Process
    ldr_small = tmo.process(hdr_small)
    
    # Upscale
    ldr_large = cv2.resize(ldr_small, 
                          (hdr_large.shape[1], hdr_large.shape[0]),
                          interpolation=cv2.INTER_CUBIC)
    
    return ldr_large

Performance:

Scale	4K→1080p Time	Quality Loss
1.0	2.8s	0%
0.75	1.6s	<5%
0.5	0.7s	~10%
0.25	0.2s	~25%

5. FFT Optimization

OTF and glare use FFT (via scipy.fft):

# Use pyfftw for faster FFTs (optional)
pip install pyfftw

# Enable in code
import pyfftw
pyfftw.interfaces.cache.enable()

Speedup: 1.2-1.5× for FFT operations

6. Data Type Optimization

Use float32 instead of float64:

# Convert input
hdr_f32 = hdr.astype(np.float32)

# Process
ldr = tmo.process(hdr_f32)

Benefits:

2× less memory
Faster operations on some hardware
Minimal precision loss for imaging

GPU Acceleration

Setup

Install PyTorch with CUDA:

# CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# ROCm (AMD)
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6

Verify:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device: {torch.cuda.get_device_name(0)}")

Basic GPU Usage

import torch
from dtutmo import TorchDTUTMO

# Create GPU tone mapper
tmo = TorchDTUTMO().cuda()

# Prepare input (BCHW format)
hdr_tensor = torch.from_numpy(hdr).permute(2, 0, 1).unsqueeze(0).cuda()

# Process
ldr_tensor = tmo.process(hdr_tensor)

# Convert back
ldr = ldr_tensor.squeeze(0).permute(1, 2, 0).cpu().numpy()

Batch Processing

Process multiple images in parallel:

import torch
from dtutmo import TorchDTUTMO

def batch_process_gpu(image_paths, batch_size=4):
    """Process images in batches on GPU"""
    tmo = TorchDTUTMO().cuda()
    results = []
    
    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i+batch_size]
        
        # Load batch
        batch = []
        for path in batch_paths:
            hdr = load_hdr(path)
            tensor = torch.from_numpy(hdr).permute(2, 0, 1)
            batch.append(tensor)
        
        batch_tensor = torch.stack(batch).cuda()
        
        # Process batch
        with torch.no_grad():
            ldr_batch = tmo.process(batch_tensor)
        
        # Store results
        for j, ldr_tensor in enumerate(ldr_batch):
            ldr = ldr_tensor.permute(1, 2, 0).cpu().numpy()
            results.append(ldr)
    
    return results

Speedup: Near-linear scaling up to batch size ~8

Mixed Precision

Use automatic mixed precision for faster processing:

from torch.cuda.amp import autocast

tmo = TorchDTUTMO().cuda()

with torch.no_grad(), autocast():
    ldr_tensor = tmo.process(hdr_tensor)

Benefits:

1.5-2× faster on Tensor Core GPUs (Volta+)
50% less memory
Minimal precision loss

Memory Management

import torch

# Clear cache between images
torch.cuda.empty_cache()

# Set memory allocator
torch.cuda.set_per_process_memory_fraction(0.8)  # Use 80% of GPU memory

# Monitor usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Multi-GPU Processing

Distribute across multiple GPUs:

import torch
from torch.nn.parallel import DataParallel
from dtutmo import TorchDTUTMO

# Wrap in DataParallel
tmo = TorchDTUTMO()
tmo_parallel = DataParallel(tmo)

# Process (automatically distributes across GPUs)
ldr_tensor = tmo_parallel(hdr_tensor)

Scaling:

GPUs	4K Time	Speedup
1	0.32s	1.0×
2	0.18s	1.8×
4	0.11s	2.9×

Scaling efficiency: ~75% (communication overhead)

Memory Optimization

Memory Usage Analysis

For image size $H \times W$:

Component	Memory
Input HDR (float32)	$12HW$ bytes
FFT buffers (complex64)	$24HW$ bytes
Intermediate stages	$\sim 20HW$ bytes
Output LDR (float32)	$12HW$ bytes
Peak	$\sim 80HW$ bytes

Examples:

Resolution	Peak Memory
1920×1080	~600 MB
2560×1440	~1.1 GB
3840×2160	~2.4 GB
7680×4320	~9.6 GB

Tiled Processing

For large images, process in tiles:

import numpy as np

def process_tiled(tmo, hdr, tile_size=1024, overlap=64):
    """Process image in overlapping tiles"""
    h, w = hdr.shape[:2]
    output = np.zeros((h, w, 3), dtype=np.float32)
    weight = np.zeros((h, w, 1), dtype=np.float32)
    
    # Generate tiles
    for y in range(0, h, tile_size - overlap):
        for x in range(0, w, tile_size - overlap):
            # Extract tile
            y_end = min(y + tile_size, h)
            x_end = min(x + tile_size, w)
            tile = hdr[y:y_end, x:x_end]
            
            # Process
            tile_out = tmo.process(tile)
            
            # Create blend weights (linear ramp in overlap regions)
            tile_weight = np.ones((tile_out.shape[0], tile_out.shape[1], 1))
            
            # Blend into output
            output[y:y_end, x:x_end] += tile_out * tile_weight
            weight[y:y_end, x:x_end] += tile_weight
    
    # Normalize
    output /= np.maximum(weight, 1e-6)
    
    return output

Memory reduction: Process 8K image with 2GB instead of 10GB

In-Place Operations

Reduce copies by modifying arrays in-place:

# Instead of:
img_new = img * scale  # Creates copy

# Use:
img *= scale  # In-place

In DTUTMO internals, many operations are already in-place.

Garbage Collection

Force garbage collection between images:

import gc

for img_path in image_paths:
    hdr = load_hdr(img_path)
    ldr = tmo.process(hdr)
    save_ldr(img_path, ldr)
    
    # Force cleanup
    del hdr, ldr
    gc.collect()

Profiling and Benchmarking

Simple Timing

import time

start = time.time()
ldr = tmo.process(hdr)
elapsed = time.time() - start

print(f"Processing time: {elapsed:.3f}s")
print(f"Throughput: {hdr.size / elapsed / 1e6:.1f} Mpixels/s")

Detailed Profiling

Use cProfile for detailed analysis:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

ldr = tmo.process(hdr)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(30)  # Top 30 functions

Sample output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.721    0.721 pipeline.py:45(process)
        1    0.156    0.156    0.156    0.156 optics.py:23(apply_otf)
        1    0.098    0.098    0.098    0.098 glare.py:67(apply_glare)
        1    0.234    0.234    0.234    0.234 photoreceptors.py:145(compute_response)
        ...

Stage-by-Stage Timing

result = tmo.process(hdr, return_intermediate=True)

# Access stage timings
timings = result.get('timings', {})
for stage, duration in timings.items():
    print(f"{stage}: {duration:.3f}s")

GPU Profiling

Use PyTorch profiler:

import torch
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    ldr = tmo.process(hdr_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Memory Profiling

import tracemalloc

tracemalloc.start()

ldr = tmo.process(hdr)

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory: {current / 1e6:.1f} MB")
print(f"Peak memory: {peak / 1e6:.1f} MB")

tracemalloc.stop()

Architecture-Specific Tips

Intel CPUs

Use MKL:

pip install mkl mkl-service

Enable MKL optimizations:

import mkl
mkl.set_num_threads(8)

Benefits: 1.2-1.5× faster linear algebra

AMD CPUs

Use OpenBLAS:

pip install openblas-devel

Set threads:

import os
os.environ['OPENBLAS_NUM_THREADS'] = '8'

Apple Silicon (M1/M2/M3)

Use Accelerate framework (automatic with NumPy 1.21+):

import numpy as np
# Automatically uses Accelerate BLAS

Metal GPU (experimental):

# PyTorch with MPS backend
import torch
device = torch.device("mps")
tmo = TorchDTUTMO().to(device)

Performance: 3-5× faster than CPU on M1 Max/Ultra

NVIDIA GPUs

Tensor Cores (RTX 20/30/40 series):

Use mixed precision (autocast())
Ensure tensor dimensions are multiples of 8

CUDA Graphs (advanced):

# Capture processing graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
    ldr = tmo.process(hdr_tensor)

# Replay graph (faster)
graph.replay()

Benefits: 1.2-1.3× faster, lower latency

AMD GPUs

ROCm optimization:

pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6

Set environment:

export HSA_OVERRIDE_GFX_VERSION=10.3.0  # For RX 6000 series
export ROCR_VISIBLE_DEVICES=0

Optimization Checklist

For Speed

Use GPU if available (8-15× speedup)
Set optimal thread count (4-8 threads)
Use PRODUCTION_HYBRID or WHITEBOARD mapping
Disable optional stages if quality allows
Use float32 instead of float64
Process multiple images in batches (GPU)
Enable mixed precision (GPU, Tensor Cores)

For Memory

Use tiled processing for large images
Clear GPU cache between images
Use float32 to halve memory
Disable use_bilateral (saves ~15% memory)
Process at reduced resolution
Force garbage collection

For Quality

Use FULL_INVERSE display mapping
Enable all stages (OTF, glare, bilateral)
Use DTUCAM for color appearance
Process at native resolution
Use float64 for maximum precision

Performance Tuning Examples

Example 1: Real-Time Preview

Target: <50ms for 1080p preview

config = DTUTMOConfig(
    use_otf=False,
    use_glare=False,
    use_bilateral=False,
    use_local_adapt=True,  # Keep for quality
    use_cam=CAMType.NONE,
    display_mapping=DisplayMapping.WHITEBOARD,
)

tmo = TorchDTUTMO(config).cuda()  # GPU

# Result: ~40ms for 1080p

Example 2: Video Processing

Target: 30 fps at 1080p

import torch
from dtutmo import TorchDTUTMO

config = DTUTMOConfig(
    use_otf=True,
    use_glare=False,  # Skip expensive glare
    use_bilateral=False,
    display_mapping=DisplayMapping.PRODUCTION_HYBRID,
)

tmo = TorchDTUTMO(config).cuda()

# Process frames in batches
batch_size = 8
for frame_batch in video_batches:
    with torch.no_grad(), torch.cuda.amp.autocast():
        ldr_batch = tmo.process(frame_batch)
    # Write to video

# Result: ~30-35 fps

Example 3: High-Quality Batch

Target: Maximum quality, time not critical

config = DTUTMOConfig(
    observer_age=24,
    use_otf=True,
    use_glare=True,
    use_bilateral=True,
    use_cam=CAMType.DTUCAM,
    display_mapping=DisplayMapping.FULL_INVERSE,
)

tmo = CompleteDTUTMO(config)

# Use multiprocessing
from multiprocessing import Pool

def process_one(path):
    hdr = load_hdr(path)
    return tmo.process(hdr)

with Pool(8) as pool:
    results = pool.map(process_one, image_paths)

# Result: 8× parallel speedup

Benchmarking Your System

Run this script to benchmark your system:

import time
import numpy as np
from dtutmo import CompleteDTUTMO, DTUTMOConfig, DisplayMapping, CAMType

def benchmark():
    # Create test image
    hdr = np.random.rand(1920, 1080, 3).astype(np.float32) * 100
    
    configs = {
        'Fast': DTUTMOConfig(
            use_otf=False, use_glare=False, use_bilateral=False,
            use_cam=CAMType.NONE, display_mapping=DisplayMapping.WHITEBOARD
        ),
        'Balanced': DTUTMOConfig(
            display_mapping=DisplayMapping.PRODUCTION_HYBRID
        ),
        'Quality': DTUTMOConfig(
            display_mapping=DisplayMapping.FULL_INVERSE
        ),
    }
    
    print(f"{'Config':<12} {'Time (s)':<10} {'Mpixels/s':<12}")
    print("-" * 40)
    
    for name, config in configs.items():
        tmo = CompleteDTUTMO(config)
        
        # Warmup
        _ = tmo.process(hdr)
        
        # Benchmark
        start = time.time()
        for _ in range(5):
            _ = tmo.process(hdr)
        elapsed = (time.time() - start) / 5
        
        mpixels_per_sec = hdr.size / elapsed / 1e6
        print(f"{name:<12} {elapsed:<10.3f} {mpixels_per_sec:<12.1f}")

if __name__ == '__main__':
    benchmark()

Next Steps

Configuration Guide - Tune parameters for your needs
Pipeline Stages - Understand stage costs
System Architecture - Memory and threading details
API Reference - Implementation details

Performance Optimization

Performance Optimization

Table of Contents

Performance Overview

Baseline Performance

Performance Profiles

Fast Preview

Balanced (Default)

Maximum Quality

CPU Optimization

1. NumPy/BLAS Threading

2. Disable Optional Stages

3. Display Mapping Strategy

4. Image Downsampling

5. FFT Optimization

6. Data Type Optimization

GPU Acceleration

Setup

Basic GPU Usage

Batch Processing

Mixed Precision

Memory Management

Multi-GPU Processing

Memory Optimization

Memory Usage Analysis

Tiled Processing

In-Place Operations

Garbage Collection

Profiling and Benchmarking

Simple Timing

Detailed Profiling

Stage-by-Stage Timing

GPU Profiling

Memory Profiling

Architecture-Specific Tips

Intel CPUs

AMD CPUs

Apple Silicon (M1/M2/M3)

NVIDIA GPUs

AMD GPUs

Optimization Checklist

For Speed

For Memory

For Quality

Performance Tuning Examples

Example 1: Real-Time Preview

Example 2: Video Processing

Example 3: High-Quality Batch

Benchmarking Your System

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally