Skip to content

Batches the quat to vec, rotation, and mat compose functions for upst…#108

Open
apasarkar wants to merge 4 commits into
pygfx:mainfrom
apasarkar:vectorize_quat_funcs
Open

Batches the quat to vec, rotation, and mat compose functions for upst…#108
apasarkar wants to merge 4 commits into
pygfx:mainfrom
apasarkar:vectorize_quat_funcs

Conversation

@apasarkar

@apasarkar apasarkar commented Apr 15, 2026

Copy link
Copy Markdown

This PR allows for faster vectorized computations of some key functions that are regularly used in fastplotlib. Specifically, the functionality to go from vectors --> quaternions and the function to compose a transformation matrix from translation vectors/quaternions/scaling offsets is updated.

Edit: also includes updated tests for checking that batching works.

@apasarkar apasarkar requested a review from Korijn as a code owner April 15, 2026 14:15
@almarklein

Copy link
Copy Markdown
Member

Thanks for this contribution! Some general comments:

  • You reduced and removed some docstrings, not sure why?
  • This project is formatted with ruff. You can run ruff format to autoformat the code, and ruff check to check for linting errors.
  • Would be good to have some benchmarks to verify that the code has not become significantly slower due to these changes, when used without the batching.

@apasarkar

apasarkar commented Apr 15, 2026

Copy link
Copy Markdown
Author

Sounds good, thanks @almarklein To address the comments:

  1. Yep good call, I've reintroduced those comments.
  2. Formatted with ruff in latest commits
  3. Batching experiments below for the two key functions modified (compose_mat and quat_to_vecs)

Code for profiling compose_mat:


def benchmark_mat_compose(
    ns=(1, 10, 100, 1_000, 10_000),
    n_repeat=1,
    n_number=20,
    seed=0,
):
    """
    Benchmark mat_compose (scalar loop) vs mat_batch_compose (vectorized).
 
    Parameters
    ----------
    ns : sequence of int
        Batch sizes to test.
    n_repeat : int
        Number of timeit repeats (best-of is taken).
    n_number : int
        Number of calls per timeit repeat.
    seed : int
        RNG seed for reproducibility.
 
    Returns
    -------
    pd.DataFrame with columns: n, loop_ms, batch_ms, speedup
    """
    rng = np.random.default_rng(seed)
 
    rows = []
    for n in ns:
        translations = rng.random((n, 3))
        rotations = rng.random((n, 4))
        rotations /= np.linalg.norm(rotations, axis=1, keepdims=True)
        scalings  = rng.random((n, 3)) + 0.1
 
        def loop_fn():
            for i in range(n):
                mat_compose(translations[i], rotations[i], scalings[i])
 
        def batch_fn():
            mat_batch_compose(translations, rotations, scalings)
 
        t_loop  = min(timeit.repeat(loop_fn,  repeat=n_repeat, number=n_number)) / n_number * 1e3
        t_batch = min(timeit.repeat(batch_fn, repeat=n_repeat, number=n_number)) / n_number * 1e3
 
        rows.append(dict(n=n, loop_ms=round(t_loop, 4), batch_ms=round(t_batch, 4), speedup=round(t_loop / t_batch, 1)))
        print(f"n={n:>7,} | loop {t_loop:8.3f} ms | batch {t_batch:8.3f} ms | speedup {t_loop/t_batch:.1f}x")
 
    return pd.DataFrame(rows).set_index("n")

The relative runtimes here are:

n=      1 | loop    0.024 ms | batch    0.092 ms | speedup 0.3x
n=     10 | loop    0.235 ms | batch    0.095 ms | speedup 2.5x
n=    100 | loop    1.623 ms | batch    0.049 ms | speedup 33.3x
n=  1,000 | loop   13.604 ms | batch    0.154 ms | speedup 88.5x
n= 10,000 | loop  132.154 ms | batch    1.278 ms | speedup 103.4x

So mat_compose looks good w.r.t. overheads observable at n = 1.

Did a similar expt for the quat_to_vec code:



def benchmark_quat_from_vecs_singleton(n_repeat=5, n_number=1000, seed=0):
    """
    Compare quat_from_vecs_scalar vs quat_from_vecs_batch on singleton [3] inputs.
    Goal: verify batch overhead is not significant for a single vector pair.
    """
    rng = np.random.default_rng(seed)
 
    def random_unit_vec():
        v = rng.random(3)
        return v / np.linalg.norm(v)
 

    src, tgt = random_unit_vec(), random_unit_vec()

    t_scalar = min(timeit.repeat(
        lambda: quat_from_vecs(src, tgt),
        repeat=n_repeat, number=n_number
    )) / n_number * 1e3

    t_batch = min(timeit.repeat(
        lambda: quat_from_vecs_batch(src, tgt),
        repeat=n_repeat, number=n_number
    )) / n_number * 1e3

    overhead = t_batch / t_scalar
    print(f"scalar: {t_scalar:>10.4f} batch: {t_batch:>10.4f} ")

Each call takes ~0.1 ms, so the before/after didn't really change with the batching logic it seems.

@Korijn

Korijn commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

I think in particular we are eager to see a comparison with the implementation on the main branch :) to avoid a potential performance regression

@apasarkar

apasarkar commented Apr 15, 2026

Copy link
Copy Markdown
Author

Hi @Korijn! The comparison above is w.r.t. the code on the main branch. (I cut/pasted the code on main right now and compared that with the new batched code).

@Korijn

Korijn commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

So then do I correctly understand that the n=1 case became 3x slower? It's on the hot path for pygfx if I am not mistaken, can you address this?

E.g. the explicit float typecast in asarray is some overhead you can potentially do without... There's more ways to get it done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants