Skip to content

Conversation

@NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Dec 30, 2025

This PR is to address #26. The goal is to use less-aggressive planning flag for FFTW.


As mentioned in #26 (comment), MATLAB takes a similar approach by default:

Description
method = fftw('planner') returns the method that the fast Fourier transform functions fft, fft2, fftn, ifft, ifft2, and ifftn use to determine a transform algorithm. The default method is 'estimate', which determines the algorithm based on the size of the data.

@gitnotebooks
Copy link

gitnotebooks bot commented Dec 30, 2025

@NimaSarajpoor
Copy link
Collaborator Author

In the following figure, the baseline is the modified version of pyfftw_sdp, where planning flag is set to FFTW_ESTIMATE.

#!/bin/bash

rm -rf sdp/__pycache__
./timing.py -timeout 5.0 -pmin 6 -pmax 20 pyfftw pocketfft_r2c_c2r scipy_oaconvolve > timing.csv
rm -rf sdp/__pycache__
pyfftw_sdp_ESTIMATE

@seanlaw
Copy link
Contributor

seanlaw commented Dec 30, 2025

In the following figure, the baseline is the modified version of pyfftw_sdp, where planning flag is set to FFTW_ESTIMATE.

So, what do these plots tell us? All that I can conclude is that pocketfft is faster than FFTW_ESTIMATE when the length is greater than 15

@NimaSarajpoor
Copy link
Collaborator Author

So, what do these plots tell us?

Should have added "WIP". Apologies. Was planning to update.

All that I can conclude is that pocketfft is faster than FFTW_ESTIMATE when the length is greater than 15

Right. And, in some of those cases, pocketfft is faster than scipy_oaconvolve; and, in some other cases, it is the other way around. However, I think we should pause here and compare the performance of pyfftw_sdp with MATLAB_sdp.

@NimaSarajpoor
Copy link
Collaborator Author

MATLAB_sdp vs pyfftw_sdp (planning_flag is FFTW_ESTIMATE)

MATLAM_sdp (1 physical core) vs pyfftw_sdp (1 logical core)

image

The blue dots are the (Q, T) cases where MATLAB_sdp performs better. For those cases, we can switch to other algos. To be more precise, we can revise the sliding_dot_product as follows:

def sliding_dot_product(Q, T, n_threads=1, planning_flag="FFTW_ESTIMATE"):
    if len(Q) == len(T):
        return _njit_dot(Q, T)  # njit-decorated for-loop
    elif len(Q) <= 128: # 2^7 == 128 
        return _njit_sliding_dot_product(Q, T)  # njit_sdp
    else:
        return _sliding_dot_product(Q, T, n_threads=n_threads, planning_flag=planning_flag)

And that will give us:
Color "red" means pyfftw_sdp is faster than MATLAB_sdp. Each point is annotated with its corresponding performance ratio.

image

I think this result indicates that we’re on the right track. However, we should keep in mind that MATLAB’s FFT uses multi-threading by default. At this stage, we need to compare the single-threaded performance of pyfftw_sdp against multi-threaded MATLAB_sdp, which benefits from automatically adjusted multi-threaded FFT execution.

MATLAM_sdp (upto 8 physical core) vs pyfftw_sdp (1 logical core)

Like before, I am using njit_sdp when len(Q) <= 2^7.

image

As shown, the multi-threaded MATLAB_sdp shows better performance for certain cases when len(T) >= 2^18. I think we are at a position to enhance "pyfftw_sdp" to "properly" support n_threads > 1

SDP_RUN_ON_MATLAB.zip

@NimaSarajpoor
Copy link
Collaborator Author

As shown, the multi-threaded MATLAB_sdp shows better performance for certain cases when len(T) >= 2^18. I think we are at a position to enhance "pyfftw_sdp" to "properly" support n_threads > 1

Just sharing here before I forget: Section 5.3 "How Many Threads to Use?" of FFTW documentation says:

As a general rule, you don’t want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.

You will have to experiment with your system to see what level of parallelization is best for your problem size. Typically, the problem will have to involve at least a few thou-sand data points before threads become beneficial.

So.. I think it is better to use psutil.cpu_count(logical=False) instead of os.cpu_count() as the maximum value for n_threads in pyfftw.FFTW

@seanlaw
Copy link
Contributor

seanlaw commented Dec 31, 2025

I think it is better to use psutil.cpu_count(logical=False)

Note that psutil is not a part of the Python standard library. I think os.cpu_count() is fine. To some extent, we should loop over for num_threads in range(1, os.cpu_count() + 1) in order to find the optimal number of threads and stop if we see the performance degrading.

At this stage, we need to compare the single-threaded performance of pyfftw_sdp against multi-threaded MATLAB_sdp, which benefits from automatically adjusted multi-threaded FFT execution.

I wonder how Matlab is able to achieve this dynamically 🤔

Color "red" means pyfftw_sdp is faster than MATLAB_sdp. Each point is annotated with its corresponding performance ratio.

I like this visualization. It makes it very easy to understand!

As shown, the multi-threaded MATLAB_sdp shows better performance for certain cases when len(T) >= 2^18. I think we are at a position to enhance "pyfftw_sdp" to "properly" support n_threads > 1

Based on these results alone, it seems that it's not even worth using multithreaded Matlab because having 8 cores doesn't even help it be anywhere close to being 2x faster than pyfftw. Is that the correct interpretation?

Also, may I ask where you got the Matlab numbers from?

@seanlaw
Copy link
Contributor

seanlaw commented Dec 31, 2025

we can revise the sliding_dot_product as follows:

What happens when pyfftw isn't installed? How should we fall back (to oaconvolve and pocketfft)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 1, 2026

Note that psutil is not a part of the Python standard library.

👍

To some extent, we should loop over for num_threads in range(1, os.cpu_count() + 1) in order to find the optimal number of threads and stop if we see the performance degrading.

I have been doing some works locally regarding this, and I tried different approaches:

  • loop over for range(1, os.cpu_count() + 1) and find the optimal n_threads for EACH input size (2^2, 2^3, ..., 2^20) and save it in dictionary, one for rfft & one for irfft. I thought one n_threads should work for both (rfft & irfft) but didn't get that conclusion after checking out the optimal n_threads I found for each.

  • check two cases only: n_threads=1 and n_threads=os.cpu_count() for different input sizes, and find the input size beyond which the multi-threading approach shows better performance

  • Use hard-coded cutoff value 2**15 for input size. For any input size that is below that we choose n_threads=1. If the input size is >= 2**15, we choose n_threads=os.cpu_count().

The best approach is probably the first one, which is what you suggested. We just compute the proper number of threads for input sizes that are 2^pmin, 2^3, .... 2^pmax. The total time should be close to:

(pmax - pmin + 1) *  `os.cpu_count()` * timeout * 2   

# timeout: the time of collecting samples in each case
# 2 because: rfft and irfft

So, for pmin=2, pmax=25, and os.cpu_count()==16, and timeout 0.1 sec, the total time will be roughly 1.2min. We can probably start our search from pmin=11 as any input size that is <=2^10 should be ok with n_threads=1. For any other (fast) input size, we can choose its optimal n_threads based on some kind of logic (??) using the optimal n_threads chosen for input sizes that are exact power of two.

Btw, I tried the last approach (i.e. using the hard-coded cutoff 2**15) and got the following figure.

image

So this result shows that "pyfftw_sdp" has a good performance when we use multi-threading for large input sizes.

I wonder how Matlab is able to achieve this dynamically

Couldn't find anything before. Will provide an update if I find something.

it seems that it's not even worth using multithreaded Matlab because having 8 cores doesn't even help it be anywhere close to being 2x faster than pyfftw

Our comparison may not be fair from MATLAB's perspective because we are using "RFFT". The only reason I am comparing pyfftw_spd with MATLAB_sdp is to make sure that this piece of DAMP is going to be fine in the DAMP implementation.

Also, may I ask where you got the Matlab numbers from?

By running MATLAB_spd code (see DAMP_2_0.m) on MATLAB online using the code provided in the attached zip file. It would be great if you could review the code that I used for timing! This can help us make sure there is no silly mistake in my numbers.

SDP_RUN_ON_MATLAB.zip

After I collect the running times saved as .mat & .npy files, I plot the performance-ratio figure using the following code:

mat_file_name = ""
npy_file_name = ""

ref = scipy.io.loadmat(mat_file_name)
ref = ref['timing_values']

comp = np.load(npy_file_name)

p_min = 2
p_max = 24
n_physical_cores = 8  # used in MATLAB_sdp
n_logical_cores = 16  # used in pyfftw_sdp



points = []
plt.figure(figsize=(12, 12))
ax = plt.gca()

need_legend_red = True
need_legend_blue = True
for q in range(p_min, p_max + 1):
  for p in range(q, p_max + 1):
      r = ref[q-1, p-1]
      c = comp[q-1, p-1]

      ratio = r/c

      if ratio >= 1.0:
        color='r'
        msg = 'FASTER'
      else:
        msg = 'SLOWER'
        color='b'

      if ratio < 1.0:
        print(f'({msg}) q={q}, p={p} --> ratio: {ratio}', flush=True)

      legend = None
      if color == 'r' and need_legend_red:
          legend = f'pyfftw_sdp ({n_logical_cores} logical core) is faster'
          need_legend_red = False
      elif color == 'b' and need_legend_blue:
          legend = f'MATLAB_sdp ({n_physical_cores} physical core) is faster'
          need_legend_blue = False

      points.append((q, p))

      if legend is not None:
        plt.scatter(q, p, color=color, label=legend)
      else:
        plt.scatter(q, p, color=color)

      ax.annotate(
        f"{ratio:.1f}",
        (q, p),
        textcoords="offset points",
        xytext=(2, 2),
        ha="left",
        va="bottom"
      )


plt.xlabel('log2 of len(Q)')
plt.ylabel('log2 of len(T)')
plt.xticks(ticks=np.arange(p_min, p_max + 1), labels=np.arange(p_min, p_max + 1))
plt.yticks(ticks=np.arange(p_min, p_max + 1), labels=np.arange(p_min, p_max + 1))
plt.grid()
plt.legend()
plt.show()

What happens when pyfftw isn't installed? How should we fall back (to oaconvolve and pocketfft)?

Haven't thought about it yet! I was wondering if we should wrap up pyfftw_sdp first and then focus on this part.

@seanlaw
Copy link
Contributor

seanlaw commented Jan 1, 2026

Our comparison may not be fair from MATLAB's perspective because we are using "RFFT".

Do you mean that Matlab is not doing rfft but is instead doing fft, which is less efficient?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2026

Update: Please ignore the figures


(R)FFT

Do you mean that Matlab is not doing rfft but is instead doing fft, which is less efficient?

Right. MATALB does not have Rfft function. Although MATLAB fft on a real-valued array is faster than MATLAB fft on complex-valued array (probably because it uses some rfft trick under the hood) , it is still less efficient than (pyFFTW's) RFFT for large arrays.

The following figure shows the performance ratio of PyFFTW's RFFT (with 1 logical thread) relative to MATLAB's fft 1 physical core (blue) and relative to MATLAB's fft auto-adjusted 8 physical core (red). The code is provided at the bottom of this comment. When a y value is larger than one, it means pyfftw's RFFT is faster.

image

According to the figure, the pyfftw's Rfft shows better performance for larger arrays compared to MATLAB's fft. Results may change a bit from one run to another...but the overall trend remains the same. Now that I've provided the performance ratio for (R)fft, I think it is worth it to show the plot for I(R)FFT, the other component of SDP.

I(R)FFT

I can see a different trend for irfft. When a y value is larger than one, it means pyfftw's IRFFT is faster.

image

This shows that MATLAB's ifft is faster for longer arrays. Now if we consider BOTH this I(R)FFT plot and the previous (R)FFT plot, it makes sense to some extent as why we saw some cases with performance ratio < 1 in the SDP plot, the last figure in this comment.

Code for (R)FFT

% MATLAB
function [z] = fft_func_single(T)
    z = fft(T);
    end

num_of_thread = 'automatic';  % set num_of_thread to 1 for using single physical core
maxNumCompThreads(num_of_thread);

and,

# python
class REAL_FFT:
    def __init__(self, max_n=2**20):
        """
        Parameters
        ----------
        max_n : int
            Maximum length to preallocate arrays for. This will be the size of the
            the real-valued array. A complex-valued array of size `1 + (max_n // 2)`
            will also be preallocated.
        """
        self.real_arr = pyfftw.empty_aligned(max_n, dtype="float64")
        self.complex_arr = pyfftw.empty_aligned(1 + (max_n // 2), dtype="complex128")
        self.rfft_objects = {}
        
    def __call__(self, T, n_threads=1, planning_flag="FFTW_ESTIMATE"):
        """
        Compute the sliding dot product between `Q` and `T` using FFTW via pyfftw.

        Parameters
        ----------
        T : numpy.ndarray
            Time series or sequence.

        n_threads : int, default 1
            Number of logical threads

        planning_flag : str, default="FFTW_MEASURE"
            The planning flag that will be used in FFTW for planning.
            See pyfftw documentation for details. Current options include:
            "FFTW_ESTIMATE", "FFTW_MEASURE", "FFTW_PATIENT", and "FFTW_EXHAUSTIVE".

        Returns
        -------
        out : numpy.ndarray
            The RFFT of T
        """
        
        n = T.shape[0]
        
        
        # next_fast_n = pyfftw.next_fast_len(n)
        next_fast_n = n

        if next_fast_n > len(self.real_arr):
            self.real_arr = pyfftw.empty_aligned(next_fast_n, dtype="float64")
            self.complex_arr = pyfftw.empty_aligned(
                1 + (next_fast_n // 2), dtype="complex128"
            )

        real_arr = self.real_arr[:next_fast_n]
        complex_arr = self.complex_arr[: 1 + (next_fast_n // 2)]

        # Get or create FFTW objects
        key = (next_fast_n, n_threads, planning_flag)

        rfft_obj = self.rfft_objects.get(key, None)
        if rfft_obj is None:
            rfft_obj = pyfftw.FFTW(
                input_array=real_arr,
                output_array=complex_arr,
                direction="FFTW_FORWARD",
                flags=(planning_flag,),
                threads=n_threads,
            )
            self.rfft_objects[key] = rfft_obj
        else:
            rfft_obj.update_arrays(real_arr, complex_arr)

        # RFFT(T)
        real_arr[:] = T
        rfft_obj.execute()  # output is in complex_arr
        
        return complex_arr


_real_fft = REAL_FFT()


def setup(T, n_threads=1, planning_flag="FFTW_ESTIMATE"):
    _real_fft(T, n_threads=n_threads, planning_flag=planning_flag)
    return

def real_fft(T, n_threads=1, planning_flag="FFTW_ESTIMATE"):
    return _real_fft(T, n_threads=n_threads, planning_flag=planning_flag)

@seanlaw
Copy link
Contributor

seanlaw commented Jan 2, 2026

I think that the plots above confirm that "auto-adjusted 8 physical core" doesn't really help (based on the fact that the blue and red lines are very similar). Having said that, I wouldn't put to much emphasis on your ability to leverage all 8-cores on Matlab Online! I would guess that the hardware resources on Matlab Online are shared and, maybe, you have access to 2-cores.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2026

Something doesn't add up... To verify

I would guess that the hardware resources on Matlab Online are shared and, maybe, you have access to 2-cores.

I've decided to put aside pyfftw, and only focused on MATLAB's fft / ifft. Wanted to see how their performance changes with different number of cores on MATLAB Online. So, I checked the following cases (FULL MATLAB CODE is provided at the bottom of this comment):

  • Auto-core
  • 8-core
  • 4-core
  • 2-core
  • 1-core

And I plotted a figure to show the performance ratio of multi-threaded cases relative to 1-core. I got the following plots, one for FFT, and another for IFFT. A value higher than one means the multi-threaded approach is faster.

FFT_MATLAB_Online IFFT_MATLAB_Online

According to the figures above, the performance ratio for "Auto" (auto-adjusted 8-core) is close to the performance ratio of "8-core" case (for large arrays), which is higher than the performance ratio of 4-core and 2-core cases. This means that I might have access to more than two cores. BUT these numbers do not seem to be aligned with the plots I provided in my previous comment. Is it possible that the number of cores fft can actually use depends on how much load the server is handling at a particular time?? I am going to check those plots again.

FULL MATLAB CODE

% COPY THIS SCRIPT, and PASTE IT IN A FILE, and RUN IT!

function results = benchmark_fft(min_exp, max_exp, min_time)
    if nargin < 1, min_exp = 2; end
    if nargin < 2, max_exp = 26; end
    if nargin < 3, min_time = 5.0; end

    exps = min_exp:max_exp;
    n_vals = 2 .^ exps;

    real_avg = zeros(size(n_vals));
    real_iters = zeros(size(n_vals));
    
    for i = 1:length(n_vals)
        n = n_vals(i);
        T_real = rand(n, 1);
        
        % fft warm up
        fft(T_real); 
        
        % ifft warm up
        % F = fft(T_real);  
        % ifft(F); % Warm-up


        count = 0;
        total_time = 0;
        while total_time < min_time
            t_start = tic;
            fft(T_real); % ifft(F);
            total_time = total_time + toc(t_start);
            count = count + 1;
        end

        real_avg(i) = total_time / count;
        real_iters(i) = count;

        fprintf('n = 2^%d (%d): real = %.3e s, real_iters = %d \n', ...
                exps(i), n, real_avg(i), real_iters(i));
    end

    results.n = n_vals;
    results.real_avg_time = real_avg;
    results.real_iters = real_iters;
end


num_of_thread = 8; 
maxNumCompThreads(num_of_thread);
n_threads_multi = maxNumCompThreads;
disp(strcat("MAX number of physical cores: ",num2str(n_threads_multi)));
results_8 = benchmark_fft();

disp('=================================')

num_of_thread = 'automatic';
maxNumCompThreads(num_of_thread);
n_threads_multi = maxNumCompThreads;
disp(strcat("MAX number of physical cores: ",num2str(n_threads_multi)));
results_auto = benchmark_fft();

disp('=================================')



num_of_thread = 4; 
maxNumCompThreads(num_of_thread);
n_threads_multi = maxNumCompThreads;
disp(strcat("MAX number of physical cores: ",num2str(n_threads_multi)));
results_4 = benchmark_fft();

disp('=================================')

num_of_thread = 2;
maxNumCompThreads(num_of_thread);
n_threads_multi = maxNumCompThreads;
disp(strcat("MAX number of physical cores: ",num2str(n_threads_multi)));
results_2 = benchmark_fft();

disp('=================================')

% Set threads to 1
num_of_thread = 1; %'automatic';
maxNumCompThreads(num_of_thread);
n_threads = maxNumCompThreads;
disp(strcat("MAX number of physical cores: ",num2str(n_threads)));
results_1 = benchmark_fft();


%%%%%%
% PLOT
exponents = log2(results_1.n);

ratio_2 = results_1.real_avg_time ./ results_2.real_avg_time;
ratio_4 = results_1.real_avg_time ./ results_4.real_avg_time;
ratio_8 = results_1.real_avg_time ./ results_8.real_avg_time;
ratio_auto = results_1.real_avg_time ./ results_auto.real_avg_time;


figure;

plot(exponents, ratio_2, '-o', 'LineWidth', 1.5, 'DisplayName', '2-cores');
hold on;
plot(exponents, ratio_4, '-o', 'LineWidth', 1.5, 'DisplayName', '4-cores');
plot(exponents, ratio_8, '-o', 'LineWidth', 1.5, 'DisplayName', '8-cores');
plot(exponents, ratio_auto, '-o', 'LineWidth', 1.5, 'DisplayName', 'Auto');
yline(1, '--k', 'LineWidth', 1.2);  % y = 1 reference line
hold off;

xlabel('Exponent (n = 2^k)');
ylabel('MATLAB FFT: 1 core / X core');
title('FFT: Performance Ratio for X cores relative to 1 core');
xticks(exponents)
grid on;

@seanlaw
Copy link
Contributor

seanlaw commented Jan 2, 2026

Is it possible that the number of cores fft can actually use depends on how much load the server is handling at a particular time??

Certainly possible but nearly impossible to confirm. I just don't want you to waste time trying to interpret results that may be highly variable and that you may have little control over. The max performance between 1-core and 2-cores is around 10-20% and appears to diminish as you add more cores.

@NimaSarajpoor
Copy link
Collaborator Author

I just don't want you to waste time trying to interpret results that may be highly variable and that you may have little control over.

Right! For instance....

I am going to check those plots again.

I checked those pyfftw-vs-MATLAB plots again for FFT and for IFFT . Got slightly different outcome but the overall trends were almost the same. I also plotted that Q-T scatter plot (to check the correctness of the last plot in this comment) and the results were slightly different but the overall trend was the same: pyfftw(1-thread) outperforms MATLAB's fft_sdp for most cases. In the remaining cases, particularly for 2^19, 2^20, and 2^24, it was only 20-30% slower.

The max performance between 1-core and 2-cores is around 10-20% and appears to diminish as you add more cores.

Right! And, therefore, my suggestion is to just go with single-threaded pyfftw for now as I think it is ALMOST ready to be used in DAMP. And, later, if needed, we can try to enhance it to find the proper number of threads. To have smaller search space, we can also limit the maximum number of threads to four (or even two??). We can discuss it further when the time comes.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2026

What happens when pyfftw isn't installed? How should we fall back (to oaconvolve and pocketfft)?

I think we can ignore oaconvolve, and just fallback to pocketfft. Because if we look at the performance plots in this comment, we can see that there are only a few cases where oaconvolve performs better, and in those cases, pocketfft is already better than the baseline, pyfftw_sdp. And when their performance is below the baseline, we can see that pocketfft is closer to the baseline pyfftw_sdp. We can think about including oaconvolve later if needed.

@seanlaw
What do you think? Note that the maximum input size in that performance plot is 2^20. Maybe I should check it for larger input sizes?
How about merging this PR and address it in a new PR?

@NimaSarajpoor NimaSarajpoor mentioned this pull request Jan 2, 2026
@seanlaw
Copy link
Contributor

seanlaw commented Jan 3, 2026

Right! And, therefore, my suggestion is to just go with single-threaded pyfftw for now as I think it is ALMOST ready to be used in DAMP. And, later, if needed, we can try to enhance it to find the proper number of threads. To have smaller search space, we can also limit the maximum number of threads to four (or even two??). We can discuss it further when the time comes.

Agreed!

How about merging this PR and address it in a new PR?

Yes, please feel free to go ahead and merge when you think this is ready. I don't see anything wrong with the changes files

@NimaSarajpoor NimaSarajpoor merged commit 9b4cf54 into main Jan 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants