Skip to content

Star analysis performance improvement#257

Open
Borschtsch wants to merge 7 commits into
isbeorn:developfrom
Borschtsch:star_analysis_performance_improvement
Open

Star analysis performance improvement#257
Borschtsch wants to merge 7 commits into
isbeorn:developfrom
Borschtsch:star_analysis_performance_improvement

Conversation

@Borschtsch
Copy link
Copy Markdown
Contributor

🚀 Purpose

Hello everyone,

I want to propose a performance improvement in the image analysis path.
As well as demosaicing of the image, the canny edge detectors could also benefit of parallelization and some local optimizations.
In order to achieve bit-exact results during the optimization, I have established different frame format tests and also benchmarks to compare against original algorithms. This is done by reusing original source file for no blur canny edge detector and using original Accord's canny edge detector. I also improved testing for demosaicing the same way, rather than relying on rewritten implementation in the test itself.

I have added a Benchmarks sub-project but I am not sure if it should be kept. Please let me know. See the benchmark results below.

🧪 How Was It Tested?

The code is unit tested with different real-world frame formats and output is compared against reference calculations that are preserved in the test folder or reused from Accord library.

✅ PR Checklist

  • All unit tests pass
  • UI changes tested across Light, Dark, and Night themes (if applicable)
  • Plugin API compatibility reviewed
    • No breaking changes to interfaces
    • No changes to method/class signatures (especially optional parameters)
  • Changelog.md updated (if applicable)
  • Documentation updated (if applicable)

🔗 Related Issues

Follow up on #147.

📸 Screenshots

📝 Additional Notes

CannyBlurredBenchmarks

CaseName: Gray8Aligned_IMX571_6224x4168

Thread Count Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
1 OriginalBlurredCannyEdgeDetector 1910.75 ms 1490.05 ms 81.67 ms 1881.41 ms 1.00 - 123.70 MB 1.00
1 ImprovedBlurredCannyEdgeDetector 1078.00 ms 348.05 ms 19.08 ms 1088.47 ms 0.56 0.03 123.72 MB 1.00
15 OriginalBlurredCannyEdgeDetector 1854.01 ms 632.31 ms 34.66 ms 1837.76 ms 1.00 - 123.70 MB 1.00
15 ImprovedBlurredCannyEdgeDetector 221.26 ms 270.38 ms 14.82 ms 228.75 ms 0.12 0.01 123.76 MB 1.00

CannyNoBlurBenchmarks

CaseName: Gray8Aligned_IMX571_6224x4168

Thread Count Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
1 OriginalNoBlurCannyEdgeDetector 715.86 ms 402.48 ms 22.06 ms 703.15 ms 1.00 - 123.69 MB 1.00
1 ImprovedNoBlurCannyEdgeDetector 640.61 ms 169.74 ms 9.30 ms 637.04 ms 0.89 0.03 123.71 MB 1.00
15 OriginalNoBlurCannyEdgeDetector 716.97 ms 208.48 ms 11.43 ms 711.54 ms 1.00 - 123.69 MB 1.00
15 ImprovedNoBlurCannyEdgeDetector 146.40 ms 15.08 ms 826.61 us 146.25 ms 0.20 0.00 123.72 MB 1.00

@daleghent
Copy link
Copy Markdown
Contributor

It seems that some of these new code paths are still quite scalar in nature. Have you explored using the Vector and Vector3d classes, especially when iterating pixel by pixel?

@Borschtsch
Copy link
Copy Markdown
Contributor Author

It seems that some of these new code paths are still quite scalar in nature. Have you explored using the Vector and Vector3d classes, especially when iterating pixel by pixel?

Hello, I have explored this idea before my first PR with demosaicing as it seemed the rigth thing to do in the first place. Though I could not get a meaningful improvement compared with current scalar path, Could be because I didn't have good benchmaks at that time. Now with the right test and benchmark harness in place this idea can be pursued. Let me take a look.

@Borschtsch
Copy link
Copy Markdown
Contributor Author

I have collected some data from test implementation for demosaicing as I had already worked on it. I am not sure what is the oldest CPU N.I.N.A. could have supported - probably limited with CPU that can run Windows 10? But one approach was using standard Vector that is SIMD-friendly and another one is using AVX intrinsics.

I had benchmarks running with the following JIT knobs - DOTNET_EnableHWIntrinsic = 0/1 (0 for scalar path), DOTNET_EnableAVX2 = 0/1 and DOTNET_EnableAVX512F = 0/1
With proper benchmarks it is clear that even Vector is faster than scalar path.

CaseName: Bayer16Aligned_IMX461_11656x8750_Demosaic

Reference algorithm

SIMD Mode Worker Count Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
Scalar 1 OriginalBayerFilter16bpp 3462.99 ms 2893.51 ms 158.60 ms 3430.13 ms 1.00 - 778.12 MB 1.00

Vector

SIMD Mode Worker Count Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
MaxSIMD 1 OriginalBayerFilter16bpp 3009.95 ms 1721.71 ms 94.37 ms 2998.91 ms 1.00 - 778.12 MB 1.00
Scalar 1 ImprovedBayerFilter16bpp 1049.09 ms 558.52 ms 30.61 ms 1049.42 ms 0.32 0.01 778.13 MB 1.00
NoAVX2 1 ImprovedBayerFilter16bpp 908.95 ms 485.99 ms 26.64 ms 917.23 ms 0.32 0.01 778.13 MB 1.00
MaxSIMD 1 ImprovedBayerFilter16bpp 771.50 ms 58.32 ms 3.20 ms 770.87 ms 0.26 0.01 778.13 MB 1.00
Scalar 15 ImprovedBayerFilter16bpp 265.42 ms 58.73 ms 3.22 ms 263.88 ms 0.08 0.00 778.13 MB 1.00
NoAVX2 15 ImprovedBayerFilter16bpp 236.64 ms 191.78 ms 10.51 ms 234.94 ms 0.08 0.00 778.13 MB 1.00
MaxSIMD 15 ImprovedBayerFilter16bpp 257.65 ms 107.70 ms 5.90 ms 260.39 ms 0.09 0.00 778.13 MB 1.00

AVX intrinsics

SIMD Mode Worker Count Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
NoAVX2 1 ImprovedBayerFilter16bpp 746.39 ms 172.98 ms 9.48 ms 741.04 ms 0.24 0.01 778.13 MB 1.00
MaxSIMD 1 ImprovedBayerFilter16bpp 553.47 ms 1237.55 ms 67.83 ms 575.11 ms 0.16 0.02 778.13 MB 1.00
Scalar 4 ImprovedBayerFilter16bpp 431.21 ms 267.29 ms 14.65 ms 429.20 ms 0.12 0.00 778.13 MB 1.00
NoAVX2 4 ImprovedBayerFilter16bpp 284.38 ms 312.54 ms 17.13 ms 275.95 ms 0.09 0.01 778.13 MB 1.00
MaxSIMD 4 ImprovedBayerFilter16bpp 289.84 ms 747.15 ms 40.95 ms 268.17 ms 0.09 0.01 778.13 MB 1.00
Scalar 8 ImprovedBayerFilter16bpp 308.52 ms 579.60 ms 31.77 ms 308.11 ms 0.09 0.01 778.13 MB 1.00
NoAVX2 8 ImprovedBayerFilter16bpp 225.21 ms 116.29 ms 6.37 ms 225.87 ms 0.07 0.00 778.13 MB 1.00
MaxSIMD 8 ImprovedBayerFilter16bpp 215.16 ms 112.50 ms 6.17 ms 217.53 ms 0.07 0.00 778.13 MB 1.00

With the number of threads performance does not change much. I guess, I have saturated memory I/O on my laptop.

What do you think?

@isbeorn
Copy link
Copy Markdown
Owner

isbeorn commented Apr 3, 2026

How did you verify that the edge detection will yield equivalent results. This is such a central component and crucial for star detection that I'm very hesitant to replace it.

@daleghent
Copy link
Copy Markdown
Contributor

Considering NINA is ran most often on low-power CPUs, "typical" core count is going to be 4-8 cores. The Vector* classes will dynamically use what is available and fall back to scalar iteration internally if more advanced instruction sets aren't available. SSE2 and AVX are old, going back over 15 years. AVX2 is more recent, appearing with the Intel Haswell microacrch. Yet, Haswell was introduced over 10 years ago. I would use the Vector classes and let the JIT determine what to use. Working on 8 or 16 16bit pixels at a time is far better than one at a time. Threading is going to be problematic if you are scanning the entire frame in broken-up sections. What if a star sits on the border between sections?

@Borschtsch
Copy link
Copy Markdown
Contributor Author

Borschtsch commented Apr 10, 2026

How did you verify that the edge detection will yield equivalent results. This is such a central component and crucial for star detection that I'm very hesitant to replace it.

I understand the question needs to be addressed better. In the last commits an image fixture has been created and used in bayer, gaussian and canny tests on different image formats - this fixture creates images with features like uniform black, uniform white, star field, diagonal line, step edge, checker board, etc.
The tests are running against reference implementations in N.I.N.A. and Accord - bayer and both canny ones. I put the both N.I.N.A. original files back in the tests folder.

The change is mostly focused on improving performance of gaussian blur and canny edge detection (magnitude and edge orientation, non-maximums suppression, hysteresis is unchanged as it relies on sequential execution). I tried to keep the structure of the non blur canny the same is in the original file, so it is easier to compare side-by-side. The rest of the updates is benchmarking and test improvements.

Anyway, I think I need to spend more time hardening the tests. The change was relatively easy to do, but testing is lot more important.

@Borschtsch
Copy link
Copy Markdown
Contributor Author

Considering NINA is ran most often on low-power CPUs, "typical" core count is going to be 4-8 cores. The Vector* classes will dynamically use what is available and fall back to scalar iteration internally if more advanced instruction sets aren't available. SSE2 and AVX are old, going back over 15 years. AVX2 is more recent, appearing with the Intel Haswell microacrch. Yet, Haswell was introduced over 10 years ago. I would use the Vector classes and let the JIT determine what to use. Working on 8 or 16 16bit pixels at a time is far better than one at a time.

This is also my line of thinking with Vector classes, it feels like keeping generic implementation is more beneficial for all kinds of PCs out there.. I will try it later on mini-PCs running N100 to see if we can still benefit. But I think this can be another round once we have good tests we can trust.

Threading is going to be problematic if you are scanning the entire frame in broken-up sections. What if a star sits on the border between sections?

If I understood the question right, the parallel processing is not a problem as long as we can produce independent artifacts like debayered pixels or parts of canny pipeline like in this change..

@Borschtsch
Copy link
Copy Markdown
Contributor Author

Borschtsch commented Apr 13, 2026

I have improved tests by adding randomly generated star field with different image features to compare new canny behavior against the original (blured vs non-blured).
The input images look like this. If tests fail, then the input/expected/actual and diff will be saved into NINA.Test\bin\Release\net10.0-windows\Artifacts folder.

Please find attached the example of the input images.

Example 1
input1

Example 2
input2

Example 3
input3

Added back the actual reference implementation Bayer demosaicing and No Blur Canny Edge Detector to use in tests, accordingly tests are updated to use reference calculations (including Accord's ones)
Added Benchmark project
…nnyEdgeDetector uses this core implementation along with a separate GaussianBlur

Using deterministic image fixtures that are reused across all tests for all image formats
…generate a starfield with different features (100 images) that are used for both blured and non-blured canny paths
@Borschtsch Borschtsch force-pushed the star_analysis_performance_improvement branch from f8d537b to 056e70f Compare April 13, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants