Star analysis performance improvement by Borschtsch · Pull Request #257 · isbeorn/nina

Borschtsch · 2026-04-02T09:58:31Z

🚀 Purpose

Hello everyone,

I want to propose a performance improvement in the image analysis path.
As well as demosaicing of the image, the canny edge detectors could also benefit of parallelization and some local optimizations.
In order to achieve bit-exact results during the optimization, I have established different frame format tests and also benchmarks to compare against original algorithms. This is done by reusing original source file for no blur canny edge detector and using original Accord's canny edge detector. I also improved testing for demosaicing the same way, rather than relying on rewritten implementation in the test itself.

I have added a Benchmarks sub-project but I am not sure if it should be kept. Please let me know. See the benchmark results below.

🧪 How Was It Tested?

The code is unit tested with different real-world frame formats and output is compared against reference calculations that are preserved in the test folder or reused from Accord library.

✅ PR Checklist

All unit tests pass
UI changes tested across Light, Dark, and Night themes (if applicable)
Plugin API compatibility reviewed
- No breaking changes to interfaces
- No changes to method/class signatures (especially optional parameters)
Changelog.md updated (if applicable)
Documentation updated (if applicable)

🔗 Related Issues

Follow up on #147.

📸 Screenshots

📝 Additional Notes

CannyBlurredBenchmarks

CaseName: Gray8Aligned_IMX571_6224x4168

Thread Count	Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
1	OriginalBlurredCannyEdgeDetector	1910.75 ms	1490.05 ms	81.67 ms	1881.41 ms	1.00	-	123.70 MB	1.00
1	ImprovedBlurredCannyEdgeDetector	1078.00 ms	348.05 ms	19.08 ms	1088.47 ms	0.56	0.03	123.72 MB	1.00
15	OriginalBlurredCannyEdgeDetector	1854.01 ms	632.31 ms	34.66 ms	1837.76 ms	1.00	-	123.70 MB	1.00
15	ImprovedBlurredCannyEdgeDetector	221.26 ms	270.38 ms	14.82 ms	228.75 ms	0.12	0.01	123.76 MB	1.00

CannyNoBlurBenchmarks

CaseName: Gray8Aligned_IMX571_6224x4168

Thread Count	Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
1	OriginalNoBlurCannyEdgeDetector	715.86 ms	402.48 ms	22.06 ms	703.15 ms	1.00	-	123.69 MB	1.00
1	ImprovedNoBlurCannyEdgeDetector	640.61 ms	169.74 ms	9.30 ms	637.04 ms	0.89	0.03	123.71 MB	1.00
15	OriginalNoBlurCannyEdgeDetector	716.97 ms	208.48 ms	11.43 ms	711.54 ms	1.00	-	123.69 MB	1.00
15	ImprovedNoBlurCannyEdgeDetector	146.40 ms	15.08 ms	826.61 us	146.25 ms	0.20	0.00	123.72 MB	1.00

daleghent · 2026-04-02T14:12:15Z

It seems that some of these new code paths are still quite scalar in nature. Have you explored using the Vector and Vector3d classes, especially when iterating pixel by pixel?

Borschtsch · 2026-04-02T14:36:15Z

It seems that some of these new code paths are still quite scalar in nature. Have you explored using the Vector and Vector3d classes, especially when iterating pixel by pixel?

Hello, I have explored this idea before my first PR with demosaicing as it seemed the rigth thing to do in the first place. Though I could not get a meaningful improvement compared with current scalar path, Could be because I didn't have good benchmaks at that time. Now with the right test and benchmark harness in place this idea can be pursued. Let me take a look.

Borschtsch · 2026-04-03T20:50:54Z

I have collected some data from test implementation for demosaicing as I had already worked on it. I am not sure what is the oldest CPU N.I.N.A. could have supported - probably limited with CPU that can run Windows 10? But one approach was using standard Vector that is SIMD-friendly and another one is using AVX intrinsics.

I had benchmarks running with the following JIT knobs - DOTNET_EnableHWIntrinsic = 0/1 (0 for scalar path), DOTNET_EnableAVX2 = 0/1 and DOTNET_EnableAVX512F = 0/1
With proper benchmarks it is clear that even Vector is faster than scalar path.

CaseName: Bayer16Aligned_IMX461_11656x8750_Demosaic

Reference algorithm

SIMD Mode	Worker Count	Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
Scalar	1	OriginalBayerFilter16bpp	3462.99 ms	2893.51 ms	158.60 ms	3430.13 ms	1.00	-	778.12 MB	1.00

Vector

SIMD Mode	Worker Count	Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
MaxSIMD	1	OriginalBayerFilter16bpp	3009.95 ms	1721.71 ms	94.37 ms	2998.91 ms	1.00	-	778.12 MB	1.00
Scalar	1	ImprovedBayerFilter16bpp	1049.09 ms	558.52 ms	30.61 ms	1049.42 ms	0.32	0.01	778.13 MB	1.00
NoAVX2	1	ImprovedBayerFilter16bpp	908.95 ms	485.99 ms	26.64 ms	917.23 ms	0.32	0.01	778.13 MB	1.00
MaxSIMD	1	ImprovedBayerFilter16bpp	771.50 ms	58.32 ms	3.20 ms	770.87 ms	0.26	0.01	778.13 MB	1.00
Scalar	15	ImprovedBayerFilter16bpp	265.42 ms	58.73 ms	3.22 ms	263.88 ms	0.08	0.00	778.13 MB	1.00
NoAVX2	15	ImprovedBayerFilter16bpp	236.64 ms	191.78 ms	10.51 ms	234.94 ms	0.08	0.00	778.13 MB	1.00
MaxSIMD	15	ImprovedBayerFilter16bpp	257.65 ms	107.70 ms	5.90 ms	260.39 ms	0.09	0.00	778.13 MB	1.00

AVX intrinsics

SIMD Mode	Worker Count	Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
NoAVX2	1	ImprovedBayerFilter16bpp	746.39 ms	172.98 ms	9.48 ms	741.04 ms	0.24	0.01	778.13 MB	1.00
MaxSIMD	1	ImprovedBayerFilter16bpp	553.47 ms	1237.55 ms	67.83 ms	575.11 ms	0.16	0.02	778.13 MB	1.00
Scalar	4	ImprovedBayerFilter16bpp	431.21 ms	267.29 ms	14.65 ms	429.20 ms	0.12	0.00	778.13 MB	1.00
NoAVX2	4	ImprovedBayerFilter16bpp	284.38 ms	312.54 ms	17.13 ms	275.95 ms	0.09	0.01	778.13 MB	1.00
MaxSIMD	4	ImprovedBayerFilter16bpp	289.84 ms	747.15 ms	40.95 ms	268.17 ms	0.09	0.01	778.13 MB	1.00
Scalar	8	ImprovedBayerFilter16bpp	308.52 ms	579.60 ms	31.77 ms	308.11 ms	0.09	0.01	778.13 MB	1.00
NoAVX2	8	ImprovedBayerFilter16bpp	225.21 ms	116.29 ms	6.37 ms	225.87 ms	0.07	0.00	778.13 MB	1.00
MaxSIMD	8	ImprovedBayerFilter16bpp	215.16 ms	112.50 ms	6.17 ms	217.53 ms	0.07	0.00	778.13 MB	1.00

With the number of threads performance does not change much. I guess, I have saturated memory I/O on my laptop.

What do you think?

isbeorn · 2026-04-03T21:02:21Z

How did you verify that the edge detection will yield equivalent results. This is such a central component and crucial for star detection that I'm very hesitant to replace it.

daleghent · 2026-04-03T21:09:31Z

Considering NINA is ran most often on low-power CPUs, "typical" core count is going to be 4-8 cores. The Vector* classes will dynamically use what is available and fall back to scalar iteration internally if more advanced instruction sets aren't available. SSE2 and AVX are old, going back over 15 years. AVX2 is more recent, appearing with the Intel Haswell microacrch. Yet, Haswell was introduced over 10 years ago. I would use the Vector classes and let the JIT determine what to use. Working on 8 or 16 16bit pixels at a time is far better than one at a time. Threading is going to be problematic if you are scanning the entire frame in broken-up sections. What if a star sits on the border between sections?

Borschtsch · 2026-04-10T00:08:09Z

How did you verify that the edge detection will yield equivalent results. This is such a central component and crucial for star detection that I'm very hesitant to replace it.

I understand the question needs to be addressed better. In the last commits an image fixture has been created and used in bayer, gaussian and canny tests on different image formats - this fixture creates images with features like uniform black, uniform white, star field, diagonal line, step edge, checker board, etc.
The tests are running against reference implementations in N.I.N.A. and Accord - bayer and both canny ones. I put the both N.I.N.A. original files back in the tests folder.

The change is mostly focused on improving performance of gaussian blur and canny edge detection (magnitude and edge orientation, non-maximums suppression, hysteresis is unchanged as it relies on sequential execution). I tried to keep the structure of the non blur canny the same is in the original file, so it is easier to compare side-by-side. The rest of the updates is benchmarking and test improvements.

Anyway, I think I need to spend more time hardening the tests. The change was relatively easy to do, but testing is lot more important.

Borschtsch · 2026-04-10T00:50:57Z

Considering NINA is ran most often on low-power CPUs, "typical" core count is going to be 4-8 cores. The Vector* classes will dynamically use what is available and fall back to scalar iteration internally if more advanced instruction sets aren't available. SSE2 and AVX are old, going back over 15 years. AVX2 is more recent, appearing with the Intel Haswell microacrch. Yet, Haswell was introduced over 10 years ago. I would use the Vector classes and let the JIT determine what to use. Working on 8 or 16 16bit pixels at a time is far better than one at a time.

This is also my line of thinking with Vector classes, it feels like keeping generic implementation is more beneficial for all kinds of PCs out there.. I will try it later on mini-PCs running N100 to see if we can still benefit. But I think this can be another round once we have good tests we can trust.

Threading is going to be problematic if you are scanning the entire frame in broken-up sections. What if a star sits on the border between sections?

If I understood the question right, the parallel processing is not a problem as long as we can produce independent artifacts like debayered pixels or parts of canny pipeline like in this change..

Borschtsch · 2026-04-13T18:23:51Z

I have improved tests by adding randomly generated star field with different image features to compare new canny behavior against the original (blured vs non-blured).
The input images look like this. If tests fail, then the input/expected/actual and diff will be saved into NINA.Test\bin\Release\net10.0-windows\Artifacts folder.

Please find attached the example of the input images.

Example 1

Example 2

Example 3

Added back the actual reference implementation Bayer demosaicing and No Blur Canny Edge Detector to use in tests, accordingly tests are updated to use reference calculations (including Accord's ones) Added Benchmark project

…nnyEdgeDetector uses this core implementation along with a separate GaussianBlur Using deterministic image fixtures that are reused across all tests for all image formats

…generate a starfield with different features (100 images) that are used for both blured and non-blured canny paths

Borschtsch added 7 commits April 13, 2026 20:29

Added performance improved Canny Edge Detectors

ecfba39

Added back the actual reference implementation Bayer demosaicing and No Blur Canny Edge Detector to use in tests, accordingly tests are updated to use reference calculations (including Accord's ones) Added Benchmark project

Benchmarks allow to compare single and multi-thread performance

9eab110

Disabling the canny edge detector tests (similar to demosaicing tests)

aaa1f07

Making clear the one test script covers both canny edge detectors

c3e04c0

NoBlurCannyEdgeDetector holds the core implementation, the regular Ca…

15c5af6

…nnyEdgeDetector uses this core implementation along with a separate GaussianBlur Using deterministic image fixtures that are reused across all tests for all image formats

Image fixture generates random sized stars

7f5dd5d

Refactored the canny tests - a new series of tests has been added to …

056e70f

…generate a starfield with different features (100 images) that are used for both blured and non-blured canny paths

Borschtsch force-pushed the star_analysis_performance_improvement branch from f8d537b to 056e70f Compare April 13, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Star analysis performance improvement#257

Star analysis performance improvement#257
Borschtsch wants to merge 7 commits into
isbeorn:developfrom
Borschtsch:star_analysis_performance_improvement

Borschtsch commented Apr 2, 2026

Uh oh!

daleghent commented Apr 2, 2026

Uh oh!

Borschtsch commented Apr 2, 2026

Uh oh!

Borschtsch commented Apr 3, 2026

Uh oh!

isbeorn commented Apr 3, 2026

Uh oh!

daleghent commented Apr 3, 2026

Uh oh!

Borschtsch commented Apr 10, 2026 •

edited

Loading

Uh oh!

Borschtsch commented Apr 10, 2026

Uh oh!

Borschtsch commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Borschtsch commented Apr 2, 2026

🚀 Purpose

🧪 How Was It Tested?

✅ PR Checklist

🔗 Related Issues

📸 Screenshots

📝 Additional Notes

CannyBlurredBenchmarks

CannyNoBlurBenchmarks

Uh oh!

daleghent commented Apr 2, 2026

Uh oh!

Borschtsch commented Apr 2, 2026

Uh oh!

Borschtsch commented Apr 3, 2026

Uh oh!

isbeorn commented Apr 3, 2026

Uh oh!

daleghent commented Apr 3, 2026

Uh oh!

Borschtsch commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borschtsch commented Apr 10, 2026

Uh oh!

Borschtsch commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Borschtsch commented Apr 10, 2026 •

edited

Loading

Borschtsch commented Apr 13, 2026 •

edited

Loading