Skip to content

Add a numerical mixing diagnostic to MOM6#57

Draft
jbisits wants to merge 288 commits into
2026.05from
jib/numerical-mixing
Draft

Add a numerical mixing diagnostic to MOM6#57
jbisits wants to merge 288 commits into
2026.05from
jib/numerical-mixing

Conversation

@jbisits

@jbisits jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

This PR adds a diagnostic that can track the variance production over the transport timestep, which is related to the numerical mixing of tracers, of any registered tracer in the MOM6 tracer directory. The diagnostic is:

$$ \begin{eqnarray} %&&\text{{Advection scheme variance production in 2D/vertical ALE cell $i,j$}} =\nonumber &&\chi_{i, j}^{\tau} = \bigg[\frac{hc^2}{\Delta t}\bigg]_{i,j,\tau}^{i,j,\tau + 1} + \frac{1}{\Delta_{i,j}A} \left( \bigg[ 2f^xc - Uc^2 \bigg]_{i-\frac{1}{2},j,\tau}^{i+\frac{1}{2},j,\tau} + \bigg[ 2f^yc - Vc^2 \bigg]_{i,j-\frac{1}{2},\tau}^{i,j+\frac{1}{2},\tau} \right) \end{eqnarray} $$

We have a manuscript that details the derivation and uses the diagnostic in ANU-TUB that I think we can share if needed.

The diagnostic (equation above) is computed in src/tracer/MOM_tracer_numerical_mixing.F90 from prognostic and diagnostic variables available in MOM6. The terms in the equation above are:

  • $c$: tracer concentration (any tracer that is registered in tracer registry)
  • $h$: thickness at previous and current timestep;
  • $\Delta t$: transport timestep
  • $1 / \Delta A_{i, j}$: inverse grid cell area;
  • $f^{x}$: diagnostic x advective flux;
  • $U$: zonal mass transport;
  • $f^{y}$: diagnostic y advective flux;
  • $V$: meridional mass transport.

Upwind value are used for concentration on the grid cell faces. Please let me know what other information I should add here!

I am baffled and need some help with the failing thickness test. The diagnostic is split into two subroutines: thickness_weighted_variance_advection and thickness_weighted_variance_flux_divergence. When both subroutines are used the CI thickness test fails on the OBC test but if I run CI on the subroutines one at a time all tests pass; see CI only with variance flux subroutine and CI only with variance advection subroutine). Any help with this would be great!

Closes ACCESS-NRI/access-om3-configs#897

cc @janzika

@jbisits

jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

I need to bring this branch up to date with the current ACCESS release which I will do before this is ready for review. This should reduce the total number of changes as the only intended addition of this PR is the numerical mixing diagnostic. But because I forked from the mom6-ocean fork I have conflicts.

@dougiesquire

Copy link
Copy Markdown
Collaborator

@jbisits let me know if you'd like any help with rebasing

@jbisits

jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

@jbisits let me know if you'd like any help with rebasing

Yeah that would be great! I am happy to try first but if you have time I am sure you will be able to do it much faster! I can add a list of file that I have changed for the new diagnostic if that helps?

@dougiesquire

Copy link
Copy Markdown
Collaborator

I suspect rebasing will fix the failing tests - it looks like you might've incorrectly resolved a few merge conflicts along the way. I'll have a stab today.

@jbisits

jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

I suspect rebasing will fix the failing tests - it looks like you might've incorrectly resolved a few merge conflicts along the way. I'll have a stab today.

Great, thanks @dougiesquire! Unfortunately I do recall when I merged my fork with ACCESS-NRI I had a lot of conflicts and there were quite a few that I did not look at too carefully. My apologies! Let me know if there is anything I can do to help.

@dougiesquire

dougiesquire commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

That was a lot of commits and conflicts 😅. Unfortunately, it looks like the test-dim (h, vertical thickness) tests are still failing.

Could you please check that this looks like the full set of changes? Once you've confirmed, I'll force push to this branch

@jbisits

jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

That was a lot of commits and conflicts 😅. Unfortunately, it looks like the test-dim (h, vertical thickness) tests are still failing.

That's unfortunate.

Could you please check that this looks like the full set of changes? Once you've confirmed, I'll force push to this branch

Yep these look like all the changes, thanks!

@dougiesquire dougiesquire force-pushed the jib/numerical-mixing branch from 5386bbb to 9b1dd04 Compare June 2, 2026 04:08
@dougiesquire

Copy link
Copy Markdown
Collaborator

I've force pushed @jbisits. You'll probably want to update your local branch. Something like:

git fetch access-nri
git reset --hard access-nri/jib/numerical-mixing

@jbisits

jbisits commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for all that work @dougiesquire! I have tried many times to fix the thickness test and always come up short. I also have not quite been able to figure out how the test works so I can run locally. Do you think it would be worth coming to the next TWG working group meeting and showing where this is all at? Maybe someone has encountered something like this before.

@dougiesquire

dougiesquire commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

I've tried to look into the failing test and I am also pretty baffled.

It's failing the thickness dimensional rescaling test for tc3. This runs the .testing/tc3 configuration with H_RESCALE_POWER=1 and H_RESCALE_POWER=11 and expects the same results.

The test fails because the remapped z-level _advection_scheme_variance_production diagnostics for all of the tr_D1-tr_D11 DOME tracers differ very slightly at every time step except for the first, e.g.:

...
- h-point: c=     24416 ocean_model_z-tr_D1_advection_scheme_variance_production 0 240
+ h-point: c=     24400 ocean_model_z-tr_D1_advection_scheme_variance_production 0 240
...
- h-point: c=     24416 ocean_model_z-tr_D2_advection_scheme_variance_production 0 240
+ h-point: c=     24400 ocean_model_z-tr_D2_advection_scheme_variance_production 0 240
...

The native diagnostics are the same between the two runs; it's only the remapped diagnostics that are different.

The magnitudes of the *_advection_scheme_variance_production diagnostics are very small in this configuration (somewhere around e-50). I think this is because there's effectively zero tracer advection. My guess is that what we're seeing is floating-point rounding differences in the z-level remapping.

I've output the tr_D1_advection_scheme_variance_production diagnostic on z-levels at double precision at every timestep to see where the differences are. These are the differences at every depth level at the final time step (note the tiny magnitudes):

Screenshot 2026-06-03 at 10 37 49 am

I'm not really sure what to do about this at this point. The fact that the native diagnostics pass the test indicates that the computation of asvp is h-scaling-invariant, so I don't think reformulating the diagnostic can help. The issue seems to be the z-level remapping of near-zero values. Perhaps @marshallward has a suggestion?

@jbisits

jbisits commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the investigative work @dougiesquire!

The native diagnostics are the same between the two runs, it's only the remapped diagnostics that are different.

That's good!

I've output the diagnostics at double precision at every timestep to see where the differences are. These are the differences at every depth level at the final time step (note the tiny magnitudes):

For me to double check, this figure is the difference between z-tr_D1_advection_scheme_variance_production and z-tr_D2_advection_scheme_variance_production, where D1=1 and D2=11, at each depth level? And from the figure it is not every x and y point in the domain that there is error, only some? And at greater depths there is none?

I think this is because there's effectively zero tracer advection. My guess is that what we're seeing is floating-point rounding differences in the z-level remapping.

I don't know much about the diagnostic remapping but I am keen to learn more. Extremely unlikely I will find anything you have not done, or know, already though.

@dougiesquire

Copy link
Copy Markdown
Collaborator

For me to double check, this figure is the difference between z-tr_D1_advection_scheme_variance_production and z-tr_D2_advection_scheme_variance_production, where D1=1 and D2=11, at each depth level? And from the figure it is not every x and y point in the domain that there is error, only some? And at greater depths there is none?

Sorry, my comment was a bit unclear (now edited). The figure shows the difference in the tr_D1_advection_scheme_variance_production diagnostic on z-levels between the runs with H_RESCALE_POWER=1 and H_RESCALE_POWER=11. The other tracers (tr_D2 etc) show similar differences. Yeah, most of the points are the same (white). Some depths show no differences, but the location of the differences changes at each timestep, e.g. this is two timesteps prior:

Screenshot 2026-06-03 at 12 26 18 pm

@jbisits

jbisits commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

For me to double check, this figure is the difference between z-tr_D1_advection_scheme_variance_production and z-tr_D2_advection_scheme_variance_production, where D1=1 and D2=11, at each depth level? And from the figure it is not every x and y point in the domain that there is error, only some? And at greater depths there is none?

Sorry, my comment was a bit unclear (now edited). The figure shows the difference in the tr_D1_advection_scheme_variance_production diagnostic on z-levels between the runs with H_RESCALE_POWER=1 and H_RESCALE_POWER=11. The other tracers (tr_D2 etc) show similar differences. Yeah, most of the points are the same (white). Some depths show no differences, but the location of the differences changes at each timestep, e.g. this is two timesteps prior:

Screenshot 2026-06-03 at 12 26 18 pm

Cool, actually was me not understanding but all cleared up now!

@marshallward

Copy link
Copy Markdown

I don't have much time to have a look this week, but could it be a H vs Z mixup somewhere?

@jbisits

jbisits commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Would a H and Z mixup be more likely internally or when outputting?

Here is the internal dimensional analysis of the computation in the subroutines:

thickness_weighted_variance_advection computation:

Ih = 1 / h(i,j,k)                                                              !< [H-1]
ht_prev = h_diag(i,j,k) * Tr%t_prev(i,j,k)                                     !< [H]*[CU] = [H CU] 
ht_adv = ht_prev + dt * Tr%advection_xy(i,j,k)                                 !< [H CU] + [T]*[CU H T-1] = [H CU]
asvp(i,j,k) = ( (Ih * (ht_adv*ht_adv)) - (ht_prev * Tr%t_prev(i,j,k)) ) * Idt  !< ([H-1]*[H2 CU2] - [H CU]*[CU]) * [T-1] = [CU H T-1]

thickness_weighted_variance_flux_divergence computation:

 east = (2 * (Tr%ad_x(I,j,k)  *x_upwind(I,j,k)))   - ((Idt*uhtr(I,j,k))   * (x_upwind(I,j,k)  *x_upwind(I,j,k)))    !< [CU H L2 T-1] * [CU] - [T-1]*[H L2]*[CU]*[CU] = [CU2 H L2 T-1]
 west = (2 * (Tr%ad_x(I-1,j,k)*x_upwind(I-1,j,k))) - ((Idt*uhtr(I-1,j,k)) * (x_upwind(I-1,j,k)*x_upwind(I-1,j,k)))  !< [CU H L2 T-1] * [CU] - [T-1]*[H L2]*[CU]*[CU] = [CU2 H L2 T-1]
north = (2 * (Tr%ad_y(i,J,k)  *y_upwind(i,J,k)))   - ((Idt*vhtr(i,J,k))   * (y_upwind(i,J,k)  *y_upwind(i,J,k)))    !< [CU H L2 T-1] * [CU] - [T-1]*[H L2]*[CU]*[CU] = [CU2 H L2 T-1]
south = (2 * (Tr%ad_y(i,J-1,k)*y_upwind(i,J-1,k))) - ((Idt*vhtr(i,J-1,k)) * (y_upwind(i,J-1,k)*y_upwind(i,J-1,k)))  !< [CU H L2 T-1] * [CU] - [T-1]*[H L2]*[CU]*[CU] = [CU2 H L2 T-1]
asvp(i,j,k) = asvp(i,j,k) + (((east - west) + (north - south)) * G%IareaT(i,j)) !< [CU2 H L2 T-1] * [L-2] = [CU2 H T-1]

And the conversion when outputting:

Tr%id_advection_scheme_variance_production = register_diag_field("ocean_model", &
    trim(shortnm)//"_advection_scheme_variance_production", diag%axesTL, Time, &
     "Spurious variance production of "//trim(shortnm)//" variance due to advection", &
     trim(Tr%units)//"2 m s-1", conversion=(TR%conc_scale**2)*GV%H_to_MKS*US%s_to_T)

@marshallward

marshallward commented Jun 4, 2026

Copy link
Copy Markdown

tc3 is based on a zero-flow open boundary test. Looking at OBC code might help to track it down.

But the way to do these is to turn on Debug = True and diff through the output. You might need to add extra [huvB]chksum() functions to narrow it down.

If there's no urgency here then I can also give it a try next week.

@jbisits

jbisits commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

tc3 is based on a zero-flow open boundary test. Looking at OBC code might help to track it down.

But the way to do these is to turn on Debug = True and diff through the output. You might need to add extra [huvB]chksum() functions to narrow it down.

I have the tests up and running locally so I will see if I can make any progress on this.

If there's no urgency here then I can also give it a try next week.

No urgency here; would be great if you could take a look next week!

@jbisits

jbisits commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

I ran with Debug = True and the only diff between work/symmetric/std.err and work/dim.h/str.err was:

diff symmetric/std.err dim.h/std.err
5651c5651
<                                Memuse(MB) at Memory HiWaterMark=  4.502E+01  4.502E+01  0.000E+00  4.502E+01
---
>                                Memuse(MB) at Memory HiWaterMark=  4.505E+01  4.505E+01  0.000E+00  4.505E+01

Does this indicate there was no difference between the model runs (apart from memory usage)? If you get some time to take a look next week @marshallward that would be great!

@marshallward

Copy link
Copy Markdown

I was able to reproduce it on my laptop:

cd .testing
make -j tc3.dim.h.diag

and the error is in ocean_model_z-tr_D3_advection_scheme_variance_production.

DEBUG = True unfortunately doesn't have much to add, since there are (understandably) very few debug checksums in your new module.

I added some chksum calls for asvp, and the min and max do seem to scale like H, so that part is likely correct.

But I also noticed is that these numbers are very small.

h-point: mean=   0.0000000000000000E+00 min=   1.4165571386305437E-59 max=   1.1872909495924985E-43 Tr nummix after advect: asvp
h-point: c=     25200 Tr nummix after advect: asvp                              
h-point: mean=   0.0000000000000000E+00 min=   1.4165571386305437E-59 max=   1.1872909495924985E-43 Tr nummix after div: asvp
h-point: c=     25200 Tr nummix after div: asvp                                                                                                                                                                                                                                        

The min and max are 1e-43 to 1e-60. The mean is reported as zero, because most of these values are too small to be recorded in our reproducing_sum() subroutine.

Also, if I reduce H_RESCALE_POWER to 6 (i.e. rescale H by 2**6), then it passes. If I increase to 7, then it fails.

So the short answer is that your numbers are too small to apply dimensional scaling.

There are a few things you can do here, perhaps add a "floor" function that zeros out numbers below a certain value. Or perhaps there is something about your calculation that is capturing residuals in a strange way.

@Hallberg-NOAA has encountered this before, so he might be a better person to comment.

@jbisits

jbisits commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for taking a look @marshallward!

There are a few things you can do here, perhaps add a "floor" function that zeros out numbers below a certain value.

I can add this but will just need to choose the value below which to zero below. The maximum value in your example above may work for these tests but perhaps there is more appropriate value to choose (though I do not know what that might be!).

@Hallberg-NOAA if you have other suggestions or comments please let me know!

Thanks again for the help on this!

@dougiesquire

Copy link
Copy Markdown
Collaborator

Thanks for looking @marshallward

So the short answer is that your numbers are too small to apply dimensional scaling.

Just to state the obvious: As I mentioned here, the native diagnostics (e.g. ocean_model-tr_D1_advection_scheme_variance_production) are obviously equally small and do pass dimensional rescaling, even with H_RESCALE_POWER=11. It's only the remapped z-level diagnostics that fail.

There are a few things you can do here, perhaps add a "floor" function that zeros out numbers below a certain value. Or perhaps there is something about your calculation that is capturing residuals in a strange way.

I'm not really across the diagnostic but it may effectively be defined as a residual. If there were no numerical mixing, the two terms cancel? That at least seems to be the case in the failing configuration - e.g. these are the chksums for each term:

h-point: mean=   0.0000000000000000E+00 min=  -4.6008012880328669E-43 max=   2.3705811131449731E-42 ocean_model_z-tr_D1_thickness_weighted_variance_advection 0 240
h-point: c=     25652 ocean_model_z-tr_D1_thickness_weighted_variance_advection 0 240

h-point: mean=   0.0000000000000000E+00 min=  -2.3705811131449811E-42 max=   4.6008012880327427E-43 ocean_model_z-tr_D1_thickness_weighted_variance_flux_divergence 0 240
h-point: c=     25628 ocean_model_z-tr_D1_thickness_weighted_variance_flux_divergence 0 240

I guess this makes it particularly prone to this kind of issue.

Good to know that flooring is an option. I wasn't sure that MOM6 maintainers would be happy with such an approach.

@jbisits

jbisits commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

I'm not really across the diagnostic but it may effectively be defined as a residual. If there were no numerical mixing, the two terms cancel?

If the two terms cancelled there would be no variance production from the advection scheme i.e. the advection scheme would conserve the second moment of tracers (which they tend not to unless they are specifically designed to). So you are right @dougiesquire it can be thought of as a residual for tracer variance over the advection timestep. Apologies if this is an obvious restatement of your question!

That at least seems to be the case in the failing configuration - e.g. these are the chksums for each term:

h-point: mean=   0.0000000000000000E+00 min=  -4.6008012880328669E-43 max=   2.3705811131449731E-42 ocean_model_z-tr_D1_thickness_weighted_variance_advection 0 240
h-point: c=     25652 ocean_model_z-tr_D1_thickness_weighted_variance_advection 0 240

h-point: mean=   0.0000000000000000E+00 min=  -2.3705811131449811E-42 max=   4.6008012880327427E-43 ocean_model_z-tr_D1_thickness_weighted_variance_flux_divergence 0 240
h-point: c=     25628 ocean_model_z-tr_D1_thickness_weighted_variance_flux_divergence 0 240

Ah, this explains when I ran CI on the subroutines separately they passed; see above. Is summing these two very small components where you thought the numerical error might be coming from in the remapped z case? Or is it in the bit count that these numbers might be slightly different but very close?

@marshallward

Copy link
Copy Markdown

Good to know that flooring is an option. I wasn't sure that MOM6 maintainers would be happy with such an approach.

I think this has been done before, but I probably would only do it as a last resort.

I wonder if this operation should be handled inside of chksum rather than the solvers. But this is something that @Hallberg-NOAA should address. (He is on leave this week but should be back soon.)

(Also apologies @dougiesquire for not reading your earlier message more carefully about the smallness of the values.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding a numerical mixing diagnostic to the MC_25km_jra_ryf configuration

4 participants