We've recently been experiencing builds failing due to using too much resource (I think memory) for the Gadi login node.
For example, see this ACCESS-OM3 prerelease which has particularly high memory usage due to lto flags. This failed with
ifx: error #10106: Fatal error in /apps/intel-tools/.packages/2025.2.0.575/compiler/2025.2/bin/compiler/ld.lld, terminated by kill signal
while building access3. I can reproduce the failure using my own spack instance on a login node, but can build successfully on a Sapphire-Rapids compute node (104 cpus, 496G mem).
Some options have been floated on Zulip for preventing these failures:
-
Have the option to run particularly meaty builds on compute nodes. One complication of this is that Gadi compute nodes do not have network access, so sources would need to be mirrored on the login node first. I've typically done this in the past with something like:
[login-node] $ spack env activate <env>
[login-node] $ spack concretize -f --fresh
[login-node] $ spack mirror create -d sources -a
<Start interactive job>
[compute-node] $ spack env activate <env>
[compute-node] $ spack install
-
From @aidanheerdegen
We could ask NCI if they could increase the allowed memory for the service user ... it would be a lot simpler.
-
From @aidanheerdegen
We might be able to have a simpler qsub version of a build by using a persistent session tunnel to access the source code, so making it more transparent to spack.
-
From Angus Gibson
As far as I can tell there's a 4GB limit of resident memory (I'm not sure if per session or per process), and after that it'll start spilling into swap. There's only 16GB swap and that's shared across all users of the node. On e.g. gadi-login-06 it's currently almost all used:
$ free -h
total used free shared buff/cache available
Mem: 250Gi 108Gi 36Gi 13Gi 104Gi 126Gi
Swap: 15Gi 15Gi 245M
In fact, my test got killed after allocating only 2.8GB there...But on gadi-login-03 there's about 11Gi free and I could get a lot more. So a lot of non-determinism around where the builder lands, particularly if LTO is a bit hungry (or just a complex compile...)
...
Seems to dip into swap at around 4GB combined, it's probably actually per-user because the cgroup is /sys/fs/cgroup/memory/user.slice/user-$(id).slice/memory.limit_in_bytes (= 4294967296)
We could potentially try to target nodes that aren't under heavy load?
We've recently been experiencing builds failing due to using too much resource (I think memory) for the Gadi login node.
For example, see this ACCESS-OM3 prerelease which has particularly high memory usage due to lto flags. This failed with
while building
access3. I can reproduce the failure using my own spack instance on a login node, but can build successfully on a Sapphire-Rapids compute node (104 cpus, 496G mem).Some options have been floated on Zulip for preventing these failures:
Have the option to run particularly meaty builds on compute nodes. One complication of this is that Gadi compute nodes do not have network access, so sources would need to be mirrored on the login node first. I've typically done this in the past with something like:
From @aidanheerdegen
From @aidanheerdegen
From Angus Gibson
We could potentially try to target nodes that aren't under heavy load?