Some builds are too hungry for login nodes

We've recently been experiencing builds failing due to using too much resource (I think memory) for the Gadi login node.

For example, see [this ACCESS-OM3 prerelease](https://github.com/ACCESS-NRI/ACCESS-OM3/pull/189#issuecomment-3918419102) which has particularly high memory usage due to lto flags. This failed with 

```
ifx: error #10106: Fatal error in /apps/intel-tools/.packages/2025.2.0.575/compiler/2025.2/bin/compiler/ld.lld, terminated by kill signal
```

while building `access3`. I can reproduce the failure using my own spack instance on a login node, but can build successfully on a Sapphire-Rapids compute node (104 cpus, 496G mem).

Some options have been floated on [Zulip](https://access-nri.zulipchat.com/#narrow/channel/470325-model-release/topic/Prerelease.20build.20running.20out.20of.20memory/with/574612895) for preventing these failures:

- Have the option to run particularly meaty builds on compute nodes. One complication of this is that Gadi compute nodes do not have network access, so sources would need to be mirrored on the login node first. I've typically done this in the past with something like:

  ```
  [login-node] $ spack env activate <env>
  [login-node] $ spack concretize -f --fresh
  [login-node] $ spack mirror create -d sources -a
  
  <Start interactive job>
  
  [compute-node] $ spack env activate <env>
  [compute-node] $ spack install
  ```
- From @aidanheerdegen
  > We could ask NCI if they could increase the allowed memory for the service user ... it would be a lot simpler.
- From @aidanheerdegen
  > We might be able to have a simpler `qsub` version of a build by using a persistent session tunnel to access the source code, so making it more transparent to `spack`.
- From Angus Gibson
  > As far as I can tell there's a 4GB limit of resident memory (I'm not sure if per session or per process), and after that it'll start spilling into swap. There's only 16GB swap and that's shared across all users of the node. On e.g. gadi-login-06 it's currently almost all used:
  > 
  > ```
  > $ free -h
  >               total        used        free      shared  buff/cache   available
  > Mem:          250Gi       108Gi        36Gi        13Gi       104Gi       126Gi
  > Swap:          15Gi        15Gi       245M
  > ```
  > In fact, my test got killed after allocating only 2.8GB there...But on gadi-login-03 there's about 11Gi free and I could get a lot more. So a lot of non-determinism around where the builder lands, particularly if LTO is a bit hungry (or just a complex compile...)
  > ... 
  > Seems to dip into swap at around 4GB combined, it's probably actually per-user because the cgroup is `/sys/fs/cgroup/memory/user.slice/user-$(id).slice/memory.limit_in_bytes` (= 4294967296)

  We could potentially try to target nodes that aren't under heavy load?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some builds are too hungry for login nodes #362

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Some builds are too hungry for login nodes #362

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions