Skip to content

VMs disappeared after strange event causing many repeated actions on node #2650

@scottyeager

Description

@scottyeager

On December 4, we received reports of multiple VMs becoming unreachable on mainnet node 8. This coincided with broader reports of workload issues across FreeFarm and Naiein_000 (same physical location), including VM failures and failures of nodes to provide gateway services to workloads on nodes outside of FreeFarm.

I focused my efforts on node 8, since I had a VM on that node which suffered from the issue:

  • Inspection of the node via SSH revealed that only two VMs remained, of at least four active contracts for VMs at the time
  • Missing VMs appeared to be completely decommissioned ("no mount points/logs/remnants")
  • There are no obvious logs explaining the disappearance of these VMs

The strange event

We don't know exactly when these VMs disappeared, but there is evidence of some significant event happening on this node during the same day.

Metrics:

Image

Logs volume:

Image

Checking the logs, we see many elements of a node boot up sequence (node registering itself, various services starting), but the machine did not reboot. I might suspect this to be related to a system update, but the last log indicating an update is from November 20.

Closer inspection of the logs reveals that certain actions repeated many times during a very short period. For example, we can see that redis started up 33 times in the course of about two minutes.

Image

Also of potential interest, we see repeated attempts to mount an flist for a VM called "georunner", which is one of the two VMs that survived the incident:

Image

The other surviving VM, "vm_v2clsvb8", didn't get the same treatment, so there's not a clear correlation in that regard.

Conclusion

I'm not sure if the VM disappearance and the strange event are directly related. I do think they are both worthy of investigation though and it would be rather coincidental.

I should also note that node 50 experienced a similar strange event a couple hours later:

Image Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions