Skip to content

Investigating "ghost deadlock" occurring randomly and unpredictably #707

@NotRequiem

Description

@NotRequiem

Source info

  • This is from the latest commit in the main branch
  • This is from the latest release
  • Other

Issue type

  • False positive
  • Compilation warning/error
  • Suggestion
  • Runtime error/crash
  • Other

Environment

  • Windows
  • Linux
  • MacOS
  • Other

Description

From commit ddab9e5, VM::TIMER is reported to deadlock on a Lenovo ideapad miix 510-12isk. The function doesn't return, and the CPU core count seems to be enough.

After researching all CPUs that normally ship with this manufacturer, Intel Pentium 4405U, Core i3-6006U / i3-6100U, Core i5-6200U, and Core i7-6500U were detected for that Lenovo model. Heuristically determined that the CPU must be an i3. Even if those CPUs do not have multiple cores, the random entropy algorithm may be choosing unexpected core indexes when pinning the counter and trigger threads of the timing anomalies check.
Specifically, the get_trigger_mask lambda that attempts to randomize where the routine will run to make impossible for the hypervisor to predict where the counter is, even after intercepting instructions like RDRAND/RDSEED (reason why VMAware doesn't use random_device standard library calls) is pending to be reviewed in June 15, after I'm free.

Image

Some arm64 machines are freezing from ddab9e5 and up too when calling VMAware routines on a single thread. However, VM::TIMER doesn't run in ARM machines, and we've confirmed that the runtime safe-guard that detects whether the x86 assembly code is being translated to ARM on Windows is working and not letting the detection run in such environments. This suggests that the issue:

  • Might be happening in a completely different, unknown point of the code.
  • Might be happening on multiple points of the code, including (or not) VM::TIMER.

The issue ends in an unrecoverable freeze, without possibility of analyzing debug logs.

In June 15 after I get some time, VMAware will be re-ran in such CPUs to determine the cause of the deadlock via stack-tracing, and re-analyze all the relevant code.

CLI output (if possible)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions