Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Use this only on systems you own or are explicitly authorized to administer.

## Contents

- [What's New in 1.8.5](#whats-new-in-185)
- [What's New in 1.8.0](#whats-new-in-180)
- [What's New in 1.7.0](#whats-new-in-170)
- [What's New in 1.6.0](#whats-new-in-160)
Expand All @@ -57,6 +58,28 @@ Use this only on systems you own or are explicitly authorized to administer.

---

## What's New in 1.8.5

A fleet **reliability and observability** release. **No database schema change since 1.8.0.**

**Agent liveness over a named pipe**
- The Helper (updater) now reads agent liveness from the agent's read-only status pipe
(`RemoteAgent.status` → `LastHeartbeatUtc`) instead of a heartbeat file, removing a file-race that could
report a bogus multi-billion-second "stale heartbeat" and force an unnecessary agent restart.
- A two-poll confirmation keeps a single transient blip from restarting a healthy agent. The legacy
heartbeat file is still written for an older, file-based Helper during a rolling update and **self-retires**
once the co-located Helper is the new pipe-aware build.

**Flaky-link detection (observability only)**
- The device list now tells **"alive but on a poor network"** apart from **"offline / dead"**: a device with
frequent C2 reconnects shows as **`◐ flaky`** (amber) instead of **`○ offline`** (grey), with the reconnect
count in the tooltip and a *Link* row in the telemetry panel.
- Computed **server-side** from C2 connection churn (in-memory, last hour). It is **pure observability and
never triggers a restart**, needs no schema change, and is backward/forward compatible (older clients
ignore the new field; an older server leaves it dormant).

---

## What's New in 1.8.0

1.8.0 adds **agentless operator consoles for Linux and Windows** and hardens the keyless sign-in path.
Expand Down
12 changes: 12 additions & 0 deletions src/RemoteAgent.Contracts/Admin.cs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,18 @@ public sealed class DeviceInfo
[JsonPropertyName("online")]
public bool Online { get; set; }

/// <summary>C2 (re)connections observed for this device in the last hour (server connection registry).
/// 0–1 = stable; higher = flaky link (agent likely alive, poor network), not a dead device.</summary>
[JsonPropertyName("recentReconnects")]
public int RecentReconnects { get; set; }

/// <summary>Churn at or above this is shown as "flaky" rather than "stable". Shared display threshold.</summary>
public const int FlakyReconnectThreshold = 3;

/// <summary>Derived display flag: the link churns enough to call it flaky. Not serialized.</summary>
[JsonIgnore]
public bool LinkFlaky => RecentReconnects >= FlakyReconnectThreshold;

[JsonPropertyName("lastSeenAt")]
public DateTimeOffset? LastSeenAt { get; set; }

Expand Down
4 changes: 4 additions & 0 deletions src/RemoteAgent.Contracts/Status.cs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ public sealed class StatusReport
/// <summary>Time of the last successful server contact, either C2 connection or telemetry.</summary>
[JsonPropertyName("lastServerContactUtc")] public DateTimeOffset? LastServerContactUtc { get; set; }

/// <summary>Agent liveness tick, updated by the agent roughly every 15 s. The Helper reads it over this
/// status pipe to detect a hung agent (stale or missing tick), replacing the old heartbeat file.</summary>
[JsonPropertyName("lastHeartbeatUtc")] public DateTimeOffset? LastHeartbeatUtc { get; set; }

/// <summary>Local agent device ID sent by the client in login/reset requests for the device-level failure counter.</summary>
[JsonPropertyName("deviceId")] public string? DeviceId { get; set; }
}
1 change: 1 addition & 0 deletions src/RemoteAgent.Updater/Localization/String.en.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ internal static partial class Strings
private static readonly Dictionary<string, string> En = new()
{
[nameof(SupervisorWorker_AgentHungHeartbeatAbout0)] = "agent hung (heartbeat about {0:F0}s old) - forced restart",
[nameof(SupervisorWorker_AgentHungNoHeartbeat)] = "agent hung (heartbeat file missing/unreadable) - forced restart",
[nameof(SupervisorWorker_RemoteAgentIsNotRunningState)] = "RemoteAgent is not running ({State}) - starting.",
[nameof(SupervisorWorker_AgentStoppedRestarted)] = "agent stopped -> restarted",
[nameof(SupervisorWorker_AgentStartFailed)] = "agent start failed",
Expand Down
1 change: 1 addition & 0 deletions src/RemoteAgent.Updater/Localization/String.hu.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ internal static partial class Strings
private static readonly Dictionary<string, string> Hu = new()
{
[nameof(SupervisorWorker_AgentHungHeartbeatAbout0)] = "agent beragadt (életjel ~{0:F0}s régi) — kényszerített újraindítás",
[nameof(SupervisorWorker_AgentHungNoHeartbeat)] = "agent beragadt (nincs/olvashatatlan életjel-fájl) — kényszerített újraindítás",
[nameof(SupervisorWorker_RemoteAgentIsNotRunningState)] = "A RemoteAgent nem fut ({State}) — indítás.",
[nameof(SupervisorWorker_AgentStoppedRestarted)] = "agent leállt → újraindítva",
[nameof(SupervisorWorker_AgentStartFailed)] = "agent indítása sikertelen",
Expand Down
1 change: 1 addition & 0 deletions src/RemoteAgent.Updater/Localization/Strings.cs
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ private static string NormalizeLanguageCode(string? langCode)
}

public static string SupervisorWorker_AgentHungHeartbeatAbout0 => Get(nameof(SupervisorWorker_AgentHungHeartbeatAbout0));
public static string SupervisorWorker_AgentHungNoHeartbeat => Get(nameof(SupervisorWorker_AgentHungNoHeartbeat));
public static string SupervisorWorker_RemoteAgentIsNotRunningState => Get(nameof(SupervisorWorker_RemoteAgentIsNotRunningState));
public static string SupervisorWorker_AgentStoppedRestarted => Get(nameof(SupervisorWorker_AgentStoppedRestarted));
public static string SupervisorWorker_AgentStartFailed => Get(nameof(SupervisorWorker_AgentStartFailed));
Expand Down
2 changes: 1 addition & 1 deletion src/RemoteAgent.Updater/RemoteAgent.Updater.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<RootNamespace>RemoteAgent.Updater</RootNamespace>
<AssemblyName>RemoteAgent.Updater</AssemblyName>
<OutputType>Exe</OutputType>
<Version>1.8.0.0</Version>
<Version>1.8.5.0</Version>
<ApplicationIcon>..\..\icon\app.ico</ApplicationIcon>

<!-- Small standalone executable that replaces the main agent, so it cannot be the same exe. -->
Expand Down
61 changes: 40 additions & 21 deletions src/RemoteAgent.Updater/SupervisorWorker.cs
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
using System.Diagnostics;
using System.Globalization;
using System.IO.Pipes;
using System.Text.Json;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using RemoteAgent.Admin;
using RemoteAgent.Commands;
using L = RemoteAgent.Updater.Localization.Strings;

namespace RemoteAgent.Updater;
Expand All @@ -15,16 +17,16 @@ namespace RemoteAgent.Updater;
/// service, replaces the executable, and restarts it. A running service cannot
/// replace its own binary, so this lives in a separate executable/service.
///
/// 2) WATCHDOG: watches the agent heartbeat file
/// (&lt;ProgramData&gt;\RemoteAgent\agent.heartbeat).
/// 2) WATCHDOG: checks the agent's liveness over its read-only status named pipe
/// ("RemoteAgent.status", StatusReport.LastHeartbeatUtc).
/// - if the service is not running, it tries to start it;
/// - if the service appears running but the heartbeat is stale, the agent is hung
/// (SCM only sees process exit): stop, kill by PID if it does not stop in time,
/// then restart.
/// - if the service appears running but the pipe is unresponsive or the heartbeat tick is
/// stale, the agent is hung (SCM only sees process exit): stop, kill by PID if it does not
/// stop in time, then restart.
/// Backoff and circuit breaker prevent a tight failure loop; reboot is the natural reset.
///
/// The Helper has no network or command authority. It only reacts to local markers and
/// heartbeat files. Only the authenticated Agent talks to the server. Incidents are
/// The Helper has no network or command authority. It only reacts to local update markers and the
/// agent's status pipe. Only the authenticated Agent talks to the server. Incidents are
/// written to a local status file and uploaded by the Agent as telemetry.
/// </summary>
public sealed class SupervisorWorker(ILogger<SupervisorWorker> logger) : BackgroundService
Expand All @@ -34,11 +36,13 @@ public sealed class SupervisorWorker(ILogger<SupervisorWorker> logger) : Backgro
private static readonly string DataDir =
Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData), "RemoteAgent");
private static readonly string UpdateDir = Path.Combine(DataDir, "update");
private static readonly string HeartbeatFile = Path.Combine(DataDir, "agent.heartbeat");
private static readonly string StatusFile = Path.Combine(DataDir, "supervisor.status");
private const string StatusPipeName = "RemoteAgent.status";

private static readonly TimeSpan Poll = TimeSpan.FromSeconds(10);
private static readonly TimeSpan HeartbeatStale = TimeSpan.FromSeconds(90);
private const int HungConfirmPolls = 2; // consecutive unhealthy polls required before a forced restart
private const int PipeConnectTimeoutMs = 5000; // the status pipe must answer within this, else the agent is treated as hung
private static readonly TimeSpan StartGrace = TimeSpan.FromSeconds(60); // do not judge hang immediately after start
private static readonly TimeSpan StopTimeout = TimeSpan.FromSeconds(20); // graceful stop window before killing
private const int MaxConsecutiveFailures = 5;
Expand All @@ -49,6 +53,7 @@ public sealed class SupervisorWorker(ILogger<SupervisorWorker> logger) : Backgro
private DateTimeOffset _lastAgentAction = DateTimeOffset.UtcNow;
private DateTimeOffset _parkedUntil = DateTimeOffset.MinValue;
private int _consecutiveFailures;
private int _unhealthyPolls; // consecutive polls with a stale/missing heartbeat (transient-blip filter)
private int _agentRestarts;
private string? _lastIncident;

Expand Down Expand Up @@ -91,21 +96,30 @@ private async Task WatchdogAsync(CancellationToken ct)
if (DateTimeOffset.UtcNow - _lastAgentAction < StartGrace)
return;

var age = HeartbeatAge();
if (age <= HeartbeatStale)
var age = await HeartbeatAgeAsync(ct);
if (age is { } fresh && fresh <= HeartbeatStale)
{
// Healthy means both running and heartbeat present; only this resets failure state.
// Healthy means both running and a recent heartbeat; only this resets failure state.
_unhealthyPolls = 0;
_consecutiveFailures = 0;
_parkedUntil = DateTimeOffset.MinValue;
return;
}

// A single missing/unreadable heartbeat is usually a transient file race with the agent's 15 s
// write, not a hang; only act once it stays unhealthy across two consecutive polls.
if (++_unhealthyPolls < HungConfirmPolls)
return;

// Running but silent means hung. When parked, do not hammer SCM.
if (DateTimeOffset.UtcNow < _parkedUntil)
return;

_lastIncident = L.Format(L.SupervisorWorker_AgentHungHeartbeatAbout0, age.TotalSeconds);
_lastIncident = age is { } stale
? L.Format(L.SupervisorWorker_AgentHungHeartbeatAbout0, stale.TotalSeconds)
: L.SupervisorWorker_AgentHungNoHeartbeat;
logger.LogWarning("{Incident}", _lastIncident);
_unhealthyPolls = 0;
await RestartHungAgentAsync(ct);
await RegisterFailureAsync(); // hung-service churn should also trip the breaker
return;
Expand Down Expand Up @@ -154,18 +168,23 @@ private async Task RegisterFailureAsync()
await WriteStatusAsync();
}

private static TimeSpan HeartbeatAge()
/// <summary>Agent liveness age read over the status named pipe (now - StatusReport.LastHeartbeatUtc).
/// Null when the pipe does not answer in time (agent hung/dead). An older agent that serves the pipe but
/// has no heartbeat field counts as fresh (TimeSpan.Zero) — the pipe answering already proves it is alive.</summary>
private static async Task<TimeSpan?> HeartbeatAgeAsync(CancellationToken ct)
{
try
{
if (!File.Exists(HeartbeatFile)) return TimeSpan.MaxValue;
var txt = File.ReadAllText(HeartbeatFile).Trim();
if (DateTimeOffset.TryParse(txt, CultureInfo.InvariantCulture,
DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal, out var ts))
return DateTimeOffset.UtcNow - ts;
return DateTimeOffset.UtcNow - File.GetLastWriteTimeUtc(HeartbeatFile); // fallback
await using var pipe = new NamedPipeClientStream(".", StatusPipeName, PipeDirection.In, PipeOptions.Asynchronous);
await pipe.ConnectAsync(PipeConnectTimeoutMs, ct);
using var ms = new MemoryStream();
await pipe.CopyToAsync(ms, ct);
if (ms.Length == 0) return null;
var report = JsonSerializer.Deserialize(ms.ToArray(), AgentJsonContext.Default.StatusReport);
if (report is null) return null;
return report.LastHeartbeatUtc is { } beat ? DateTimeOffset.UtcNow - beat : TimeSpan.Zero;
}
catch { return TimeSpan.MaxValue; }
catch { return null; } // pipe unavailable / connect timeout = agent not serving = hung
}

// ---------------- UPDATE SWAP ----------------
Expand Down
2 changes: 1 addition & 1 deletion src/RemoteAgent/RemoteAgent.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<ImplicitUsings>enable</ImplicitUsings>
<RootNamespace>RemoteAgent</RootNamespace>
<AssemblyName>RemoteAgent</AssemblyName>
<Version>1.8.0.0</Version>
<Version>1.8.5.0</Version>
<ApplicationIcon>..\..\icon\app.ico</ApplicationIcon>

<!-- The service runs under SYSTEM with no user interaction. -->
Expand Down
14 changes: 14 additions & 0 deletions src/RemoteAgent/Services/AgentStatusState.cs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ public sealed class AgentStatusState
{
private volatile bool _c2Connected;
private long _lastContactTicks; // DateTimeOffset.UtcNow.UtcTicks, 0 = never
private long _lastHeartbeatTicks; // agent liveness tick, 0 = never

public bool C2Connected => _c2Connected;

Expand All @@ -31,4 +32,17 @@ public void SetC2Connected(bool connected)
/// <summary>Successful server communication occurred through C2 or telemetry.</summary>
public void MarkServerContact() =>
Interlocked.Exchange(ref _lastContactTicks, DateTimeOffset.UtcNow.UtcTicks);

/// <summary>Agent liveness tick, bumped periodically while the work loop is alive. The Helper reads it
/// over the status pipe (StatusReport.LastHeartbeatUtc) to detect a hung agent.</summary>
public DateTimeOffset? LastHeartbeatUtc
{
get
{
var t = Interlocked.Read(ref _lastHeartbeatTicks);
return t == 0 ? null : new DateTimeOffset(t, TimeSpan.Zero);
}
}

public void Heartbeat() => Interlocked.Exchange(ref _lastHeartbeatTicks, DateTimeOffset.UtcNow.UtcTicks);
}
33 changes: 24 additions & 9 deletions src/RemoteAgent/Services/HeartbeatService.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@
namespace RemoteAgent.Services;

/// <summary>
/// Periodically updates the heartbeat file (&lt;EnrollmentDir&gt;\agent.heartbeat). The Helper
/// (RemoteAgent.Updater) watches it: if the heartbeat is stale while the service is "running",
/// the agent is hung. SCM cannot see that, only process exit. The Helper recovers through
/// stop, optional kill, and restart. Deliberately cheap signal: one file timestamp, no IPC.
/// Periodically bumps the agent liveness tick (<see cref="AgentStatusState.Heartbeat"/>), which the
/// Helper (RemoteAgent.Updater) reads over the status pipe as StatusReport.LastHeartbeatUtc: if the
/// tick is stale while the service is "running", the agent is hung. SCM cannot see that, only process
/// exit. The Helper recovers through stop, optional kill, and restart. The legacy heartbeat file
/// (&lt;EnrollmentDir&gt;\agent.heartbeat) is written only while the installed Helper is older than 1.8.1
/// (file-based); once the co-located Helper is pipe-aware the agent stops writing it automatically.
/// </summary>
public sealed class HeartbeatService(IOptions<AgentOptions> options, ILogger<HeartbeatService> logger) : BackgroundService
public sealed class HeartbeatService(IOptions<AgentOptions> options, AgentStatusState status, RemoteAgent.Telemetry.SystemInfoCollector sysInfo, ILogger<HeartbeatService> logger) : BackgroundService
{
private static readonly TimeSpan Interval = TimeSpan.FromSeconds(15);
private static readonly Version PipeAwareHelper = new(1, 8, 1, 0); // first Helper that reads liveness over the status pipe
private readonly string _file = Path.Combine(options.Value.EnrollmentDir, "agent.heartbeat");

protected override async Task ExecuteAsync(CancellationToken stoppingToken)
Expand All @@ -23,15 +26,27 @@ protected override async Task ExecuteAsync(CancellationToken stoppingToken)

while (!stoppingToken.IsCancellationRequested)
{
try
status.Heartbeat(); // primary liveness signal: the Helper reads it over the status pipe (StatusReport.LastHeartbeatUtc)

// Legacy heartbeat file: only needed while an older, file-based Helper is installed. As soon as the
// co-located Helper is pipe-aware (>= 1.8.1) it reads the tick over the pipe, so this self-retires.
if (!HelperReadsPipe())
{
await File.WriteAllTextAsync(_file, DateTimeOffset.UtcNow.ToString("O"), stoppingToken);
try
{
await File.WriteAllTextAsync(_file, DateTimeOffset.UtcNow.ToString("O"), stoppingToken);
}
catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested) { break; }
catch (Exception ex) { logger.LogDebug(ex, L.HeartbeatService_HeartbeatWriteFailed); }
}
catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested) { break; }
catch (Exception ex) { logger.LogDebug(ex, L.HeartbeatService_HeartbeatWriteFailed); }

try { await Task.Delay(Interval, stoppingToken); }
catch (OperationCanceledException) { break; }
}
}

/// <summary>True when the installed Helper reads liveness over the status pipe (>= 1.8.1), making the legacy
/// heartbeat file redundant. Unknown or older version → false: keep writing the file (fail safe).</summary>
private bool HelperReadsPipe() =>
Version.TryParse(sysInfo.ComponentVersions().Helper, out var v) && v >= PipeAwareHelper;
}
1 change: 1 addition & 0 deletions src/RemoteAgent/Services/StatusPipeService.cs
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ private async Task WriteStatusAsync(NamedPipeServerStream pipe, CancellationToke
BastionTransport = transport.Transport,
ActiveBastionPort = transport.LastWorkingPort,
LastServerContactUtc = state.LastServerContactUtc,
LastHeartbeatUtc = state.LastHeartbeatUtc,
Healthy = state.C2Connected,
DeviceId = _deviceId,
};
Expand Down
5 changes: 5 additions & 0 deletions src/RemoteClient.Core/Localization/String.en.cs
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,11 @@ public static partial class Strings
[nameof(DevicesView_Connect)] = "Connect",
[nameof(DevicesView_Device)] = "Device",
[nameof(DevicesView_LastOnline)] = "Last online",
[nameof(DevicesView_LinkFlaky)] = "flaky",
[nameof(DevicesView_LinkFlakyTip)] = "Flaky link: {0} reconnects in the last hour — the agent is likely alive, just on a poor network.",
[nameof(DeviceTelemetryPanel_LinkQuality)] = "Link",
[nameof(DeviceTelemetryPanel_LinkStable)] = "stable",
[nameof(DeviceTelemetryPanel_LinkFlakyDetail)] = "flaky · {0} reconnects/hour",
[nameof(DevicesView_Update)] = "Update",
[nameof(DevicesView_UnlockSignIn)] = "Unlock sign-in",
[nameof(DevicesView_Approve)] = "Approve",
Expand Down
Loading