Skip to content

Improve Playground CLI child process error handling#2844

Merged
fredrikekelund merged 18 commits intotrunkfrom
f26d/improve-timeout-error-handling
Mar 20, 2026
Merged

Improve Playground CLI child process error handling#2844
fredrikekelund merged 18 commits intotrunkfrom
f26d/improve-timeout-error-handling

Conversation

@fredrikekelund
Copy link
Contributor

Related issues

  • Related to p1773148946011199-slack-C06DRMD6VPZ

How AI was used in this PR

Codex was used to implement the timestamping in the process manager daemon and to reason about startingPromise logic in the child process.

Proposed Changes

A couple of simpler changes:

  1. Unset the stack prop on CliCommandError errors. This is because the strongest "event grouping" signal Sentry uses is the error stack trace. All CliCommandError stack traces are identical, because we always construct them at the same place in execute-command.ts. This currently results in all CliCommandError errors being grouped into a single Sentry issue, which is incorrect.
  2. Timestamp all lines that are written to the child process logs.
  3. More logging in the Playground CLI child process.

And some more complex changes:

  1. I've improved the abort event handling in startServer and made stopServer able to stop an in-flight startServer operation.
    1. This is based on a hunch from looking at one of the CliCommandError Sentry events. It's possible to run a site stop CLI command for a site that's currently starting, and it's also possible that users might trigger site stop operations from Studio if another error leads to Studio losing track of the current started/stopped state for sites. Reproducing this error from within Studio is impossible under normal circumstances, so this falls mostly under "robustness".
  2. I've made stopWordPressServer wait for the child process to exit cleanly (with fallback), and I've made the child process actually exit cleanly in response to stop-server messages.
    1. This is also more of a robustness change. Previously, we'd always send a SIGKILL signal to the child. That's OK, but ideally a stop-server message should lead to a 0 exit code unless something actually went wrong.

Testing Instructions

For the changes in wordpress-server-child.ts, code review and CI are paramount. However, we can verify that stopping an in-flight start operation works as expected:

  1. Run npm run cli:build
  2. Open two terminal tabs
  3. Type node apps/cli/dist/cli/main.js site stop --path PATH_TO_SITE in the first tab. Don't run this command yet. Leave it so you can quickly return to the tab after starting the site.
  4. Type node apps/cli/dist/cli/main.js site start --path PATH_TO_SITE in the second tab and run the command.
  5. Quickly return to the first terminal tab and click Enter
  6. In the second tab, ensure that you see the following message Failed to start WordPress server: This operation was aborted

Test that log file lines are timestamped:

  1. Open ~/.studio/pm2/logs
  2. Clean the directory
  3. Run npm start
  4. Start a site
  5. Find the logfile for your site
  6. Ensure that every line has a timestamp

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

@fredrikekelund fredrikekelund requested a review from a team March 19, 2026 11:19
@fredrikekelund fredrikekelund self-assigned this Mar 19, 2026
@wpmobilebot
Copy link
Collaborator

wpmobilebot commented Mar 19, 2026

📊 Performance Test Results

Comparing 10dc943 vs trunk

app-size

Metric trunk 10dc943 Diff Change
App Size (Mac) 1236.42 MB 1236.42 MB +0.01 MB ⚪ 0.0%

site-editor

Metric trunk 10dc943 Diff Change
load 1946 ms 1883 ms 63 ms 🟢 -3.2%

site-startup

Metric trunk 10dc943 Diff Change
siteCreation 7129 ms 7074 ms 55 ms 🟢 -0.8%
siteStartup 3934 ms 3943 ms +9 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

// Cache the process list returned from the process manager for a very short time to make multiple
// calls in quick succession more efficient
const listProcesses = cacheFunctionTTL( async () => {
async function listProcesses() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the cacheFunctionTTL is not crucial to this PR's purpose. I just started viewing it as a liability while working on the stopWordPressServer logic. isProcessRunning depends on listProcesses, which caches its result for 1 second. That could easily lead to a sneaky bug at some point, and communicating with the process manager daemon over a socket is pretty fast to begin with.

Comment on lines -283 to +295
signal.addEventListener(
'abort',
() => {
throw new Error( 'Operation aborted' );
},
{ once: true }
);
stopSignal.throwIfAborted();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake on my part to think that an exception thrown from this event handler would be caught by the surrounding try..catch. We need to rely on throwIfAborted instead.

Comment on lines +464 to 477
function sendErrorMessage( messageId: string, error: unknown ): Promise< void > {
return new Promise( ( resolve ) => {
const errorResponse: ChildMessageRaw = {
originalMessageId: messageId,
topic: 'error',
errorMessage: parsePhpError( error ),
errorStack: error instanceof Error ? error.stack : undefined,
cliArgs: lastCliArgs ?? undefined,
};
process.send!( errorResponse, () => {
resolve();
} );
} );
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we really need to wait for the process.send callback or if that operation is effectively synchronous, but since the next thing that happens after this function is called is that we kill the process, I thought it best to.

Comment on lines +549 to +550
} finally {
delete abortControllers[ validMessage.messageId ];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken from #2805 👍

Copy link
Contributor

@epeicher epeicher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fredrikekelund! I tested it and received the expected error message when stopping a site that was starting. I can also see the timestamped logs. Also, code changes LGTM!

Stopping a Starting site Timestamped Logs
Image Image

@fredrikekelund fredrikekelund merged commit e4accf2 into trunk Mar 20, 2026
14 checks passed
@fredrikekelund fredrikekelund deleted the f26d/improve-timeout-error-handling branch March 20, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants