Skip to content

docs: add troubleshooting guide with 10 real-world entries#284

Open
cookiebody wants to merge 1 commit into
TencentCloud:masterfrom
cookiebody:docs/troubleshooting-guide
Open

docs: add troubleshooting guide with 10 real-world entries#284
cookiebody wants to merge 1 commit into
TencentCloud:masterfrom
cookiebody:docs/troubleshooting-guide

Conversation

@cookiebody
Copy link
Copy Markdown

Add 10 bilingual troubleshooting articles covering deployment, operations, and runtime issues encountered during real Cube Sandbox PVM deployment:

  • PVM enable flag overridden by .env file
  • kvm_pvm module missing after reboot
  • network-agent missing after server reboot
  • disable_tso missing field in cube-shim config
  • Template build stuck in UNPACKING phase
  • Template has no ready replica error
  • envd stdout truncation limitation
  • envd Basic Auth requirement
  • CoreDNS container restart loop
  • Insufficient disk space

Closes #241

Assisted-by: Hermes Agent:qwen3.6-plus

Add 10 bilingual troubleshooting articles covering deployment,
operations, and runtime issues encountered during real Cube Sandbox
PVM deployment:

- PVM enable flag overridden by .env file
- kvm_pvm module missing after reboot
- network-agent missing after server reboot
- disable_tso missing field in cube-shim config
- Template build stuck in UNPACKING phase
- Template has no ready replica error
- envd stdout truncation limitation
- envd Basic Auth requirement
- CoreDNS container restart loop
- Insufficient disk space

Closes TencentCloud#241

Assisted-by: Hermes Agent:qwen3.6-plus
Signed-off-by: chrisloong <chrisloong5@gmaiil.com>
@cookiebody cookiebody requested a review from tinklone as a code owner May 16, 2026 02:30
@kinwin-ustc
Copy link
Copy Markdown
Collaborator

/cubebot review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

Claude finished @kinwin-ustc's task in 1m 20s —— View job


Reviewing PR #284: Troubleshooting guide

  • Read all changed files (22 files)
  • Checked VitePress sidebar config
  • Checked template file for format compliance
  • Verified bilingual content consistency
  • Compile and post review

Code Review: docs: add troubleshooting guide with 10 real-world entries

Overview

This PR adds 10 bilingual (EN/ZH) troubleshooting articles to the Cube Sandbox documentation. Each article covers a real deployment or operations issue with a consistent Symptom → Environment → Root Cause → Resolution → References structure.

1097 additions / 77 deletions across 22 files — mostly new content, with two index.md files rewritten to include article tables.

What's done well

  • Consistent structure: Every article follows the _template.md format. Frontmatter, section headings, and content flow are uniform across all 10 articles.
  • Bilingual alignment: Each English article has a matching Chinese counterpart with same filename, identical structure, and properly localized content. Chinese articles correctly use /zh/ prefixed links.
  • Practical value: All 10 articles describe real, reproducible issues with actionable resolution steps. Commands are copy-pasteable.
  • Index page redesign: The empty "Published articles" placeholder table has been replaced with a filled table of the 10 new articles. The _template.md naming/frontmatter documentation was moved into the template itself, which is a cleaner approach.
  • VitePress compatibility: The srcExclude: ['**/_template.md'] config already handles template exclusion. The sidebar link to /guide/troubleshooting/ resolves to index.md without additional config changes.

Issues found

1. envd-stdout-truncated.md — Python HTTP server blocks execution

File: docs/guide/troubleshooting/envd-stdout-truncated.md:33-35

# Inside the sandbox, serve the file via HTTP
import subprocess
subprocess.run(["python3", "-m", "http.server", "8888"])

subprocess.run() is a blocking call. If this is submitted via envd exec, it will hang waiting for the server to exit. If it's meant to be run inside the sandbox shell interactively, the snippet is fine as-is, but this context isn't clear.

Suggestion: Either add & to the bash equivalent, or mention that the HTTP server should be started in a separate terminal/session.

2. pvm-enable-overridden-by-env.md — Missing edge case for absent .env

File: docs/guide/troubleshooting/pvm-enable-overridden-by-env.md:27-28

The root cause covers the case where .env exists with CUBE_PVM_ENABLE=0, but doesn't mention what happens if .env doesn't exist at all. Does the installer create it? Will the env var be respected?

Suggestion: Add a note clarifying that if .env doesn't exist, the environment variable is used directly (or state the installer's behavior explicitly).

3. pvm-enable-overridden-by-env.md — Bilingual references section mismatch

English (lines 50-52): References split across two lines

- Related docs: [PVM Deployment](/guide/pvm-deploy)
- Related issue: [#147](https://github.com/TencentCloud/CubeSandbox/issues/147)

Chinese: Combined on one line

- 相关文档: [PVM部署](/zh/guide/pvm-deploy), 相关 issue: [#147](https://github.com/TencentCloud/CubeSandbox/issues/147)

Minor inconsistency — the Chinese version combines two reference items on one line while the English version uses separate lines.

4. insufficient-disk-space.md — Destructive Docker prune without warning

File: docs/guide/troubleshooting/insufficient-disk-space.md:39

docker system prune -f
docker volume prune -f

docker system prune -f removes all unused containers, networks, images, and build cache. Adding docker volume prune -f also removes all unused volumes, which can delete persistent data that might be needed later.

Suggestion: Add a brief warning note that docker volume prune -f will remove all unused volumes and their data. Suggest using docker system prune -f alone as a safer first step.

5. kvm-pvm-missing-after-reboot.mdlsmod expected output removed in Chinese version

English (lines 36-38): Shows expected lsmod output

lsmod | grep kvm_pvm
# Expected output:
# kvm_pvm                49152  10
# kvm                   1175552  1 kvm_pvm

Chinese (lines 34-35): No expected output shown

lsmod | grep kvm_pvm

Suggestion: Add the expected output to the Chinese version for consistency.

6. envd-stdout-truncated.md — Vague reference

File: docs/guide/troubleshooting/envd-stdout-truncated.md:43

- Related issue: Host mount and file transfer discussions

This is not a specific, actionable reference. If there's a GitHub issue number, use it. Otherwise, this should link to a relevant docs page.

Minor suggestions

  • Tags capitalization: Some tags use lowercase (deployment, template) while others use uppercase (PVM, DNS) or mixed (Basic Auth). Consider standardizing tag casing for consistency.
  • zh index.md title change: The Chinese index title changed from "故障排障" to "故障排查". Both are valid, but "故障排障" is somewhat redundant (障 appears twice). The new title "故障排查" reads better.

Summary

This is a solid documentation PR. The content is practical, well-structured, and properly bilingual. The issues above are minor — mostly consistency nits and one functional concern about the Python snippet. With those addressed, this is ready to merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields has removed from the latest version of CubeSandbox, so this doc is outdated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand — what scenario triggers this issue? From my understanding, when users interact with sandboxes normally through the E2B SDK, they wouldn't encounter this problem. Could you describe your use case?

Comment on lines +1 to +7
# 故障排查

::: warning 必须同时提交中英文
本栏目所有投稿都必须同时包含 `docs/guide/troubleshooting/` 下的英文文件和 `docs/zh/guide/troubleshooting/` 下的中文文件。只更新单一语言的 PR 不会被合并。
::: warning 需要双语 PR
本目录下的每篇贡献都必须同时包含英文文件(`docs/guide/troubleshooting/`)和中文文件(`docs/zh/guide/troubleshooting/`)。只更新一种语言的 PR 不会被合并。
:::

这里收录 Cube Sandbox 在部署、使用与运维过程中遇到的真实问题与解决方案。我们更欢迎可复现、可验证、可直接落地的排障经验
本页面收集了 Cube Sandbox 在部署、配置和运行过程中的真实故障排查文章
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not modify the content of this document's frontmatter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not modify any part of the index page other than the article list.

Comment on lines +42 to +45
```bash
cd /usr/local/services/cubetoolbox/scripts/one-click/
bash down.sh && sleep 2 && bash up.sh
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

down-with-deps.sh and up-with-deps.sh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the right fix. down-with-deps.sh and up-with-deps.sh already handle starting and stopping all components.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue only requires waiting a bit if the template is in READY state. It happens right after restarting cubelet, before cubelet has reported node template availability. Also, the command shown for restarting cubelet is incorrect — it drops the dynamic config. The proper way is to use our up-with-deps.sh / down-with-deps.sh scripts to restart.

@fslongjin
Copy link
Copy Markdown
Member

fslongjin commented May 16, 2026

Just a friendly reminder — troubleshooting tutorials should be verified against a real CubeSandbox deployment before submission, so that the examples don't inadvertently mislead other users. Also, any GitHub issues referenced in the documentation should point to real, existing GitHub issues with actual links rather than being fabricated by an LLM. Could you please review, correct, and verify all affected documents against a real deployment? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[good first issue] docs: Help us build the troubleshooting guide

3 participants