diff --git a/docs/implementation-guide/pxe-boot.md b/docs/implementation-guide/pxe-boot.md new file mode 100644 index 00000000..6037f627 --- /dev/null +++ b/docs/implementation-guide/pxe-boot.md @@ -0,0 +1,205 @@ +--- +title: "PXE Boot Setup Guide" +description: "Configure VergeOS to PXE boot nodes for installation or every-boot runtime. Covers External vNet setup, VLAN/DHCP requirements, and boot configuration." +semantic_keywords: + - "VergeOS PXE boot setup" + - "diskless compute node PXE" + - "PXE install VergeOS nodes" + - "ybos PXE option" +use_cases: + - pxe_boot_node_installation + - diskless_compute_node_deployment + - mass_node_provisioning +tags: + - installation + - pxe + - networking + - diskless +--- + + +# PXE Boot Setup Guide + +**Audience:** System administrators and support engineers who want to implement a PXE network for installing VergeOS nodes from the network and/or deploying nodes without boot disks + + +## Overview + +VergeOS has a built-in PXE boot service that allows optionally installing nodes from the network (replacing USB installers, remote BMC ISO mounting, etc.) and network boot of registered, diskless nodes. These are two distinct scenarios: + +| Scenario | Use case | +|----------|----------| +| **First-time install via PXE** | The VergeOS install is initiated from the network providing an alternative to USB installer, IPMI virtual media, etc. Node PXE-boots to the installer, you select Scale-Out / Compute / Storage, installer completes and node joins the cluster | +| **Every-boot PXE** | Node has no local drives (or none configured as bootable). Node is installed and registered as a pxe boot node. Every reboot pulls the running VergeOS image from the cluster over PXE. Common for diskless compute-only nodes; can also be used for nodes with non-bootable storage | + +Both use the same underlying service: the provider cluster runs dnsmasq on a designated vNet and serves the VergeOS boot image over the network. + + +## Prerequisites + +- An **operational VergeOS cluster** (at minimum, controller node up and reachable for pxe-installing, two controllers up and running for pxe node every-boot) +- **Dedicated PXE NIC (recommended)** — the PXE network should be on a separate NIC on each node, dedicated to PXE booting. Typically implemented as a maintenance network, isolated from production data paths so that PXE traffic doesn't compete with other network roles on the same interface +- **Switch configuration** allowing the target PXE NIC to reach the Verge PXE network. Using a dedicated, inexpensive, basic switch for this network to serve pxe and provide out-of-band maintenance is recommended. +- **Native VLAN match** — the native/access VLAN on the target node's switch port (or vNIC, if applicable) **must match the VLAN of the Verge PXE network**. PXE boot broadcasts leave the NIC untagged, so they land on whatever VLAN the port treats as native. If the native VLAN doesn't match where Verge is serving PXE, the boot request never reaches Verge's dnsmasq and the node will get no PXE response. +- **BIOS/UEFI or boot policy** on the target node configured to boot from NIC/LAN +- **No competing DHCP** on the PXE segment (this is the #1 cause of silent PXE failures) + + +## Network Configuration + +The PXE service runs on a VergeOS **External vNet** with DHCP enabled and a specific PXE option set. + +1. **Networks → New External** (or edit an existing External vNet) +2. Configure basic network settings: + - **Name** — a descriptive name, such as "pxe-boot" or "pxe" + - **Layer 2 Type** — set to *none* or *vLan* with *Layer 2 ID* — your VLAN ID (e.g. 50) + - **Interface Network** — the physical network backing this vNet + - **IP Address Type** — *Static* + - **IP Address** — the router IP for this network e.g. 10.50.5.10 + - **Network Address** — CIDR, e.g. 10.50.5.0/24 +3. Enable DHCP (VergeOS's dnsmasq will be the authoritative DHCP for this segment) + - Check the **DHCP** option + - Set **DHCP Start Address** and **DHCP Stop Address** for the address pool + - Check the **Dynamic DHCP** option + - **Gateway setting** — leave the DHCP **Gateway** field blank (VergeOS defaults it to the vNet's own router IP.) Do NOT set it to an upstream/off-segment gateway — that makes PXE clients try to route their TFTP/HTTP fetches off-network, and installs a default route that can conflict with the production networks the node will use after install +4. Set the **PXE Boot** option to ***ybos*** +5. Power on the vNet + - **Submit** the form + - On the resulting dashboard, click **Power On** + - Verify the network status is *Running* + + + +## First-time node install via PXE + +1. Target node boots → PXE menu appears showing `Verge.io OS PXE` +2. Auto-boot countdown starts (default 1 second — press any key to interrupt) +3. Installer loads +4. Select install type: + - **Scale-Out** — node contributes both compute and vSAN storage + - **Compute** — compute only, no local storage in vSAN + - **Storage** — vSAN storage only +5. Follow the installer prompts (cluster selection, NIC identification, disk selection, etc.) — same flow as USB installer +6. Node reboots, joins cluster + + +## Every-boot PXE (diskless nodes) + +Use case: compute nodes with no local bootable disks, or nodes that should always pull a fresh OS image from the cluster. + +### Boot Flow +1. Node powers on → boot policy → LAN Boot +2. NIC sends DHCP on the pxe network +3. Verge's dnsmasq responds with IP + *next-server* + boot filename +4. Node downloads the VergeOS boot image from Verge +5. VergeOS loads into RAM, node rejoins the cluster +6. Reboot → repeat from step 1 + +No per-node customization. Scale identically across N nodes with the same configuration. + +See KB article: [Configuring a Node for Diskless PXE Boot (PXE every-boot)](/knowledge-base/pxe-every-boot-node-config) for step-by-step instructions. + + +## Considerations + +- **PXE server must be up for the node to boot** — if the Verge cluster is down during a node reboot, the node hangs waiting for PXE response. Healthy cluster = never a problem, but worth noting for DR scenarios +- **Boot policy retry behavior** — check the platform's "reboot on boot failure" settings to avoid stuck-in-loop scenarios +- **No per-node image management** — when the cluster is updated on the provider side, nodes pick up the new VergeOS version at their next reboot. You don't maintain separate OS images per node + +--- + +### Changing a node's NIC configuration (caution) + +!!! warning + **Changing the NIC or interface used by a PXE-booting node can break its ability to boot.** Proceed carefully, especially for every-boot PXE (diskless) nodes that depend on PXE for every startup. + +The PXE boot path is tied to a specific NIC and interface: +- **Node identity is tied to the MAC address of the NIC used during install** — this MAC must remain constant for the life of the node. If it changes, the cluster sees the node as new/unknown and it will not rejoin automatically +- Managed boot policies (where applicable) reference a specific vNIC by name +- BIOS/UEFI boot order points at a specific physical NIC +- The switch port (or vNIC, where applicable) carries a specific native VLAN matching the PXE network + +Any change to that chain can prevent PXE from working on the next reboot. + +#### What to watch out for + +- **Replacing the physical NIC** — new hardware = new MAC. VergeOS may not recognize the node, and any MAC-based DHCP reservations will no longer match +- **Swapping which vNIC is used for LAN Boot** — the new vNIC needs the correct native VLAN (or Boot VLAN set in the boot policy) to reach the Verge PXE network +- **Rebuilding a managed service profile** — unless MACs are preserved (via a MAC pool or explicit assignment), the node will get a new MAC and Verge will see it as a new node +- **Moving the cable to a different switch port** — the new port must carry the same native VLAN as the PXE network +- **Changing bond configuration or NIC teaming** — PXE boots before the OS forms the bond, so it uses a single physical NIC; make sure that NIC is still on the PXE VLAN + +#### Before making the change + +1. Note the current MAC address and which NIC/vNIC is used for boot +2. If possible, have console/IPMI access during the first reboot after the change — in case PXE fails and you need to intervene +3. If the node is part of a running cluster, consider putting it in maintenance / draining VMs first so you can take your time troubleshooting if needed +4. For every-boot PXE nodes: verify after the change that the node successfully PXE-boots once before declaring the change complete + +#### If the node fails to boot after a NIC change + +Start by verifying the basics: + +- Check console output for PXE error codes +- Verify the new NIC/vNIC is on the correct VLAN (native VLAN on the vNIC, or access VLAN on the switch) +- If using a managed boot policy, verify it still references a valid vNIC + +If the hardware side looks correct but the cluster still references the old MAC, you have two ways to update VergeOS's view of the node: + +**Option 1 — Refresh Drives and NICs (node is running / bootable):** +On the node's dashboard in the provider UI, run the **Refresh → Drives and NICs** action. The cluster re-detects the current hardware and picks up the new MAC automatically. + +**Option 2 — Manually edit the NIC's MAC (node is down, same NIC slot, new MAC):** +Useful when the NIC hardware / slot hasn't changed but the MAC did — for example, a vNIC rebuild or service profile change on a managed platform. With the node offline: + +- Navigate to the node's dashboard → NICs +- Edit the NIC entry corresponding to the PXE network +- Update the stored MAC to match the new one +- **Reboot the PXE network** (power-cycle the External vNet) so dnsmasq picks up the new MAC-to-config mapping +- The node should boot correctly on the next attempt + +If neither option resolves it, contact support. + +--- + +## Troubleshooting + +### Common things to verify + +Work through this list first when PXE isn't behaving. Most failures trace back to one of these: + +- **Did the node get a DHCP lease?** Watch the boot console — it should show "DHCP..." followed by an IP. If no IP, the DHCP phase itself is failing. +- **Is the IP in the expected subnet?** If the node gets a lease but the IP isn't from the pool you configured on the Verge External vNet, a different DHCP server answered. A rogue/corporate DHCP on the same L2 will usually win the race. +- **No competing DHCP server on the segment?** This is the #1 cause of silent PXE failures. Verge's dnsmasq must be the only DHCP on the VLAN. +- **PXE option set to *ybos*** on the External vNet? Without this, Verge answers DHCP but doesn't hand out the PXE boot filename. +- **DHCP enabled on the vNet** (not just configured)? The checkbox must be checked and the network powered on. +- **Gateway on the DHCP scope is sensible?** Blank or the vNet's own router IP is fine; an upstream/off-segment gateway will make PXE clients try to route off-network. +- **Native VLAN match?** The switch port (or vNIC) carrying the node's PXE NIC must have the Verge PXE VLAN as its native/access VLAN. PXE broadcasts are untagged. +- **Correct NIC configured for PXE?** BIOS/UEFI boot order (or the boot policy on managed platforms) must point at the NIC actually cabled to the PXE network. +- **Physical connectivity?** Link lights on both ends, cable seated, switch port not administratively down or error-disabled. +- **External vNet status is "Running"?** If the vNet is stopped/initializing, dnsmasq isn't answering. +- **VergeOS cluster is healthy?** A cluster that's degraded or with controllers down may not be serving PXE properly. Check the main dashboard. +- **MAC registration (for nodes previously joined)?** If a NIC was swapped or a Service Profile rebuilt, the new MAC may not be recognized. See *Changing a node's NIC configuration* above. +- **Try forcing the correct NIC via the BIOS boot manager.** If the persistent boot order isn't picking the right NIC, use the server's one-time boot manager (typically F11/F12 at POST, or a one-time boot override in the BMC / management controller) to explicitly select the PXE NIC. This isolates whether the issue is boot-order selection vs. something downstream like DHCP or PXE serving. + +### Specific error codes + +| Symptom | Likely cause | Fix | +|---------|--------------|-----| +| Node gets DHCP lease, no PXE menu | Competing DHCP on segment, OR PXE option not set to `ybos` | Verify only Verge DHCP is on the VLAN; verify the vNet's PXE option | +| `PXE-E53: No boot filename received` | PXE option missing or DHCP not enabled on the vNet | Enable DHCP; set PXE option to `ybos` | +| `PXE-E32: TFTP open timeout` | Firewall or VLAN isolation between node and Verge | Check L2 path from node to the Verge vNet | +| Boot loops | BIOS/boot order issue — node booting from empty local disk | Fix boot order, or remove Local Disk from the boot policy | +| Node hangs at "PXE-MOF: Exiting..." | Usually a boot order issue — PXE succeeded but machine tried to fall through to another boot device that isn't bootable | Set PXE as the only boot option, or ensure local disk is actually bootable post-install | +| Node gets IP but from the wrong subnet | Rogue/competing DHCP server answered first | Isolate the PXE VLAN; disable other DHCP on the segment | +| No DHCP lease at all | Native VLAN mismatch, cable issue, NIC not configured for PXE/LAN Boot | Verify switch port native VLAN matches Verge PXE VLAN; check physical link; enable PXE/network boot on the NIC | +| After a diskless install completes, the node only PXE-boots back into the installer (not the running OS) | PXE server's state hasn't refreshed post-install — it's still serving the install image for that node | Reboot the PXE network (power-cycle the External vNet) and apply rules. On next boot, the node should get the running OS image | + +--- + +## Related Resources + +- [Configuring a Node for Diskless PXE Boot (PXE every-boot)](/knowledge-base/pxe-every-boot-node-config/) +- [Connect VergeOS to an Existing LAN/WAN](/product-guide/networks/connect-lan-wan/) +- [Installation Guide](/implementation-guide/installation-guide/) +- [Scale-out Node Installation Guide](/implementation-guide/scale-out-nodes/) diff --git a/docs/knowledge-base/posts/pxe-every-boot-node-config.md b/docs/knowledge-base/posts/pxe-every-boot-node-config.md new file mode 100644 index 00000000..46164d2a --- /dev/null +++ b/docs/knowledge-base/posts/pxe-every-boot-node-config.md @@ -0,0 +1,173 @@ +--- +title: Configuring a Node for Diskless PXE Boot (PXE every-boot) +slug: pxe-every-boot-node-config +description: How to configure a registered VergeOS compute node to network-boot from the cluster on every startup — for nodes with no local storage or no bootable storage devices. +author: VergeOS Documentation Team +draft: false +date: 2026-05-12 +semantic_keywords: + - "pxe every boot diskless compute node vergeos" + - "network boot vergeos node no local storage" + - "configure node to pxe boot every reboot" + - "diskless compute node maintenance network pxe" +use_cases: + - pxe_every_boot_node_configuration + - diskless_compute_node_deployment + - non_bootable_storage_node_setup +tags: + - pxe + - nodes + - networking + - installation + - diskless +categories: + - Installation + - Network +editor: markdown +dateCreated: 2026-05-12 +--- + +# Configuring a Node for PXE Every-Boot + +## Overview + +!!! info "Key Points" + - PXE every-boot lets a registered VergeOS compute node load the running OS from the cluster on every startup, with no local boot media + - Intended for nodes that have no local storage, or only non-bootable storage devices + - Requires a PXE-capable physical network with no competing DHCP servers + - This guide covers nodes that are installed to PXE-boot on **every reboot**. It does **not** cover one-time PXE installs (e.g. PXE only replaces the USB installer and the node boots from local disk afterward), or PXE networks for tenant workloads. + +VergeOS supports PXE booting for host nodes, allowing you to deploy nodes that network-boot from the cluster every time they start. This guide covers configuration of a PXE network and VergeOS host node to use PXE diskless boot. + +!!! note "When PXE every-boot applies (dedicated boot drive not needed)" + + VergeOS does **not** require a separate drive for the operating system. By default, the installer places a small boot partition on each storage drive — including drives that will participate in the vSAN. The VergeOS system is small, so the impact on vSAN capacity is negligible. PXE every-boot only makes sense when the node either has **no storage at all** or all installed storage devices are **non-bootable**. If the node has any bootable disk, install as a standard (non-pxe) node instead. + + +## Configuration Summary + +Configuring a node for PXE every-boot involves: + +1. **Physical network configuration** — establish an accessible physical network in VergeOS and **Node NIC assignment** — connect the node's PXE NIC to the physical network (Physical network/node NIC assignment is often already configured during initial controller node installation.) +2. **Create an external network** - configure a VergeOS vNet (external network) to serve PXE. +3. **BIOS / UEFI / boot policy** — configure the node to boot from the PXE NIC +4. **Install the node** — boot the VergeOS ISO and select the PXE node install option + +Detailed PXE network prerequisites and configuration information is available at [PXE Implementation Guide](/implementation-guide/pxe-boot). + + +## Network Setup + +### Physical Network Requirements + +The physical network used for PXE must meet a few requirements: + +- **No competing DHCP servers.** VergeOS runs its own DHCP service on the PXE network. Any other DHCP server reachable on the same Layer 2 segment will race with VergeOS, and whichever responds first wins. If a competing server wins, the node gets an IP but no PXE boot information. +- **Broadcast traffic is allowed.** PXE relies on broadcasts to discover the boot server. Switches and any network gear in the path must permit broadcast traffic and must not filter PXE/TFTP traffic. + +!!! tip "Recommended pattern: dedicated maintenance network" + The preferred setup is a dedicated physical network used for maintenance traffic (IPMI / iDRAC) and PXE boot, using inexpensive basic switches (for example, 1 GbE). Keeping PXE on a dedicated maintenance network avoids interference with production traffic and removes the risk of DHCP contention with other system processes. + +The physical network **may already exist from initial system installation** (for example, a maintenance network created during the install of controller nodes). A physical network can also be created post-install. + +To create a new physical network post-install: + +1. Navigate to **Networks** > **New Physical**. +2. Configure network settings: + - **Name**: a descriptive name such as "Maintenance-pxe Switch" + - **Layer 2 Type**: `none` + - **MTU size**: The MTU setting must always be a value supported by the physical switching hardware. + - **IP Address Type:** None +3. **Assign the node NIC to the physical network:** + - Navigate to **Infrastructure** > **Nodes** and double-click the target node. + - Click **NICs** on the left menu. + - Double-click the NIC that is cabled to the PXE network. + - Set **Interface** to **Direct**. + - Select the appropriate physical network (e.g. *Maintenance Switch*). + +### Create an External network for PXE service + +The physical network is the underlying L1/L2 fabric. To actually serve PXE (DHCP + boot image), an External network must be configured on top of this physical network with **DHCP enabled** and the **PXE boot option** set. + +1. Navigate to **Networks** > **New External**. +2. Configure network settings: + - **Name:** provide a descriptive name, such as "PXE boot" + - **Layer 2 Type**: ***none*** for a dedicated management / PXE network (recommended). Alternatively, Layer 2 Type: *vLan* with a dedicated *Layer 2 ID* (vLAN id) if PXE will share a trunked uplink with other VLANs. + - **PXE Boot:** ***ybos*** + - **Interface Network:** select the **physical network** for the PXE network (e.g. *Maintenance Switch*) + - **IP Address Type:** ***Static*** + - **DHCP:** ***enabled*** + - **DHCP Start Address:** and **DHCP Stop Address:** defining the address pool + - **Dynamic DHCP:** ***enabled*** + + +## BIOS / UEFI / Boot Policy + +The target node must be configured to boot from the NIC cabled into the PXE network. The specific steps are **vendor-specific** and depend on the server hardware and/or management controller, but typically include: + +- **Enable PXE / network boot** on the target NIC +- **Configure the PXE NIC first** in the boot order +- **Verify no other bootable devices take priority** (for example, USB install media) + + !!! warning "**PXE boot will be reliant on specific NIC and interface**: Node identity is tied to the MAC address of the NIC used during node installation - this MAC must remain constant." + +!!! tip "Pro Tip" + Consult your server hardware or BIOS documentation for the exact menus and options. On managed platforms (blade chassis, converged infrastructure), the boot configuration lives in a service profile or boot policy rather than per-node BIOS. + + +## Install the Node + +!!! tip "Consider creating a new cluster first" + PXE every-boot nodes typically have a different hardware makeup than the nodes in your existing clusters. Because like nodes should be grouped together, you'll usually want a dedicated cluster for them. Common cases: + + - **Compute-only PXE nodes** — create a new compute cluster for them to join, separate from your existing storage or hyperconverged clusters. + - **Nodes with non-bootable storage devices** — create a new cluster so they don't mix with existing nodes that have bootable vSAN drives. + + Create the new cluster in the VergeOS UI **before** running the installer so it's available for selection in the step below. + +With the network and BIOS configured, install the node using the VergeOS ISO: + +1. Boot the node to the VergeOS installation ISO. **The ISO version must match the version of VergeOS running on the cluster.** +2. From the install menu, select ***PXE Compute node with optional storage***. +3. Enter **admin credentials** for the existing VergeOS system. The installer checks the network configuration to identify core network connections. +4. If prompted, select the appropriate **cluster for the node to join**. +5. For diskless (no-storage) nodes, the installer will report that no drives were discovered. Press **Enter** to proceed with the install. +6. The node registers with the system. +7. When the install finishes processing, remove the install media and reboot the node. + +### Verifying success + +After the install completes and the node reboots, you should see: + +- The node listed in the VergeOS system (additional node visible in the cluster) +- The node automatically network-booting to VergeOS on reboot, with no install media present + +## Troubleshooting + +!!! warning "Common Issues" + - **Problem:** After installation and reboot, the VergeOS installer screen appears again instead of the running OS + - **Solution:** Remove the installation media (for example, USB installer or virtual media mounted via IPMI) and reboot. The node will fall through to the next item in the boot order — the PXE NIC. + + - **Problem:** Installation completes but the node does not boot to VergeOS after reboot + - **Solution:** Work through the following checks: + - Verify the BIOS / UEFI / boot policy is configured to PXE boot from the correct NIC. These settings are vendor-dependent — consult your vendor's documentation. + - Ensure the underlying network infrastructure allows PXE traffic (broadcasts permitted, TFTP not blocked, no firewall or ACL filtering between the node and the VergeOS PXE network). + - Confirm the PXE NIC is physically connected to the correct switch port and is receiving a DHCP lease from the configured VergeOS PXE network. Verify the lease is **not** coming from a competing DHCP server on the segment. + + - See [PXE Implementation Guide](/implementation-guide/pxe-boot/) for additional Troubleshooting information. + +!!! tip "Pro Tip" + If you're stuck, use the [Network Diagnostics Tool](/product-guide/networks/network-diagnostics) on the PXE external network, running the TCPDump command with a filter for `bootp || tftp`. This can show exactly where the PXE "conversation" stops. If you don't see any incoming packets at all, the issue is almost always upstream — a routing or switching problem, a VLAN mismatch, or a NIC not actually configured to send PXE requests. + + + +## Additional Resources +- [PXE Implementation Guide](/implementation-guide/pxe-boot) +- [Installation Guide](/implementation-guide/installation-guide) +- [How to Create an External Network](/knowledge-base/create-external-network/) +- [Network Diagnostics Tool](/product-guide/networks/network-diagnostics) + +## Feedback + +!!! question "Need Help?" + If you need further assistance or have any questions about this article, please don't hesitate to reach out to the [VergeOS Support Team](/support). diff --git a/mkdocs.yml b/mkdocs.yml index 1f99b454..e4917a8f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -148,6 +148,7 @@ nav: - Node Sizing: implementation-guide/sizing.md - Network Design: implementation-guide/network-design.md - Switch Config Guide: implementation-guide/switch-configuration.md + - PXE-boot Setup Guide: implementation-guide/pxe-boot.md - Installation: - Pre-Installation: implementation-guide/pre-installation.md - Bootable Media: implementation-guide/install-media.md