feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a periodic “local DNS cache” for mcr.microsoft.com that writes resolved IPs into /etc/hosts.testing and wires CoreDNS/localdns to consult that file before forwarding, improving reliability and latency for MCR image pulls when LocalDNS is enabled (especially in the scriptless path).
Changes:
- Adds
mcr-hosts-setupscript, systemd service, and timer into the VHD build pipeline and node provisioning flow, including a newshouldEnableMCRHostsSetuphelper and CSE wiring to enable the timer when LocalDNS (scriptless) is enabled. - Updates the localdns CoreDNS template and associated tests to add a
hosts /etc/hosts.testingplugin block so MCR lookups can be served from the generated hosts file before going to Azure DNS. - Adds targeted shellspec coverage for the new
mcr-hosts-setupbehavior and for enabling its timer, and refreshes bakedCustomDatablobs used in VHD-related tests to include the new artifacts.
Reviewed changes
Copilot reviewed 31 out of 83 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
vhdbuilder/packer/vhd-image-builder-mariner*.json, vhdbuilder/packer/vhd-image-builder-flatcar*.json, vhdbuilder/packer/vhd-image-builder-cvm.json, vhdbuilder/packer/vhd-image-builder-base.json |
Ensures new mcr-hosts-setup.sh, .service, and .timer artifacts are copied into /home/packer during various VHD builds so they can be installed into the image. |
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json |
Same as above for the ARM64 Gen2 image; one object is slightly misformatted compared to the rest of the JSON. |
vhdbuilder/packer/packer_source.sh |
Copies mcr-hosts-setup.sh to /opt/azure/containers and installs the corresponding systemd service and timer units into /etc/systemd/system with appropriate permissions. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.sh |
New script that resolves A/AAAA records for mcr.microsoft.com via dig and writes them into /etc/hosts.testing, logging both summary counts and the concrete IPs. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.service |
Ones-shot systemd unit that runs the mcr-hosts-setup.sh script after network-online.target is reached. |
parts/linux/cloud-init/artifacts/mcr-hosts-setup.timer |
Systemd timer that triggers mcr-hosts-setup.service at boot and every 5 minutes thereafter, with jitter and ordering relative to localdns.service. |
parts/linux/cloud-init/artifacts/cse_config.sh |
Adds shouldEnableMCRHostsSetup, which uses systemctlEnableAndStart mcr-hosts-setup.timer 30 to enable/start the timer and logs descriptive messages. |
parts/linux/cloud-init/artifacts/cse_main.sh |
Integrates shouldEnableMCRHostsSetup into the base provisioning flow, calling it when SHOULD_ENABLE_LOCALDNS is true so the timer is only enabled alongside LocalDNS scriptless corefile generation. |
pkg/agent/baker.go |
Extends the LocalDNS CoreDNS template so that, when $isRootDomain is true, a hosts /etc/hosts.testing { fallthrough } block is inserted before the Azure DNS forwarder. |
pkg/agent/baker_test.go |
Updates expected localdns corefile strings in tests to include the new hosts /etc/hosts.testing stanza, ensuring the template change is validated. |
spec/parts/linux/cloud-init/artifacts/mcr_hosts_setup_spec.sh |
New shellspec tests that (by re-simulating the logic) verify hosts file generation and content based on mocked dig output; currently they do not execute the real script, which has maintainability implications. |
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh |
Adds tests to ensure shouldEnableMCRHostsSetup echoes the expected messages and calls systemctlEnableAndStart mcr-hosts-setup.timer 30. |
pkg/agent/testdata/CustomizedImage*/CustomData |
Refreshes gzipped CustomData payloads to include the new artifacts and behavior, keeping VHD-related tests aligned with the new provisioning logic. |
aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto
Outdated
Show resolved
Hide resolved
| {{- if $isRootDomain}} | ||
| # Check /etc/localdns/hosts first for critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.) | ||
| hosts /etc/localdns/hosts { | ||
| fallthrough | ||
| } | ||
| {{- end}} | ||
| {{- if $isRootDomain}} |
There was a problem hiding this comment.
This enables the CoreDNS hosts plugin for the root domain unconditionally, but the PR also introduces LocalDNSProfile.EnableHostsPlugin and a separate service that populates /etc/localdns/hosts. As written, the plugin will be enabled even when the hosts file/population service isn’t present or the feature is meant to be disabled, which risks localdns startup/runtime errors and makes the new EnableHostsPlugin flag ineffective. Suggest gating this block on EnableHostsPlugin (and/or ensuring an empty hosts file is always created before localdns starts).
aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto
Outdated
Show resolved
Hide resolved
| if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then | ||
| # Write hosts file BEFORE starting LocalDNS so it has entries to serve | ||
| # Enable aks-hosts-setup timer to periodically resolve and cache critical AKS FQDN DNS addresses | ||
| logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL |
There was a problem hiding this comment.
we should always enable host systemd unit, so the host file is always available. and mount the host file in corefile if enableHostplugin == true is passed in.
|
|
||
| # This function enables and starts the aks-hosts-setup timer. | ||
| # The timer periodically resolves critical AKS FQDN DNS records and populates /etc/localdns/hosts. | ||
| enableAKSHostsSetup() { |
There was a problem hiding this comment.
do not make this fail. just log the error and make an empty host file.
When enabling parameter is passed in, fail when host file is empty.
| // LocalDNSProfile represents localdns configuration for agentpool nodes. | ||
| type LocalDNSProfile struct { | ||
| EnableLocalDNS bool `json:"enableLocalDNS,omitempty"` | ||
| EnableHostsPlugin bool `json:"enableHostsPlugin,omitempty"` |
There was a problem hiding this comment.
EnableHostsPlugin is added to the public LocalDNSProfile API surface, but it isn’t used to control Corefile generation or provisioning behavior (the hosts plugin is enabled unconditionally). Either wire this flag through the LocalDNS Corefile template + CSE enablement logic, or remove it to avoid exposing a non-functional/misleading API field.
| EnableHostsPlugin bool `json:"enableHostsPlugin,omitempty"` |
| if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then | ||
| # Write hosts file BEFORE starting LocalDNS so it has entries to serve | ||
| # Enable aks-hosts-setup timer to periodically resolve and cache critical AKS FQDN DNS addresses | ||
| logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL |
There was a problem hiding this comment.
my comment is gone, for some reason.
There was a problem hiding this comment.
i think you said something like, have enableAKSHostsSetup always run and have the hosts plugin mounted to the corefile if the enableHostsPlugin == true
…etup success Generate two localdns Corefiles at Go template time: one with the hosts plugin (for caching critical AKS FQDNs) and one without. At provisioning time, if EnableHostsPlugin is true, attempt enableAKSHostsSetup; use the hosts-enabled Corefile on success or fall back to the no-hosts variant on failure. This follows the same dual-config pattern used for containerd GPU/no-GPU configs.
…re graceful fallback - Add file-existence checks for aks-hosts-setup.sh and aks-hosts-setup.timer so older VHDs (or build modes that omit them) skip with a warning instead of failing provisioning - Replace exit with return so cse_main.sh fallback logic is reachable - Return failure on initial DNS resolution error so the caller falls back to the corefile without the hosts plugin - Add ShellSpec tests for missing artifacts, script failure, and timer failure
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: