Skip to content

feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639

Open
saewoni wants to merge 48 commits intomainfrom
sakwa/localdns_poc
Open

feat: enable localdns hosts plugin to cache critical AKS FQDNs#7639
saewoni wants to merge 48 commits intomainfrom
sakwa/localdns_poc

Conversation

@saewoni
Copy link
Contributor

@saewoni saewoni commented Jan 12, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Requirements:

  • uses conventional commit messages
  • includes documentation
  • adds unit tests
  • tested upgrade from previous version
  • commits are GPG signed and Github marks them as verified

Special notes for your reviewer:

Release note:

none

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a periodic “local DNS cache” for mcr.microsoft.com that writes resolved IPs into /etc/hosts.testing and wires CoreDNS/localdns to consult that file before forwarding, improving reliability and latency for MCR image pulls when LocalDNS is enabled (especially in the scriptless path).

Changes:

  • Adds mcr-hosts-setup script, systemd service, and timer into the VHD build pipeline and node provisioning flow, including a new shouldEnableMCRHostsSetup helper and CSE wiring to enable the timer when LocalDNS (scriptless) is enabled.
  • Updates the localdns CoreDNS template and associated tests to add a hosts /etc/hosts.testing plugin block so MCR lookups can be served from the generated hosts file before going to Azure DNS.
  • Adds targeted shellspec coverage for the new mcr-hosts-setup behavior and for enabling its timer, and refreshes baked CustomData blobs used in VHD-related tests to include the new artifacts.

Reviewed changes

Copilot reviewed 31 out of 83 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-mariner*.json, vhdbuilder/packer/vhd-image-builder-flatcar*.json, vhdbuilder/packer/vhd-image-builder-cvm.json, vhdbuilder/packer/vhd-image-builder-base.json Ensures new mcr-hosts-setup.sh, .service, and .timer artifacts are copied into /home/packer during various VHD builds so they can be installed into the image.
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json Same as above for the ARM64 Gen2 image; one object is slightly misformatted compared to the rest of the JSON.
vhdbuilder/packer/packer_source.sh Copies mcr-hosts-setup.sh to /opt/azure/containers and installs the corresponding systemd service and timer units into /etc/systemd/system with appropriate permissions.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.sh New script that resolves A/AAAA records for mcr.microsoft.com via dig and writes them into /etc/hosts.testing, logging both summary counts and the concrete IPs.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.service Ones-shot systemd unit that runs the mcr-hosts-setup.sh script after network-online.target is reached.
parts/linux/cloud-init/artifacts/mcr-hosts-setup.timer Systemd timer that triggers mcr-hosts-setup.service at boot and every 5 minutes thereafter, with jitter and ordering relative to localdns.service.
parts/linux/cloud-init/artifacts/cse_config.sh Adds shouldEnableMCRHostsSetup, which uses systemctlEnableAndStart mcr-hosts-setup.timer 30 to enable/start the timer and logs descriptive messages.
parts/linux/cloud-init/artifacts/cse_main.sh Integrates shouldEnableMCRHostsSetup into the base provisioning flow, calling it when SHOULD_ENABLE_LOCALDNS is true so the timer is only enabled alongside LocalDNS scriptless corefile generation.
pkg/agent/baker.go Extends the LocalDNS CoreDNS template so that, when $isRootDomain is true, a hosts /etc/hosts.testing { fallthrough } block is inserted before the Azure DNS forwarder.
pkg/agent/baker_test.go Updates expected localdns corefile strings in tests to include the new hosts /etc/hosts.testing stanza, ensuring the template change is validated.
spec/parts/linux/cloud-init/artifacts/mcr_hosts_setup_spec.sh New shellspec tests that (by re-simulating the logic) verify hosts file generation and content based on mocked dig output; currently they do not execute the real script, which has maintainability implications.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds tests to ensure shouldEnableMCRHostsSetup echoes the expected messages and calls systemctlEnableAndStart mcr-hosts-setup.timer 30.
pkg/agent/testdata/CustomizedImage*/CustomData Refreshes gzipped CustomData payloads to include the new artifacts and behavior, keeping VHD-related tests aligned with the new provisioning logic.

@saewoni saewoni changed the title Sakwa/localdns poc feat: add hosts plugin to Jan 30, 2026
@saewoni saewoni changed the title feat: add hosts plugin to feat(localdns): enable mcr-hosts-setup timer for DNS caching Jan 30, 2026
@saewoni saewoni changed the title feat(localdns): enable mcr-hosts-setup timer for DNS caching feat: enable localdns hosts plugin to cache critical AKS FQDNs Jan 30, 2026
@saewoni saewoni marked this pull request as ready for review February 2, 2026 17:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 106 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings February 14, 2026 02:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 88 changed files in this pull request and generated 13 comments.

Comment on lines 1890 to 1896
{{- if $isRootDomain}}
# Check /etc/localdns/hosts first for critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.)
hosts /etc/localdns/hosts {
fallthrough
}
{{- end}}
{{- if $isRootDomain}}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables the CoreDNS hosts plugin for the root domain unconditionally, but the PR also introduces LocalDNSProfile.EnableHostsPlugin and a separate service that populates /etc/localdns/hosts. As written, the plugin will be enabled even when the hosts file/population service isn’t present or the feature is meant to be disabled, which risks localdns startup/runtime errors and makes the new EnableHostsPlugin flag ineffective. Suggest gating this block on EnableHostsPlugin (and/or ensuring an empty hosts file is always created before localdns starts).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on it

if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then
# Write hosts file BEFORE starting LocalDNS so it has entries to serve
# Enable aks-hosts-setup timer to periodically resolve and cache critical AKS FQDN DNS addresses
logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should always enable host systemd unit, so the host file is always available. and mount the host file in corefile if enableHostplugin == true is passed in.


# This function enables and starts the aks-hosts-setup timer.
# The timer periodically resolves critical AKS FQDN DNS records and populates /etc/localdns/hosts.
enableAKSHostsSetup() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not make this fail. just log the error and make an empty host file.
When enabling parameter is passed in, fail when host file is empty.

Copilot AI review requested due to automatic review settings February 17, 2026 05:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 87 changed files in this pull request and generated 1 comment.

// LocalDNSProfile represents localdns configuration for agentpool nodes.
type LocalDNSProfile struct {
EnableLocalDNS bool `json:"enableLocalDNS,omitempty"`
EnableHostsPlugin bool `json:"enableHostsPlugin,omitempty"`
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnableHostsPlugin is added to the public LocalDNSProfile API surface, but it isn’t used to control Corefile generation or provisioning behavior (the hosts plugin is enabled unconditionally). Either wire this flag through the LocalDNS Corefile template + CSE enablement logic, or remove it to avoid exposing a non-functional/misleading API field.

Suggested change
EnableHostsPlugin bool `json:"enableHostsPlugin,omitempty"`

Copilot uses AI. Check for mistakes.
if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then
# Write hosts file BEFORE starting LocalDNS so it has entries to serve
# Enable aks-hosts-setup timer to periodically resolve and cache critical AKS FQDN DNS addresses
logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment is gone, for some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you said something like, have enableAKSHostsSetup always run and have the hosts plugin mounted to the corefile if the enableHostsPlugin == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it this ? #7639 (comment)

…etup success

Generate two localdns Corefiles at Go template time: one with the hosts
plugin (for caching critical AKS FQDNs) and one without. At provisioning
time, if EnableHostsPlugin is true, attempt enableAKSHostsSetup; use the
hosts-enabled Corefile on success or fall back to the no-hosts variant on
failure. This follows the same dual-config pattern used for containerd
GPU/no-GPU configs.
…re graceful fallback

- Add file-existence checks for aks-hosts-setup.sh and aks-hosts-setup.timer
  so older VHDs (or build modes that omit them) skip with a warning instead
  of failing provisioning
- Replace exit with return so cse_main.sh fallback logic is reachable
- Return failure on initial DNS resolution error so the caller falls back
  to the corefile without the hosts plugin
- Add ShellSpec tests for missing artifacts, script failure, and timer failure
Copilot AI review requested due to automatic review settings February 17, 2026 20:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 150 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants