feat: lockfile-driven reproducible installs for Artifactory proxies#401
feat: lockfile-driven reproducible installs for Artifactory proxies#401chkp-roniz wants to merge 2 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes installs reproducible in Artifactory-proxied / air-gapped environments by treating the lockfile as the source of truth for dependency provenance (host), and tightening ARTIFACTORY_ONLY enforcement.
Changes:
- Record the actual resolved download host in the lockfile (including Artifactory proxy path) and prefer that host during re-installs.
- Add
ARTIFACTORY_ONLYlockfile conflict detection and prevent “direct source” cached reuse underARTIFACTORY_ONLY. - Close
ARTIFACTORY_ONLYenforcement gaps for virtual packages; add unit tests and changelog entries.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/apm_cli/commands/install.py |
Uses lockfile host for transitive downloads, adds ARTIFACTORY_ONLY conflict detection + cache enforcement, improves output. |
src/apm_cli/deps/github_downloader.py |
Adds get_resolved_host() and enforces ARTIFACTORY_ONLY for virtual file/collection/subdir packages. |
src/apm_cli/deps/lockfile.py |
Adds host_override plumbing so lockfile can store resolved hosts; backward-compatible tuple parsing. |
src/apm_cli/drift.py |
Makes build_download_ref() prefer lockfile host when rebuilding download refs. |
tests/unit/test_artifactory_support.py |
Adds unit tests for lockfile host override, build_download_ref host preference, and conflict detection logic. |
CHANGELOG.md |
Adds Unreleased entries describing the new behavior. |
…ackages The lockfile now records the actual download host (including Artifactory proxy path) so that subsequent installs fetch from the exact same source without requiring ARTIFACTORY_BASE_URL to be set. This makes the lockfile the single source of truth for package provenance. Key changes: - Lockfile host field stores the resolved proxy host+path (e.g. art.example.com/artifactory/apm) instead of the original github.com - build_download_ref prefers lockfile host over manifest host - ARTIFACTORY_ONLY conflict detection: hard error when lockfile has github.com deps but ARTIFACTORY_ONLY=1 is set - ARTIFACTORY_ONLY enforcement for all virtual package types (files, collections, subdirectories) — closes a gap where subdirectory packages bypassed the check and fell through to direct git clone - Public get_resolved_host() API on GitHubPackageDownloader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e1e669b to
f705aaf
Compare
- build_download_ref preserves locked_dep.resolved_ref when no commit SHA is available (Artifactory downloads) - Add tests for no-commit + pinned ref path and host-only override - Update authentication docs with lockfile reproducibility section - CHANGELOG entries include PR number (microsoft#401) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stalls) Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/apm/sessions/5847a44c-f37c-4545-9101-da09e3205a8c
…) against main's auth overhaul Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com> Agent-Logs-Url: https://github.com/microsoft/apm/sessions/a8d26cd1-ef4a-4d8d-af04-0b5838730481
danielmeppiel
left a comment
There was a problem hiding this comment.
Review: Request Changes
Hey @chkp-roniz — thank you for this contribution! The feature goal is spot-on: enterprise teams using registry proxies absolutely need reproducible installs from the lockfile alone. This is a real user pain point and we want to solve it.
That said, we recently merged a significant Auth + Logging Architecture Overhaul (#393/#394) that redesigned how APM handles host identity, token routing, and credential resolution. This PR conflicts with that architecture in ways that would create security vulnerabilities and structural regression. I want to walk through the concerns holistically and point toward a path that delivers the same feature cleanly.
1. 🔴 Host Identity vs. Download Routing — The Core Architecture Issue
The central problem is that get_resolved_host() returns compound "host/path" strings (e.g., "art.example.com/artifactory/apm") that get stored in the lockfile's host field and then flow through the auth system. But host in APM's architecture means a pure FQDN used for:
AuthResolver.classify_host()→ determines host kind (github/ghe_cloud/ghes/ado/generic)AuthResolver.resolve(host, org)→ selects the correct authentication tokengit credential fill→ expectshost=<FQDN>per the git-credential protocol
When "art.example.com/artifactory/apm" reaches classify_host(), it returns kind="generic", which maps to TOKEN_PRECEDENCE["modules"] → GITHUB_APM_PAT. This means your GitHub PAT gets sent to the Artifactory server instead of ARTIFACTORY_APM_TOKEN. We confirmed this by tracing the execution path through auth.py.
Similarly, git credential fill with host=art.example.com/artifactory/apm is malformed and silently fails per the git-credential protocol.
The fix is architectural: "where to download from" and "host identity for authentication" are two different things. The lockfile should store them separately:
# Current PR (compound — breaks auth):
host: "art.example.com/artifactory/apm"
# Proposed (split — auth-safe):
host: "art.example.com" # Pure FQDN → AuthResolver works correctly
registry_prefix: "artifactory/apm" # Download routing → URL constructionThis way classify_host("art.example.com") works correctly, git credential fill gets a valid host, and the download path construction has the prefix it needs.
2. 🔴 Registry-Agnostic Architecture — This Must Be Generic
All env vars are ARTIFACTORY_* and the code is Artifactory-branded throughout — but the underlying mechanism is generic: any HTTP VCS archive proxy could work. APM is positioning as the package manager for the AI agent ecosystem — a neutral standard. Our public API surface cannot be coupled to one vendor.
Currently Artifactory logic is scattered across ~250 lines in github_downloader.py (8 methods: _parse_artifactory_base_url, _should_use_artifactory_proxy, _is_artifactory_only, _get_artifactory_headers, _download_artifactory_archive, etc.), plus helper functions in github_host.py and a token purpose in token_manager.py. There is no registry abstraction.
What we need: A RegistryProxy class that centralizes registry concerns:
@dataclass(frozen=True)
class RegistryConfig:
"""Registry proxy configuration (Artifactory, Nexus, GitHub Packages, etc.)."""
url: str # APM_REGISTRY_URL
host: str # Extracted FQDN for auth routing
prefix: str # URL path prefix for download routing
token: Optional[str] # APM_REGISTRY_TOKEN
enforce_only: bool # APM_REGISTRY_ONLY
class RegistryProxy:
"""Routing, auth, and URL construction for registry proxies."""
def should_proxy(self, dep_ref) -> bool
def get_config(self) -> Optional[RegistryConfig]
def get_auth_headers(self) -> Dict[str, str]
def build_archive_url(self, owner, repo, ref) -> str
def validate_lockfile_deps(self, lockfile) -> List[str] # conflict detectionThe canonical env vars become generic:
| New (Canonical) | Old (Alias, deprecated) | Purpose |
|---|---|---|
APM_REGISTRY_URL |
ARTIFACTORY_BASE_URL |
Proxy URL |
APM_REGISTRY_TOKEN |
ARTIFACTORY_APM_TOKEN |
Auth token |
APM_REGISTRY_ONLY |
ARTIFACTORY_ONLY |
Enforce proxy-only |
The ARTIFACTORY_* names remain as backward-compatible aliases — no one's broken. But internal architecture, lockfile format (registry_prefix not artifactory_prefix), and documentation all use the generic model. This ensures the feature works equally well for Nexus, GitHub Packages, Azure Artifacts, or any HTTP archive proxy.
Download mechanics (zip extraction, retry logic, progress tracking) stay on GitHubPackageDownloader — the extraction is about routing, config, and auth, not HTTP plumbing.
3. 🟠 Lockfile as Trust Boundary — Supply Chain Security
This PR makes the lockfile the source of truth for where packages are downloaded from. That's powerful, but it means a malicious lockfile edit (e.g., in a PR from an external contributor) can redirect all package downloads silently. This is a known attack vector — npm lockfile injection is well-documented.
APM already has content_hash (SHA-256) in the lockfile, which is great! But currently it's only verified on cached installs. For fresh downloads, the hash is computed after download and stored. An attacker who modifies both host and content_hash bypasses all checks.
npm solved this with SRI integrity hashes verified before extraction. We should do the same:
- Verify
content_hashagainst the downloaded content before accepting it - Make
content_hashmandatory for any dependency with a non-default host
This isn't necessarily a blocker for the host-persistence feature, but it should be part of the same effort or land immediately before/after.
4. 🟡 Separation of Concerns
A few structural items that would make this much cleaner:
The installed_packages tuple: This is now a 6-element positional tuple with len()-branching for backward compat, and the same extraction boilerplate copy-pasted at 3 append() sites. We have 60+ dataclasses in this codebase — this should be an InstalledPackage dataclass with a from_graph_node() classmethod. That eliminates positional bugs and makes adding fields safe.
Registry-only validation in install.py: The ~30-line conflict detection block reads env vars directly and uses dep.host in (None, "github.com") — but _is_artifactory_only() already exists on the downloader, and this check misses GHE Cloud (*.ghe.com), GHES, and ADO hosts. This validation belongs on the new RegistryProxy class (via validate_lockfile_deps()), and should use classify_host() rather than raw string matching.
Duplicate host override: build_download_ref() in drift.py patches the host, AND install.py's download callback also patches the host. One location for host resolution, please — build_download_ref() is the right place since it already handles lockfile→dep_ref patching.
What We'd Love to See
Here's a path that delivers your feature cleanly:
-
Split the bug fix out: The
ARTIFACTORY_ONLYenforcement for virtual packages (files, collections, subdirectories) is independently valuable and has no architectural concerns. Can you open a separate PR for just that? We'd love to merge it quickly. -
For the lockfile host persistence, converge with the auth overhaul and go registry-agnostic:
- Extract a
RegistryProxyclass from the scattered Artifactory methods ingithub_downloader.py— centralizes config, routing, auth headers, and validation - Introduce generic env vars (
APM_REGISTRY_URL,APM_REGISTRY_TOKEN,APM_REGISTRY_ONLY) withARTIFACTORY_*as deprecated aliases - Split
host(FQDN) fromregistry_prefixinLockedDependency— keeps auth routing clean - Keep
build_download_ref()as the single point for lockfile→dep_ref patching (including host) - Introduce
InstalledPackagedataclass to replace the growing tuple - Move registry-only validation into
RegistryProxy.validate_lockfile_deps(), usingclassify_host()for host classification - Add pre-download content_hash verification (or we can pair this as a companion PR)
- Extract a
-
Tests: Your test coverage is solid (230+ lines). The reworked architecture would need tests for FQDN/prefix separation, auth routing correctness, and the generic registry model.
We're happy to pair on this or provide more detailed guidance on the auth and registry integration points. This is a feature we want — it just needs to align with the architecture we've been building. Thank you for pushing enterprise use cases forward! 🚀
Note: One of the doc changes is independently good — fixing apm.lock → apm.lock.yaml in the Artifactory note. Feel free to submit that as a tiny doc-fix PR too.
Summary
This PR enhances the Artifactory VCS support added in #354 to make the lockfile the single source of truth for package provenance — ensuring reproducible, auditable installs in enterprise and air-gapped environments.
Why This Matters
The Lockfile Integrity Problem
PR #354 introduced JFrog Artifactory as a first-class package source. However, when a package was installed through an Artifactory proxy, the lockfile recorded
host: github.com— the original host, not the actual download source. This created several problems:Broken reproducibility: A developer installs via Artifactory with
ARTIFACTORY_BASE_URL. They commit the lockfile. A colleague runsapm install— but the lockfile saysgithub.com, so APM tries to fetch directly from GitHub instead of Artifactory. In an air-gapped network, this fails silently or unexpectedly.Supply chain opacity: The lockfile couldn't answer "where did this package actually come from?" — a critical audit question in regulated environments. A package fetched through a corporate-approved proxy was indistinguishable from one fetched directly from the internet.
Air-gap leaks: With
ARTIFACTORY_ONLY=1, virtual subdirectory packages (e.g.,github/awesome-copilot/skills/review-and-refactor) bypassed the enforcement check and fell through to direct git clone — breaking the air-gap guarantee.Stale lockfile ambiguity: If a team transitions from direct GitHub access to Artifactory-only, there was no detection of the mismatch between the lockfile (locked to
github.com) and the new policy (ARTIFACTORY_ONLY=1). Installs would fail with confusing downloader errors instead of a clear remediation path.The Principle
The lockfile must be self-contained. It should capture everything needed to reproduce the exact same install — including where each package was fetched from. No environment variables should be required for a lockfile-driven reinstall. This is the same principle that makes
package-lock.json,Cargo.lock, andpoetry.lockreliable in their ecosystems.Changes
1. Lockfile records actual download host (
lockfile.py,install.py)When a package is installed through an Artifactory proxy, the lockfile now stores the full proxy path in the
hostfield:The
hostfield storeshostname/repo-pathso that{host}/{repo_url}reconstructs the full download URL. Therepo_urlremains unchanged (owner/repo) for consistent identity and key matching.2. Lockfile host drives re-installs (
drift.py)build_download_ref()now prefers the lockfile'shostover the manifest'sdep_ref.host. This means:dep_ref.host+ env var routing (existing behavior)--update: Ignores lockfile, re-resolves from manifest (existing behavior)This also applies to the transitive dependency download callback in
install.py.3. ARTIFACTORY_ONLY conflict detection (
install.py)When
ARTIFACTORY_ONLY=1is set but the lockfile contains dependencies locked togithub.com, APM now exits with a clear error:Additionally, cached packages with
github.comin the lockfile are not silently reused whenARTIFACTORY_ONLYis active — they are forced through the download path.4. ARTIFACTORY_ONLY enforcement for virtual packages (
github_downloader.py)Closes a gap in #354 where virtual file, collection, and subdirectory packages bypassed the
ARTIFACTORY_ONLYcheck:is_virtual_file()— now blocked whenARTIFACTORY_ONLYis set without a proxyis_virtual_collection()— sameis_virtual_subdirectory()— already had partial handling, now also blocks when proxy is unavailable5. Public
get_resolved_host()API (github_downloader.py)New public method on
GitHubPackageDownloaderthat returns the actual download host for a dependency (e.g., the Artifactory proxy path). This replaces direct access to private_parse_artifactory_base_url()/_should_use_artifactory_proxy()frominstall.py.Files Changed
src/apm_cli/commands/install.pysrc/apm_cli/deps/github_downloader.pyget_resolved_host()API, virtual package ARTIFACTORY_ONLY enforcementsrc/apm_cli/deps/lockfile.pyhost_overrideparam infrom_dependency_refandfrom_installed_packagessrc/apm_cli/drift.pybuild_download_refprefers lockfile host, works withoutresolved_committests/unit/test_artifactory_support.pyCHANGELOG.mdTest Plan
ARTIFACTORY_BASE_URL→ lockfile records Artifactory hostapm_modules, runapm installwithout env vars → fetches from Artifactory using lockfile hostgithub.com+ARTIFACTORY_ONLY=1→ clear error with remediationARTIFACTORY_ONLY=1withoutARTIFACTORY_BASE_URL→ virtual subdirectory packages blocked (not silently cloned)github.com, no regressionsinstalled_packagestuples still work, older APM can read new lockfilesBackward Compatibility
hostfield can now containhostname/pathvalues (e.g.,art.example.com/artifactory/apm). Older APM versions read this as a plain string —LockedDependency.from_dict()stores whatever value is there. The field is not used for path computation, so older versions are unaffected.from_installed_packages()accepts both 4-element and 5-element tuples viaentry[:4]/len(entry) > 4pattern.🤖 Generated with Claude Code