Skip to content

Sitemap discovery misses root-domain sitemap when scoring a subdirectory URL #83

@dacharyc

Description

@dacharyc

Summary

When the scored URL is a subdirectory (e.g. https://www.swift.org/documentation/), afdocs checks for a sitemap at <base-url>/sitemap.xml — in this case https://www.swift.org/documentation/sitemap.xml. If that path returns 404, afdocs falls back to testing only the root URL and emits a single-page-sample diagnostic, even when a valid sitemap exists at https://www.swift.org/sitemap.xml.

Steps to reproduce

npx afdocs check https://www.swift.org/documentation/ --sampling deterministic --max-links 50 --format json --score

Expected: afdocs discovers and samples pages from the sitemap at https://www.swift.org/sitemap.xml.

Actual: discoverySources: ["fallback"], testedPages: 1, single-page-sample diagnostic fires.

Root cause

The discovery sequence:

  1. Checks robots.txt at https://www.swift.org/robots.txt — found, but no Sitemap: directive
  2. Tries https://www.swift.org/documentation/sitemap.xml — 404
  3. Falls back to testing only the root URL

Step 2 is path-scoped to the base URL. It never tries https://www.swift.org/sitemap.xml — the conventional root-domain location.

Expected behavior

When the base URL is a subdirectory and the path-relative sitemap returns 404, fall back to checking <scheme>://<host>/sitemap.xml (and <host>/sitemap-index.xml variants) before giving up on sitemap discovery.

Sitemaps are almost never placed under a subdirectory path — they're nearly always at the root. The scoped path check has low hit rate, while the root-domain fallback would recover a significant fraction of cases like this one.

Notes

  • https://www.swift.org/sitemap.xml returns HTTP 200 with 446 URLs, 47 of which are under /documentation/
  • https://www.swift.org/documentation/sitemap.xml returns HTTP 404
  • The robots.txt (User-agent: *, Disallow: /builds/) has no Sitemap: line

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions