Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
e4b61c9
Lay foundation for cross-platform savePath + write sandbox
vasylenko May 13, 2026
3bda7d8
Unit-test src/sandbox.ts containment logic
vasylenko May 13, 2026
2b8957d
Wire sandbox enforcement into MCP fetch_markdown handler
vasylenko May 13, 2026
239f0a9
Add MCP integration tests for the write sandbox
vasylenko May 13, 2026
67ac2c0
CI: extend test job to a 3-OS matrix (ubuntu/macos/windows)
vasylenko May 13, 2026
c502a3f
Force LF line endings in working tree on all platforms
vasylenko May 13, 2026
93e6c7b
Document 0.6.0: README, SPEC, CHANGELOG
vasylenko May 13, 2026
9ed1255
Bump version to 0.6.0
vasylenko May 13, 2026
0a1752c
Fix Windows CI failures: tsx-shim spawn + path-escaping in error reason
vasylenko May 13, 2026
3e2ec9b
Address code-review findings: contract docs + fail-fast hardening
vasylenko May 14, 2026
2726aef
Add opt-in live e2e tests against real production URLs
vasylenko May 14, 2026
c70c2ec
Wire live e2e tests into the default suite
vasylenko May 14, 2026
32902c6
Add live e2e tests into the default suite
vasylenko May 14, 2026
0802c4c
Drop ephemeral 'Suggestion N from the code-review' comments
vasylenko May 14, 2026
a01887f
Raise CLI timeout-test cold-start budget 1500ms -> 3000ms
vasylenko May 14, 2026
5ecb2ad
Trim verbose comments and drop unit tests duplicated by integration
vasylenko May 14, 2026
67ffc6b
Drop redundant section dividers and duplicate comments
vasylenko May 14, 2026
6b00ecc
Document MARKFETCH_ALLOWED_WRITE_ROOTS replace-not-merge as deliberate
vasylenko May 14, 2026
b65d09f
Bump version from 0.5.0 to 0.6.0 in package-lock.json
vasylenko May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Keep text files LF on all platforms. Windows runners would otherwise
# autocrlf .md fixtures to CRLF and break snapshot tests.
* text=auto eol=lf
14 changes: 13 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,14 @@ on:

jobs:
test:
runs-on: ubuntu-latest
# fail-fast: false lets a failure on one OS surface all of the
# others' results in the same run, instead of cancelling siblings.
# Cheaper to read than re-running individually for each.
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
runs-on: ${{ matrix.os }}
permissions:
contents: read
steps:
Expand All @@ -24,4 +31,9 @@ jobs:
run: npm ci

- name: Run tests
# `npm test` runs `tsx --test tests/*.test.ts`. The glob is
# shell-expanded; cmd.exe (Windows default) doesn't expand it,
# so we pin Git Bash — preinstalled on the windows-latest runner.
# No-op on Linux/macOS where bash is already the default.
shell: bash
run: npm test
Comment thread
vasylenko marked this conversation as resolved.
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.6.0] - 2026-05-14

### Added
- Cross-platform absolute-path validation for `savePath` (MCP). Windows-style paths (`C:\foo`, `C:/foo`, `\\server\share`, `\foo`) are now accepted alongside POSIX absolute paths. The schema delegates to `path.isAbsolute()` so it stays correct on whichever platform the process runs.
- Write sandbox restricting MCP `savePath` writes. By default the allowed set is `realpath(os.tmpdir())` ∪ `realpath(process.cwd())`. Symlinks are resolved via `fs.realpath` before the containment check, so a planted symlink inside the sandbox cannot be used to escape. Violations return the new `save_forbidden` error code; no file is created. The CLI is intentionally unrestricted (human at the shell is the security boundary; the LLM via MCP is the threat surface).
- `MARKFETCH_ALLOWED_WRITE_ROOTS` env var: platform-delimiter-separated list of absolute paths (`:` on POSIX, `;` on Windows). When set, **replaces** the default allowed roots entirely. Validated at startup with fail-fast-on-stderr semantics matching existing env-var conventions (`MARKFETCH_TIMEOUT_MS`, `MARKFETCH_MAX_BYTES`, `MARKFETCH_USER_AGENT`).
- `save_forbidden` error code (8th in the contract): returned when a `savePath` resolves outside the configured allowed write roots.
- CI test job now runs on `ubuntu-latest`, `macos-latest`, and `windows-latest`. `shell: bash` is set on the `npm test` step so the test-glob expands consistently across runners.

### Changed
- Resolved code smell SonarQube findings (S4325 redundant `Document` casts, S6594 `String#match` → `RegExp#exec`) — no behavior change, all 50 tests pass. ([c993938](https://github.com/vasylenko/markfetch/commit/c9939385edfbe95f7f34a24ba8e33e5a74ac07f4))
- MCP `savePath` schema replaced the literal `startsWith('/')` constraint with `z.string().refine(path.isAbsolute)`. **Breaking change for MCP callers that previously wrote outside `os.tmpdir()` or `process.cwd()`** — they will now receive `save_forbidden` and must either move the target inside the default roots or set `MARKFETCH_ALLOWED_WRITE_ROOTS`. CLI behavior is unchanged (no sandbox there).
- Resolved code smell SonarQube findings (S4325 redundant `Document` casts, S6594 `String#match` → `RegExp#exec`) — no behavior change, all tests pass. ([c993938](https://github.com/vasylenko/markfetch/commit/c9939385edfbe95f7f34a24ba8e33e5a74ac07f4))
- Documentation and inline comments cleaned up across README, SPEC, source, and test descriptions. Text-only, no runtime change. ([#2](https://github.com/vasylenko/markfetch/pull/2))

## [0.5.0] - 2026-05-12
Expand Down
59 changes: 57 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ gemini mcp add -s user markfetch npx -y markfetch
| Generic Playwright / Puppeteer | ✓ | – | – | – |
| `mcp-server-fetch` (Python) | – | basic | – | – |
| CloudFlare `/markdown` | ✓ | ✓ | – | paid |
| **`markfetch`** | **✓** | **✓** | **✓ (7 codes)** | **✓** |
| **`markfetch`** | **✓** | **✓** | **✓ (8 codes)** | **✓** |

- **Real-browser HTTP/2 + Chrome fingerprint.** ALPN-negotiated h2, `User-Agent`, `Sec-CH-UA-*`, `Sec-Fetch-*`, `Accept-*`. A Chrome UA with no client hints is a *stronger* automation signal than curl — `markfetch` sends the full coherent set, derived from the UA at startup so an override stays internally consistent.

Expand Down Expand Up @@ -108,6 +108,19 @@ Flags:

Errors go to stderr with the same `[code] message` shape the MCP tool returns (see the table below), and the process exits with a non-zero status. The same env vars (`MARKFETCH_TIMEOUT_MS`, `MARKFETCH_MAX_BYTES`, `MARKFETCH_USER_AGENT`) apply in both modes.

Errors carry one of eight deterministic codes:

| Code | Meaning |
|---|---|
| `network_error` | DNS / TCP / TLS failure, or an unexpected internal error from the fetcher. |
| `http_error` | Upstream returned a non-2xx status. |
| `timeout` | Per-request budget `MARKFETCH_TIMEOUT_MS` exceeded. |
| `unsupported_content_type` | Response was not `text/html` or `application/xhtml+xml`. |
| `extraction_failed` | Readability returned no article content (typical for pure client-rendered SPAs). |
| `too_large` | Response body or extracted markdown exceeded `MARKFETCH_MAX_BYTES`. |
| `save_failed` | `savePath` was given but `writeFile` failed (parent directory missing, permission denied, etc.). |
| `save_forbidden` | `savePath` resolves outside the allowed write roots — see [Write sandbox](#write-sandbox). MCP-only; the CLI has no sandbox. |

## What it is not

- **Not a crawler.** No recursion, no `robots.txt` parsing, no rate-limit orchestration. One URL in, one document out.
Expand Down Expand Up @@ -138,9 +151,51 @@ Pass overrides via the `env` block of your MCP client config:
}
```

### Write sandbox

MCP `savePath` writes are confined to a set of allowed root directories. By default the allowed set is `os.tmpdir()` ∪ `process.cwd()` (each resolved via `fs.realpath` once at startup). A `savePath` outside that set returns `save_forbidden` and no file is created.

Override the default set with `MARKFETCH_ALLOWED_WRITE_ROOTS` — a list of absolute paths separated by the platform's path delimiter (`:` on POSIX, `;` on Windows). When set, the override **replaces** the defaults entirely — it does not merge. To keep `os.tmpdir()` or `process.cwd()` accessible, list them yourself; the example below shows `/tmp` for that reason. A malformed value (non-absolute entry, or a directory that doesn't exist) fails fast on stderr at startup.
Comment thread
vasylenko marked this conversation as resolved.

```json
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
}
}
}
}
```

On Windows, use backslashes and `;` as the delimiter:

```json
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
}
}
}
}
```

Notes:

- **The sandbox is MCP-only by design.** The CLI is unrestricted — a human at the shell is the security boundary, and the markfetch CLI doesn't run any sandbox check at all. The asymmetry exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched.
- **Symlinks pointing outside are blocked.** Each candidate `savePath` is resolved via `fs.realpath` to its real destination before the containment check, so a symlink planted inside the sandbox cannot be used to escape.
- **Containment is case-insensitive on Windows** (`C:\Users\Bob` and `c:\users\bob` are the same path).

## Develop

Requires Node.js ≥ 24.
Requires Node.js ≥ 24. Tested on Linux, macOS, and Windows in CI.

When iterating on CLI changes, `tsx src/index.ts <url>` and `tsx src/index.ts --help` route through the same argv-discriminated dispatcher as the built `dist/index.js` — no rebuild needed between edits.

Expand Down
11 changes: 6 additions & 5 deletions docs/SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,27 @@ URL
→ caller markdown body, or "Saved N bytes to /path" confirmation
```

Errors throw `MarkfetchError` uniformly from core; adapters catch once. Codes: `network_error`, `http_error`, `timeout`, `unsupported_content_type`, `extraction_failed`, `too_large`, `save_failed`. CLI emits `[code] message` to stderr and exits 1; MCP emits `{ isError: true, content: [{ text: "[code] message" }] }`.
Errors throw `MarkfetchError` uniformly from core; adapters catch once. Codes: `network_error`, `http_error`, `timeout`, `unsupported_content_type`, `extraction_failed`, `too_large`, `save_failed`; plus `save_forbidden`, emitted by the MCP adapter only (before `fetchMarkdown` runs — see "Asymmetric write sandbox" under Core Decisions). CLI emits `[code] message` to stderr and exits 1; MCP emits `{ isError: true, content: [{ text: "[code] message" }] }`.

## Core Decisions

- **Argv-discriminated dispatch.** `argv.length === 2` (bare invocation) routes to MCP — preserving every existing client config, which all spawn with zero args. Any argument routes to CLI. No `--mcp` flag, no separate `markfetch-mcp` bin, no `isTTY` sniffing.

- **Lazy adapter imports.** The dispatcher uses `await import()` to load exactly one adapter. The only `console.log` in the project lives in `cli.ts`; under MCP, `cli.ts` never loads, so stdout-discipline is enforced by the module graph — not by linter or convention.

- **Core throws, adapters translate.** All 7 error codes surface from `core.ts` — five are thrown explicitly as `MarkfetchError`; `network_error`, `timeout`, and (sometimes) `http_error` are translated by `classifyError` from underlying-API errors (undici TypeErrors, AbortSignal timeouts). New codes need an `ErrorCode` union member + a throw site; adapters don't change.
- **Core throws, adapters translate.** Seven of the eight error codes surface from `core.ts` — five are thrown explicitly as `MarkfetchError`; `network_error`, `timeout`, and (sometimes) `http_error` are translated by `classifyError` from underlying-API errors (undici TypeErrors, AbortSignal timeouts). The eighth code, `save_forbidden`, is the exception — it's emitted by the MCP adapter before `fetchMarkdown` is invoked (see "Asymmetric write sandbox" below). New core codes need an `ErrorCode` union member + a throw site; adapters don't change.

- **HTTP/2 + coherent Chrome fingerprint.** Wire protocol, headers, and UA must agree — a Chrome UA over HTTP/1.1 or without `Sec-CH-UA-*` is *more* suspicious than curl. `Sec-CH-UA-*` is derived from `MARKFETCH_USER_AGENT` at startup so override-coherence is mechanical.

- **Single-channel MCP response.** `content[0].text` only. Several major MCP clients (Claude Code CLI, VS Code/Copilot) forward only `structuredContent` to the model and drop `content[]` when both are present — a single-channel response keeps the markdown reachable from those clients.

- **Whole document or `too_large`.** No pagination. Partial content lets the agent reason over truncated bodies without knowing they're truncated. `savePath` / `-o` is the escape valve for genuinely large documents.

- **Asymmetric `savePath`.** MCP requires absolute paths (zod `startsWith("/")`); CLI accepts relative and resolves against `process.cwd()`. CLI has a stable cwd the user typed `cd` into; MCP servers run in whatever cwd the client picks.
- **Asymmetric `savePath`.** MCP requires absolute paths via zod `refine(path.isAbsolute)` — accepts platform-appropriate shapes on POSIX (`/foo`) and Windows (`C:\foo`, `C:/foo`, `\\server\share`, `\foo`). CLI accepts relative and resolves against `process.cwd()`. CLI has a stable cwd the user typed `cd` into; MCP servers run in whatever cwd the client picks.

- **Stderr is fatal-only.** Per-request MCP errors round-trip through `{ isError }`; only startup misconfig / unrecoverable crashes touch stderr. CLI is its own session, so its per-request errors *are* fatal for that session. Regression guard: `tests/server.test.ts:436`.
- **Asymmetric write sandbox.** MCP `savePath` writes are confined to `realpath(os.tmpdir())` ∪ `realpath(process.cwd())` by default; the env var `MARKFETCH_ALLOWED_WRITE_ROOTS` (path-delimiter-separated) replaces the defaults. CLI writes anywhere the human's shell permits — no sandbox check. The asymmetry reflects the threat model: an LLM driving the MCP tool may be steered by content from the page it just fetched; a human typing into a shell is the security boundary. Symlinks are resolved via `fs.realpath` before the containment check, so a planted symlink inside the sandbox cannot escape. Containment compare is case-insensitive on `process.platform === "win32"`. Implementation lives in `src/sandbox.ts` (a leaf module — no imports from siblings — so it's unit-testable without spinning up the MCP server). Known limitation: TOCTOU between `realpath` and `writeFile` is not closed — acceptable for a single-user developer tool.

- **Stderr is fatal-only.** Per-request MCP errors round-trip through `{ isError }`; only startup misconfig / unrecoverable crashes touch stderr. CLI is its own session, so its per-request errors *are* fatal for that session. Regression guard: `tests/server.test.ts:336`.

## Ideas for future

Expand All @@ -41,4 +43,3 @@ Errors throw `MarkfetchError` uniformly from core; adapters catch once. Codes: `
- **Cookie reuse across redirects within a single fetch.** Currently none. Trigger: a target serves content only after a session-cookie redirect.
- **Proxy support** (`MARKFETCH_PROXY_URL`) and **`Accept-Language` control** (`MARKFETCH_ACCEPT_LANGUAGE`). Trigger: corporate proxy / locale-specific content.
- **Single-binary distribution.** Bun's `build --compile`, Node SEA, or similar. Trigger: `npx` first-run latency feedback, or an offline / airgapped need.
- **Windows-friendly `savePath` schema.** Currently Unix-shaped (`startsWith("/")`). Trigger: someone needs this on Windows.
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "markfetch",
"version": "0.5.0",
"version": "0.6.0",
Comment thread
vasylenko marked this conversation as resolved.
"description": "Fetch a URL, return clean markdown. MCP server and CLI for AI agents.",
"license": "MIT",
"author": {
Expand Down
2 changes: 1 addition & 1 deletion src/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ program
"Fetch a URL and return clean markdown.\n" +
"Run with no arguments to start the MCP stdio server.",
)
.version("0.5.0")
.version("0.6.0")
.argument("<url>", "absolute http(s) URL to fetch")
.option(
"-o, --output <path>",
Expand Down
6 changes: 5 additions & 1 deletion src/core.ts
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,8 @@ export type ErrorCode =
| "unsupported_content_type"
| "extraction_failed"
| "too_large"
| "save_failed";
| "save_failed"
| "save_forbidden";

export class MarkfetchError extends Error {
constructor(
Expand Down Expand Up @@ -405,6 +406,9 @@ function convertToMarkdown(article: {
// extraction_failed, too_large, save_failed
// (The first three may also come from underlying APIs and be translated by
// classifyError — adapters MUST run classifyError(err) in their catch blocks.)
// Note: the MCP adapter additionally emits save_forbidden (the 8th code in
// the contract) before fetchMarkdown is invoked — this function never throws
// it. See src/sandbox.ts and src/mcp.ts.
export async function fetchMarkdown(input: {
url: string;
savePath?: string;
Expand Down
22 changes: 18 additions & 4 deletions src/mcp.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { fetchMarkdown, classifyError, type ErrorCode } from "./core.js";
import { isAbsolute } from "node:path";
import { buildAllowedRoots, checkPath } from "./sandbox.js";

// Built once at startup. Bad config throws and surfaces on stderr (same
// fail-fast convention as intEnv() in core.ts).
const ALLOWED_ROOTS = await buildAllowedRoots(process.env);

function errorResult(code: ErrorCode, message: string) {
return {
Expand All @@ -21,13 +27,13 @@ function errorResult(code: ErrorCode, message: string) {
};
}

const server = new McpServer({ name: "markfetch", version: "0.5.0" });
const server = new McpServer({ name: "markfetch", version: "0.6.0" });

server.registerTool(
"fetch_markdown",
{
description:
"Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return `unsupported_content_type`. Pure client-rendered SPAs with no extractable static HTML return `extraction_failed`; SPAs that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose. Also supports saving the markdown to a file, e.g., to bypass client tool-result size limits or to reuse later.",
"Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return `unsupported_content_type`. Pure client-rendered SPAs with no extractable static HTML return `extraction_failed`; SPAs that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose. Also supports saving the markdown to a file, e.g., to bypass client tool-result size limits or to reuse later. Saved files must land inside the allowed write roots (defaults: system temp dir and the server's working directory; configurable via `MARKFETCH_ALLOWED_WRITE_ROOTS`); paths outside return `save_forbidden`.",
inputSchema: {
url: z
.string()
Expand All @@ -37,14 +43,22 @@ server.registerTool(
),
savePath: z
.string()
.startsWith("/")
.refine(isAbsolute, "savePath must be an absolute filesystem path")
.optional()
.describe(
"Optional. When provided, the fetched markdown is written to this absolute filesystem path and the response becomes a small confirmation. Use this when the markdown might exceed your client's tool-result inline cap. Must be an absolute path starting with '/'; relative paths and tilde-paths ('~/...') are rejected by the schema. Existing files are overwritten; the parent directory must exist (caller's responsibility). The file is written only on fetch success — fetch / extraction / size-cap errors return a [code] string and never touch the file.",
"Optional. When provided, the fetched markdown is written to this absolute filesystem path and the response becomes a small confirmation. Use this when the markdown might exceed your client's tool-result inline cap. Must be an absolute path on the host platform (e.g., `/foo/bar.md` on POSIX; `C:\\foo\\bar.md` or `\\\\server\\share\\bar.md` on Windows); relative paths and tilde paths (`~/...`) are rejected by the schema. Writes are confined to an allow-listed sandbox — defaults are the system temp dir (`os.tmpdir()`) and the server's working directory; operators can override with `MARKFETCH_ALLOWED_WRITE_ROOTS` (path-delimiter-separated). A `savePath` outside the allowed roots returns `save_forbidden` and no file is created. Existing files are overwritten; the parent directory must exist (caller's responsibility). The file is written only on fetch success — fetch / extraction / size-cap errors return a `[code]` string and never touch the file.",
),
},
},
async ({ url, savePath }) => {
// Sandbox gate (MCP-only; CLI is intentionally unbounded). Runs before
// fetchMarkdown so a forbidden path short-circuits the fetch.
if (savePath !== undefined) {
const check = await checkPath(savePath, ALLOWED_ROOTS);
if (!check.ok) {
return errorResult("save_forbidden", check.reason);
}
}
try {
const { markdown, bytes, savedTo } = await fetchMarkdown({
url,
Expand Down
Loading
Loading