GitHub - dotcommander/defuddle: Go library and CLI for extracting web page content — articles, metadata, and clean text from any URL

Introduction

Defuddle Go is a port of the Defuddle TypeScript library. It extracts clean, readable content from any web page — stripping away navigation, ads, sidebars, and other clutter so you're left with just the article.

Available as both a Go library and a drop-in CLI tool compatible with the original Defuddle CLI.

Installation

CLI

Download a pre-built binary from the releases page, or install with Go:

go install github.com/dotcommander/defuddle/cmd/defuddle@latest

Library

Require Defuddle Go using go get:

go get github.com/dotcommander/defuddle

Requires Go 1.26 or higher.

Quick Start

Extract the main content from any web page in just a few lines:

d, err := defuddle.NewDefuddle(htmlString, nil)
if err != nil {
    log.Fatal(err)
}

result, err := d.Parse(context.Background())
if err != nil {
    log.Fatal(err)
}

fmt.Println(result.Title)
fmt.Println(result.Content)

Or fetch and parse a URL directly:

result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", nil)

Extracting Content

From HTML

Pass raw HTML and receive structured content with metadata:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    URL: "https://example.com/article",
})
if err != nil {
    log.Fatal(err)
}

result, err := d.Parse(context.Background())

fmt.Printf("Title:       %s\n", result.Title)
fmt.Printf("Author:      %s\n", result.Author)
fmt.Printf("Published:   %s\n", result.Published)
fmt.Printf("Description: %s\n", result.Description)
fmt.Printf("Word Count:  %d\n", result.WordCount)
fmt.Printf("Language:    %s\n", result.Language)

From a URL

ParseFromURL handles HTTP fetching, encoding detection, and parsing in one call:

result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", &defuddle.Options{
    Markdown: true,
})

Markdown Output

Convert extracted content to Markdown for storage, indexing, or LLM consumption:

result, err := d.Parse(ctx)

// When Markdown is enabled, Content is returned as Markdown
fmt.Println(result.Content)

To receive both HTML and Markdown in the same response:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    SeparateMarkdown: true,
})

result, err := d.Parse(ctx)

fmt.Println(result.Content)          // HTML
fmt.Println(*result.ContentMarkdown) // Markdown

Site-Specific Extractors

Defuddle automatically detects popular platforms and applies specialized extraction logic. No configuration needed — if the URL matches, the right extractor activates.

Platform	Content Type
ChatGPT	Conversations with role-separated messages
Claude	Conversations with human/assistant turns
Gemini	Google AI conversations
Grok	xAI conversations
GitHub	Issues and pull requests with comments
Hacker News	Posts and threaded comment discussions
Reddit	Posts with comment trees
Substack	Newsletter articles
Twitter / X	Tweets and threads
X Articles	Long-form articles (Draft.js)
YouTube	Video metadata and descriptions

Custom Extractors

Implement the BaseExtractor interface to add support for any site:

type MyExtractor struct {
    *extractors.ExtractorBase
}

func NewMyExtractor(doc *goquery.Document, url string, schema any) extractors.BaseExtractor {
    return &MyExtractor{ExtractorBase: extractors.NewExtractorBase(doc, url, schema)}
}

func (e *MyExtractor) Name() string     { return "MyExtractor" }
func (e *MyExtractor) CanExtract() bool { return true }

func (e *MyExtractor) Extract() *extractors.ExtractorResult {
    doc := e.GetDocument()
    content, _ := doc.Find(".article-body").Html()
    return &extractors.ExtractorResult{
        ContentHTML: content,
        Variables:   map[string]string{"site": "My Site"},
    }
}

Register it before parsing:

extractors.Register(extractors.ExtractorMapping{
    Patterns:  []any{"mysite.com"},
    Extractor: NewMyExtractor,
})

Configuration

Options

All options have sensible defaults. Pass nil for zero-config extraction.

opts := &defuddle.Options{
    // Output
    Markdown:         false, // Return content as Markdown
    SeparateMarkdown: false, // Return both HTML and Markdown

    // Content selection
    ContentSelector:  "",    // CSS selector override for main content
    URL:              "",    // Source URL (used for link resolution and domain detection)

    // Removal controls — pointer bools default to true when nil.
    // Use defuddle.PtrBool(false) to explicitly disable.
    RemoveExactSelectors:   nil, // Remove known clutter (ads, nav, social buttons)
    RemovePartialSelectors: nil, // Remove probable clutter (class/id pattern matching)
    RemoveHiddenElements:   nil, // Remove display:none and hidden elements
    RemoveContentPatterns:  nil, // Remove boilerplate (breadcrumbs, related posts, etc.)
    RemoveLowScoring:       nil, // Remove low-scoring non-content blocks
    RemoveImages:           false,// Strip all images from output

    // Element processing
    ProcessCode:      false, // Normalize code blocks with language detection
    ProcessImages:    false, // Optimize images (lazy-load resolution, srcset)
    ProcessHeadings:  false, // Clean heading hierarchy
    ProcessMath:      false, // Normalize MathJax/KaTeX formulas
    ProcessFootnotes: false, // Standardize footnote format
    ProcessRoles:     false, // Convert ARIA roles to semantic HTML

    // HTTP (for ParseFromURL)
    Client:         nil,   // Custom *requests.Client
    MaxConcurrency: 5,     // Parallel limit for ParseFromURLs
    Debug:          false, // Emit debug processing info
}

Content Selector

Override automatic content detection with a CSS selector:

d, err := defuddle.NewDefuddle(html, &defuddle.Options{
    ContentSelector: "article.post-body",
})

The Extraction Pipeline

Defuddle processes content through a multi-stage pipeline:

HTML Input
 |
 v
1. Schema.org         -- Extract JSON-LD structured data
2. Site Detection      -- Match URL to specialized extractor
3. Shadow DOM          -- Flatten shadow roots and resolve React SSR
4. Selector Removal    -- Strip known clutter by CSS selector
5. Content Scoring     -- Score nodes and identify main content
6. Content Patterns    -- Remove boilerplate (breadcrumbs, related posts, newsletters)
7. Standardization     -- Normalize headings, footnotes, code blocks, images, math
8. Markdown            -- Convert to Markdown (if requested)
 |
 v
Result

The pipeline includes an automatic retry cascade: if initial extraction yields fewer than 50 words, Defuddle progressively relaxes removal filters to recover content from heavily-decorated pages.

The Result Object

Field	Type	Description
`Title`	`string`	Article title
`Author`	`string`	Article author
`Description`	`string`	Article description or summary
`Domain`	`string`	Website domain
`Favicon`	`string`	Website favicon URL
`Image`	`string`	Main article image URL
`Published`	`string`	Publication date
`Language`	`string`	Content language (BCP 47)
`Site`	`string`	Website name
`Content`	`string`	Cleaned HTML (or Markdown if enabled)
`ContentMarkdown`	`*string`	Markdown version (with `SeparateMarkdown`)
`WordCount`	`int`	Word count of extracted content
`ParseTime`	`int64`	Parse duration in milliseconds
`SchemaOrgData`	`any`	Schema.org structured data
`Variables`	`map[string]string`	Extractor-specific variables
`MetaTags`	`[]MetaTag`	Document meta tags
`ExtractorType`	`*string`	Which extractor was used
`DebugInfo`	`*debug.Info`	Debug processing steps (with `Debug`)

CLI Usage

The defuddle command provides a fast interface for content extraction, fully compatible with the original TypeScript CLI.

Extracting Content

# From a URL
defuddle parse https://example.com/article

# From a local file
defuddle parse article.html

# As Markdown
defuddle parse https://example.com/article --markdown

# As JSON with all metadata
defuddle parse https://example.com/article --json

# Extract a single field
defuddle parse https://example.com/article --property title

Saving Output

defuddle parse https://example.com/article --markdown --output article.md

Authentication and Proxies

# Custom headers
defuddle parse https://example.com --header "Authorization: Bearer token123"

# Through a proxy
defuddle parse https://example.com --proxy http://localhost:8080

# Custom timeout
defuddle parse https://slow-site.com --timeout 120s

All CLI Options

Option	Short	Description
`--output`	`-o`	Output file path (default: stdout)
`--markdown`	`-m`	Convert content to Markdown
`--json`	`-j`	Output as JSON with metadata
`--property`	`-p`	Extract a specific property
`--header`	`-H`	Custom header (repeatable)
`--proxy`		Proxy URL
`--user-agent`		Custom user agent
`--timeout`		Request timeout (default: 30s)
`--debug`		Enable debug output

Examples

The examples/ directory contains ready-to-run programs:

go run ./examples/basic              # Simple extraction
go run ./examples/markdown           # HTML to Markdown
go run ./examples/advanced           # Full option usage
go run ./examples/extractors         # Site-specific extraction
go run ./examples/custom_extractor   # Building a custom extractor

Testing

# Run all tests
go test ./...

# With race detection
go test -race ./...

# Benchmarks
go test -bench=. -benchmem ./...

Credits

Defuddle by Steph Ango (@kepano) — the original TypeScript library
Defuddle CLI by Steph Ango — the original CLI tool
Inspired by Mozilla's Readability algorithm

License

Defuddle Go is open-sourced software licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github		.github
.reference		.reference
cmd/defuddle		cmd/defuddle
docs		docs
examples		examples
extractors		extractors
internal		internal
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.version		.golangci.version
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
Taskfile.yml		Taskfile.yml
content.go		content.go
defuddle.go		defuddle.go
defuddle_bench_test.go		defuddle_bench_test.go
defuddle_test.go		defuddle_test.go
display.go		display.go
encoding_test.go		encoding_test.go
errors.go		errors.go
errors_test.go		errors_test.go
extractor_test.go		extractor_test.go
go.mod		go.mod
go.sum		go.sum
images.go		images.go
retry_test.go		retry_test.go
schema.go		schema.go
scoring_integration_test.go		scoring_integration_test.go
types.go		types.go
version.go		version.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

CLI

Library

Quick Start

Extracting Content

From HTML

From a URL

Markdown Output

Site-Specific Extractors

Custom Extractors

Configuration

Options

Content Selector

The Extraction Pipeline

The Result Object

CLI Usage

Extracting Content

Saving Output

Authentication and Proxies

All CLI Options

Examples

Testing

Credits

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

CLI

Library

Quick Start

Extracting Content

From HTML

From a URL

Markdown Output

Site-Specific Extractors

Custom Extractors

Configuration

Options

Content Selector

The Extraction Pipeline

The Result Object

CLI Usage

Extracting Content

Saving Output

Authentication and Proxies

All CLI Options

Examples

Testing

Credits

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages