Defuddle Go is a port of the Defuddle TypeScript library. It extracts clean, readable content from any web page — stripping away navigation, ads, sidebars, and other clutter so you're left with just the article.
Available as both a Go library and a drop-in CLI tool compatible with the original Defuddle CLI.
Download a pre-built binary from the releases page, or install with Go:
go install github.com/dotcommander/defuddle/cmd/defuddle@latestRequire Defuddle Go using go get:
go get github.com/dotcommander/defuddleRequires Go 1.26 or higher.
Extract the main content from any web page in just a few lines:
d, err := defuddle.NewDefuddle(htmlString, nil)
if err != nil {
log.Fatal(err)
}
result, err := d.Parse(context.Background())
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Title)
fmt.Println(result.Content)Or fetch and parse a URL directly:
result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", nil)Pass raw HTML and receive structured content with metadata:
d, err := defuddle.NewDefuddle(html, &defuddle.Options{
URL: "https://example.com/article",
})
if err != nil {
log.Fatal(err)
}
result, err := d.Parse(context.Background())
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("Author: %s\n", result.Author)
fmt.Printf("Published: %s\n", result.Published)
fmt.Printf("Description: %s\n", result.Description)
fmt.Printf("Word Count: %d\n", result.WordCount)
fmt.Printf("Language: %s\n", result.Language)ParseFromURL handles HTTP fetching, encoding detection, and parsing in one call:
result, err := defuddle.ParseFromURL(ctx, "https://example.com/article", &defuddle.Options{
Markdown: true,
})Convert extracted content to Markdown for storage, indexing, or LLM consumption:
result, err := d.Parse(ctx)
// When Markdown is enabled, Content is returned as Markdown
fmt.Println(result.Content)To receive both HTML and Markdown in the same response:
d, err := defuddle.NewDefuddle(html, &defuddle.Options{
SeparateMarkdown: true,
})
result, err := d.Parse(ctx)
fmt.Println(result.Content) // HTML
fmt.Println(*result.ContentMarkdown) // MarkdownDefuddle automatically detects popular platforms and applies specialized extraction logic. No configuration needed — if the URL matches, the right extractor activates.
| Platform | Content Type |
|---|---|
| ChatGPT | Conversations with role-separated messages |
| Claude | Conversations with human/assistant turns |
| Gemini | Google AI conversations |
| Grok | xAI conversations |
| GitHub | Issues and pull requests with comments |
| Hacker News | Posts and threaded comment discussions |
| Posts with comment trees | |
| Substack | Newsletter articles |
| Twitter / X | Tweets and threads |
| X Articles | Long-form articles (Draft.js) |
| YouTube | Video metadata and descriptions |
Implement the BaseExtractor interface to add support for any site:
type MyExtractor struct {
*extractors.ExtractorBase
}
func NewMyExtractor(doc *goquery.Document, url string, schema any) extractors.BaseExtractor {
return &MyExtractor{ExtractorBase: extractors.NewExtractorBase(doc, url, schema)}
}
func (e *MyExtractor) Name() string { return "MyExtractor" }
func (e *MyExtractor) CanExtract() bool { return true }
func (e *MyExtractor) Extract() *extractors.ExtractorResult {
doc := e.GetDocument()
content, _ := doc.Find(".article-body").Html()
return &extractors.ExtractorResult{
ContentHTML: content,
Variables: map[string]string{"site": "My Site"},
}
}Register it before parsing:
extractors.Register(extractors.ExtractorMapping{
Patterns: []any{"mysite.com"},
Extractor: NewMyExtractor,
})All options have sensible defaults. Pass nil for zero-config extraction.
opts := &defuddle.Options{
// Output
Markdown: false, // Return content as Markdown
SeparateMarkdown: false, // Return both HTML and Markdown
// Content selection
ContentSelector: "", // CSS selector override for main content
URL: "", // Source URL (used for link resolution and domain detection)
// Removal controls — pointer bools default to true when nil.
// Use defuddle.PtrBool(false) to explicitly disable.
RemoveExactSelectors: nil, // Remove known clutter (ads, nav, social buttons)
RemovePartialSelectors: nil, // Remove probable clutter (class/id pattern matching)
RemoveHiddenElements: nil, // Remove display:none and hidden elements
RemoveContentPatterns: nil, // Remove boilerplate (breadcrumbs, related posts, etc.)
RemoveLowScoring: nil, // Remove low-scoring non-content blocks
RemoveImages: false,// Strip all images from output
// Element processing
ProcessCode: false, // Normalize code blocks with language detection
ProcessImages: false, // Optimize images (lazy-load resolution, srcset)
ProcessHeadings: false, // Clean heading hierarchy
ProcessMath: false, // Normalize MathJax/KaTeX formulas
ProcessFootnotes: false, // Standardize footnote format
ProcessRoles: false, // Convert ARIA roles to semantic HTML
// HTTP (for ParseFromURL)
Client: nil, // Custom *requests.Client
MaxConcurrency: 5, // Parallel limit for ParseFromURLs
Debug: false, // Emit debug processing info
}Override automatic content detection with a CSS selector:
d, err := defuddle.NewDefuddle(html, &defuddle.Options{
ContentSelector: "article.post-body",
})Defuddle processes content through a multi-stage pipeline:
HTML Input
|
v
1. Schema.org -- Extract JSON-LD structured data
2. Site Detection -- Match URL to specialized extractor
3. Shadow DOM -- Flatten shadow roots and resolve React SSR
4. Selector Removal -- Strip known clutter by CSS selector
5. Content Scoring -- Score nodes and identify main content
6. Content Patterns -- Remove boilerplate (breadcrumbs, related posts, newsletters)
7. Standardization -- Normalize headings, footnotes, code blocks, images, math
8. Markdown -- Convert to Markdown (if requested)
|
v
Result
The pipeline includes an automatic retry cascade: if initial extraction yields fewer than 50 words, Defuddle progressively relaxes removal filters to recover content from heavily-decorated pages.
| Field | Type | Description |
|---|---|---|
Title |
string |
Article title |
Author |
string |
Article author |
Description |
string |
Article description or summary |
Domain |
string |
Website domain |
Favicon |
string |
Website favicon URL |
Image |
string |
Main article image URL |
Published |
string |
Publication date |
Language |
string |
Content language (BCP 47) |
Site |
string |
Website name |
Content |
string |
Cleaned HTML (or Markdown if enabled) |
ContentMarkdown |
*string |
Markdown version (with SeparateMarkdown) |
WordCount |
int |
Word count of extracted content |
ParseTime |
int64 |
Parse duration in milliseconds |
SchemaOrgData |
any |
Schema.org structured data |
Variables |
map[string]string |
Extractor-specific variables |
MetaTags |
[]MetaTag |
Document meta tags |
ExtractorType |
*string |
Which extractor was used |
DebugInfo |
*debug.Info |
Debug processing steps (with Debug) |
The defuddle command provides a fast interface for content extraction, fully compatible with the original TypeScript CLI.
# From a URL
defuddle parse https://example.com/article
# From a local file
defuddle parse article.html
# As Markdown
defuddle parse https://example.com/article --markdown
# As JSON with all metadata
defuddle parse https://example.com/article --json
# Extract a single field
defuddle parse https://example.com/article --property titledefuddle parse https://example.com/article --markdown --output article.md# Custom headers
defuddle parse https://example.com --header "Authorization: Bearer token123"
# Through a proxy
defuddle parse https://example.com --proxy http://localhost:8080
# Custom timeout
defuddle parse https://slow-site.com --timeout 120s| Option | Short | Description |
|---|---|---|
--output |
-o |
Output file path (default: stdout) |
--markdown |
-m |
Convert content to Markdown |
--json |
-j |
Output as JSON with metadata |
--property |
-p |
Extract a specific property |
--header |
-H |
Custom header (repeatable) |
--proxy |
Proxy URL | |
--user-agent |
Custom user agent | |
--timeout |
Request timeout (default: 30s) | |
--debug |
Enable debug output |
The examples/ directory contains ready-to-run programs:
go run ./examples/basic # Simple extraction
go run ./examples/markdown # HTML to Markdown
go run ./examples/advanced # Full option usage
go run ./examples/extractors # Site-specific extraction
go run ./examples/custom_extractor # Building a custom extractor# Run all tests
go test ./...
# With race detection
go test -race ./...
# Benchmarks
go test -bench=. -benchmem ./...- Defuddle by Steph Ango (@kepano) — the original TypeScript library
- Defuddle CLI by Steph Ango — the original CLI tool
- Inspired by Mozilla's Readability algorithm
Defuddle Go is open-sourced software licensed under the MIT license.