This is a powerful web crawler designed to extract and process content from websites. It can:
- Extract text content
- Download images
- Generate markdown files
- Translate content
- Create summaries
这是一个功能强大的网页爬虫,用于从网站提取和处理内容。它可以:
- 提取文本内容
- 下载图片
- 生成markdown文件
- 翻译内容
- 创建摘要
- Content Extraction: Extracts text, images, and metadata from web pages
- Markdown Generation: Creates well-structured markdown files
- Translation: Automatically translates content to Chinese
- Summary Generation: Creates concise summaries of extracted content
- Image Handling: Downloads and organizes images
- Multi-page Support: Can process multiple URLs in sequence
- Error Handling: Robust error handling and logging
- 内容提取: 从网页中提取文本、图片和元数据
- Markdown生成: 创建结构良好的markdown文件
- 翻译功能: 自动将内容翻译成中文
- 摘要生成: 创建提取内容的简明摘要
- 图片处理: 下载并组织图片
- 多页面支持: 可以顺序处理多个URL
- 错误处理: 强大的错误处理和日志记录
- Run the script:
python3 web_crawler.py - Enter the URL you want to crawl
- View results in the generated folder
- 运行脚本:
python3 web_crawler.py - 输入要爬取的URL
- 在生成的文件夹中查看结果
- Built with Python 3
- Uses BeautifulSoup for HTML parsing
- Leverages OpenAI API for translation and summarization
- Handles various content types and structures
- 基于Python 3构建
- 使用BeautifulSoup进行HTML解析
- 利用OpenAI API进行翻译和摘要生成
- 处理各种内容类型和结构
- Python 3.x
- BeautifulSoup4
- requests
- openai
- Python 3.x
- BeautifulSoup4
- requests
- openai
Each crawled website creates a folder containing:
output.md: Original contentoutput_translated.md: Translated contentsummary.md: Generated summaryimages/: Downloaded images
每个爬取的网站会创建一个包含以下内容的文件夹:
output.md: 原始内容output_translated.md: 翻译后的内容summary.md: 生成的摘要images/: 下载的图片
- Automatic English to Chinese translation
- Preserves original formatting
- Handles technical terms accurately
- 自动英译中
- 保留原始格式
- 准确处理技术术语
- Downloads all images from the page
- Organizes them in an images folder
- 下载页面中的所有图片
- 将它们组织在images文件夹中
- 在markdown中使用正确的语法保留图片引用:

- Creates concise summaries of main content
- Highlights key points
- Preserves important technical details
- 创建主要内容的简明摘要
- 突出关键点
- 保留重要的技术细节