webspider

Open WEB spider platform, inspired by LinkTiger and PageFreezer

The related projects are listed Here

Description

Open WEB spider aimed to solve common task of downloading the entire content of WEB-site and allow on-fly post-processing of content. Planned features are

extract text from HTML/PDF documents
process only documents, matching given patterns in names/content types
extract data using XPath expressions from not well-formed HTML pages or XHTML ones
maintain website graph (links between ancestor / successor pages)
process websites behind the authentication (HTTP Basic/Digest, Form-Based authentication)
handle failures and restart processing from point where application was aborted
provide extension API for document type handlers, protocol handlers
concurrent processing of website pages
minimize traffic using bzip/gzip encoding when possible, avoid donloading of same link twice or more times

Supported protocols:

HTTP(S)
FTP

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
project		project
webspider-core		webspider-core
webspider-parser		webspider-parser
webspider-storage		webspider-storage
webspider-transport		webspider-transport
webspider		webspider
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webspider

Description

About

Uh oh!

Releases

Packages

Shoshkin/webspider

Folders and files

Latest commit

History

Repository files navigation

webspider

Description

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages