Skip to content

Shoshkin/webspider

 
 

Repository files navigation

webspider Build Status

Open WEB spider platform, inspired by LinkTiger and PageFreezer

The related projects are listed Here

Description

Open WEB spider aimed to solve common task of downloading the entire content of WEB-site and allow on-fly post-processing of content. Planned features are

  • extract text from HTML/PDF documents
  • process only documents, matching given patterns in names/content types
  • extract data using XPath expressions from not well-formed HTML pages or XHTML ones
  • maintain website graph (links between ancestor / successor pages)
  • process websites behind the authentication (HTTP Basic/Digest, Form-Based authentication)
  • handle failures and restart processing from point where application was aborted
  • provide extension API for document type handlers, protocol handlers
  • concurrent processing of website pages
  • minimize traffic using bzip/gzip encoding when possible, avoid donloading of same link twice or more times

Supported protocols:

  • HTTP(S)
  • FTP

About

Open WEB spider platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published