Skip to content

mrsladoje/filesearchman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” pdfsearchman

A sophisticated search system for PDF documents featuring intelligent ranking, autocomplete, and boolean operations.

✨ Features

  • Smart Search - TF-IDF ranking with contextual scoring
  • Boolean Operators - AND, OR, NOT support with nested queries
  • Phrase Search - Exact phrase matching with "quoted terms"
  • Autocomplete - Wildcard * completion with popularity ranking
  • Page Linking - Cross-reference detection ("see page X")
  • PDF Export - Results automatically saved to rezultati.pdf
  • Did You Mean? - Suggestions for typos and low-result queries

πŸš€ Quick Start

# Index your PDF
python searching_util.py  # Creates data.pkl

# Start searching
python search.py

🎯 Search Syntax

word                    # Simple search
"exact phrase"          # Phrase search
word1 AND word2         # Both terms
word1 OR word2          # Either term
word1 NOT word2         # Exclude term
auto*                   # Autocomplete
(word1 OR word2) AND word3  # Complex boolean (infix notation)

πŸ—οΈ Architecture

  • Trie - Efficient word storage and prefix matching
  • TF-IDF - Document relevance scoring
  • PageRank-style - Cross-reference boosting
  • Context Window - Surrounding word analysis for better ranking

πŸ“¦ Dependencies

PyPDF2, reportlab, sty

Serbian UI Β· PDF exports Β· Cross-reference aware

About

Efficient engine for searching through a pdf file

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages