This project implements a complete search engine system with three integrated components: a web crawler, an indexer, and a querier. The system was constructed through a design-to-implementation pipeline, beginning with specific requirements and progressing through detailed design specifications to final implementation. By utilizing efficient data structures and algorithms, the Tiny Search Engine can crawl web pages, build inverted indexes, and process complex queries with AND/OR operators to deliver ranked search results.
- Breadth-First Traversal: Systematically explores web pages starting from a seed URL up to a specified maximum depth.
- Hashtable-Based Deduplication: Uses a hashtable to track visited URLs, preventing duplicate crawling and infinite loops.
- Bag Data Structure: Implements a bag to manage the queue of pages to be crawled, ensuring efficient page processing.
- Respectful Crawling: Includes a 1-second delay between fetches to avoid overwhelming servers.
- Inverted Index Construction: Builds an efficient inverted index mapping each word to its occurrences across all documents.
- Hashtable Storage: Utilizes hashtables with word keys mapping to counters that track word frequency per document.
- Normalized Word Processing: Converts all words to lowercase and filters out words shorter than 3 characters.
- Persistent Storage: Saves the complete index to disk for use by the querier component.
- Boolean Query Processing: Supports complex queries with AND/OR operators, where AND takes precedence over OR.
- Ranking Algorithm: Scores documents based on word frequency and returns results in descending order of relevance.
- Interactive Interface: Provides a command-line interface for real-time query processing.
- Query Validation: Ensures proper syntax and handles edge cases in user input.
- Modular Architecture: Each component is independently testable and maintainable.
- Separation of Concerns: Clear interfaces between crawler, indexer, and querier components.
- Robust Error Handling: Comprehensive validation and graceful failure modes.
- Memory Management: Careful allocation and deallocation using consistent patterns.
- Starts from a seed URL and explores linked pages within the CS50 server domain.
- Saves each crawled page to a designated directory with unique document IDs.
- Maintains crawling boundaries with configurable maximum depth (≤ 10 for project scope).
- Reads crawled pages from the crawler's output directory.
- Processes HTML content to extract and normalize words.
- Constructs an inverted index and writes it to a structured file format.
- Loads the index file created by the indexer.
- Processes user queries with support for AND/OR boolean operations.
- Returns ranked results with document URLs and relevance scores.
- Shared utilities for page directory management, word processing, and index operations.
- Leverages the CS50 library for fundamental data structures (hashtables, counters, bags).
-
Crawling Phase:
- Initialize with a seed URL and maximum depth parameter.
- Use breadth-first search to discover and fetch web pages.
- Save each page with a unique document ID for indexer processing.
-
Indexing Phase:
- Read all crawled pages from the specified directory.
- Extract and normalize words from HTML content.
- Build an inverted index mapping words to document frequencies.
-
Querying Phase:
- Load the pre-built index into memory.
- Parse user queries and validate syntax.
- Process boolean operations and rank results by relevance.
- GCC compiler with C99 support
- CS50 library (libcs50) included in the project
- Unix-like environment for Makefile compilation
-
Build the system:
make
-
Run the crawler:
./crawler/crawler [seedURL] [pageDirectory] [maxDepth]
Example:
./crawler/crawler http://cs50tse.cs.dartmouth.edu/tse/letters/ ./data/letters 2
-
Run the indexer:
./indexer/indexer [pageDirectory] [indexFilename]
Example:
./indexer/indexer ./data/letters ./data/letters.index
-
Run the querier:
./querier/querier [pageDirectory] [indexFilename]
Example:
./querier/querier ./data/letters ./data/letters.index
Then enter queries interactively:
Query? computer science and dartmouth
- Each component includes comprehensive testing scripts
- Run
make testin any component directory for automated testing - Valgrind integration for memory leak detection
This project was developed as part of the Dartmouth CS50 curriculum, Winter 2025.
This project is distributed under the MIT License. Refer to the LICENSE file for detailed terms and conditions.