Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
# Adobe India Hackathon: Document Intelligence (Round 1B)
# Document Intelligence System

This repository contains the solution for Round 1B of the "Connecting the Dots" Hackathon by Team Sentinels.
This repository contains an intelligent document analysis system that extracts and prioritizes the most relevant sections from a collection of PDFs based on a specific user persona and their job-to-be-done. The solution is designed to run entirely offline, leveraging a hybrid approach that combines structural analysis, semantic understanding, and keyword relevance.

The goal is to build a system that acts as an intelligent document analyst, extracting and prioritizing the most relevant sections from a collection of PDFs based on a specific user persona and their job-to-be-done. The solution is designed to run entirely offline, leveraging a hybrid approach that combines structural analysis, semantic understanding, and keyword relevance.

## Team Sentinels
## Contributors

- [Saksham Kumar](https://github.com/sakshamkumar04)
- [Aloukik Joshi](https://github.com/aloukikjoshi)
Expand All @@ -19,7 +17,7 @@ Our solution employs a multi-stage, hybrid pipeline that combines structural doc

### 1. Parallel Structure Extraction

- Reuses the high-performance outline extractor from Round 1A.
- Uses a high-performance outline extractor for document processing.
- Processes all input PDFs in parallel, with each document handled by a separate CPU core.
- Extracts all titles and headings (H1, H2, H3) for each document, serving as candidate sections for relevance ranking.

Expand Down Expand Up @@ -50,7 +48,7 @@ Our solution employs a multi-stage, hybrid pipeline that combines structural doc
### 5. Final Output Generation

- Aggregates results, including ranked section titles and refined subsection text.
- Formats output into `challenge1b_output.json` with all required metadata.
- Formats output into `output.json` with all required metadata.

## Models and Libraries Used

Expand All @@ -63,7 +61,7 @@ Our solution employs a multi-stage, hybrid pipeline that combines structural doc
- `torch`: Framework for running the sentence-transformer model.
- `sentence-transformers`: For loading and using the embedding model.
- `rank_bm25`: For keyword-based relevance scoring.
- `pymupdf`: For efficient PDF text extraction (reused from Round 1A).
- `pymupdf`: For efficient PDF text extraction.
- `numpy` & `pandas`: For numerical operations and data handling.

All dependencies are listed in [requirements.txt](requirements.txt) and are installed within the Docker container.
Expand All @@ -87,7 +85,7 @@ Before running the solution, ensure your directories are organized as follows:
```
root/
├── input/
│ ├── challenge1b_input.json
│ ├── input.json
│ └── PDFs/ # All required PDF documents
├── output/ # Results will be written here (create empty)
├── Dockerfile
Expand All @@ -100,15 +98,15 @@ root/
After building the image:

1. Create an `input` directory containing in the root directory of your project. Refer to the expected directory structure above.
1. Place `challenge1b_input.json` and a sub-folder `PDFs` with all required documents in your local `input` directory.
1. Place `input.json` and a sub-folder `PDFs` with all required documents in your local `input` directory.
1. (Optional) Create an empty local directory for results (e.g., `output`).
1. Run the following command from the directory containing your `input` and `output` folders:

```sh
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none document-intelligence:somerandomidentifier
```

The script inside the container will process the input JSON and PDF collection, and generate `challenge1b_output.json` in `/app/output`.
The script inside the container will process the input JSON and PDF collection, and generate `output.json` in `/app/output`.

---

Expand Down Expand Up @@ -173,8 +171,8 @@ This solution was realized with the support of Gemini, Perplexity, and GitHub Ch

## Copyright

© Team Sentinels (Saksham Kumar, Aloukik Joshi, Nihal Pandey).
All rights reserved. Team members possess exclusive rights to this solution, along with Adobe for the purpose of the competition.
© Contributors (Saksham Kumar, Aloukik Joshi, Nihal Pandey).
All rights reserved.
Unauthorized copying, distribution, or use of this code or documentation is strictly prohibited and liable to legal action.

---
18 changes: 9 additions & 9 deletions main.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
#!/usr/bin/env python3
"""
Round-1B | Persona-Aware PDF Pipeline
Document Intelligence | Persona-Aware PDF Pipeline
------------------------------------
CLI
python main.py <input_dir> <output_dir>

• Reads challenge1b_input.json from <input_dir>.
• Runs the original Round-1A extractor *in parallel* on every PDF
• Reads input.json from <input_dir>.
• Runs the PDF outline extractor *in parallel* on every PDF
(each worker opens the file from bytes once).
• Embeds + ranks sections with Granite-107 M embeddings and BM25.
• Refines best sections (thread-pool) and writes challenge1b_output.json
• Refines best sections (thread-pool) and writes output.json
to <output_dir>.
"""

Expand All @@ -23,7 +23,7 @@
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed


# Round-1B modules
# Document Intelligence modules
from semantic_analyzer import SemanticAnalyzer
from relevance_scorer import RelevanceScorer
from subsection_extractor import SubsectionExtractor
Expand All @@ -34,20 +34,20 @@

# ───────────────────────── Helpers ──────────────────────────
def _extract_outline_blob(pdf_path: str) -> Dict:
"""Worker: run Round-1A on a PDF file path."""
"""Worker: run PDF outline extraction on a PDF file path."""
# This import is intentionally local to the worker process
from r1a.enhanced_pdf_extractor import process_pdf_enhanced
return process_pdf_enhanced(pdf_path)


def _load_input(inp_dir: str) -> Dict:
with open(os.path.join(inp_dir, "challenge1b_input.json"), encoding="utf-8") as fh:
with open(os.path.join(inp_dir, "input.json"), encoding="utf-8") as fh:
return json.load(fh)


def _write_output(out_dir: str, data: Dict) -> None:
os.makedirs(out_dir, exist_ok=True)
with open(os.path.join(out_dir, "challenge1b_output.json"), "w", encoding="utf-8") as fh:
with open(os.path.join(out_dir, "output.json"), "w", encoding="utf-8") as fh:
json.dump(data, fh, indent=2, ensure_ascii=False)


Expand All @@ -63,7 +63,7 @@ def run_pipeline(input_dir: str, output_dir: str) -> None:
"processing_timestamp": datetime.now(timezone.utc).isoformat()
}

# 1. Parallel outline extraction (Round-1A)
# 1. Parallel outline extraction
pdf_dir = os.path.join(input_dir, "PDFs")
outlines: Dict[str, Dict] = {}
pdf_paths = [os.path.join(pdf_dir, d["filename"]) for d in req["documents"]]
Expand Down