Adding more to the graph. #3
Merged
Merged
Conversation
Handle semicolon-delimited author metadata and persist extra local-source info. - src/datasets/indexing/neo4j/biblio_import.py: add _split_authors helper and import Tuple; ensure a Scholar node and WROTE relationship are created for each individual author when author fields contain multiple names. - src/datasets/indexing/neo4j/mark_local_sources.py: add _split_authors that accepts lists or semicolon-delimited strings and use it in _extract_fields; include folder_name and pdf_filenames in _source_stats, local_props, and the Cypher update to store more provenance about local PDF copies. - src/datasets/indexing/bibliography/complete_pipeline_example.py: minor whitespace/comment formatting changes around run_specific_file calls. These changes improve author handling for multi-author strings and record additional local source metadata for downstream querying in Neo4j.
Introduce batch_pipeline_runner.py to scan PDFs under src/datasets/raw_data/cairo_genizah/academic_literature and resume the processing pipeline (OCR → structured LLM → enhanced/secondary). The CLI supports dry-run/status, subdirectory limiting, forcing enhanced reruns, skipping un-OCR'd files, and switching Gemini vs Ollama. It discovers per-PDF metadata, prints a status table, runs BookOCRService / StructuredJSONLLM / SecondaryLLMProcessor and re-attaches full text, and emits a final summary with succeeded/skipped/failed outcomes.
Add a backfill script and provenance tagging across Neo4j importers. Introduces src/datasets/indexing/neo4j/backfill_data_sources.py to stamp 'pgp' onto existing Princeton-origin nodes and relationships (with a --dry-run option) so pre-existing null data_sources are corrected. Update biblio_import to append 'biblio' to relevant nodes and relationships when creating/merging. Update enhanced_kg_import to append 'extracted' to created/merged nodes/relationships and add a place-name resolver to canonicalize Place names against Princeton variants. Update knowlege_graph_poc to append 'pgp' provenance to Princeton-created nodes/relationships. All changes use coalesce(...) + [...] patterns to avoid duplicating tags and preserve existing provenance.
Add place metadata extraction and enrichment to KG import, plus a geocoding utility and stronger LLM extraction prompts. - src/datasets/indexing/neo4j/enhanced_kg_import.py: build a places_by_name lookup from people_locations, pass it into triplet normalization, filter out non-geographic/place-like and trivial BookArticle titles, attach subject/object place metadata (city/country/region) to rows, and write those fields into Place nodes and relationships. Added _enrich_place_node to idempotently stamp city/country/region onto existing Place nodes and execute enrichment after import. - src/datasets/indexing/neo4j/geocode_places.py: new script to geocode Place nodes with Nominatim (geopy), including historical name overrides, country→region mapping, polite rate limiting, dry-run/limit/all flags, and writing lat/lng/city/country/region/osm_display_name back to Neo4j. - src/models/llm/secondary_llm_processing.py: update the people/locations extraction prompt to require full person names, richer place fields (city/country/region), and stricter rules (exclude archives/libraries/collections as places, ban vague compound places, and require proper BookArticle titles) to improve data quality. Overall goal: improve geographic coverage and accuracy in the KG by enriching Place nodes, automating geocoding, and reducing bad/extraneous place/title/person extractions from the LLM.
Add robust handling for institutions and historical places across Neo4j import and geocoding code. Key changes: - Detect institution names and reroute LLM-extracted Place nodes to an AFFILIATED_WITH relationship instead of LIVED_IN for modern academic/library affiliations (Enhanced KG importer). Added _is_institution_name and a new _write_person_affiliated_with write function. - Mark Princeton-derived Place nodes as place_type='historical' when imported and set place_type='historical' on various write paths to separate historical places from modern institutions. - Major geocoder improvements (geocode_places.py): Hebrew-script handling, transliteration fallback, Hebrew prefix stripping, curated historical overrides, Princeton DMS coordinate parsing, region/OSM-type heuristics to reject non-geographic or out-of-region results, importance/suspect flags, Google Maps integration (optional) with fallback to Nominatim, separate institution geocoding logic, and a dry-run/CLI enhancements (places/institutions flags). - In the Neo4j POC importer, set place_type='historical' for PGP imports and related writes. - Update LLM secondary processing prompt and REL_MAP to include AFFILIATED_WITH and allow TRAVELED_TO to target Place or Institution. - requirements.txt: add googlemaps dependency. Overall goal: reduce geocoding false positives for Genizah data, correctly separate modern affiliations from historical residences, and improve geocoding accuracy for Hebrew/Arabic names and institutional names.
Add robust source_book tracking and people/place upserts, and introduce two enrichment tools. - enhanced_kg_import.py: introduce _src_books() helper to append the current source book to node/relationship .source_books and thread the $book parameter through many _write_* methods; merge people_locations and context_analysis people/locations when building lookups; deduplicate name variants; ensure people and places are upserted even when no KG triplets via new _write_people_locations() and return counts; small refactors and safer checks. - new enrich_fragment_people.py: CLI tool that scans *_enhanced.json files to create Fragment-[:MENTIONS_PERSON]->Person edges using two strategies (name-in-description = definite, single-person page co-occurrence = possible), normalises shelfmarks, and writes to Neo4j (or dry-run). - new enrich_node_relations.py: larger entity-centric enrichment pass (extract + import) that builds focused page-context across books, calls LLM backends (LM Studio or Gemini), parses JSON triplets, writes per-entity enrichment files, and can import validated triplets into Neo4j. Includes prompt trimming, allowed-relation filtering, and safe MERGE writes that attach source book metadata. These changes ensure extracted provenance (source_books) is preserved, avoid silently dropping people/places extracted by the LLM, and add separate enrichment passes for fragment–person links and entity-focused relation extraction.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR expands the Cairo Genizah Neo4j knowledge-graph pipeline by adding additional enrichment/import utilities (geocoding, enrichment passes, provenance backfills) and by tightening LLM extraction guidance to reduce non-geographic “locations” and improve relationship extraction.
Changes:
- Refines LLM prompts and Neo4j triplet mapping to better separate geographic places vs institutions and to introduce affiliation relationships.
- Adds/extends multiple Neo4j utilities: geocoding for Place/Institution nodes, enrichment passes, and provenance tagging/backfill via
data_sources. - Improves local-source indexing and Princeton/biblio import behavior (author splitting, additional node properties, provenance tagging).
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| src/models/llm/secondary_llm_processing.py | Tightens people/location and KG-triplet prompts; adds AFFILIATED_WITH mapping in Neo4j push helper. |
| src/datasets/indexing/neo4j/mark_local_sources.py | Adds author splitting and stores additional local file/folder metadata on BookArticle nodes. |
| src/datasets/indexing/neo4j/knowlege_graph_poc.py | Adds data_sources provenance stamping and imports document→person mention edges. |
| src/datasets/indexing/neo4j/geocode_places.py | New script to geocode Place/Institution nodes via Nominatim/Google with historical-name overrides. |
| src/datasets/indexing/neo4j/enrich_node_relations.py | New 2-stage (extract→JSON, import→Neo4j) entity-anchored relationship enrichment tool. |
| src/datasets/indexing/neo4j/enrich_fragment_people.py | New script to mine *_enhanced.json for Fragment→Person mention edges. |
| src/datasets/indexing/neo4j/enhanced_kg_import.py | New importer for *_enhanced.json triplets into Neo4j with provenance tagging and label normalization. |
| src/datasets/indexing/neo4j/biblio_import.py | Splits semicolon-delimited authors into individual Scholar nodes; stamps data_sources on nodes/edges. |
| src/datasets/indexing/neo4j/backfill_data_sources.py | New script to backfill data_sources on existing nodes/relationships. |
| src/datasets/indexing/bibliography/complete_pipeline_example.py | Updates example pipeline calls/formatting. |
| src/datasets/indexing/bibliography/batch_pipeline_runner.py | New resumable batch runner for OCR → structured → enhanced pipeline stages. |
| requirements.txt | Adds googlemaps dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
692
to
696
| 'LIVED_IN': ('Person', 'Place'), | ||
| 'TRAVELED_TO': ('Person', 'Place'), | ||
| 'AFFILIATED_WITH': ('Person', 'Institution'), | ||
| 'TRAVELED_TO': ('Person', None), # Place or Institution | ||
| 'WROTE': ('Person', None), # object type varies | ||
| 'MENTIONED_IN': (None, 'Fragment'), |
Comment on lines
+69
to
+83
| def _split_authors(raw: Any) -> List[str]: | ||
| """Return a flat list of individual author names. | ||
|
|
||
| Handles both a list of strings and a single semicolon-delimited string, | ||
| e.g. "French, Mary ;Goldie, Rebecca ;Nichols, Emma". | ||
| """ | ||
| if isinstance(raw, str): | ||
| raw = [raw] | ||
| authors: List[str] = [] | ||
| for item in raw: | ||
| for part in str(item).split(';'): | ||
| name = part.strip() | ||
| if name: | ||
| authors.append(name) | ||
| return authors |
Comment on lines
+579
to
+583
| Also imports Fragment -[:SENT_TO]-> Person and | ||
| Fragment -[:AUTHORED_BY]-> Person from the `destination` / `origin` | ||
| columns where those values resolve to known Person nodes (as opposed to | ||
| Place nodes which are already handled by import_documents). | ||
| """ |
Comment on lines
+606
to
+610
| def _names(field): | ||
| raw = doc_data.get(field) or '' | ||
| return [n.strip() for n in raw.replace(';', ',').split(',') | ||
| if n.strip()] | ||
|
|
Comment on lines
363
to
+366
| Extract knowledge-graph triplets connecting these entities. Use ONLY these relation types: | ||
| - LIVED_IN : person permanently resided in a place | ||
| - TRAVELED_TO : person traveled to or visited a place | ||
| - LIVED_IN : historical person permanently resided in a geographic place | ||
| - AFFILIATED_WITH : modern scholar holds a position at a university, library, or institution | ||
| - TRAVELED_TO : person traveled to or visited a place or institution |
Comment on lines
+739
to
+742
| def _write_fragment_mentions(tx, row: Dict): | ||
| """Fragment -[:MENTIONS / :MENTIONS_PLACE]-> Person/Place""" | ||
| rel = row["relation"] | ||
| tgt_label = row["object_label"] |
Comment on lines
+746
to
+750
| # Fragment nodes are keyed by canonical_shelfmark; if we only have a | ||
| # display form here, just store it — it may already exist from Princeton. | ||
| tx.run(f""" | ||
| MERGE (f:Fragment {{shelfmark: $shelfmark}}) | ||
| SET f.data_sources = CASE |
Comment on lines
+55
to
+60
| HISTORICAL_OVERRIDES: dict[str, str] = { | ||
| "Fustat": "Old Cairo, Cairo, Egypt", | ||
| "Fusṭāṭ": "Old Cairo, Cairo, Egypt", | ||
| "al-Fustat": "Old Cairo, Cairo, Egypt", | ||
| "Qayrawān": "Kairouan, Tunisia", | ||
| "Qayrawan": "Kairouan, Tunisia", |
Comment on lines
+764
to
+773
| # Fix macOS SSL certificate verification errors | ||
| import ssl | ||
| try: | ||
| import certifi | ||
| ssl_ctx = ssl.create_default_context(cafile=certifi.where()) | ||
| except ImportError: | ||
| logger.warning("certifi not found — falling back to unverified SSL (pip install certifi to fix)") | ||
| ssl_ctx = ssl.create_default_context() | ||
| ssl_ctx.check_hostname = False | ||
| ssl_ctx.verify_mode = ssl.CERT_NONE |
Comment on lines
+342
to
+347
| parser.add_argument( | ||
| "--skip-unocred", | ||
| action="store_true", | ||
| help="Skip any PDF that hasn't been OCR'd yet (avoids GCP Vision API costs).", | ||
| default=True | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.