Adding more to the graph. by isaacmg · Pull Request #3 · AIStream-Peelout/historical-document-analysis

isaacmg · 2026-05-20T00:31:43Z

No description provided.

Handle semicolon-delimited author metadata and persist extra local-source info. - src/datasets/indexing/neo4j/biblio_import.py: add _split_authors helper and import Tuple; ensure a Scholar node and WROTE relationship are created for each individual author when author fields contain multiple names. - src/datasets/indexing/neo4j/mark_local_sources.py: add _split_authors that accepts lists or semicolon-delimited strings and use it in _extract_fields; include folder_name and pdf_filenames in _source_stats, local_props, and the Cypher update to store more provenance about local PDF copies. - src/datasets/indexing/bibliography/complete_pipeline_example.py: minor whitespace/comment formatting changes around run_specific_file calls. These changes improve author handling for multi-author strings and record additional local source metadata for downstream querying in Neo4j.

Introduce batch_pipeline_runner.py to scan PDFs under src/datasets/raw_data/cairo_genizah/academic_literature and resume the processing pipeline (OCR → structured LLM → enhanced/secondary). The CLI supports dry-run/status, subdirectory limiting, forcing enhanced reruns, skipping un-OCR'd files, and switching Gemini vs Ollama. It discovers per-PDF metadata, prints a status table, runs BookOCRService / StructuredJSONLLM / SecondaryLLMProcessor and re-attaches full text, and emits a final summary with succeeded/skipped/failed outcomes.

Add a backfill script and provenance tagging across Neo4j importers. Introduces src/datasets/indexing/neo4j/backfill_data_sources.py to stamp 'pgp' onto existing Princeton-origin nodes and relationships (with a --dry-run option) so pre-existing null data_sources are corrected. Update biblio_import to append 'biblio' to relevant nodes and relationships when creating/merging. Update enhanced_kg_import to append 'extracted' to created/merged nodes/relationships and add a place-name resolver to canonicalize Place names against Princeton variants. Update knowlege_graph_poc to append 'pgp' provenance to Princeton-created nodes/relationships. All changes use coalesce(...) + [...] patterns to avoid duplicating tags and preserve existing provenance.

Add place metadata extraction and enrichment to KG import, plus a geocoding utility and stronger LLM extraction prompts. - src/datasets/indexing/neo4j/enhanced_kg_import.py: build a places_by_name lookup from people_locations, pass it into triplet normalization, filter out non-geographic/place-like and trivial BookArticle titles, attach subject/object place metadata (city/country/region) to rows, and write those fields into Place nodes and relationships. Added _enrich_place_node to idempotently stamp city/country/region onto existing Place nodes and execute enrichment after import. - src/datasets/indexing/neo4j/geocode_places.py: new script to geocode Place nodes with Nominatim (geopy), including historical name overrides, country→region mapping, polite rate limiting, dry-run/limit/all flags, and writing lat/lng/city/country/region/osm_display_name back to Neo4j. - src/models/llm/secondary_llm_processing.py: update the people/locations extraction prompt to require full person names, richer place fields (city/country/region), and stricter rules (exclude archives/libraries/collections as places, ban vague compound places, and require proper BookArticle titles) to improve data quality. Overall goal: improve geographic coverage and accuracy in the KG by enriching Place nodes, automating geocoding, and reducing bad/extraneous place/title/person extractions from the LLM.

Add robust handling for institutions and historical places across Neo4j import and geocoding code. Key changes: - Detect institution names and reroute LLM-extracted Place nodes to an AFFILIATED_WITH relationship instead of LIVED_IN for modern academic/library affiliations (Enhanced KG importer). Added _is_institution_name and a new _write_person_affiliated_with write function. - Mark Princeton-derived Place nodes as place_type='historical' when imported and set place_type='historical' on various write paths to separate historical places from modern institutions. - Major geocoder improvements (geocode_places.py): Hebrew-script handling, transliteration fallback, Hebrew prefix stripping, curated historical overrides, Princeton DMS coordinate parsing, region/OSM-type heuristics to reject non-geographic or out-of-region results, importance/suspect flags, Google Maps integration (optional) with fallback to Nominatim, separate institution geocoding logic, and a dry-run/CLI enhancements (places/institutions flags). - In the Neo4j POC importer, set place_type='historical' for PGP imports and related writes. - Update LLM secondary processing prompt and REL_MAP to include AFFILIATED_WITH and allow TRAVELED_TO to target Place or Institution. - requirements.txt: add googlemaps dependency. Overall goal: reduce geocoding false positives for Genizah data, correctly separate modern affiliations from historical residences, and improve geocoding accuracy for Hebrew/Arabic names and institutional names.

Add robust source_book tracking and people/place upserts, and introduce two enrichment tools. - enhanced_kg_import.py: introduce _src_books() helper to append the current source book to node/relationship .source_books and thread the $book parameter through many _write_* methods; merge people_locations and context_analysis people/locations when building lookups; deduplicate name variants; ensure people and places are upserted even when no KG triplets via new _write_people_locations() and return counts; small refactors and safer checks. - new enrich_fragment_people.py: CLI tool that scans *_enhanced.json files to create Fragment-[:MENTIONS_PERSON]->Person edges using two strategies (name-in-description = definite, single-person page co-occurrence = possible), normalises shelfmarks, and writes to Neo4j (or dry-run). - new enrich_node_relations.py: larger entity-centric enrichment pass (extract + import) that builds focused page-context across books, calls LLM backends (LM Studio or Gemini), parses JSON triplets, writes per-entity enrichment files, and can import validated triplets into Neo4j. Includes prompt trimming, allowed-relation filtering, and safe MERGE writes that attach source book metadata. These changes ensure extracted provenance (source_books) is preserved, avoid silently dropping people/places extracted by the LLM, and add separate enrichment passes for fragment–person links and entity-focused relation extraction.

Copilot

Pull request overview

This PR expands the Cairo Genizah Neo4j knowledge-graph pipeline by adding additional enrichment/import utilities (geocoding, enrichment passes, provenance backfills) and by tightening LLM extraction guidance to reduce non-geographic “locations” and improve relationship extraction.

Changes:

Refines LLM prompts and Neo4j triplet mapping to better separate geographic places vs institutions and to introduce affiliation relationships.
Adds/extends multiple Neo4j utilities: geocoding for Place/Institution nodes, enrichment passes, and provenance tagging/backfill via data_sources.
Improves local-source indexing and Princeton/biblio import behavior (author splitting, additional node properties, provenance tagging).

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 23 comments.

Show a summary per file

File	Description
src/models/llm/secondary_llm_processing.py	Tightens people/location and KG-triplet prompts; adds AFFILIATED_WITH mapping in Neo4j push helper.
src/datasets/indexing/neo4j/mark_local_sources.py	Adds author splitting and stores additional local file/folder metadata on BookArticle nodes.
src/datasets/indexing/neo4j/knowlege_graph_poc.py	Adds `data_sources` provenance stamping and imports document→person mention edges.
src/datasets/indexing/neo4j/geocode_places.py	New script to geocode Place/Institution nodes via Nominatim/Google with historical-name overrides.
src/datasets/indexing/neo4j/enrich_node_relations.py	New 2-stage (extract→JSON, import→Neo4j) entity-anchored relationship enrichment tool.
src/datasets/indexing/neo4j/enrich_fragment_people.py	New script to mine *_enhanced.json for Fragment→Person mention edges.
src/datasets/indexing/neo4j/enhanced_kg_import.py	New importer for *_enhanced.json triplets into Neo4j with provenance tagging and label normalization.
src/datasets/indexing/neo4j/biblio_import.py	Splits semicolon-delimited authors into individual Scholar nodes; stamps `data_sources` on nodes/edges.
src/datasets/indexing/neo4j/backfill_data_sources.py	New script to backfill `data_sources` on existing nodes/relationships.
src/datasets/indexing/bibliography/complete_pipeline_example.py	Updates example pipeline calls/formatting.
src/datasets/indexing/bibliography/batch_pipeline_runner.py	New resumable batch runner for OCR → structured → enhanced pipeline stages.
requirements.txt	Adds `googlemaps` dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            'LIVED_IN':        ('Person',      'Place'),
-            'TRAVELED_TO':     ('Person',      'Place'),
+            'AFFILIATED_WITH': ('Person',      'Institution'),
+            'TRAVELED_TO':     ('Person',      None),      # Place or Institution
            'WROTE':           ('Person',      None),      # object type varies
            'MENTIONED_IN':    (None,          'Fragment'),


+def _split_authors(raw: Any) -> List[str]:
+    """Return a flat list of individual author names.
+
+    Handles both a list of strings and a single semicolon-delimited string,
+    e.g. "French, Mary ;Goldie, Rebecca ;Nichols, Emma".
+    """
+    if isinstance(raw, str):
+        raw = [raw]
+    authors: List[str] = []
+    for item in raw:
+        for part in str(item).split(';'):
+            name = part.strip()
+            if name:
+                authors.append(name)
+    return authors


+        Also imports Fragment -[:SENT_TO]-> Person and
+        Fragment -[:AUTHORED_BY]-> Person from the `destination` / `origin`
+        columns where those values resolve to known Person nodes (as opposed to
+        Place nodes which are already handled by import_documents).
+        """


+                def _names(field):
+                    raw = doc_data.get(field) or ''
+                    return [n.strip() for n in raw.replace(';', ',').split(',')
+                            if n.strip()]
+


 Extract knowledge-graph triplets connecting these entities. Use ONLY these relation types:
-  - LIVED_IN        : person permanently resided in a place
-  - TRAVELED_TO     : person traveled to or visited a place
+  - LIVED_IN        : historical person permanently resided in a geographic place
+  - AFFILIATED_WITH : modern scholar holds a position at a university, library, or institution
+  - TRAVELED_TO     : person traveled to or visited a place or institution


+    def _write_fragment_mentions(tx, row: Dict):
+        """Fragment -[:MENTIONS / :MENTIONS_PLACE]-> Person/Place"""
+        rel        = row["relation"]
+        tgt_label  = row["object_label"]


+        # Fragment nodes are keyed by canonical_shelfmark; if we only have a
+        # display form here, just store it — it may already exist from Princeton.
+        tx.run(f"""
+            MERGE (f:Fragment {{shelfmark: $shelfmark}})
+            SET f.data_sources = CASE


+HISTORICAL_OVERRIDES: dict[str, str] = {
+    "Fustat":         "Old Cairo, Cairo, Egypt",
+    "Fusṭāṭ":         "Old Cairo, Cairo, Egypt",
+    "al-Fustat":      "Old Cairo, Cairo, Egypt",
+    "Qayrawān":       "Kairouan, Tunisia",
+    "Qayrawan":       "Kairouan, Tunisia",


+    # Fix macOS SSL certificate verification errors
+    import ssl
+    try:
+        import certifi
+        ssl_ctx = ssl.create_default_context(cafile=certifi.where())
+    except ImportError:
+        logger.warning("certifi not found — falling back to unverified SSL (pip install certifi to fix)")
+        ssl_ctx = ssl.create_default_context()
+        ssl_ctx.check_hostname = False
+        ssl_ctx.verify_mode = ssl.CERT_NONE


+    parser.add_argument(
+        "--skip-unocred",
+        action="store_true",
+        help="Skip any PDF that hasn't been OCR'd yet (avoids GCP Vision API costs).",
+        default=True
+    )


This reverts commit fd03e31.

isaacmg added 8 commits May 15, 2026 01:49

Update biblio_import.py

84e5c9f

Create enhanced_kg_import.py

7f41fed

isaacmg requested a review from Copilot May 29, 2026 21:45

Copilot started reviewing on behalf of isaacmg May 29, 2026 21:45 View session

Update geocode_places.py

104e014

Copilot AI reviewed May 29, 2026

View reviewed changes

isaacmg added 3 commits May 29, 2026 17:51

Update geocode_places.py

fd03e31

Revert "Update geocode_places.py"

f636641

This reverts commit fd03e31.

Update geocode_places.py

7e3a58f

isaacmg merged commit 7e3a58f into master Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding more to the graph. #3

Adding more to the graph. #3
isaacmg merged 12 commits into
masterfrom
normalization_graph_work

isaacmg commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

isaacmg commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants