Skip to content

Add EntityEncoder for entity-aware late chunking #14

@429er

Description

@429er

Motivation

Entity clustering and identity resolution for knowledge graph construction is a common pain point that could benefit from context-aware entity embeddings.

Current approaches are limited:

  • String matching misses variations ("Apple Inc." vs "AAPL" vs "Apple")
  • Simple entity embeddings lack context for disambiguation
  • Rule-based systems don't scale across domains

Late chunking can solve this by extracting entity embeddings that preserve full document context, enabling better clustering and identity construction.

Proposed Solution

Add an EntityEncoder class that applies late chunking principles to named entities instead of sentences:

  1. Tokenize full document (preserving context)
  2. Run through transformer to get token embeddings
  3. Identify entity spans via NER
  4. Mean-pool token embeddings within entity boundaries
  5. Return entity embeddings with metadata

Use Cases

1. Entity Identity Construction (Primary)

# Cluster entity mentions as same canonical entity
mentions = [
    "Apple Inc. announced new products",
    "AAPL stock rose today", 
    "Apple said revenue increased"
]

df, embeddings = entity_encoder.encode(mentions)
# Cluster similar embeddings → identify "Apple Inc." = "AAPL" = "Apple"

2. Entity Disambiguation

# Different embeddings for same string in different contexts
doc1 = "Tim Cook joined Apple in 1998"  # → "Apple" (tech company)
doc2 = "I ate an apple for lunch"       # → "apple" (fruit)
# Context-aware embeddings distinguish these automatically

3. Cross-Document Coreference

# Link entity mentions across documents
doc1 = "The company announced record profits"
doc2 = "Microsoft said it would expand"  
# Can link "the company" → "Microsoft" via embedding similarity

4. Knowledge Graph Construction

  • Extract entities with contextual embeddings
  • Cluster mentions into canonical entities
  • Build entity nodes with rich context
  • Link entities based on semantic similarity

Design Considerations

Entity Detection Strategy

Option 1: Built-in NER head (if model supports)

  • Use model's native NER capabilities
  • Fastest, most integrated

Option 2: External NER tagger (spaCy, Stanza)

  • More flexible, can swap NER systems
  • Better for specialized domains

Option 3: Pre-annotated spans (user provides)

  • Maximum control
  • Useful when NER already done upstream

Entity Types

  • Support filtering by type (PER, ORG, LOC, MISC, etc.)
  • Different strategies per type if needed
  • Option to extract all vs. specific types

Overlapping/Nested Entities

  • How to handle "New York City Police Department"?
    • Nested: "New York City" (LOC) inside "New York City Police Department" (ORG)
  • Strategy: Prefer longest span? Extract both? Configurable?

API Sketch

from afterthoughts import EntityEncoder

# Initialize with NER-capable model or separate NER tagger
encoder = EntityEncoder(
    "dslim/bert-base-NER",  # Model with NER head
    # OR
    ner_tagger="spacy",     # External NER system
)

# Extract entity embeddings
documents = ["Apple Inc. partnered with Google in California."]
df, embeddings = encoder.encode(
    documents,
    entity_types=["ORG", "LOC"],  # Optional: filter by type
    return_text=True,
)

# DataFrame columns:
# - idx: Global entity index
# - document_idx: Source document
# - entity_idx: Entity index within document  
# - entity_type: NER label (PER, ORG, LOC, etc.)
# - entity_text: Entity surface form
# - start_char: Character offset start
# - end_char: Character offset end
# - num_tokens: Token count in entity

# embeddings: numpy array of entity embeddings

Implementation Approach

Phase 1: Core Functionality

  • Create EntityEncoder class following Encoder patterns
  • Support pre-annotated entity spans (simplest)
  • Entity span → token span mapping
  • Mean-pool tokens within entity boundaries
  • Return DataFrame + embeddings

Phase 2: NER Integration

  • Add built-in NER head support (if model has token classification)
  • Add external NER tagger option (spaCy, Stanza, etc.)
  • Entity type filtering

Phase 3: Advanced Features

  • Handle nested/overlapping entities
  • Entity-level chunking strategies (if entity span exceeds max_length)
  • Batch processing optimizations
  • Integration examples with KG libraries

Related Work

Similar to late chunking for sentences, but applied to entity boundaries:

  • Günther et al. 2024 showed context preservation improves embeddings
  • Same principle applies to entities: full context → extract entity spans
  • Particularly valuable for disambiguation and clustering

Priority

Medium-High - Addresses real pain point in KG construction and entity resolution workflows. Natural extension of late chunking principles to a different chunking strategy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions