Motivation
Entity clustering and identity resolution for knowledge graph construction is a common pain point that could benefit from context-aware entity embeddings.
Current approaches are limited:
- String matching misses variations ("Apple Inc." vs "AAPL" vs "Apple")
- Simple entity embeddings lack context for disambiguation
- Rule-based systems don't scale across domains
Late chunking can solve this by extracting entity embeddings that preserve full document context, enabling better clustering and identity construction.
Proposed Solution
Add an EntityEncoder class that applies late chunking principles to named entities instead of sentences:
- Tokenize full document (preserving context)
- Run through transformer to get token embeddings
- Identify entity spans via NER
- Mean-pool token embeddings within entity boundaries
- Return entity embeddings with metadata
Use Cases
1. Entity Identity Construction (Primary)
# Cluster entity mentions as same canonical entity
mentions = [
"Apple Inc. announced new products",
"AAPL stock rose today",
"Apple said revenue increased"
]
df, embeddings = entity_encoder.encode(mentions)
# Cluster similar embeddings → identify "Apple Inc." = "AAPL" = "Apple"
2. Entity Disambiguation
# Different embeddings for same string in different contexts
doc1 = "Tim Cook joined Apple in 1998" # → "Apple" (tech company)
doc2 = "I ate an apple for lunch" # → "apple" (fruit)
# Context-aware embeddings distinguish these automatically
3. Cross-Document Coreference
# Link entity mentions across documents
doc1 = "The company announced record profits"
doc2 = "Microsoft said it would expand"
# Can link "the company" → "Microsoft" via embedding similarity
4. Knowledge Graph Construction
- Extract entities with contextual embeddings
- Cluster mentions into canonical entities
- Build entity nodes with rich context
- Link entities based on semantic similarity
Design Considerations
Entity Detection Strategy
Option 1: Built-in NER head (if model supports)
- Use model's native NER capabilities
- Fastest, most integrated
Option 2: External NER tagger (spaCy, Stanza)
- More flexible, can swap NER systems
- Better for specialized domains
Option 3: Pre-annotated spans (user provides)
- Maximum control
- Useful when NER already done upstream
Entity Types
- Support filtering by type (PER, ORG, LOC, MISC, etc.)
- Different strategies per type if needed
- Option to extract all vs. specific types
Overlapping/Nested Entities
- How to handle "New York City Police Department"?
- Nested: "New York City" (LOC) inside "New York City Police Department" (ORG)
- Strategy: Prefer longest span? Extract both? Configurable?
API Sketch
from afterthoughts import EntityEncoder
# Initialize with NER-capable model or separate NER tagger
encoder = EntityEncoder(
"dslim/bert-base-NER", # Model with NER head
# OR
ner_tagger="spacy", # External NER system
)
# Extract entity embeddings
documents = ["Apple Inc. partnered with Google in California."]
df, embeddings = encoder.encode(
documents,
entity_types=["ORG", "LOC"], # Optional: filter by type
return_text=True,
)
# DataFrame columns:
# - idx: Global entity index
# - document_idx: Source document
# - entity_idx: Entity index within document
# - entity_type: NER label (PER, ORG, LOC, etc.)
# - entity_text: Entity surface form
# - start_char: Character offset start
# - end_char: Character offset end
# - num_tokens: Token count in entity
# embeddings: numpy array of entity embeddings
Implementation Approach
Phase 1: Core Functionality
Phase 2: NER Integration
Phase 3: Advanced Features
Related Work
Similar to late chunking for sentences, but applied to entity boundaries:
- Günther et al. 2024 showed context preservation improves embeddings
- Same principle applies to entities: full context → extract entity spans
- Particularly valuable for disambiguation and clustering
Priority
Medium-High - Addresses real pain point in KG construction and entity resolution workflows. Natural extension of late chunking principles to a different chunking strategy.
Motivation
Entity clustering and identity resolution for knowledge graph construction is a common pain point that could benefit from context-aware entity embeddings.
Current approaches are limited:
Late chunking can solve this by extracting entity embeddings that preserve full document context, enabling better clustering and identity construction.
Proposed Solution
Add an
EntityEncoderclass that applies late chunking principles to named entities instead of sentences:Use Cases
1. Entity Identity Construction (Primary)
2. Entity Disambiguation
3. Cross-Document Coreference
4. Knowledge Graph Construction
Design Considerations
Entity Detection Strategy
Option 1: Built-in NER head (if model supports)
Option 2: External NER tagger (spaCy, Stanza)
Option 3: Pre-annotated spans (user provides)
Entity Types
Overlapping/Nested Entities
API Sketch
Implementation Approach
Phase 1: Core Functionality
EntityEncoderclass followingEncoderpatternsPhase 2: NER Integration
Phase 3: Advanced Features
Related Work
Similar to late chunking for sentences, but applied to entity boundaries:
Priority
Medium-High - Addresses real pain point in KG construction and entity resolution workflows. Natural extension of late chunking principles to a different chunking strategy.