Add EntityEncoder for entity-aware late chunking

## Motivation

Entity clustering and identity resolution for knowledge graph construction is a common pain point that could benefit from context-aware entity embeddings.

**Current approaches are limited:**
- String matching misses variations ("Apple Inc." vs "AAPL" vs "Apple")
- Simple entity embeddings lack context for disambiguation
- Rule-based systems don't scale across domains

**Late chunking can solve this** by extracting entity embeddings that preserve full document context, enabling better clustering and identity construction.

## Proposed Solution

Add an `EntityEncoder` class that applies late chunking principles to named entities instead of sentences:

1. Tokenize full document (preserving context)
2. Run through transformer to get token embeddings  
3. **Identify entity spans** via NER
4. Mean-pool token embeddings **within entity boundaries**
5. Return entity embeddings with metadata

## Use Cases

### 1. Entity Identity Construction (Primary)
```python
# Cluster entity mentions as same canonical entity
mentions = [
    "Apple Inc. announced new products",
    "AAPL stock rose today", 
    "Apple said revenue increased"
]

df, embeddings = entity_encoder.encode(mentions)
# Cluster similar embeddings → identify "Apple Inc." = "AAPL" = "Apple"
```

### 2. Entity Disambiguation
```python
# Different embeddings for same string in different contexts
doc1 = "Tim Cook joined Apple in 1998"  # → "Apple" (tech company)
doc2 = "I ate an apple for lunch"       # → "apple" (fruit)
# Context-aware embeddings distinguish these automatically
```

### 3. Cross-Document Coreference
```python
# Link entity mentions across documents
doc1 = "The company announced record profits"
doc2 = "Microsoft said it would expand"  
# Can link "the company" → "Microsoft" via embedding similarity
```

### 4. Knowledge Graph Construction
- Extract entities with contextual embeddings
- Cluster mentions into canonical entities
- Build entity nodes with rich context
- Link entities based on semantic similarity

## Design Considerations

### Entity Detection Strategy
**Option 1: Built-in NER head** (if model supports)
- Use model's native NER capabilities
- Fastest, most integrated

**Option 2: External NER tagger** (spaCy, Stanza)
- More flexible, can swap NER systems
- Better for specialized domains

**Option 3: Pre-annotated spans** (user provides)
- Maximum control
- Useful when NER already done upstream

### Entity Types
- Support filtering by type (PER, ORG, LOC, MISC, etc.)
- Different strategies per type if needed
- Option to extract all vs. specific types

### Overlapping/Nested Entities
- How to handle "New York City Police Department"?
  - Nested: "New York City" (LOC) inside "New York City Police Department" (ORG)
- Strategy: Prefer longest span? Extract both? Configurable?

## API Sketch

```python
from afterthoughts import EntityEncoder

# Initialize with NER-capable model or separate NER tagger
encoder = EntityEncoder(
    "dslim/bert-base-NER",  # Model with NER head
    # OR
    ner_tagger="spacy",     # External NER system
)

# Extract entity embeddings
documents = ["Apple Inc. partnered with Google in California."]
df, embeddings = encoder.encode(
    documents,
    entity_types=["ORG", "LOC"],  # Optional: filter by type
    return_text=True,
)

# DataFrame columns:
# - idx: Global entity index
# - document_idx: Source document
# - entity_idx: Entity index within document  
# - entity_type: NER label (PER, ORG, LOC, etc.)
# - entity_text: Entity surface form
# - start_char: Character offset start
# - end_char: Character offset end
# - num_tokens: Token count in entity

# embeddings: numpy array of entity embeddings
```

## Implementation Approach

### Phase 1: Core Functionality
- [ ] Create `EntityEncoder` class following `Encoder` patterns
- [ ] Support pre-annotated entity spans (simplest)
- [ ] Entity span → token span mapping
- [ ] Mean-pool tokens within entity boundaries
- [ ] Return DataFrame + embeddings

### Phase 2: NER Integration
- [ ] Add built-in NER head support (if model has token classification)
- [ ] Add external NER tagger option (spaCy, Stanza, etc.)
- [ ] Entity type filtering

### Phase 3: Advanced Features
- [ ] Handle nested/overlapping entities
- [ ] Entity-level chunking strategies (if entity span exceeds max_length)
- [ ] Batch processing optimizations
- [ ] Integration examples with KG libraries

## Related Work

Similar to late chunking for sentences, but applied to entity boundaries:
- Günther et al. 2024 showed context preservation improves embeddings
- Same principle applies to entities: full context → extract entity spans
- Particularly valuable for disambiguation and clustering

## Priority

**Medium-High** - Addresses real pain point in KG construction and entity resolution workflows. Natural extension of late chunking principles to a different chunking strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EntityEncoder for entity-aware late chunking #14

Motivation

Proposed Solution

Use Cases

1. Entity Identity Construction (Primary)

2. Entity Disambiguation

3. Cross-Document Coreference

4. Knowledge Graph Construction

Design Considerations

Entity Detection Strategy

Entity Types

Overlapping/Nested Entities

API Sketch

Implementation Approach

Phase 1: Core Functionality

Phase 2: NER Integration

Phase 3: Advanced Features

Related Work

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add EntityEncoder for entity-aware late chunking #14

Description

Motivation

Proposed Solution

Use Cases

1. Entity Identity Construction (Primary)

2. Entity Disambiguation

3. Cross-Document Coreference

4. Knowledge Graph Construction

Design Considerations

Entity Detection Strategy

Entity Types

Overlapping/Nested Entities

API Sketch

Implementation Approach

Phase 1: Core Functionality

Phase 2: NER Integration

Phase 3: Advanced Features

Related Work

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions