Interactive toolkit for exploring word embeddings: Word2Vec, GloVe, nearest neighbors, analogies, and 2D visualizations.
Built with a robust, modular architecture for enhanced stability and maintainability.
- Modular Architecture: Separated concerns across
services,presentation,visualization, anddatalayers for clean, testable code. - Lazy Model Loading: Models load on demand via
ModelManager, improving startup speed and memory efficiency. - Centralized Configuration: All settings, paths, and URLs managed in
src/core/config.py. - Enhanced Logging: Detailed process logs to file, essential user messages to console.
- Automatic Download of pre-trained models with integrity checks:
- Word2Vec GoogleNews (3M words, 300d)
- GloVe 6B (400K words, 50/100/200/300d variants)
- Smart Caching – models saved in Gensim binary format for instant subsequent loads
- Nearest Neighbors – find semantically similar words with visual similarity bars
- Analogies – solve
king - man + woman = ?with 2D vector visualization (-vflag) using PCA/t-SNE - Semantic Clusters – 2D vector visualization of seed words + neighbors using PCA/t-SNE
- Evaluation – test on Google Analogy Test Set (19,544 questions) with semantic/syntactic breakdown
- Interactive CLI – intuitive shell with contextual help and demo mode
git clone https://github.com/IlyaShaposhnikov/embedding-visualizer.git
cd embedding-visualizer
python -m venv .venv
source venv/Scripts/activate #Windows
source .venv/bin/activate # Linux/macOS
pip install -r requirements.txtpython main.pyOn first run, models download automatically (~1.5 GB for Word2Vec, ~800 MB for GloVe). Subsequent runs use cached binaries.
>>> nn king 5 # 5 nearest neighbors of 'king'
>>> ana paris france berlin -v tsne # paris - france = ? - berlin + visualization (t-SNE)
>>> vc king queen computer 3 pca # clusters for 3 seeds with 3 neighbors each (PCA)
>>> eval # evaluate model on Google Analogy Test Set
>>> demo # full demonstration (neighbors, analogies, clusters)| Command | Description |
|---|---|
use <model> |
Switch active model: word2vec or glove |
nn <word> [topn] |
Nearest neighbors (default topn=5) |
ana <w1> <w2> <w3> [topn] [-v] [m] |
Solve analogy w1 - w2 = ? - w3• -v: visualize vector relationships• [m]: method pca or tsne (default pca)- PCA — Linear projection that preserves global structure by mapping vectors onto axes of maximum variance. Fast, but may blur local clusters. - t-SNE — Non-linear projection that preserves local neighborhoods by modeling pairwise similarities. Slower, but reveals fine-grained clusters at the cost of global geometry. → Auto-saved to data/visualizations/ |
vc <w1> [w2 ...] [n] [m] |
Visualize semantic clusters: • Seeds: w1... (min 1)• [n]: neighbors per seed (default 3, max 20)• [m]: method pca or tsne (default pca)→ Auto-saved to data/visualizations/ |
eval |
Evaluate current model on Google Analogy Test Set |
model |
Show model info (vocab size, dimension, memory usage) |
demo |
Run full demonstration (nearest neighbors, analogies, clusters for both models) |
help |
Show command reference |
exit / quit |
Exit program |
Word2Vec (GoogleNews) | NEAREST NEIGHBORS: 'king'
────────────────────────────────────────────────────────────
1. queen | 0.7660 | ================
2. prince | 0.7421 | ===============
3. kings | 0.7285 | ===============
4. monarch | 0.7123 | ==============
5. crown | 0.6987 | ==============
────────────────────────────────────────────────────────────
Word2Vec (GoogleNews) | ANALOGY: king - man = ? - woman
────────────────────────────────────────────────────────────
# Solution Similarity
────────────────────────────────────────────────────────────
1. queen 0.7660
2. monarch 0.7421
3. prince 0.7285
────────────────────────────────────────────────────────────
embedding-visualizer/
├── data/ # Models (downloaded by user) & visualizations
│ ├── GoogleNews-vectors-negative300.bin # Word2Vec (3.4 GB)
│ ├── glove.6B.100d.txt # GloVe (331 MB)
│ └── visualizations/ # Auto-saved plots
├── logs/ # Directory for log files
│ └── embedding_visualizer.log # Application logs
├── src/
│ ├── core/ # Core application
│ │ ├── config.py # Centralized configuration
│ │ └── logging_config.py # Logging setup
│ ├── data/ # Data handling
│ │ └── data_extraction.py # Data extraction from models
│ ├── services/ # Business logic
│ │ ├── embedding.py # Core operations
│ │ └── evaluation.py # Evaluation logic
│ ├── visualization/ # Visualization
│ │ ├── data_preparation.py # Data preparation for visualization
│ │ ├── projections.py # Data projection
│ │ ├── plotting.py # Plotting logic
│ │ ├── clusters.py # Cluster visualization
│ │ └── analogies.py # Analogy visualization
│ ├── presentation/ # Presentation and formatting
│ │ └── formatting.py # Result formatting
│ ├── cli.py # Interactive shell & command parsing
│ ├── download.py # Model download with size verification & mirrors
│ ├── evaluate.py # Google Analogy Test Set evaluation
│ ├── models.py # Model loading with caching
│ ├── queries.py # Core operations interface
│ └── visualize.py # Facade for visualization functions
├── main.py # Entry point
├── requirements.txt # Dependencies
├── README.md # Project documentation (English)
└── README.ru.md # Project documentation (Russian)
Word2Vec download may fail with:
Too many users have viewed or downloaded this file recently...
Solution: Use the suggested mirror
GloVe 6B.100d has limited vocabulary (~400K words) and 100d dimensionality:
- Geographic names often missing → some semantic sections skipped during
eval - Expected accuracy: ~2-5% overall (vs 65-75% for Word2Vec GoogleNews 300d)
- Recommendation: Use Word2Vec for serious evaluation; GloVe 100d is suitable for demonstrations
| Model | RAM required | Load time (first) | Load time (cached) |
|---|---|---|---|
| Word2Vec GoogleNews | ~4.2 GB | 3-5 min | ~10 sec |
| GloVe 6B.100d | ~1.2 GB | 1-2 min | ~5 sec |
Ensure ≥6 GB free RAM for comfortable usage with both models.
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR.
- Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. EMNLP.
- Word2Vec GoogleNews: Trained on Google News corpus (3B words), 3M vocabulary, 300d vectors
- GloVe 6B: Trained on Wikipedia 2014 + Gigaword 5 (6B tokens), 400K vocabulary
- Google Analogy Test Set: 19,544 questions (8,869 semantic + 10,675 syntactic)