This repository contains tooling, documentation, and experiments for clustering and interpreting a personal collection of ~7,000 images enriched with captions, metadata, and precomputed embeddings. The goal is to surface recurring visual motifs and themes that reflect personal visual preferences.
- Organize the dataset into meaningful clusters using both existing embeddings and any additional representations that improve structure.
- Leverage LLM-assisted sensemaking to label clusters and summarize prevailing concepts.
- Iterate on quantitative evaluations that validate cluster quality and capture trends over time.
docs/– project documentation, changelog, research notes, and experiment journal.src/analysis/– Python modules for preprocessing, clustering, and visualization helpers.requirements.txt– curated dependency list for the experimentation environment.data/– generated artifacts (ignored by git). Created by preprocessing/clustering scripts.
Note:
eagle_images_rows.csvis ignored by git due to size/sensitivity. Place it at the repo root before running scripts.
- Create and activate a virtual environment (e.g.,
python -m venv .venv && source .venv/bin/activate). - Install dependencies:
pip install -r requirements.txt. - Ensure the CSV export from Eagle (
eagle_images_rows.csv) lives at the project root.
All modules live under src/analysis. Add the directory to PYTHONPATH when executing modules:
PYTHONPATH=src python -m analysis.preprocess --helpThe planned workflow is:
- Run
analysis.preprocessto parse JSON columns, materialize embeddings as NumPy arrays, and serialize a cleaned metadata table. - Experiment with
analysis.clusterto perform UMAP dimensionality reduction followed by HDBSCAN (or other algorithms) and persist experiment outputs. - Summarize results, plots, and LLM interpretations in
docs/cluster-journal.md.
Personal research project – no license specified yet.