Skip to content

NolanKoblischke/AION-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AION-Search: Semantic search for galaxy images using AI-generated captions

arXiv Project Page HuggingFace Demo License: MIT Citation

AION-Search is a text-based search engine for galaxy images trained from GPT-4.1-mini descriptions and summaries of HSC and Legacy Survey galaxies.

🔭 Use AION-Search now!
Try the live web demo: AION-Search App

📚 Checkout our results
More details at: Project Page

📦 Explore Our Datasets
Access all data products (embeddings and captions): HuggingFace Datasets

Quick Start

Installation

git clone https://github.com/NolanKoblischke/AION-Search.git
cd AION-Search
pip install -e . # or uv pip install -e .

Requirements

This package requires an OpenAI API key for generating text embeddings.

  1. Get your API key at platform.openai.com
  2. Create a .env file in the project root:
    OPENAI_API_KEY=<your-api-key>

Usage

from aionsearch import AIONSearchClipModel

# Load pretrained model from HuggingFace
model = AIONSearchClipModel.from_pretrained()

# Project AION image embeddings into shared space
aion_embedding = # Embedding of an image using github.com/PolymathicAI/AION
projected_image = model.image_projector(aion_embedding)  # (batch, 768) -> (batch, 1024)

# Project OpenAI text embeddings into shared space  
text_embedding = # Embedding of text using text-embedding-3-large
projected_text = model.text_projector(text_embedding)    # (batch, 3072) -> (batch, 1024)

# Compute similarity for semantic search
similarity = projected_image @ projected_text.T

See examples/quick_start.ipynb for a complete walkthrough that downloads a galaxy image, generates embeddings with AION, and performs text-to-image similarity search.


Research Code

The research/ directory contains the paper-facing experiment code and small frozen artifacts needed to reproduce the main analyses where redistribution is practical. It includes:

The research environment can be installed from research/pyproject.toml:

cd research
uv sync

Some large inputs and external services are intentionally not vendored into the Git repository. The experiment READMEs document which public Hugging Face artifacts, raw survey products, or API credentials are required for each workflow.


Citation

If you find this work useful, please cite:

@misc{koblischke2025semantic,
      title={Semantic Search for 100M+ Galaxy Images Using AI-Generated Captions}, 
      author={Nolan Koblischke and Liam Parker and Francois Lanusse and Jo Bovy and Irina Espejo and Shirley Ho},
      year={2025},
      eprint={2512.11982},
      archivePrefix={arXiv},
      primaryClass={astro-ph.IM},
      url={https://arxiv.org/abs/2512.11982}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages