Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ celerybeat.pid
# Environments
.env
.envrc
.omop_emb/
.venv
env/
venv/
Expand Down
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,50 @@
# omop-emb
Embedding layer for OMOP CDM.

`omop-emb` now separates model metadata from embedding storage:

- model metadata is stored locally in SQLite (`metadata.db`)
- embedding vectors are stored by the selected backend (`pgvector` or `faiss`)
- OMOP concept metadata remains in the OMOP CDM database

## Installation

`omop-emb` now exposes backend-specific optional dependencies so installation
can match the embedding backend you actually intend to use.

```bash
pip install "omop-emb[postgres]"
pip install "omop-emb[pgvector]"
pip install "omop-emb[faiss]"
pip install "omop-emb[all]"
```

Notes:

- `postgres` installs the PostgreSQL/pgvector dependencies.
- `pgvector` installs the PostgreSQL/pgvector dependencies.
- `faiss` installs the FAISS-based backend dependencies. This currently only includes CPU support
- `all` installs both backend stacks for development or mixed environments.
- A plain `pip install omop-emb` installs the shared core package only.
- PostgreSQL-specific embedding dependencies are now optional, but `omop-emb`
still requires some database backend for OMOP access and model registration.
- PostgreSQL-specific embedding dependencies are optional, but `omop-emb`
still requires OMOP CDM database access.
- Non-PostgreSQL database backends have not yet been tested.

## Runtime Configuration

Common environment variables:

- `OMOP_EMB_BACKEND`: backend name (`pgvector` or `faiss`) used by the backend factory.
- `OMOP_EMB_BASE_STORAGE_DIR`: local base directory for `omop-emb` artifacts, including local metadata (`metadata.db`) and FAISS files. If unset, `omop-emb` defaults to `./.omop_emb` in the current working directory.
- `OMOP_DATABASE_URL`: SQLAlchemy URL for the OMOP CDM database.

Extended documentation can be found [here](https://AustralianCancerDataNetwork.github.io/omop-emb).

# Project Roadmap

- [x] Interface for postgres storage of vectors
- [x] Interface for PostgreSQL storage of vectors
- [x] Interface for FAISS storage of embeddings
- [ ] Extensive unit testing
- [ ] Backend testing
- [ ] Corruption and restoration of DB testing
- [x] Extensive unit testing
- [x] Backend testing
- [x] Corruption and restoration of DB testing
- [ ] Support non-Flat indices for each backend
- [ ] `faiss` GPU support
- [ ] [`pgvectorscale`](https://github.com/timescale/pgvectorscale) support
Expand Down
23 changes: 15 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
The package currently supports:

- dynamic embedding model registration
- multiple embedding models can be stored in the respective backend
- model metadata is stored locally in SQLite (`metadata.db`)
- multiple embedding models can be tracked per backend and index type
- embedding and lookup for OMOP concepts
- supports various backends with a PostgreSQL linker
- supports various storage backends
- [pgvector](https://github.com/pgvector/pgvector): storage in the original OMOP database
- [FAISS](https://github.com/facebookresearch/faiss): efficient storage on disk for low-RAM applications
- [FAISS](https://github.com/facebookresearch/faiss): on-disk vector storage and index files
- Extension to [`omop-alchemy`](https://AustralianCancerDataNetwork.github.io/OMOP_Alchemy/) to support new tables
- CLI scripts to add embeddings to an already existing OMOP CDM

Expand All @@ -18,7 +19,7 @@ The package currently supports:
Install the backend you actually want to use:

```bash
pip install "omop-emb[postgres]"
pip install "omop-emb[pgvector]"
pip install "omop-emb[faiss]"
pip install "omop-emb[all]"
```
Expand All @@ -28,15 +29,21 @@ A plain `pip install omop-emb` installs only the shared core package.
At runtime, backend choice should also be explicit. The intended direction is:

- install-time choice via extras
- runtime choice via config such as `OMOP_EMB_BACKEND=postgres` or `OMOP_EMB_BACKEND=faiss` or passing it as an argument to the respective interface (e.g. see [CLI reference](usage/cli.md))
- runtime choice via config such as `OMOP_EMB_BACKEND=pgvector` or `OMOP_EMB_BACKEND=faiss` or passing it as an argument to the respective interface (e.g. see [CLI reference](usage/cli.md))

Recommended runtime environment variables:

- **`OMOP_EMB_BACKEND`**: `pgvector` or `faiss`
- **`OMOP_EMB_BASE_STORAGE_DIR`**: base directory for local metadata and FAISS artifacts; defaults to `omop_emb/.omop_emb` the root direcotry of hte package.
- **`OMOP_DATABASE_URL`**: OMOP CDM database URL

!!! info "Important caveats"

- `omop-emb` depends on an OMOP PostgreSQL database for storage of embeddings (pgvector) or to keep track of already embedded concepts.
!!! info "Important caveats"
- `omop-emb` depends on OMOP CDM database access for concept metadata and filtering.
- Current operational and test coverage is PostgreSQL-focused. Extension planned in the future.


## Documentation overview
- [Installation](usage/installation.md)
- [Backend Selection](usage/backend-selection.md)
- [Embedding storage backends](usage/backend-selection.md)
- [CLI Reference](usage/cli.md)
25 changes: 16 additions & 9 deletions docs/usage/backend-selection.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Backend Selection
# Backend Selection for embeddings

`omop-emb` now has a backend abstraction layer so embedding storage and
retrieval can be selected explicitly instead of being inferred implicitly from
Expand All @@ -8,10 +8,11 @@ whatever happens to be installed.

The current backend factory recognizes:

- `pgvector`: The [pgvector](https://github.com/pgvector/pgvector) extension to a standard postgres database to store embeddings directly in the database.
- `pgvector`: The [pgvector](https://github.com/pgvector/pgvector) extension to a standard PostgreSQL database to store embeddings directly in the database.
- `faiss`: The [FAISS](https://github.com/facebookresearch/faiss) storage solution for on-disk storage.

The default backend name is currently `postgres`.
There is no implicit default backend name. You must pass one explicitly or set
`OMOP_EMB_BACKEND`.

## Runtime selection

Expand All @@ -23,20 +24,26 @@ The intended pattern is:
Examples:

```bash
export OMOP_EMB_BACKEND=postgres
export OMOP_EMB_BACKEND=pgvector
export OMOP_EMB_BACKEND=faiss
export OMOP_EMB_BASE_STORAGE_DIR=$PWD/.omop_emb
```

You can also pass the backend name directly in Python.

Storage directory behavior:

- If `OMOP_EMB_BASE_STORAGE_DIR` is unset and no explicit path is passed, `omop-emb` defaults to `./.omop_emb` in the current working directory.
- If a path includes `~`, it is expanded (for example `~/.omop_emb`).

## Python factory

The backend factory lives in `omop_emb.backends`:

```python
from omop_emb.backends import get_embedding_backend

backend = get_embedding_backend("postgres")
backend = get_embedding_backend("pgvector")
backend = get_embedding_backend("faiss")
```

Expand Down Expand Up @@ -78,10 +85,10 @@ At the moment:
- the backend abstraction and backend factory exist
- PostgreSQL and FAISS backend classes exist
- the production CLI path still targets the PostgreSQL embedding workflow
- PostgreSQL-specific embedding dependencies are optional, but a database
backend is still required for OMOP access and model registration
- model registration is intended to remain shared and database-backed even when
FAISS is used for vector storage and retrieval
- PostgreSQL-specific embedding dependencies are optional, but OMOP database
access is still required for concept metadata
- model registration metadata is stored locally in SQLite (`metadata.db`) under
`OMOP_EMB_BASE_STORAGE_DIR`
- database backends other than PostgreSQL have not yet been tested

So this page documents the selection model and Python interface shape now, even
Expand Down
130 changes: 122 additions & 8 deletions docs/usage/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,17 @@ At present, the production CLI path is PostgreSQL-oriented and stores embeddings

## Prerequisites

- **Installation**: install the PostgreSQL backend dependencies:
- **Installation**: install the backend dependencies you plan to use:

```bash
pip install "omop-emb[postgres]"
pip install "omop-emb[pgvector]"
# or
pip install "omop-emb[faiss]"
```

- **Database**: Postgres implementation of OMOP CDM. See [`omop-graph` documentation](reference-missing) for information how to setup.
- **Environment**: `OMOP_DATABASE_URL` must be exported or existing in the .env file (e.g., `postgresql://user:pass@localhost:5432/omop`).
- **Database**: PostgreSQL implementation of OMOP CDM. See [`omop-graph` documentation](https://AustralianCancerDataNetwork.github.io/omop-graph) for information how to setup.
- **Environment**: `OMOP_DATABASE_URL` must be exported or present in `.env` (e.g., `postgresql://user:pass@localhost:5432/omop`).
- **Backend config**: set `OMOP_EMB_BACKEND` (`pgvector` or `faiss`) and optionally `OMOP_EMB_BASE_STORAGE_DIR`.
- **Connectivity**: Access to an OpenAI-compatible embeddings endpoint. *Currently only Ollama supported*.

!!! note "Backend Scope"
Expand All @@ -37,8 +40,6 @@ omop-emb add-embeddings --api-base <URL> --api-key <KEY> [OPTIONS]
where `[OPTIONS]` are optional arguments that can be specified as described below.


### Command Options

### Command Options

| Option | Short | Type | Default | Description |
Expand All @@ -48,8 +49,121 @@ where `[OPTIONS]` are optional arguments that can be specified as described belo
| **`--index-type`** | | `IndexType` | `FLAT` | The storage index for the embeddings for retrieval. Currently supported: `FLAT`. |
| **`--batch-size`** | `-b` | `Integer` | `100` | Number of concepts to process in each chunk. |
| **`--model`** | `-m` | `String` | `text-embedding-3-small` | Name of the embedding model to use for generating vectors. |
| **`--backend`** | | `Literal['pgvector', 'faiss']` | `None` | Embedding backend to use (can be replaced by `OMOP_EMB_BACKEND` env var). Requires the respective backend installed using `pip install omop-emb[pgvector or faiss]` |
| **`--faiss-base-dir`** | | `String` | `None` | Optional base directory for FAISS backend storage. |
| **`--backend`** | | `Literal['pgvector', 'faiss']` | `None` | Embedding backend to use (can be replaced by `OMOP_EMB_BACKEND`). Requires the corresponding optional dependency. |
| **`--storage-base-dir`** | | `String` | `None` | Optional base directory for backend storage and local metadata registry (`metadata.db`). |
| **`--standard-only`** | | `Boolean` | `False` | If set, only generate embeddings for OMOP standard concepts (`standard_concept = 'S'`). |
| **`--vocabulary`** | | `List[String]` | `None` | Filter to embed concepts only from specific OMOP vocabularies. |
| **`--num-embeddings`** | `-n` | `Integer` | `None` | Limit the number of concepts processed (useful for testing). |

## Environment Variables

- `OMOP_DATABASE_URL`: OMOP CDM database connection string.
- `OMOP_EMB_BACKEND`: backend selector used when `--backend` is not provided.
- `OMOP_EMB_BASE_STORAGE_DIR`: local storage root for metadata and file-based artifacts. If unset, `omop-emb` defaults to `./.omop_emb` in the current working directory.

Paths that include `~` are expanded automatically.

---

## `export-pgvector`

Export pgvector embedding tables to CSV files plus a manifest so they can be restored later.

### Usage
```bash
omop-emb export-pgvector --output-dir <SNAPSHOT_DIR> [OPTIONS]
```

### Options

| Option | Short | Type | Default | Description |
| :--- | :--- | :--- | :--- | :--- |
| **`--output-dir`** | `-o` | `String` | **Required** | Directory where snapshot files are written. |
| **`--storage-base-dir`** | | `String` | `None` | Optional path to local metadata registry (`metadata.db`). If unset, falls back to `OMOP_EMB_BASE_STORAGE_DIR`, otherwise defaults to `./.omop_emb` in the current working directory. Paths with `~` are expanded. |
| **`--model`** | `-m` | `List[String]` | `None` | Optional model-name filter. Repeat to export specific models only. |
| **`--index-type`** | | `IndexType` | `None` | Optional index-type filter. |

### Output

The command writes:

- `manifest.json`: snapshot metadata and table mapping
- One CSV per embedding table named `<storage_identifier>.csv`

---

## `import-pgvector`

Restore pgvector embedding tables from files previously created by `export-pgvector`.

### Usage
```bash
omop-emb import-pgvector --input-dir <SNAPSHOT_DIR> [OPTIONS]
```

### Options

| Option | Short | Type | Default | Description |
| :--- | :--- | :--- | :--- | :--- |
| **`--input-dir`** | `-i` | `String` | **Required** | Directory containing `manifest.json` and CSV files. |
| **`--storage-base-dir`** | | `String` | `None` | Optional path to local metadata registry (`metadata.db`). If unset, falls back to `OMOP_EMB_BASE_STORAGE_DIR`, otherwise defaults to `./.omop_emb` in the current working directory. Paths with `~` are expanded. |
| **`--replace`** | | `Boolean` | `False` | If set, truncate destination embedding tables before import. |
| **`--batch-size`** | `-b` | `Integer` | `5000` | Number of rows inserted per SQL batch. |

### Notes

- Import re-registers pgvector models into local metadata before loading rows.
- Import uses upsert semantics (`ON CONFLICT (concept_id) DO UPDATE`) unless `--replace` is set.

---

## `migrate-legacy-pgvector-registry`

Migrate legacy pgvector registry rows from a source database table into the local metadata registry (`metadata.db`).

This command is intended for compatibility with older setups that kept registry metadata in the database instead of the local metadata store.

### Usage
```bash
omop-emb migrate-legacy-pgvector-registry [OPTIONS]
```

### Options

| Option | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| **`--storage-base-dir`** | `String` | `None` | Optional path to local metadata registry location. If unset, falls back to `OMOP_EMB_BASE_STORAGE_DIR`, otherwise defaults to `./.omop_emb` in the current working directory. |
| **`--source-database-url`** | `String` | `OMOP_DATABASE_URL` | Source database URL containing the legacy registry table. |
| **`--legacy-table`** | `String` | `model_registry` | Name of the legacy registry table in the source database. |
| **`--dry-run`** | `Boolean` | `False` | Show what would be migrated without writing changes. |
| **`--drop-legacy-registry`** | `Boolean` | `False` | Drop the legacy table after successful migration. |

### Recommended Migration Flow

1. Validate what will migrate:

```bash
omop-emb migrate-legacy-pgvector-registry --dry-run
```

2. Run the migration:

```bash
omop-emb migrate-legacy-pgvector-registry
```

3. Optionally remove legacy table after verification:

```bash
omop-emb migrate-legacy-pgvector-registry --drop-legacy-registry
```

### Field Mapping

The migration command supports these legacy field names when reading rows:

- model name: `model_name`
- dimensions: `dimensions`
- index type: `index_type` (fallback: `index_method`)
- storage identifier: `storage_identifier` (fallback: `table_name`)
- metadata: `details` (fallback: `metadata`)
21 changes: 8 additions & 13 deletions docs/usage/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ requires database-backed OMOP access and model registration.
## PostgreSQL backend

```bash
pip install "omop-emb[postgres]"
pip install "omop-emb[pgvector]"
```

Use this when you want the current pgvector/PostgreSQL-backed embedding store
Expand All @@ -33,7 +33,7 @@ pip install "omop-emb[faiss]"
Use this when you want the FAISS backend dependencies available.

Even in this case, a database backend is still required for OMOP concept
metadata and model registration.
metadata access.

## Everything

Expand All @@ -52,20 +52,15 @@ install-time.
Examples:

```bash
export OMOP_EMB_BACKEND=postgres
export OMOP_EMB_BACKEND=pgvector
export OMOP_EMB_BACKEND=faiss
export OMOP_EMB_BASE_STORAGE_DIR=$PWD/.omop_emb
```

That avoids silent fallback between backend implementations.

## Current database support caveat
`OMOP_EMB_BASE_STORAGE_DIR` controls where `omop-emb` stores local metadata
(`metadata.db`) and file-based backend artifacts (such as FAISS files).
If it is not set, `omop-emb` defaults to `./.omop_emb` in the current working directory.
If a provided path includes `~`, it is expanded automatically.

PostgreSQL-specific embedding dependencies are now optional, but the broader
system has not yet been tested against non-PostgreSQL database backends.

So the current position is:

- PostgreSQL embedding infrastructure is optional
- a database backend is still always required
- database backends other than PostgreSQL should currently be treated as
unverified
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ nav:
- Home: index.md
- Getting Started:
- Installation: usage/installation.md
- Embedding Storage: usage/backend-selection.md
- "CLI Reference": usage/cli.md

plugins:
Expand Down
Loading
Loading