A small container to get an OMOP CDM database running quickly, with support for both PostgreSQL and SQL Server.
Drop your data into data/
, and run the container.
You can configure the container or CLI using the following environment variables:
DB_HOST
: The hostname of the database. Default isdb
.DB_PORT
: The port number of the database. Default is5432
.DB_USER
: The username for the database. Default ispostgres
.DB_PASSWORD
: The password for the database. Default ispassword
.DB_NAME
: The name of the database. Default isomop
.DIALECT
: The type of database to use. Default ispostgresql
, but can also bemssql
.SCHEMA_NAME
: The name of the schema to be created/used in the database. Default ispublic
.DATA_DIR
: The directory containing the data CSV files. Default isdata
.SYNTHETIC
: Load synthetic data (boolean). Default isfalse
SYNTHETIC_NUMBER
: Size of synthetic data,100
or1000
. Default is100
.DELIMITER
: The delimiter used to separate data. Default istab
, can also be,
pip install omop-lite
python omop-lite --help
docker run -v ./data:/data ghcr.io/health-informatics-uon/omop-lite
# docker-compose.yml
services:
omop-lite:
image: ghcr.io/health-informatics-uon/omop-lite
volumes:
- ./data:/data
depends_on:
- db
db:
image: postgres:latest
environment:
- POSTGRES_DB=omop
- POSTGRES_PASSWORD=password
ports:
- "5432:5432"
To install using Helm:
# Add the Helm repository
helm install omop-lite oci://ghcr.io/health-informatics-uon/charts/omop-lite --version 0.2.2
The Helm chart deploys OMOP Lite as a Kubernetes Job that creates an OMOP CDM in a database. You can customise the installation using a values file:
# values.yaml
env:
dbHost: postgres
dbPort: "5432"
dbUser: postgres
dbPassword: postgres
dbName: omop_helm
dialect: postgresql
schemaName: public
synthetic: "false"
Install with custom values:
helm install omop-lite omop-lite/omop-lite -f values.yaml
If you need synthetic data, some is provided in the synthetic
directory. It provides a small amount of data to load quickly.
To load the synthetic data, run the container with the SYNTHETIC
environment variable set to true
.
- 100 is fake data.
- 1000 is Synthea 1k data.
You can provide your own data for loading into the tables by placing your files in the data/
directory. This should contain .csv
files matching the data tables (DRUG_STRENGTH.csv
, CONCEPT.csv
, etc.).
To match the vocabulary files from Athena, this data should be tab-separated, but as a .csv
file extension.
You can override the delimiter with DELIMITER
configuration.
Adding a tsvector column to the concept table and an index on that column makes full-text search queries on the concept table run much faster.
This can be configured by setting FTS_CREATE
to be non-empty in the environment.
Postgres does vector search too!
To enable this on omop-lite, you can compose the compose-omop-ts.yml
with
docker compose -f compose-omop-ts.yml
To do this, you need to have embeddings/embeddings.parquet
, containing concept_ids and embeddings.
This uses pgvector to create an embeddings
table.
If you're a developer and want to iterate on omop-lite quickly, there's a small subset of the vocabularies sufficient to build in synthetic/
.
If you wish to test the vector search, there are matching embeddings in embeddings/embeddings.parquet
.