An RDKit extension for DuckDB

This repository is based on https://github.com/duckdb/extension-template, check it out if you want to build and ship your own DuckDB extension.

This extension, duckdb_rdkit, integrates RDKit into DuckDB to enable you to do cheminformatics work with DuckDB.

Currently supported functionality:

Types

Mol: the internal duckdb_rdkit representation of a RDKit molecule.
- Currently only SMILES can be converted to Mol. This can be done with mol_from_smiles, or by casts (i.e. inserting a SMILES string into a column that expects Mol or 'CC::mol').

Important

The duckdb_rdkit molecule representation has additional metadata and cannot be read directly by RDKit. You will get an error. You can use mol_to_rdkit_mol to convert the duckdb_rdkit molecule representation into one that is RDKit compatible.

File formats

SDF

There are two ways to query .sdf files with SQL. These can be used to extract, transform, and load data into a duckdb file for faster subsequent queries, or to directly query the sdf to explore the data.
- read_sdf(path/to/sdf/file, COLUMNS={column_name: LogicalType}); Using the read_sdf function, the properties of interest in the sdf file can be explicitly defined. If a record does not have the specified property, a null value will be returned. The 'Mol' type will indicate to the extension that the molecules in the records should be extracted and returned.
  - Example: SELECT * FROM read_sdf(path/to/file, COLUMNS={desired_col: 'VARCHAR', mol: 'Mol'});
- Automatic detection of sdf files. This will execute the query against the sdf file when the extension .sdf is detected.
  
  In this case, the extension will guess what the schema is. If the schema is not homogeneous, it is possible that the automatic detection will miss certain properties in the SDF.
  
  In this case, it is better to use the read_sdf function in order to make sure the property of interest is extracted. This is not a problem if the schema is uniform throughout the sdf file.
  
  The molecule column is named mol.
- Example: SELECT mol, id FROM 'test.sdf';

Searches

is_exact_match(mol1, mol2): exact structure search. Returns true if the two molecules are the same. (Chirality sensitive search is not on)
- Note: if you are looking for very specific capabilities with exact match with regards to stereochemistry or tautomers, the RegistrationHash (https://rdkit.org/docs/source/rdkit.Chem.RegistrationHash.html) might be an option to consider. You would need to write this to your DB and then you can do a simple VARCHAR based search on those columns.
is_substruct(mol1, mol2): returns true if mol2 is a substructure of mol1.

Molecule conversion functions

mol_from_smiles(SMILES): returns a molecule for a SMILES string. Returns NULL if mol cannot be made from SMILES
mol_to_smiles(mol): returns the SMILES string for a RDKit molecule
mol_to_rdkit_mol(mol): returns the binary RDKit molecule in hexadecimal representation
- duckdb_rdkit has its own binary representation of molecules, which differs from RDKit’s format. Use this function to extract a molecule from duckdb_rdkit and convert it into a format compatible with RDKit. The returned value can be passed to RDKit's Chem.Mol function for further processing in Python.

Molecule descriptors

mol_logp(mol): returns the Wildman-Crippen LogP estimate for a molecule
mol_exactmw(mol): returns the exact molecular weight
mol_amw(mol): returns the approximate molecular weight
mol_tpsa(mol): returns the topological polar surface area
mol_hba(mol): returns the number of H-bonds acceptors
mol_hbd(mol): returns the number of H-bonds donors
mol_num_rotatable_bonds(mol): returns the number of rotatable bonds
mol_qed(mol): returns the quantitative estimate of drug-likeness (QED) of the molecule
- currently only implements the "mean weight" of the ADS parameters from the paper Quantifying the chemical beauty of drugs by Bickerton, et al.

Getting started

Unfortunately, I haven't been able to find a way to make installing the duckdb_rdkit extension as easy as INSTALL and LOAD as other duckdb extensions may be.

I have only been able to successfully compile and test the extension on linux_amd64 and osx_arm64.

You can download the binaries from the releases, or build the extension from source. The compiled binary built for the duckdb_rdkit extension is not signed. You may get a warning about running unverified applications from the OS.

Building section.
Running the extension section.

Building

Building duckdb_rdkit

First, clone this repository with recurse submodules to pull duckdb and the extension-ci-tools repositories

git clone --recurse-submodules https://github.com/bodowd/duckdb_rdkit.git

To build the extension, you need to have RDKit installed. The instructions below are derived from this post on the RDKit blog. The easiest way to install RDKit is with conda, and I used miniforge.

After installing conda, you can create a new conda environment and then install the packages needed. linux_conda_env.yml or osx_conda_env.yml can be used to create a conda environment for building the extension.

# activate your conda env and then in your conda env run:
conda create -n rdkit_dev
conda activate rdkit_dev
# or use the osx_conda_env.yml if you are on osx
conda env update -f linux_conda_env.yml

After installing the prerequisite software, you can run:

GEN=ninja make

This will compile duckdb and the extension and you will find it in the build folder.

For further information on building duckdb from source, you can visit https://duckdb.org/docs/dev/building/overview.html

Running the extension

In the CLI

If you want to run the duckdb binary you built from source from this duckdb_rdkit extension repository, you can just run ./build/release/duckdb. This will already have the extension loaded in.

If you downloaded the compiled binaries from here, you will need to tell duckdb where to find the RDKit shared object files. Otherwise, you may see errors like this: ./duckdb: error while loading shared libraries: libRDKitDescriptors.so.1: cannot open shared object file: No such file or directory

If you have your conda env activated:

# LINUX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# OSX
export DYLD_LIBRARY_PATH=$CONDA_PREFIX/lib:$DYLD_LIBRARY_PATH

If you don't have your conda env activated, you will need to find where your installation has placed these files. For example, in ~/miniforge3/envs/my_rdkit_env/lib. You will need to add your path to LD_LIBRARY_PATH on Linux, or DYLD_LIBRARY_PATH on osx.

If you want to run with a different binary that does not have the extension already installed and loaded, but rather point to this extension, you'll need to tell duckdb where the extension is, and you also need to tell it to run unsigned extensions.

Warning: I was not able to get the extension to run on the linux CLI binary downloaded from duckdb's website. That seems to have been compiled for linux_amd64_gcc4, and I was not successful compiling the extension for that.

Run duckdb with the unsigned flag on to run unsigned extensions. More information here: https://duckdb.org/docs/extensions/overview.html#unsigned-extensions

duckdb -unsigned

Then load the extension with the path to the duckdb_extension file:

LOAD 'path/to/duckdb_rdkit.duckdb_extension'

Now confirm if the extension is working:

# should return true
SELECT is_exact_match('C', 'C');

# should return false
SELECT is_exact_match('C', 'CO');

In the python client

Warning: On Linux, I was unable to get the client I installed via pip to load the extension because it only seems to support loading extensions compiled for linux_amd64_gcc4. I was able to get it loaded in duckdb installed via conda though. See duckdb's website for more information.

See the duckdb documentation for instructions on installing the python client.

You may need to tell duckdb where to find the RDKit shared object files.

# LINUX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# OSX
export DYLD_LIBRARY_PATH=$CONDA_PREFIX/lib:$DYLD_LIBRARY_PATH

Then test it out:

import duckdb
con = duckdb.connect(config = {"allow_unsigned_extensions": "true"})
con.install_extension('/path/to/duckdb_rdkit.duckdb_extension')
con.load_extension('/path/to/duckdb_rdkit.duckdb_extension')
# should return true
con.sql("SELECT is_exact_match('C', 'C');")
# should return false
con.sql("SELECT is_exact_match('C', 'CO');")

Running the tests

Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql. These SQL tests can be run using:

make test

Name		Name	Last commit message	Last commit date
Latest commit History 563 Commits
.github/workflows		.github/workflows
docs		docs
duckdb @ 1986445		duckdb @ 1986445
experiments		experiments
extension-ci-tools @ 1a6fd9d		extension-ci-tools @ 1a6fd9d
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
linux_conda_env.yml		linux_conda_env.yml
osx_conda_env.yml		osx_conda_env.yml
test.sdf		test.sdf
todos.json		todos.json
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An RDKit extension for DuckDB

Currently supported functionality:

Types

File formats

SDF

Searches

Molecule conversion functions

Molecule descriptors

Getting started

Building

Building duckdb_rdkit

Running the extension

In the CLI

In the python client

Running the tests

About

Uh oh!

Releases

Packages

Languages

License

kdinkla/duckdb_rdkit

Folders and files

Latest commit

History

Repository files navigation

An RDKit extension for DuckDB

Currently supported functionality:

Types

File formats

SDF

Searches

Molecule conversion functions

Molecule descriptors

Getting started

Building

Building duckdb_rdkit

Running the extension

In the CLI

In the python client

Running the tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages