Skip to main content

Initial Support for SDF Files

· 4 min read
Trent Hauck
Trent Hauck
Developer

It's now possible to read SDF files with Exon and BioBear.

SDF Files

The "SDF" in SDF files stands for "Structure Data File". It's sorta like ATM machines.

As you might imagine, this file format contains chemical structures and additional data. There may be multiple such records in a single file. For example, from the ChemBL database.

CHEMBL153534
RDKit 2D

16 17 0 0 0 0 0 0 0 0999 V2000
7.6140 -22.2702 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7047 -23.1991 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1806 -22.5282 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.9604 -22.7690 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8790 -23.2163 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.2791 -21.1119 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
7.5280 -21.4445 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4225 -22.4364 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.8353 -21.7198 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2035 -23.8527 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
4.0534 -23.2163 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9776 -23.5889 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -22.4938 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -23.9215 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.4397 -20.3035 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.6552 -21.6280 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 3 2 0
3 4 1 0
4 1 1 0
5 2 1 0
6 7 1 0
7 1 2 0
8 1 1 0
9 8 2 0
10 12 1 0
11 5 2 3
12 4 2 0
13 11 1 0
14 11 1 0
15 6 1 0
16 9 1 0
6 9 1 0
10 2 1 0
M END
> <chembl_id>
CHEMBL153534

$$$$

This record starts with header information, followed by the atom and bond information, then any additional data (chembl_id in this case). The record ends with $$$$.

Reading SDF Files

Depending on which library you're using, you have different options for working with SDF files.

BioBear

If you're working with BioBear, install the latest version from PyPI:

pip install -U biobear

Then you can read an SDF file like this:

from biobear import new_session
from pathlib import Path

session = new_session()

sdf_file = Path("data/chembl_34.sdf.gz")
result = session.read_sdf_file(sdf_file.as_posix()).to_polars()
print(len(result))

# Output:
# 2409270

Exon

If you're working with Exon, you'll use SQL. First register the table:

CREATE EXTERNAL TABLE sdf
STORED AS SDF
LOCATION 'data/chembl_34.sdf.gz'
OPTIONS (compression 'gzip');

Then you can query the table:

SELECT COUNT(*) FROM sdf;
-- Output: 2409270

File Schema

An SDF table has the following schema:

Column NameTypeDescription
headerUtf8The header of the record
atom_countUInt32The number of atoms in the record
bond_countUInt32The number of bonds in the record
dataStructAdditional data in the record

The data column is inferred from the underlying file format based on the first record. So for example, given the SDF file above, the data column would have the following schema:

Column NameTypeDescription
chembl_idUtf8The chembl_id field from the record

Which could be accessed like this:

SELECT data.chembl_id FROM sdf LIMIT 1;

What's Missing?

As mentioned, this is an initial implementation. There are some things missing which which we plan to add in the future:

  • There is lots of room for optimization
  • Support for canonicalization of SMILES
  • Support for generating SMILES from coordinates
  • Support for writing SDF files
  • Functions for chemical descriptors

If you have any feedback or suggestions, please let us know.