Initial Support for SDF Files
It's now possible to read SDF files with Exon and BioBear.
SDF Files
The "SDF" in SDF files stands for "Structure Data File". It's sorta like ATM machines.
As you might imagine, this file format contains chemical structures and additional data. There may be multiple such records in a single file. For example, from the ChemBL database.
CHEMBL153534
RDKit 2D
16 17 0 0 0 0 0 0 0 0999 V2000
7.6140 -22.2702 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7047 -23.1991 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1806 -22.5282 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.9604 -22.7690 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8790 -23.2163 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.2791 -21.1119 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
7.5280 -21.4445 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4225 -22.4364 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.8353 -21.7198 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2035 -23.8527 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
4.0534 -23.2163 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9776 -23.5889 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -22.4938 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -23.9215 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.4397 -20.3035 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.6552 -21.6280 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 3 2 0
3 4 1 0
4 1 1 0
5 2 1 0
6 7 1 0
7 1 2 0
8 1 1 0
9 8 2 0
10 12 1 0
11 5 2 3
12 4 2 0
13 11 1 0
14 11 1 0
15 6 1 0
16 9 1 0
6 9 1 0
10 2 1 0
M END
> <chembl_id>
CHEMBL153534
$$$$
This record starts with header information, followed by the atom and bond information, then any additional data (chembl_id
in this case). The record ends with $$$$
.
Reading SDF Files
Depending on which library you're using, you have different options for working with SDF files.
BioBear
If you're working with BioBear, install the latest version from PyPI:
pip install -U biobear
Then you can read an SDF file like this:
from biobear import new_session
from pathlib import Path
session = new_session()
sdf_file = Path("data/chembl_34.sdf.gz")
result = session.read_sdf_file(sdf_file.as_posix()).to_polars()
print(len(result))
# Output:
# 2409270
Exon
If you're working with Exon, you'll use SQL. First register the table:
CREATE EXTERNAL TABLE sdf
STORED AS SDF
LOCATION 'data/chembl_34.sdf.gz'
OPTIONS (compression 'gzip');
Then you can query the table:
SELECT COUNT(*) FROM sdf;
-- Output: 2409270
File Schema
An SDF table has the following schema:
Column Name | Type | Description |
---|---|---|
header | Utf8 | The header of the record |
atom_count | UInt32 | The number of atoms in the record |
bond_count | UInt32 | The number of bonds in the record |
data | Struct | Additional data in the record |
The data
column is inferred from the underlying file format based on the first record. So for example,
given the SDF file above, the data
column would have the following schema:
Column Name | Type | Description |
---|---|---|
chembl_id | Utf8 | The chembl_id field from the record |
Which could be accessed like this:
SELECT data.chembl_id FROM sdf LIMIT 1;
What's Missing?
As mentioned, this is an initial implementation. There are some things missing which which we plan to add in the future:
- There is lots of room for optimization
- Support for canonicalization of SMILES
- Support for generating SMILES from coordinates
- Support for writing SDF files
- Functions for chemical descriptors
If you have any feedback or suggestions, please let us know.