WTT-02 Preview Release
We are excited to announce the preview release of our latest tool, WTT-02, designed specifically for Cheminformatics users. WTT-02 is the second major tool in the WHERE TRUE Tools suite and comes packed with a range of powerful features to simplify your work.
What is WTT-02?
WTT-02 is a Cheminformatics tool that provides a range of features to help users streamline their workflows. With WTT-02, you can:
- Input and output SDF files with glob and compression support.
- Easily featurize machine learning workflows using chemical descriptors, Morgan fingerprints, and other related features.
- Subset datasets by substructure or fingerprint similarity.
- Get Within SQL ETL support for PubChem datasets.
For more information see the documentation.
A Minimal Example
Imagine you have a set of SDF files that you would like to filter based on a substructure and fingerprint similarity, featurize using Morgan fingerprints and molecular descriptors, and finally, write to parquet for use in a machine learning workflow. With WTT-02, you can perform all of these tasks in a single query.
COPY (
SELECT _Name as name, smiles, features.*
FROM (
SELECT featurize(smiles) AS features, _Name, smiles
FROM read_sd_file('*.sdf')
WHERE substructure(smiles, 'c1ccccc1') AND tanimoto_similarity(smiles, 'c1ccccc1') > 0.7
)
) TO 's3://my-bucket/my-file.parquet' (FORMAT PARQUET);
name | smiles | mw | fsp3 | n_lipinski_hba | n_lipinski_hbd | n_rings | n_hetero_atoms | n_heavy_atoms | n_rotatable_bonds | morgan_fp |
---|---|---|---|---|---|---|---|---|---|---|
name1 | c1ccccc1 | 78.04695 | 0.0 | 0 | 0 | 1 | 0 | 6 | 0 | [false, ...] |
name2 | c1ccccc1 | 46.041866 | 1.0 | 1 | 1 | 0 | 1 | 3 | 0 | [false, ...] |
name2 | c1ccccc1 | 18.010565 | 0.0 | 1 | 2 | 0 | 1 | 1 | 0 | [false, ...] |
And with that you have a featurized dataset ready for machine learning or your data warehouse.
Also, say you already have your data in a postgres database, see our guide for using querying postgres with Exon-DuckDB. The same idea applies and you can quickly export data based on substructure or similarity