Skip to main content

WTT-02 Preview Release

· 2 min read
Trent Hauck
Trent Hauck
Developer

We are excited to announce the preview release of our latest tool, WTT-02, designed specifically for Cheminformatics users. WTT-02 is the second major tool in the WHERE TRUE Tools suite and comes packed with a range of powerful features to simplify your work.

What is WTT-02?

WTT-02 is a Cheminformatics tool that provides a range of features to help users streamline their workflows. With WTT-02, you can:

  • Input and output SDF files with glob and compression support.
  • Easily featurize machine learning workflows using chemical descriptors, Morgan fingerprints, and other related features.
  • Subset datasets by substructure or fingerprint similarity.
  • Get Within SQL ETL support for PubChem datasets.

For more information see the documentation.

A Minimal Example

Imagine you have a set of SDF files that you would like to filter based on a substructure and fingerprint similarity, featurize using Morgan fingerprints and molecular descriptors, and finally, write to parquet for use in a machine learning workflow. With WTT-02, you can perform all of these tasks in a single query.

COPY (
SELECT _Name as name, smiles, features.*
FROM (
SELECT featurize(smiles) AS features, _Name, smiles
FROM read_sd_file('*.sdf')
WHERE substructure(smiles, 'c1ccccc1') AND tanimoto_similarity(smiles, 'c1ccccc1') > 0.7
)
) TO 's3://my-bucket/my-file.parquet' (FORMAT PARQUET);
namesmilesmwfsp3n_lipinski_hban_lipinski_hbdn_ringsn_hetero_atomsn_heavy_atomsn_rotatable_bondsmorgan_fp
name1c1ccccc178.046950.0001060[false, ...]
name2c1ccccc146.0418661.0110130[false, ...]
name2c1ccccc118.0105650.0120110[false, ...]

And with that you have a featurized dataset ready for machine learning or your data warehouse.

Also, say you already have your data in a postgres database, see our guide for using querying postgres with Exon-DuckDB. The same idea applies and you can quickly export data based on substructure or similarity