Skip to main content

BioBear on Conda Forge

· One min read
Trent Hauck
Trent Hauck
Developer

Conda has become a popular way to distribute software in the bioinformatics community. We're excited to announce that we've added BioBear to Conda Forge. You can now install BioBear with the following command:

conda install -c conda-forge biobear

Pip still works as well:

pip install biobear

ExonR

· One min read
Trent Hauck
Trent Hauck
Developer

We're excited release ExonR. This is an R package that allows for interacting with Exon through an ergonomic R interface.

Usage

As a quick example, we can read in a GFF file from S3 and convert it to a GRanges object.

library(exonr)
library(GenomicRanges)

rdr <- read_gff_file(
"s3://bucket/path/to/file.gff",
)

df <- as.data.frame(rdr)

gr <- GRanges(
seqnames = df$seqname,
ranges = IRanges(
start = df$start,
end = df$end,
),
)

Installation

If you're on an x64 machine, you should be able to install a pre-built binary from our R-universe.

install.packages('exonr', repos = c('https://wheretrue.r-universe.dev', 'https://cloud.r-project.org'))

This should also build on other platforms, but you'll need to have Rust installed. The main Rust website has a guide for installing rust. Once you have rust installed, you can install ExonR from source with the command above.

Spring & Summer 2023 Updates

· 7 min read
Trent Hauck
Trent Hauck
Developer

It's been around six weeks since our last public update. Here's what we've done:

  1. We've refactored our DuckDB extension into a standalone Rust library named Exon. This is a Rust-based library designed to implement a SQL engine that's aware of scientific data types and workflows.
  2. The DuckDB extension now utilizes Exon and has been renamed as Exon-DuckDB. This can be used with Python, R, and C++, wherever DuckDB is used.
  3. We've also released an open-source Python package, BioBear, which connects Exon with PyArrow and Polars.
  4. Exon, along with its DuckDB extension and BioBear, now supports Object Stores such as S3 and GCS. Local paths like ./path/to/file can be replaced with cloud storage paths like s3://bucket/path/to/file or gs://bucket/path/to/file, and the code will function the same.
  5. We've set up a documentation site with additional examples and tutorials, including Delta Lake integration and neighborhood analysis, similar to a map that helps users navigate.
  6. Finally, the libraries have been made open-source under permissive licenses. We hope that this will prompt others to use and contribute to the project, aiding in the development of a community around these tools.

Exon

Exon is a library written in Rust that takes advantage of DataFusion to develop a SQL engine proficient in handling scientific data types and workflows. It's designed for embedding into other applications and is the cornerstone of our DuckDB extension and BioBear packages.

This core infrastructure allows easy data management via Arrow and/or SQL and ensures adaptability across different execution environments. BioBear is specifically a Python package, but the Exon-DuckDB extension is adaptable, operating wherever DuckDB is used, such as Python, R, and JavaScript. We're currently exploring other language bindings, so feel free to reach out if you'd like to see a specific language supported.

See the Exon documentation, which has links to the installable crate, the Rust docs, and more.

Demonstration

For a brief demonstration, let's examine how we can use Exon to scan a GFF file, an output of CRT from a JGI dataset, and join the CRISPR array annotations with repeat units entirely within the array. If Rust isn't your focus and you're primarily interested in utilizing this code as a library, move forward to the BioBear section.

Fully Formed Example
use arrow::util::pretty::pretty_format_batches;
use datafusion::error::DataFusionError;
use datafusion::prelude::*;
use exon::context::ExonSessionExt;

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let ctx = SessionContext::new_exon();

let path = "./exon-examples/data/Ga0604745_crt.gff";
let sql = format!(
"CREATE EXTERNAL TABLE gff STORED AS GFF LOCATION '{}';",
path
);

ctx.sql(&sql).await?;

let df = ctx
.sql(
r#"SELECT crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
FROM (SELECT * FROM gff WHERE type = 'CRISPR') AS crispr
JOIN (SELECT * FROM gff WHERE type = 'repeat_unit') AS repeat
ON crispr.seqname = repeat.seqname
AND crispr.start <= repeat.start
AND crispr.end >= repeat.end

ORDER BY crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
LIMIT 10"#,
)
.await?;

// Show the logical plan.
let logical_plan = df.logical_plan();
assert_eq!(
format!("\n{:?}", logical_plan),
r#"
Limit: skip=0, fetch=10
Sort: crispr.seqname ASC NULLS LAST, crispr.start ASC NULLS LAST, crispr.end ASC NULLS LAST, repeat.start ASC NULLS LAST, repeat.end ASC NULLS LAST
Projection: crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
Inner Join: Filter: crispr.seqname = repeat.seqname AND crispr.start <= repeat.start AND crispr.end >= repeat.end
SubqueryAlias: crispr
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("CRISPR")
TableScan: gff
SubqueryAlias: repeat
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("repeat_unit")
TableScan: gff"#,
);

// Uncomment to show the physical plan, though it's obviously messier.
// let physical_plan = df.create_physical_plan().await?;
// println!("Physical plan: {:?}", physical_plan);

// Collect the results as a vector of Arrow RecordBatches.
let results = df.collect().await?;
let formatted_results = pretty_format_batches(results.as_slice())?;
assert_eq!(
format!("\n{}", formatted_results),
r#"
+------------------+-------+------+-------+-----+
| seqname | start | end | start | end |
+------------------+-------+------+-------+-----+
| Ga0604745_000026 | 1 | 3473 | 1 | 37 |
| Ga0604745_000026 | 1 | 3473 | 73 | 109 |
| Ga0604745_000026 | 1 | 3473 | 147 | 183 |
| Ga0604745_000026 | 1 | 3473 | 219 | 255 |
| Ga0604745_000026 | 1 | 3473 | 291 | 327 |
| Ga0604745_000026 | 1 | 3473 | 365 | 401 |
| Ga0604745_000026 | 1 | 3473 | 437 | 473 |
| Ga0604745_000026 | 1 | 3473 | 510 | 546 |
| Ga0604745_000026 | 1 | 3473 | 582 | 618 |
| Ga0604745_000026 | 1 | 3473 | 654 | 690 |
+------------------+-------+------+-------+-----+"#
);

Ok(())
}

BioBear

BioBear is a Python package that extends Exon, providing it with a Pythonic interface. For instance, you can read a GFF file from S3, and then load it into a PyArrow RecordBatch Reader and/or a Polars DataFrame.

PyArrow's strength lies in its ability to render our domain-specific data highly portable. For instance, suppose you have a Python job that produces a GFF file, perhaps the output of Prodigal. You can create an Arrow RecordBatch Reader that streams data from S3 into a Delta Lake table.

import biobear as bb
from deltalake import write_deltalake

batch_reader = bb.GFFReader("./job/output/test.gff").to_arrow()

write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})
# aws s3 ls s3://bucket/delta/gff
# PRE _delta_log/
# 2023-06-24 11:00:19 8509556 0-ccedd844-373f-4e70-a99e-2167535fcf35-0.parquet

You can also use BioBear to query local indexed files. For example, you can query a BCF file for records on chromosome 1.

import biobear as bb

reader = bb.BCFIndexedReader("example.bcf")
df = reader.query("1")

In this case, the query is executed by the BCFIndexedReader, which uses the index to only read the records on chromosome 1. This is much faster than reading the entire file and filtering in memory.

Object Stores

Scientific computing is increasingly done in the cloud -- the data that gets generated from workflows is plopped somewhere in you favorite object store (e.g. S3). This has made it less expensive to store the output but adds an additional layer of complexity when trying to do analysis.

To try to mitigate this complexity WHERE TRUE's tools now support reading from object stores. For example, we could rewrite the last example to read from S3:

import biobear as bb
from deltalake import write_deltalake

batch_reader = bb.GFFReader("s3://bucket/job/output/test.gff").to_arrow()
write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})

Put this in a lambda and you're about 3/4ths of the way to a cloud-based bioinformatics data platform.

Or, if DuckDB is more your speed, you can do:

SELECT *
FROM read_gff('s3://bucket/job/output/test.gff')

Better Home and Docs Site

We'll, you're here aren't you... kidding of course, but we've deployed an update to the main site and the docs site. The docs have more examples. We gave the example earlier of how to use BioBear to write a Delta Lake table from a GFF file. We also have an example of how to use Exon-DuckDB to do neighborhood analysis on a metagenomics sample.

Have a gander at our main site here.

Open Source

The ultimate goal of WHERE TRUE Technologies is to not exist. We want to build tools and popularize techniques that diminish the need for the majority of specialized bioinformatics file formats in favor of more general-purpose formats that are easier to work with, but don't make tradeoffs on performance or flexibility.

To get to that state, we've opened sourced our libraries under permissive licenses. We hope that this will encourage others to adopt and contribute to the project, and help us build a community around these tools.

You can review the licenses for each of the libraries at the following links:

WTT-02 Preview Release

· 2 min read
Trent Hauck
Trent Hauck
Developer

We are excited to announce the preview release of our latest tool, WTT-02, designed specifically for Cheminformatics users. WTT-02 is the second major tool in the WHERE TRUE Tools suite and comes packed with a range of powerful features to simplify your work.

What is WTT-02?

WTT-02 is a Cheminformatics tool that provides a range of features to help users streamline their workflows. With WTT-02, you can:

  • Input and output SDF files with glob and compression support.
  • Easily featurize machine learning workflows using chemical descriptors, Morgan fingerprints, and other related features.
  • Subset datasets by substructure or fingerprint similarity.
  • Get Within SQL ETL support for PubChem datasets.

For more information see the documentation.

A Minimal Example

Imagine you have a set of SDF files that you would like to filter based on a substructure and fingerprint similarity, featurize using Morgan fingerprints and molecular descriptors, and finally, write to parquet for use in a machine learning workflow. With WTT-02, you can perform all of these tasks in a single query.

COPY (
SELECT _Name as name, smiles, features.*
FROM (
SELECT featurize(smiles) AS features, _Name, smiles
FROM read_sd_file('*.sdf')
WHERE substructure(smiles, 'c1ccccc1') AND tanimoto_similarity(smiles, 'c1ccccc1') > 0.7
)
) TO 's3://my-bucket/my-file.parquet' (FORMAT PARQUET);
namesmilesmwfsp3n_lipinski_hban_lipinski_hbdn_ringsn_hetero_atomsn_heavy_atomsn_rotatable_bondsmorgan_fp
name1c1ccccc178.046950.0001060[false, ...]
name2c1ccccc146.0418661.0110130[false, ...]
name2c1ccccc118.0105650.0120110[false, ...]

And with that you have a featurized dataset ready for machine learning or your data warehouse.

Also, say you already have your data in a postgres database, see our guide for using querying postgres with Exon-DuckDB. The same idea applies and you can quickly export data based on substructure or similarity

Exon-DuckDB v0.3 Release

· 4 min read
Trent Hauck
Trent Hauck
Developer

We're happy to announce the release of Exon-DuckDB v0.3. This release includes a number of improvements as well as some bug fixes. See the very bottom of this page for a video walkthrough of the new features.

Dedicated Docs Site

If you're reading this, you've found our new documentation site. This will be the place for API references, tutorials, and other documentation.

Alignment Functions

We've added an initial set of functions that can perform alignment and generate both the score and the CIGAR alignment string. The underlying algorithm is the state-of-the-art Wavefront aligner, which affords very fast alignments.

See the alignment function documentation for more information.

Alignment Scoring

For example, to get the alignment score between two sequences:

SELECT alignment_score('ACGT', 'ACGT');

Or, to find the top 10 sequences that align to a given sequence:

SELECT *
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10;
info

Sequence alignment tools are currently only available on Mac and Linux.

Alignment CIGAR

You can also get the CIGAR that represents the alignment between two sequences:

SELECT alignment_string('ACGT', 'ACGT');

Schema-on-Read for VCF Files

VCF's are a popular format for storing variant data. We've added schema-on-read for VCF files inferred from the header. This allows you to query VCF files without having to define a schema ahead of time.

For example, given a VCF file with a header like:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=TEST,Number=1,Type=Integer,Description="Testing Tag">
##FORMAT=<ID=TT,Number=A,Type=Integer,Description="Testing Tag, with commas and \"escapes\" and escaped escapes combined with \\\"quotes\\\\\"">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=test,Description="Testing filter">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=249250621>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##readme=AAAAAA
##readme=BBBBBB
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=STR,Number=1,Type=String,Description="Test string type">

Querying that file will generate an info struct that looks like:

SELECT info.*  -- 'dot-star' unpacks the struct into individual columns
FROM read_vcf_file_records('my.vcf.gz')
LIMIT 5;
TESTDP4ACANINDELSTR
[2]4false
[2]4false
[1, 2, 3, 4][2]4truetest
5[1, 2, 3, 4][1, 1]3false
[2]4false
info

VCF support is currently only available on Mac and Linux.

Querying your LIMS System

One of the many things people have asked for is the ability to work with their LIMS data using our tooling. Moreover many folks use Benchling which an exposes a Data Warehouse in PostgreSQL.

And while this isn't necessarily a feature of Exon-DuckDB, it is a feature by virtue of our technology choices.

For example, say using the above you'd like to query a table in your LIMS database that contains DNA sequences, and you'd like to find any sequences 'similar' to a query. Rather than copy everything locally, write it to a bioinformatic file format, then using a tool, you can query directly and even export directly to a bioinformatic file format.

Let's say you have a table in your LIMS database called sequences with an id column and a sequence column.

First, load the necessary extensions and connect to the database with a connection string.

INSTALL 'postgres_scanner';
LOAD 'postgres_scanner';

LOAD 'exondb';

-- export PGPASSWORD=postgres
CALL POSTGRES_ATTACH('postgres://postgres:postgres@localhost:5432/postgres');

Then you can query the table directly:

COPY (
SELECT id, sequence
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10
) TO 'sequences.fasta' (FORMAT FASTA);

This will export the top 10 sequences to a FASTA file for further use.

Conclusion

We're excited to see what you build with Exon-DuckDB. If you have any questions, comments, or feedback, please reach out to us at thauck@wheretrue.com.

Video

Install Improvements

· One min read
Trent Hauck
Trent Hauck
Developer

Over the last week, we've made a couple of improvements to how Exon-DuckDB is distributed. We've added Google login, and we've made it easier to install Exon-DuckDB.

Google Login

It's pretty simple. If you go to the login page, there are two new buttons for login and sign-up, respectively. Give it a shot if you're so inclined: https://wheretrue.com/login.

Exon-DuckDB on PyPI

Again, pretty simple: $ pip install exondb. For more information on how to use, see the documentation and/or installation instructions.

Exon-DuckDB Snippet: Find Contigs with 2+ PFAMs of Interest

· 2 min read
Trent Hauck
Trent Hauck
Developer

This is a quick snippet for how you can use Exon-DuckDB to find contigs that have multiple PFAMs of interest. This has application in metagenomic discovery of things like CRISPR systems and antibiotic resistance genes.

Create a Table of Annotations

CREATE TABLE gff AS
SELECT
reference_sequence_name,
gff_parse_attributes(attributes) attributes
FROM read_gff_raw('IowNatreferecore.gff');
  • read_gff_raw reads the GFF. If it was actually a gff, we could use read_gff. But this GFF is malformed from the source, so we use read_gff_raw.
  • gff_parse_attributes is converting a gff attributes string into key value pairs.
  • This dataset has 12.7 million rows, and easily processed on a laptop.

Create a Table of PFAMs

CREATE TABLE pfams AS
SELECT
target_name,
accession[:7] AS pfam
FROM read_hmm_dom_tbl_out('hmmout')
WHERE evalue < 1e-10
  • read_hmm_dom_tbl_out reads the domain hmm output.
  • accession[:7] munges the raw output from PF00000.12 to PF00000
  • evalue filtering gives us significant hits.

Combine Tables

CREATE TABLE contig_pfams AS
SELECT
reference_sequence_name AS contig_id,
LIST(pfam) AS contig_pfams
FROM gff
JOIN pfams
ON gff.attributes['locus_tag'][1] = pfams.target_name
GROUP BY reference_sequence_name
HAVING LEN(LIST(pfam)) > 5
  • Aggregating into a LIST means one row per contig with a LIST of PFAMs in the other column
  • Join Key is within attributes -- the 'locus_tag' key
  • HAVING applies a heuristic that we only want contigs with a bit of length.

Discover Cool Things

SELECT *
FROM contig_pfams
WHERE list_contains(contig_pfams, 'PF17893')
AND list_contains(contig_pfams, 'PF01867')
AND list_contains(contig_pfams, 'PF09827')
  • list_contains returns true if the list in the first argument contains the string in the second argument.
  • Easy to generalize to other CRISPR types or clusters of genes more generally.
  • All done locally with just SQL.

FASTQs, Barcodes, and Partitions

· 6 min read
Trent Hauck
Trent Hauck
Developer

FASTQs and Barcodes

The process of multiplexing in NGS is to combine many samples into one pool and sequence that pool. This is a useful strategy to amoritize sequencing run costs over those samples. The rub, of course, is almost anytime something is multiplexed, it needs to be demultiplexed. Moreover for organizations engaging in many sequencing experiments, the underlying data needs to be cataloged in accordance with data engineering best practices to allow for easy retreival.

This can be an onerous process, but with Exon-DuckDB's help, you can quickly combine the reads, barcodes, and other metadata; do pre-ingestion QC analytics; group the reads into barcode-based partitions; and finally upload that data into a larger cloud warehouse.

If you'd like to follow along, I'm using data from QIIME2.

wget -O "sample-metadata.tsv" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv"
wget -O "barcodes.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
wget -O "sequences.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Ingestion without Indigestion

Looking at this data, there's three files: a metadata TSV, a gzipped barcodes FASTQ, and a gzipped sequences FASTQ. This data is easy enough to load. Let's pop into the DuckDB CLI, load Exon-DuckDB, and have a look.

Because the two FASTQ files end with "fastq.gz", Exon-DuckDB can detect the format and comperssion.

CREATE TABLE barcodes AS
SELECT * FROM 'barcodes.fastq.gz';

CREATE TABLE reads AS
SELECT * FROM 'sequences.fastq.gz';

And having a quick peak, we can see what we're working with.

-- description is empty in these examples
SELECT * EXCLUDE(description) FROM barcodes LIMIT 5;
┌─────────────────────────────────────┬──────────────┬────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├─────────────────────────────────────┼──────────────┼────────────────┤
│ HWI-EAS440_0386:1:23:17547:1423#0/1 │ ATGCAGCTCAGT │ IIIIIIIIIIIH │
│ HWI-EAS440_0386:1:23:14818:1533#0/1 │ CCCCTCAGCGGC │ DDD@D?@B<<+/ │
│ HWI-EAS440_0386:1:23:14401:1629#0/1 │ GACGAGTCAGTC │ GGEGDGGGGGDG │
│ HWI-EAS440_0386:1:23:15259:1649#0/1 │ AGCAGTCGCGAT │ IIIIIIIIIIII │
│ HWI-EAS440_0386:1:23:13748:2482#0/1 │ AGCACACCTACA │ GGGGBGGEEGGD │
└─────────────────────────────────────┴──────────────┴────────────────┘
SELECT MIN(LENGTH(sequence)), MAX(LENGTH(sequence)) FROM barcodes;
┌─────────────────────────┬─────────────────────────┐
min(length("sequence"))max(length("sequence"))
│ int64 │ int64 │
├─────────────────────────┼─────────────────────────┤
1212
└─────────────────────────┴─────────────────────────┘

The barcodes are in the sequence column, and they're all the same length. And for the actual reads?

-- isn't EXCLUDE cool
SELECT name, sequence, quality_scores FROM reads LIMIT 5;
┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGE… │
│ HWI-EAS440_0386:1:… │ CCCCNCAGCGGCAAAAAT… │ 64<2$24;1)/:*B<?BBDDBBD<>BDD######################… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'ACC8;;>;HHHHGHDHHHHHEEHHEHHHHHECHEEEHCHFHHHAG… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)DEE?CCCBIIIIIIGIIIIIHIIIIIIIIIHIIIHFIDIGIIIHH… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'?CC4<5<6HHHHHHH@@HGGEEGHHFHHHHBBHGFFGCGBBGG@D… │
└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────┘

The TSV is a nasty little bugger with a line for the header, prepended with #, and a second line, also prepended with #, that specifies an arbitrary type system. Anyways, thankfully DuckDB has a robust set of functions for ingesting CSV/TSV data.

CREATE TABLE metadata AS
SELECT *
FROM read_csv('sample-metadata.tsv', skip=2, sep='\t', columns={
'sample_id': 'VARCHAR',
'barcode': 'VARCHAR',
'linker_primer_sequence': 'VARCHAR',
'body_site': 'VARCHAR',
'year': 'INTEGER',
'month': 'INTEGER',
'day': 'INTEGER',
'subject': 'VARCHAR',
'reported_antibiotic_usage': 'VARCHAR',
'days_since_experiment_start': 'INTEGER',
'description': 'VARCHAR'});

That wasn't so bad. And now that the data is "in", we can use SQL to manipulate it before exporting it.

A Big Table

To facilitate analysis, let's join this data together and augment the quality scores by transforming them from ASCII scores to a list of integers.

CREATE TABLE analysis_table AS
SELECT r.*,
m.sample_id,
m.body_site,
m.subject,
b.sequence barcode,

-- e.g. [14, 23, 27, 25, 25, 35, 31, 27, 33, 35, 35, 33, ...]
quality_score_string_to_list(r.quality_scores) qc_scores

FROM reads r
INNER JOIN barcodes b
ON r.name = b.name
LEFT JOIN metadata m
ON b.sequence = m.barcode;

You could stop here and ship this off to your cloud service provider for storage in parquet, but we can do some analysis here and enjoy the quick feedback loop of local work.

For example, we can quickly find out that there are about 40k undetermined reads, and maybe they came from subject-1.

SELECT subject, COUNT(*) FROM analysis_table GROUP BY subject;
┌───────────┬──────────────┐
│ subject │ count_star()
varchar │ int64 │
├───────────┼──────────────┤
│ subject-1106684
│ subject-2157194
│ │ 38703
└───────────┴──────────────┘

Or maybe, we'd like to filter our analysis table to cases where the last 5 quality scores average above 35 (i.e. we expect some drop-off in quality across the read, but if it was drastic, let's omit).

SELECT *
FROM analysis_table
WHERE list_avg(qc_scores[-5:]) > 35;

Ok, so entering the home stretch, we've got our data in a "big table". We've done some analysis. Now let's move to the final task of storage and how partitioning can simplify retrieval.

Partitioning with DuckDB and Reading with Exon-DuckDB

The idea of partitioning is to split the data into, well, partitions based on some value that is well aligned with access patterns. For example, for web log data, you might partition by day since if there was some site issue last week, you'd want to quickly grab that data without going through everything else. Here, we'll partition by barcode, but it's on you to make the decision for you data.

The syntax from DuckDB is quite simple for generating partitions.

COPY (
SELECT * FROM analysis_table
) TO 'analysis_table' (FORMAT parquet, PARTITION_BY barcode);

Now we have a directory structure that looks like this:

$ tree analysis_table | head
analysis_table
├── barcode=AAAAAAAAAAAA
│ ├── data_0.parquet
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAATTTAGG
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAGAGCTTA
│ ├── data_0.parquet
...

And that's all well and good; if you need a specific barcode, you can just grab the directory. Or again you can copy this to your cloud provider.

Though there's one step missing, say you wanted to run the FASTQ of one barcodes through a tool, and that tool like all bioinformatic tools seemingly, can't work with parquet? Exon-DuckDB can help there too.

COPY (
SELECT * FROM parquet_scan('analysis_table/barcode=AAAAAAAAAAAA/*')
) TO 'AAAAAAAAAAAA.fastq.gz' (FORMAT fastq)

And then having a quick look...

$ gzcat AAAAAAAAAAAA.fastq.gz | head -n 4
@HWI-EAS440_0386:4:11:17168:10699#0/1
NAAAAAAAAAAAAAAAAAAAAAAAACAAACAAAAAAAATCTAAAAAAACTACAAACAAAAATCAAACAAACAAACAAAAATACAACATCTAAATATAAAATTCAAACACACTACATTACACTAGAAACCCTCTCTTCACTACTCACTAACAC
+
########################################################################################################################################################

That's it. Thanks for playing along. If you'd like to learn more about WTT, the company, or Exon-DuckDB, the product, please get in touch via the email in the footer.

Kernel of WTT

· 4 min read
Trent Hauck
Trent Hauck
Developer

I wanted to briefly explain what WHERE TRUE Technologies is, what we're building, and why I'm excited about it.

WHERE TRUE Technologies is a company I've formed to commercialize scientific software that I think can have an impact for both teams and individuals. There are two software tools currently in development, Exon-DuckDB and WTT-02, targeting bioinformatics and cheminformatics, respectively.

The main idea for the products is that engineers and scientists working with complex data need both to be able to quickly explore datasets locally (micro) and integrate these datasets into modern data infrastructure (macro) to augment a company's competitive advantage with data.

In my opinion, SQL is the best path there. With recent advancements in DB tooling, it's possible to have top-tier speed locally and data warehouse interop without sacrificing the expressibility and evolvability needed by Scientists to ask questions and Engineers to build systems.

I'll share more information about Exon-DuckDB and WTT-02 as they get closer, but in the meantime, here's a kernel of what excites me.

Three Approaches to Counting Sequences

Here's a simple problem that comes up a lot: count the number of sequences in a FASTA. In this, we'll use the canonical.

The first approach might be grep, the grey-beard's favorite.

$ grep -c ^{">"} uniprot_sprot.fasta # counts lines that start with '{">"}'

Or if you're a pythonista, you might do the following.

from Bio import SeqIO
count = 0
for record in SeqIO.parse('uniprot_sprot.fasta', 'fasta'):
count += 1
print(count)

So, pray tell, what does that look like with Exon-DuckDB?

import exondb
import os

os.environ["WTT_01_LICENSE"] = "XXX"

con = exondb.get_connection()
count = con.execute("SELECT COUNT(*) FROM 'uniprot_sprot.fasta'").fetchone()

print(count)

Comparing runtimes on a local Intel Mac, grep is the fastest, then a bit slower is Exon-DuckDB, then finally python at about 3x the duration. The rub, of course, is grep isn't doing anything interesting with the sequences. It has no notion of sequence or the header, just does the line start with an '>'? (The graph looks better with the light theme selected.)

grepBioPythonExon-DuckDB10 Runs for Each Tool on Swiss-Prot00.511.522.533.54grepBioPythonExon-DuckDBTool00.511.522.533.54Seconds