Skip to main content

One post tagged with "exondb"

View All Tags

Kernel of WTT

· 4 min read
Trent Hauck
Trent Hauck
Developer

I wanted to briefly explain what WHERE TRUE Technologies is, what we're building, and why I'm excited about it.

WHERE TRUE Technologies is a company I've formed to commercialize scientific software that I think can have an impact for both teams and individuals. There are two software tools currently in development, Exon-DuckDB and WTT-02, targeting bioinformatics and cheminformatics, respectively.

The main idea for the products is that engineers and scientists working with complex data need both to be able to quickly explore datasets locally (micro) and integrate these datasets into modern data infrastructure (macro) to augment a company's competitive advantage with data.

In my opinion, SQL is the best path there. With recent advancements in DB tooling, it's possible to have top-tier speed locally and data warehouse interop without sacrificing the expressibility and evolvability needed by Scientists to ask questions and Engineers to build systems.

I'll share more information about Exon-DuckDB and WTT-02 as they get closer, but in the meantime, here's a kernel of what excites me.

Three Approaches to Counting Sequences

Here's a simple problem that comes up a lot: count the number of sequences in a FASTA. In this, we'll use the canonical.

The first approach might be grep, the grey-beard's favorite.

$ grep -c ^{">"} uniprot_sprot.fasta # counts lines that start with '{">"}'

Or if you're a pythonista, you might do the following.

from Bio import SeqIO
count = 0
for record in SeqIO.parse('uniprot_sprot.fasta', 'fasta'):
count += 1
print(count)

So, pray tell, what does that look like with Exon-DuckDB?

import exondb
import os

os.environ["WTT_01_LICENSE"] = "XXX"

con = exondb.get_connection()
count = con.execute("SELECT COUNT(*) FROM 'uniprot_sprot.fasta'").fetchone()

print(count)

Comparing runtimes on a local Intel Mac, grep is the fastest, then a bit slower is Exon-DuckDB, then finally python at about 3x the duration. The rub, of course, is grep isn't doing anything interesting with the sequences. It has no notion of sequence or the header, just does the line start with an '>'? (The graph looks better with the light theme selected.)

grepBioPythonExon-DuckDB10 Runs for Each Tool on Swiss-Prot00.511.522.533.54grepBioPythonExon-DuckDBTool00.511.522.533.54Seconds