Skip to main content

5 posts tagged with "Exon-DuckDB"

View All Tags

Exon-DuckDB v0.3 Release

· 4 min read
Trent Hauck
Trent Hauck
Developer

We're happy to announce the release of Exon-DuckDB v0.3. This release includes a number of improvements as well as some bug fixes. See the very bottom of this page for a video walkthrough of the new features.

Dedicated Docs Site

If you're reading this, you've found our new documentation site. This will be the place for API references, tutorials, and other documentation.

Alignment Functions

We've added an initial set of functions that can perform alignment and generate both the score and the CIGAR alignment string. The underlying algorithm is the state-of-the-art Wavefront aligner, which affords very fast alignments.

See the alignment function documentation for more information.

Alignment Scoring

For example, to get the alignment score between two sequences:

SELECT alignment_score('ACGT', 'ACGT');

Or, to find the top 10 sequences that align to a given sequence:

SELECT *
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10;
info

Sequence alignment tools are currently only available on Mac and Linux.

Alignment CIGAR

You can also get the CIGAR that represents the alignment between two sequences:

SELECT alignment_string('ACGT', 'ACGT');

Schema-on-Read for VCF Files

VCF's are a popular format for storing variant data. We've added schema-on-read for VCF files inferred from the header. This allows you to query VCF files without having to define a schema ahead of time.

For example, given a VCF file with a header like:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=TEST,Number=1,Type=Integer,Description="Testing Tag">
##FORMAT=<ID=TT,Number=A,Type=Integer,Description="Testing Tag, with commas and \"escapes\" and escaped escapes combined with \\\"quotes\\\\\"">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=test,Description="Testing filter">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=249250621>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##readme=AAAAAA
##readme=BBBBBB
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=STR,Number=1,Type=String,Description="Test string type">

Querying that file will generate an info struct that looks like:

SELECT info.*  -- 'dot-star' unpacks the struct into individual columns
FROM read_vcf_file_records('my.vcf.gz')
LIMIT 5;
TESTDP4ACANINDELSTR
[2]4false
[2]4false
[1, 2, 3, 4][2]4truetest
5[1, 2, 3, 4][1, 1]3false
[2]4false
info

VCF support is currently only available on Mac and Linux.

Querying your LIMS System

One of the many things people have asked for is the ability to work with their LIMS data using our tooling. Moreover many folks use Benchling which an exposes a Data Warehouse in PostgreSQL.

And while this isn't necessarily a feature of Exon-DuckDB, it is a feature by virtue of our technology choices.

For example, say using the above you'd like to query a table in your LIMS database that contains DNA sequences, and you'd like to find any sequences 'similar' to a query. Rather than copy everything locally, write it to a bioinformatic file format, then using a tool, you can query directly and even export directly to a bioinformatic file format.

Let's say you have a table in your LIMS database called sequences with an id column and a sequence column.

First, load the necessary extensions and connect to the database with a connection string.

INSTALL 'postgres_scanner';
LOAD 'postgres_scanner';

LOAD 'exondb';

-- export PGPASSWORD=postgres
CALL POSTGRES_ATTACH('postgres://postgres:postgres@localhost:5432/postgres');

Then you can query the table directly:

COPY (
SELECT id, sequence
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10
) TO 'sequences.fasta' (FORMAT FASTA);

This will export the top 10 sequences to a FASTA file for further use.

Conclusion

We're excited to see what you build with Exon-DuckDB. If you have any questions, comments, or feedback, please reach out to us at thauck@wheretrue.com.

Video

Install Improvements

· One min read
Trent Hauck
Trent Hauck
Developer

Over the last week, we've made a couple of improvements to how Exon-DuckDB is distributed. We've added Google login, and we've made it easier to install Exon-DuckDB.

Google Login

It's pretty simple. If you go to the login page, there are two new buttons for login and sign-up, respectively. Give it a shot if you're so inclined: https://wheretrue.com/login.

Exon-DuckDB on PyPI

Again, pretty simple: $ pip install exondb. For more information on how to use, see the documentation and/or installation instructions.

Exon-DuckDB Snippet: Find Contigs with 2+ PFAMs of Interest

· 2 min read
Trent Hauck
Trent Hauck
Developer

This is a quick snippet for how you can use Exon-DuckDB to find contigs that have multiple PFAMs of interest. This has application in metagenomic discovery of things like CRISPR systems and antibiotic resistance genes.

Create a Table of Annotations

CREATE TABLE gff AS
SELECT
reference_sequence_name,
gff_parse_attributes(attributes) attributes
FROM read_gff_raw('IowNatreferecore.gff');
  • read_gff_raw reads the GFF. If it was actually a gff, we could use read_gff. But this GFF is malformed from the source, so we use read_gff_raw.
  • gff_parse_attributes is converting a gff attributes string into key value pairs.
  • This dataset has 12.7 million rows, and easily processed on a laptop.

Create a Table of PFAMs

CREATE TABLE pfams AS
SELECT
target_name,
accession[:7] AS pfam
FROM read_hmm_dom_tbl_out('hmmout')
WHERE evalue < 1e-10
  • read_hmm_dom_tbl_out reads the domain hmm output.
  • accession[:7] munges the raw output from PF00000.12 to PF00000
  • evalue filtering gives us significant hits.

Combine Tables

CREATE TABLE contig_pfams AS
SELECT
reference_sequence_name AS contig_id,
LIST(pfam) AS contig_pfams
FROM gff
JOIN pfams
ON gff.attributes['locus_tag'][1] = pfams.target_name
GROUP BY reference_sequence_name
HAVING LEN(LIST(pfam)) > 5
  • Aggregating into a LIST means one row per contig with a LIST of PFAMs in the other column
  • Join Key is within attributes -- the 'locus_tag' key
  • HAVING applies a heuristic that we only want contigs with a bit of length.

Discover Cool Things

SELECT *
FROM contig_pfams
WHERE list_contains(contig_pfams, 'PF17893')
AND list_contains(contig_pfams, 'PF01867')
AND list_contains(contig_pfams, 'PF09827')
  • list_contains returns true if the list in the first argument contains the string in the second argument.
  • Easy to generalize to other CRISPR types or clusters of genes more generally.
  • All done locally with just SQL.

FASTQs, Barcodes, and Partitions

· 6 min read
Trent Hauck
Trent Hauck
Developer

FASTQs and Barcodes

The process of multiplexing in NGS is to combine many samples into one pool and sequence that pool. This is a useful strategy to amoritize sequencing run costs over those samples. The rub, of course, is almost anytime something is multiplexed, it needs to be demultiplexed. Moreover for organizations engaging in many sequencing experiments, the underlying data needs to be cataloged in accordance with data engineering best practices to allow for easy retreival.

This can be an onerous process, but with Exon-DuckDB's help, you can quickly combine the reads, barcodes, and other metadata; do pre-ingestion QC analytics; group the reads into barcode-based partitions; and finally upload that data into a larger cloud warehouse.

If you'd like to follow along, I'm using data from QIIME2.

wget -O "sample-metadata.tsv" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv"
wget -O "barcodes.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
wget -O "sequences.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Ingestion without Indigestion

Looking at this data, there's three files: a metadata TSV, a gzipped barcodes FASTQ, and a gzipped sequences FASTQ. This data is easy enough to load. Let's pop into the DuckDB CLI, load Exon-DuckDB, and have a look.

Because the two FASTQ files end with "fastq.gz", Exon-DuckDB can detect the format and comperssion.

CREATE TABLE barcodes AS
SELECT * FROM 'barcodes.fastq.gz';

CREATE TABLE reads AS
SELECT * FROM 'sequences.fastq.gz';

And having a quick peak, we can see what we're working with.

-- description is empty in these examples
SELECT * EXCLUDE(description) FROM barcodes LIMIT 5;
┌─────────────────────────────────────┬──────────────┬────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├─────────────────────────────────────┼──────────────┼────────────────┤
│ HWI-EAS440_0386:1:23:17547:1423#0/1 │ ATGCAGCTCAGT │ IIIIIIIIIIIH │
│ HWI-EAS440_0386:1:23:14818:1533#0/1 │ CCCCTCAGCGGC │ DDD@D?@B<<+/ │
│ HWI-EAS440_0386:1:23:14401:1629#0/1 │ GACGAGTCAGTC │ GGEGDGGGGGDG │
│ HWI-EAS440_0386:1:23:15259:1649#0/1 │ AGCAGTCGCGAT │ IIIIIIIIIIII │
│ HWI-EAS440_0386:1:23:13748:2482#0/1 │ AGCACACCTACA │ GGGGBGGEEGGD │
└─────────────────────────────────────┴──────────────┴────────────────┘
SELECT MIN(LENGTH(sequence)), MAX(LENGTH(sequence)) FROM barcodes;
┌─────────────────────────┬─────────────────────────┐
min(length("sequence"))max(length("sequence"))
│ int64 │ int64 │
├─────────────────────────┼─────────────────────────┤
1212
└─────────────────────────┴─────────────────────────┘

The barcodes are in the sequence column, and they're all the same length. And for the actual reads?

-- isn't EXCLUDE cool
SELECT name, sequence, quality_scores FROM reads LIMIT 5;
┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGE… │
│ HWI-EAS440_0386:1:… │ CCCCNCAGCGGCAAAAAT… │ 64<2$24;1)/:*B<?BBDDBBD<>BDD######################… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'ACC8;;>;HHHHGHDHHHHHEEHHEHHHHHECHEEEHCHFHHHAG… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)DEE?CCCBIIIIIIGIIIIIHIIIIIIIIIHIIIHFIDIGIIIHH… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'?CC4<5<6HHHHHHH@@HGGEEGHHFHHHHBBHGFFGCGBBGG@D… │
└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────┘

The TSV is a nasty little bugger with a line for the header, prepended with #, and a second line, also prepended with #, that specifies an arbitrary type system. Anyways, thankfully DuckDB has a robust set of functions for ingesting CSV/TSV data.

CREATE TABLE metadata AS
SELECT *
FROM read_csv('sample-metadata.tsv', skip=2, sep='\t', columns={
'sample_id': 'VARCHAR',
'barcode': 'VARCHAR',
'linker_primer_sequence': 'VARCHAR',
'body_site': 'VARCHAR',
'year': 'INTEGER',
'month': 'INTEGER',
'day': 'INTEGER',
'subject': 'VARCHAR',
'reported_antibiotic_usage': 'VARCHAR',
'days_since_experiment_start': 'INTEGER',
'description': 'VARCHAR'});

That wasn't so bad. And now that the data is "in", we can use SQL to manipulate it before exporting it.

A Big Table

To facilitate analysis, let's join this data together and augment the quality scores by transforming them from ASCII scores to a list of integers.

CREATE TABLE analysis_table AS
SELECT r.*,
m.sample_id,
m.body_site,
m.subject,
b.sequence barcode,

-- e.g. [14, 23, 27, 25, 25, 35, 31, 27, 33, 35, 35, 33, ...]
quality_score_string_to_list(r.quality_scores) qc_scores

FROM reads r
INNER JOIN barcodes b
ON r.name = b.name
LEFT JOIN metadata m
ON b.sequence = m.barcode;

You could stop here and ship this off to your cloud service provider for storage in parquet, but we can do some analysis here and enjoy the quick feedback loop of local work.

For example, we can quickly find out that there are about 40k undetermined reads, and maybe they came from subject-1.

SELECT subject, COUNT(*) FROM analysis_table GROUP BY subject;
┌───────────┬──────────────┐
│ subject │ count_star()
varchar │ int64 │
├───────────┼──────────────┤
│ subject-1106684
│ subject-2157194
│ │ 38703
└───────────┴──────────────┘

Or maybe, we'd like to filter our analysis table to cases where the last 5 quality scores average above 35 (i.e. we expect some drop-off in quality across the read, but if it was drastic, let's omit).

SELECT *
FROM analysis_table
WHERE list_avg(qc_scores[-5:]) > 35;

Ok, so entering the home stretch, we've got our data in a "big table". We've done some analysis. Now let's move to the final task of storage and how partitioning can simplify retrieval.

Partitioning with DuckDB and Reading with Exon-DuckDB

The idea of partitioning is to split the data into, well, partitions based on some value that is well aligned with access patterns. For example, for web log data, you might partition by day since if there was some site issue last week, you'd want to quickly grab that data without going through everything else. Here, we'll partition by barcode, but it's on you to make the decision for you data.

The syntax from DuckDB is quite simple for generating partitions.

COPY (
SELECT * FROM analysis_table
) TO 'analysis_table' (FORMAT parquet, PARTITION_BY barcode);

Now we have a directory structure that looks like this:

$ tree analysis_table | head
analysis_table
├── barcode=AAAAAAAAAAAA
│ ├── data_0.parquet
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAATTTAGG
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAGAGCTTA
│ ├── data_0.parquet
...

And that's all well and good; if you need a specific barcode, you can just grab the directory. Or again you can copy this to your cloud provider.

Though there's one step missing, say you wanted to run the FASTQ of one barcodes through a tool, and that tool like all bioinformatic tools seemingly, can't work with parquet? Exon-DuckDB can help there too.

COPY (
SELECT * FROM parquet_scan('analysis_table/barcode=AAAAAAAAAAAA/*')
) TO 'AAAAAAAAAAAA.fastq.gz' (FORMAT fastq)

And then having a quick look...

$ gzcat AAAAAAAAAAAA.fastq.gz | head -n 4
@HWI-EAS440_0386:4:11:17168:10699#0/1
NAAAAAAAAAAAAAAAAAAAAAAAACAAACAAAAAAAATCTAAAAAAACTACAAACAAAAATCAAACAAACAAACAAAAATACAACATCTAAATATAAAATTCAAACACACTACATTACACTAGAAACCCTCTCTTCACTACTCACTAACAC
+
########################################################################################################################################################

That's it. Thanks for playing along. If you'd like to learn more about WTT, the company, or Exon-DuckDB, the product, please get in touch via the email in the footer.