Exon-DuckDB v0.3 Release
We're happy to announce the release of Exon-DuckDB v0.3. This release includes a number of improvements as well as some bug fixes. See the very bottom of this page for a video walkthrough of the new features.
- Dedicated Docs Site
- Alignment Functions
- Schema-on-Read for VCF Files
- Querying your LIMS System
- Conclusion
- Video
Dedicated Docs Site
If you're reading this, you've found our new documentation site. This will be the place for API references, tutorials, and other documentation.
Alignment Functions
We've added an initial set of functions that can perform alignment and generate both the score and the CIGAR alignment string. The underlying algorithm is the state-of-the-art Wavefront aligner, which affords very fast alignments.
See the alignment function documentation for more information.
Alignment Scoring
For example, to get the alignment score between two sequences:
SELECT alignment_score('ACGT', 'ACGT');
Or, to find the top 10 sequences that align to a given sequence:
SELECT *
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10;
Sequence alignment tools are currently only available on Mac and Linux.
Alignment CIGAR
You can also get the CIGAR that represents the alignment between two sequences:
SELECT alignment_string('ACGT', 'ACGT');
Schema-on-Read for VCF Files
VCF's are a popular format for storing variant data. We've added schema-on-read for VCF files inferred from the header. This allows you to query VCF files without having to define a schema ahead of time.
For example, given a VCF file with a header like:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=TEST,Number=1,Type=Integer,Description="Testing Tag">
##FORMAT=<ID=TT,Number=A,Type=Integer,Description="Testing Tag, with commas and \"escapes\" and escaped escapes combined with \\\"quotes\\\\\"">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=test,Description="Testing filter">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=249250621>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##readme=AAAAAA
##readme=BBBBBB
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=STR,Number=1,Type=String,Description="Test string type">
Querying that file will generate an info
struct that looks like:
SELECT info.* -- 'dot-star' unpacks the struct into individual columns
FROM read_vcf_file_records('my.vcf.gz')
LIMIT 5;
TEST | DP4 | AC | AN | INDEL | STR |
---|---|---|---|---|---|
[2] | 4 | false | |||
[2] | 4 | false | |||
[1, 2, 3, 4] | [2] | 4 | true | test | |
5 | [1, 2, 3, 4] | [1, 1] | 3 | false | |
[2] | 4 | false |
VCF support is currently only available on Mac and Linux.
Querying your LIMS System
One of the many things people have asked for is the ability to work with their LIMS data using our tooling. Moreover many folks use Benchling which an exposes a Data Warehouse in PostgreSQL.
And while this isn't necessarily a feature of Exon-DuckDB, it is a feature by virtue of our technology choices.
For example, say using the above you'd like to query a table in your LIMS database that contains DNA sequences, and you'd like to find any sequences 'similar' to a query. Rather than copy everything locally, write it to a bioinformatic file format, then using a tool, you can query directly and even export directly to a bioinformatic file format.
Let's say you have a table in your LIMS database called sequences
with an id
column and a sequence
column.
First, load the necessary extensions and connect to the database with a connection string.
INSTALL 'postgres_scanner';
LOAD 'postgres_scanner';
LOAD 'exondb';
-- export PGPASSWORD=postgres
CALL POSTGRES_ATTACH('postgres://postgres:postgres@localhost:5432/postgres');
Then you can query the table directly:
COPY (
SELECT id, sequence
FROM sequences
ORDER BY alignment_score(sequences.sequence, 'ACGT') DESC
LIMIT 10
) TO 'sequences.fasta' (FORMAT FASTA);
This will export the top 10 sequences to a FASTA file for further use.
Conclusion
We're excited to see what you build with Exon-DuckDB. If you have any questions, comments, or feedback, please reach out to us at thauck@wheretrue.com.