It's been around six weeks since our last public update. Here's what we've done:
- We've refactored our DuckDB extension into a standalone Rust library named Exon. This is a Rust-based library designed to implement a SQL engine that's aware of scientific data types and workflows.
- The DuckDB extension now utilizes Exon and has been renamed as Exon-DuckDB. This can be used with Python, R, and C++, wherever DuckDB is used.
- We've also released an open-source Python package, BioBear, which connects Exon with PyArrow and Polars.
- Exon, along with its DuckDB extension and BioBear, now supports Object Stores such as S3 and GCS. Local paths like
./path/to/file
can be replaced with cloud storage paths like s3://bucket/path/to/file
or gs://bucket/path/to/file
, and the code will function the same.
- We've set up a documentation site with additional examples and tutorials, including Delta Lake integration and neighborhood analysis, similar to a map that helps users navigate.
- Finally, the libraries have been made open-source under permissive licenses. We hope that this will prompt others to use and contribute to the project, aiding in the development of a community around these tools.
Exon
Exon is a library written in Rust that takes advantage of DataFusion to develop a SQL engine proficient in handling scientific data types and workflows. It's designed for embedding into other applications and is the cornerstone of our DuckDB extension and BioBear packages.
This core infrastructure allows easy data management via Arrow and/or SQL and ensures adaptability across different execution environments. BioBear is specifically a Python package, but the Exon-DuckDB extension is adaptable, operating wherever DuckDB is used, such as Python, R, and JavaScript. We're currently exploring other language bindings, so feel free to reach out if you'd like to see a specific language supported.
See the Exon documentation, which has links to the installable crate, the Rust docs, and more.
Demonstration
For a brief demonstration, let's examine how we can use Exon to scan a GFF file, an output of CRT from a JGI dataset, and join the CRISPR array annotations with repeat units entirely within the array. If Rust isn't your focus and you're primarily interested in utilizing this code as a library, move forward to the BioBear section.
Fully Formed Example
use arrow::util::pretty::pretty_format_batches;
use datafusion::error::DataFusionError;
use datafusion::prelude::*;
use exon::context::ExonSessionExt;
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let ctx = SessionContext::new_exon();
let path = "./exon-examples/data/Ga0604745_crt.gff";
let sql = format!(
"CREATE EXTERNAL TABLE gff STORED AS GFF LOCATION '{}';",
path
);
ctx.sql(&sql).await?;
let df = ctx
.sql(
r#"SELECT crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
FROM (SELECT * FROM gff WHERE type = 'CRISPR') AS crispr
JOIN (SELECT * FROM gff WHERE type = 'repeat_unit') AS repeat
ON crispr.seqname = repeat.seqname
AND crispr.start <= repeat.start
AND crispr.end >= repeat.end
ORDER BY crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
LIMIT 10"#,
)
.await?;
let logical_plan = df.logical_plan();
assert_eq!(
format!("\n{:?}", logical_plan),
r#"
Limit: skip=0, fetch=10
Sort: crispr.seqname ASC NULLS LAST, crispr.start ASC NULLS LAST, crispr.end ASC NULLS LAST, repeat.start ASC NULLS LAST, repeat.end ASC NULLS LAST
Projection: crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
Inner Join: Filter: crispr.seqname = repeat.seqname AND crispr.start <= repeat.start AND crispr.end >= repeat.end
SubqueryAlias: crispr
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("CRISPR")
TableScan: gff
SubqueryAlias: repeat
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("repeat_unit")
TableScan: gff"#,
);
let results = df.collect().await?;
let formatted_results = pretty_format_batches(results.as_slice())?;
assert_eq!(
format!("\n{}", formatted_results),
r#"
+------------------+-------+------+-------+-----+
| seqname | start | end | start | end |
+------------------+-------+------+-------+-----+
| Ga0604745_000026 | 1 | 3473 | 1 | 37 |
| Ga0604745_000026 | 1 | 3473 | 73 | 109 |
| Ga0604745_000026 | 1 | 3473 | 147 | 183 |
| Ga0604745_000026 | 1 | 3473 | 219 | 255 |
| Ga0604745_000026 | 1 | 3473 | 291 | 327 |
| Ga0604745_000026 | 1 | 3473 | 365 | 401 |
| Ga0604745_000026 | 1 | 3473 | 437 | 473 |
| Ga0604745_000026 | 1 | 3473 | 510 | 546 |
| Ga0604745_000026 | 1 | 3473 | 582 | 618 |
| Ga0604745_000026 | 1 | 3473 | 654 | 690 |
+------------------+-------+------+-------+-----+"#
);
Ok(())
}
BioBear
BioBear is a Python package that extends Exon, providing it with a Pythonic interface. For instance, you can read a GFF file from S3, and then load it into a PyArrow RecordBatch Reader and/or a Polars DataFrame.
PyArrow's strength lies in its ability to render our domain-specific data highly portable. For instance, suppose you have a Python job that produces a GFF file, perhaps the output of Prodigal. You can create an Arrow RecordBatch Reader that streams data from S3 into a Delta Lake table.
import biobear as bb
from deltalake import write_deltalake
batch_reader = bb.GFFReader("./job/output/test.gff").to_arrow()
write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})
You can also use BioBear to query local indexed files. For example, you can query a BCF file for records on chromosome 1.
import biobear as bb
reader = bb.BCFIndexedReader("example.bcf")
df = reader.query("1")
In this case, the query is executed by the BCFIndexedReader, which uses the index to only read the records on chromosome 1. This is much faster than reading the entire file and filtering in memory.
Object Stores
Scientific computing is increasingly done in the cloud -- the data that gets generated from workflows is plopped somewhere in you favorite object store (e.g. S3). This has made it less expensive to store the output but adds an additional layer of complexity when trying to do analysis.
To try to mitigate this complexity WHERE TRUE's tools now support reading from object stores. For example, we could rewrite the last example to read from S3:
import biobear as bb
from deltalake import write_deltalake
batch_reader = bb.GFFReader("s3://bucket/job/output/test.gff").to_arrow()
write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})
Put this in a lambda and you're about 3/4ths of the way to a cloud-based bioinformatics data platform.
Or, if DuckDB is more your speed, you can do:
SELECT *
FROM read_gff('s3://bucket/job/output/test.gff')
Better Home and Docs Site
We'll, you're here aren't you... kidding of course, but we've deployed an update to the main site and the docs site. The docs have more examples. We gave the example earlier of how to use BioBear to write a Delta Lake table from a GFF file. We also have an example of how to use Exon-DuckDB to do neighborhood analysis on a metagenomics sample.
Have a gander at our main site here.
Open Source
The ultimate goal of WHERE TRUE Technologies is to not exist. We want to build tools and popularize techniques that diminish the need for the majority of specialized bioinformatics file formats in favor of more general-purpose formats that are easier to work with, but don't make tradeoffs on performance or flexibility.
To get to that state, we've opened sourced our libraries under permissive licenses. We hope that this will encourage others to adopt and contribute to the project, and help us build a community around these tools.
You can review the licenses for each of the libraries at the following links: