Skip to main content

7 posts tagged with "biobear"

View All Tags

Initial Support for SDF Files

· 4 min read
Trent Hauck
Trent Hauck
Developer

It's now possible to read SDF files with Exon and BioBear.

SDF Files

The "SDF" in SDF files stands for "Structure Data File". It's sorta like ATM machines.

As you might imagine, this file format contains chemical structures and additional data. There may be multiple such records in a single file. For example, from the ChemBL database.

CHEMBL153534
RDKit 2D

16 17 0 0 0 0 0 0 0 0999 V2000
7.6140 -22.2702 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7047 -23.1991 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1806 -22.5282 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.9604 -22.7690 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8790 -23.2163 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.2791 -21.1119 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
7.5280 -21.4445 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4225 -22.4364 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.8353 -21.7198 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2035 -23.8527 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
4.0534 -23.2163 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9776 -23.5889 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -22.4938 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.6406 -23.9215 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.4397 -20.3035 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.6552 -21.6280 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 3 2 0
3 4 1 0
4 1 1 0
5 2 1 0
6 7 1 0
7 1 2 0
8 1 1 0
9 8 2 0
10 12 1 0
11 5 2 3
12 4 2 0
13 11 1 0
14 11 1 0
15 6 1 0
16 9 1 0
6 9 1 0
10 2 1 0
M END
> <chembl_id>
CHEMBL153534

$$$$

This record starts with header information, followed by the atom and bond information, then any additional data (chembl_id in this case). The record ends with $$$$.

Reading SDF Files

Depending on which library you're using, you have different options for working with SDF files.

BioBear

If you're working with BioBear, install the latest version from PyPI:

pip install -U biobear

Then you can read an SDF file like this:

from biobear import new_session
from pathlib import Path

session = new_session()

sdf_file = Path("data/chembl_34.sdf.gz")
result = session.read_sdf_file(sdf_file.as_posix()).to_polars()
print(len(result))

# Output:
# 2409270

Exon

If you're working with Exon, you'll use SQL. First register the table:

CREATE EXTERNAL TABLE sdf
STORED AS SDF
LOCATION 'data/chembl_34.sdf.gz'
OPTIONS (compression 'gzip');

Then you can query the table:

SELECT COUNT(*) FROM sdf;
-- Output: 2409270

File Schema

An SDF table has the following schema:

Column NameTypeDescription
headerUtf8The header of the record
atom_countUInt32The number of atoms in the record
bond_countUInt32The number of bonds in the record
dataStructAdditional data in the record

The data column is inferred from the underlying file format based on the first record. So for example, given the SDF file above, the data column would have the following schema:

Column NameTypeDescription
chembl_idUtf8The chembl_id field from the record

Which could be accessed like this:

SELECT data.chembl_id FROM sdf LIMIT 1;

What's Missing?

As mentioned, this is an initial implementation. There are some things missing which which we plan to add in the future:

  • There is lots of room for optimization
  • Support for canonicalization of SMILES
  • Support for generating SMILES from coordinates
  • Support for writing SDF files
  • Functions for chemical descriptors

If you have any feedback or suggestions, please let us know.

Exon Release Notes

· 4 min read
Trent Hauck
Trent Hauck
Developer

We've released a new version of Exon, and it's Python API BioBear. This release includes some new features, improvements, and bug fixes. In particular, we'd like to highlight two things:

  1. Read and write support for FASTA and FASTQ files
  2. Support for integer encoding of dna and protein sequences from FASTA files

Read and Write Support for FASTA and FASTQ Files

Previously it'd been possible to create tables from FASTA and FASTQ files via CREATE EXTERNAL TABLE and query them directly via their respective fasta_scan and fastq_scan table functions. However, it wasn't possible to write data back to these formats until now. With this release, you can now write data back to FASTA and FASTQ files using SQL's built-in COPY command.

For example, to read the SwissProt FASTA file and write it back to a new file. First create the table:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'

Then copy the data to a new file:

COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA

Or in BioBear:

import biobear as bb

session = bb.new_session()

session.execute(
"""
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
"""
)

session.execute("COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA")

In effect, this example decompressed the file. You can also copy from query results:

session.execute(
"""
COPY (
SELECT *
FROM swissprot
WHERE NOT starts_with(sequence, 'M')
) TO 'uniprot_sprot_weird_aa.fasta' STORED AS FASTA
"""
)

Compression

You can specify the compression type in the COPY's OPTIONS clause. For example, to copy the SwissProt FASTA file to ZSTD compressed file:

session.execute(  # our session from above
"""
COPY swissprot TO 'uniprot_sprot_copy.fasta.zst'
STORED AS FASTA OPTIONS ('compression' 'zstd')
"""
)

Integer Encoding of DNA and Protein Sequences

Second, we've added support for reading DNA and protein sequences as integer-encoded arrays. This can be useful for ML applications as sequences are often encoded as integers for training. For example, to read the SwissProt FASTA file and encode the sequences as integers:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')

Or in BioBear to get a Polars DataFrame:

import biobear as bb

session = bb.new_session()

session.execute(
"""
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')
"""
)

df = session.sql("SELECT * FROM swissprot").to_polars()
df.head()
| id                   | description                     | sequence       |
|----------------------|---------------------------------|----------------|
| sp|Q6GZX4|001R_FRG3G | Putative transcription factor … | [12, 1, … 11] |
| sp|Q6GZX3|002L_FRG3G | Uncharacterized protein 002L O… | [12, 18, … 22] |
| sp|Q197F8|002R_IIV3 | Uncharacterized protein 002R O… | [12, 1, … 3] |
| sp|Q197F7|003L_IIV3 | Uncharacterized protein 003L O… | [12, 23, … 9] |
| sp|Q6GZX2|003R_FRG3G | Uncharacterized protein 3R OS=… | [12, 1, … 18] |
| … | … | … |
| sp|Q6UY62|Z_SABVB | RING finger protein Z OS=Sabia… | [12, 7, … 4] |
| sp|P08105|Z_SHEEP | Putative uncharacterized prote… | [12, 18, … 9] |
| sp|Q88470|Z_TACVF | RING finger protein Z OS=Tacar… | [12, 7, … 11] |
| sp|A9JR22|Z_TAMVU | RING finger protein Z OS=Tamia… | [12, 7, … 15] |
| sp|B2ZDY1|Z_WWAVU | RING finger protein Z OS=White… | [12, 7, … 1] |

In addition to integer_encode_protein, you can also use:

  • utf8 for the UTF-8 encoding (this is the default)
  • integer_encode_dna for integer encoding of DNA sequences
  • large_utf8 for larger sequences like chromosomes or genomes

Setting the Encoding for all FASTA Files

If you want to set the encoding for the entire session, you can use the SET command:

session.execute("SET fasta_sequence_data_type = 'integer_encode_protein'")

Then all subsequent CREATE EXTERNAL TABLE commands will use this encoding.

Conclusion

We hope you enjoy these new features and improvements. Please let us know if you have any questions or feedback.

To get started with BioBear, run:

pip install -U biobear

Exon and BioBear Updates

· 7 min read
Trent Hauck
Trent Hauck
Developer

This post will walk through some of the recent updates to Exon, and show how they can be used with BioBear and/or ExonR.

These updates include:

  • Addition of several table functions to scan various bioinformatic files and query some indexed files. Using these table functions requires less boilerplate SQL code, so are a nice quality of life improvement.
  • Exon has gain the ability to do index seeks on GFF files for region queries. Now users can access specific regions across large annotations sets without resorting to costly sequential scans.
  • Flow Cytometry Standard (FCS) files are now supported, and can help scientists go very quickly from the FCS binary format to a DataFrame for analysis.
  • Various optimizations led to performance improvements for bgzipped files.
  • New utility table types where added to align more simply to file extensions: FNA (.fna), FAA (.faa), FQ (.fq).

Table of Contents

Table Functions

The addition of table functions is mostly a quality of life improvement, but it's good to live good, so to speak. A table function returns a table that is incorporated into the larger query plan. For example in

SELECT COUNT(*) FROM gff_scan('gff/test.gff.gz');

gff_scan creates a table which is then incorporated into the overall query to count the contents of that file. This is the same thing as doing:

CREATE EXTERNAL TABLE gff_file
STORED AS GFF
COMPRESSION TYPE GZIP
LOCATION 'gff/test.gff.gz';

SELECT COUNT(*) FROM gff_file;

As you can see, it's considerably less typing. Or, if we were to use BioBear, we can see it's ease:

import biobear as bb

sess = bb.connect()
df = sess.sql("SELECT * FROM gff_scan('gff/test.gff.gz')").to_polars()
print(df.head())

There's the obvious downside in that the table function can't be reused like the table, which is not in the database's catalog. However, given these are backed by an external file, which won't change, it's often moot.

Scanning and Querying

As of this release, there are two types of table functions available: scanning, for bulk scanning of all underlying records, and querying, for specific region access for indexed files.

Please see the table function docs, for what's specific available, but generally all current files have scan support, and indexed files are VCF, SAM, and GFF.

Example Scanning Function

Scanning functions take a couple of options. First, they all require the location of the table. This can be a single file or a folder. Second, if applicable, they can take the compression of the underlying file. The table function will also try to use infer the compression type, but this only works for single files. If you pass a folder/prefix, you must pass a compression type unless the files are actually uncompressed.

So, we already saw one example, but we can also can other file types and potentially include them in larger queries. For example,

SELECT *
FROM fasta_scan('fasta/test.fasta') AS f
JOIN gff_scan('gff/test.gff.gz') AS g
ON f.id = g.seqname;

Again, see the table function docs for the list of available functions, and more examples.

Example Query Function

Querying indexed files can save a lot of time because only the bytes that belong to the query region are included. The newly adding bam_indexed_scan, vcf_indexed_scan, and gff_indexed_scan, allow querying those respective files.

With BioBear, we an easily write a script which can search regions and save the results as parquet.

def query(path: str, region: str):
sess = bb.connect()
df = sess.sql(f"SELECT * FROM vcf_indexed_scan('{path}', '{region}')").to_polars()
df.write_parquet(f"{region}.parquet")

if __name__ == "__main__":
query("vcf/index.vcf.gz", "1")

Indexed GFF Files

As mentioned, Exon can now do index seeks on GFF files to markedly reduce query times when a specific sequence or region is needed.

This access pattern comes up often. For example, in metagenomics to get the neighboring annotations for an enzyme of interest. Or in human health, to query regions near some pathogenic variant.

Similar to SAM and VCF index support in Exon, there are two ways to run queries. First, the gff_indexed_scan, like so:

SELECT COUNT(*)
FROM gff_indexed_scan('gff-index/gencode.v38.polyAs.gff.gz', 'chr1:1000000-2000000')

You must have an index file associated with the gff file at gff-index/gencode.v38.polyAs.gff.gz.tbi.

Similarly, you can make a table, then query it to your heart's content.

CREATE EXTERNAL TABLE indexed_gff
STORED AS INDEXED_GFF
COMPRESSION TYPE GZIP
LOCATION 'gff-index/gencode.v38.polyAs.gff.gz';

And to query:

SELECT COUNT(*)
FROM indexed_gff
WHERE gff_region_filter('chr1:1000000-2000000', seqname) = TRUE;

Flow Cytometry Standard (FCS) Files

FCS files are the output from flow cytometry experiments. They are binary files, and can be quite large. Exon can now can incorporate them in queries, and used with BioBear and ExonR by extension. For example, to scan a file with BioBear and load it into a DataFrame:

import biobear as bb

sess = bb.connect()
df = sess.sql("SELECT * FROM fcs_scan('fcs/Guava Muse.fcs')").to_polars()
print(df.head())

And that will output a table that looks like:

Forward Scatter (FSC-HLin)Forward Scatter Width (FSC-W)Yellow Fluorescence (YEL-HLin)Yellow Fluorescence Width (YEL-W…TimeCell Size (FSC-HLog)Viability (YEL-HLog)Nucleation (RED-HLog)
481.9313057.584.2256017.535964.02.6829851.9254442.597557
1293.61962918.75394.1403518.7552745.03.1118072.5956513.668868
763.8361825.7586.9099965.7577479.02.8831.939072.878045
1.1607680.752.7866090.75126783.00.0647450.4450762.346986
338.32446310.7538.32130110.75141965.02.5293331.583442.531971

As a next step, you could select specific columns (via the SELECT clause) or rows (via the WHERE clause) to further refine the query, and then pass the results to a plotting library or other analysis tool.

Faster bgzip Reading

Reading some compressed formats as gotten faster due to build improvements. For example, comparing The last version to this one when reading a larger VCF file, we see that the new version is about 1.5x faster than the last.

hyperfine './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' './exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'
Benchmark 1: ./exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
Time (mean ± σ): 1.430 s ± 0.055 s [User: 5.784 s, System: 0.625 s]
Range (min … max): 1.361 s … 1.553 s 10 runs

Benchmark 2: ./exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
Time (mean ± σ): 952.6 ms ± 69.7 ms [User: 3319.4 ms, System: 513.8 ms]
Range (min … max): 895.1 ms … 1117.4 ms 10 runs

Summary
'./exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' ran
1.50 ± 0.12 times faster than './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'

New Utility File Types

Because sequence files often have different extensions, e.g. .faa for a FASTA file consisting of Amino Acids, it can be annoying to manage that difference via the extension vs having a specific table type.

So now that are a few new table types to simplify things:

File TypeExtensionDescription
FAA.faaAmino Acid FASTA
FNA.fnaNucleotide FASTA
FQ.fqFASTQ

Conclusion

We hope these updates make Exon, BioBear and ExonR, more useful for your bioinformatics needs. If you have any questions, please feel free to reach out.

BAM Files & BioBear

· 9 min read
Trent Hauck
Trent Hauck
Developer

This post highlights a couple of updates to the BioBear project to support SQL sessions and improvements to BAM file support in Exon so it can scale to larger full scans (like 14G of BAM files from a 10X Genomics experiment).

BioBear Update

The main update to BioBear is that it now exposes Exon's inner query engine allowing for more streamlined application in ETL and ML use-cases. If you work in Biotech/Pharma, you probably watch your normie friends use tools like polars, DataFusion, and/or DuckDB to simplify their data processing with a mix of jealousy and contempt. They reap the benefits of faster/cheaper workflows, while you schlep GFFs and BAM files between CLI tools, workflows, and scripts.

With this BioBear release it becomes simpler to use Exon's query engine in your ETL jobs, ML pipelines, and other execution methods (like AWS Lambdas).

So what does that look like. First, install the latest BioBear version:

pip install -U biobear

Then, you can connect to a local Exon session:

import biobear as bb

session = bb.connect()

With that session, you can work with bioinformatic data along side CSV, Parquet, and JSON data. As an example, imagine the case where you wanted to combine and summarize the contents of a Metagenomics NGS sample that had been run through primary and secondary analysis.

For example here we have a sample, Ga0451106 from IMG/M. Let's say we want to store the gene annotations from the primary analysis in S3 as a Parquet file. We also want to be conservative with out storage duplication, especially sequences, so we'll join the protein FASTA on a PFAM GFF, so that we only duplicate the protein sequences that are annotated with PFAMs we care about given our end goal.

Write Gene Annotations to Parquet

Taking the session from earlier, create an external table backed by the GFF file, then run a COPY query to write the results.

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations
STORED AS GFF
LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("""
COPY gene_annotations
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet' (FORMAT parquet)
""")

df.to_polars()
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ --- │
# │ u64 │
# ╞═══════╡
# │ 7998 │
# └───────┘

Say we did have the GFF in S3, we could load it into polars directly:

# Register the store we'll be reading from
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3
STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("SELECT * FROM gene_annotations_s3").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ ┆ f32 ┆ str ┆ str ┆ list[struct[2]] │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2 ┆ … ┆ 54.5 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 228 ┆ … ┆ 114.0 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 1097 ┆ … ┆ 224.399994 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2261 ┆ … ┆ 237.699997 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 3784 ┆ … ┆ 114.400002 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Subsetting our Sequences

As mentioned, it's important here to be cognizant of storage costs and thus reduce duplication into parquet except for amino acid sequences that likely contain a certain PFAM domain.

Similar to the GFF, we can create an external table for both the protein FASTA and the PFAM GFF:

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE protein_fasta
STORED AS FASTA LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_proteins.faa'
""")

session.sql("""
CREATE EXTERNAL TABLE pfam_gff
STORED AS GFF LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_pfam.gff'
""")

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

gff = session.sql("""SELECT * FROM pfam_gff""").to_polars()
# shape: (5, 9)
# ┌──────────────────────────────┬─────────────────────────────┬─────────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ ┆ f32 ┆ str ┆ str ┆ list[struct[2]] │
# ╞══════════════════════════════╪═════════════════════════════╪═════════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14684 ┆ 673 ┆ … ┆ 100.900002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14685 ┆ 753 ┆ … ┆ 81.900002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF03572 ┆ 875 ┆ … ┆ 70.400002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 459 ┆ … ┆ 16.799999 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 408 ┆ … ┆ 15.3 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# └──────────────────────────────┴─────────────────────────────┴─────────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

# Join the protein FASTA and PFAM GFF and write the results to S3
df = session.sql("""
COPY (
SELECT DISTINCT fasta.id, fasta.description, fasta.sequence
FROM protein_fasta AS fasta
INNER JOIN pfam_gff AS pfam
ON fasta.id = pfam.seqname
WHERE pfam.type = 'PF13561' -- Enoyl-(Acyl carrier protein) reductase
) TO 's3://wtt-01-dist-prd/sequences/pfam=PF13561/sample=Ga0455106/sequences.parquet' (FORMAT parquet)
""").to_polars()

print(df)
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ --- │
# │ u64 │
# ╞═══════╡
# │ 49 │
# └───────┘

Exon BAM Update

The other update to highlight here is that, Exon, and now by extension BioBear, has undergone optimizations that allow projection and predicate pushdowns to enable more efficient queries on BAM files.

Projection Pushdown

For example, say we have we have a couple of BAM files from this 10X Genomics experiment. For simplicity lets assume one file for one sample from the treatment and one from the control, and we just want to look at the read name, the start position, and the map quality for QC purposes.

We can create an external table for the directory containing the BAM files - a listing table allows us to query the files in a directory as one table. Stop for a moment, and consider what the approach to working with multiple SAM files looks like with samtools. You'd have to write a script to loop over the files, then parse the output, and then combine the results. With Exon, you can just query the directory as a table.

$ ls -lh bam-files
-rw-r--r--@ 1 thauck staff 6.7G Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam
-rw-r--r-- 1 thauck staff 2.2M Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam.bai
-rw-r--r-- 1 thauck staff 7.0G Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam
-rw-r--r-- 1 thauck staff 2.3M Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam.bai

As you can see, this are a couple of larger files, totaling just under 14G for the main data files, and around 5M for the associated indexes.

import biobear as bb

session = bb.connect()

session.sql("""
CREATE EXTERNAL TABLE ten_x_experiment STORED AS BAM LOCATION 'bam-files/'
""")

# See which columns are available
columns = session.sql("""
DESCRIBE ten_x_experiment
""").to_polars().select('column_name')
# print(columns)

df = session.sql("""
SELECT name, start, mapping_quality
FROM ten_x_experiment
""").to_polars()
# print(df.shape)
# (287822789, 3)

Reading about 285,000,000 records took around 3 minutes to load on my laptop, an M2 MacBook Air with 24GB of RAM. And while I recognize running on a local laptop isn't the most scientific, it does represent how many practitioners will work with data. In fact, I think that's an exciting thing about Exon and related tools: building off increased compute power for personal computers to enable local analysis of large datasets.

Predicate Pushdown

We just pushed down the projection, i.e. which fields are "physically read", but we can also push down region filters, to determine which records are read in the first place, before hitting the query's predicate in the first place. This enables quickly subsetting files to just the region of interest.

import biobear as bb

session = bb.connect()

# Note that we've changed to an INDEXED_BAM table
session.sql("""
CREATE EXTERNAL TABLE ten_x_experiment_index STORED AS INDEXED_BAM LOCATION 'bam-files/'
""")

df = session.sql("""
SELECT name, start, mapping_quality
FROM ten_x_experiment_index
WHERE bam_region_filter('chr1:1-1000000', reference, start, "end")
""").to_polars()
# print(df)
# shape: (39_090, 3)
# ┌───────────────────────────────────┬────────┬─────────────────┐
# │ name ┆ start ┆ mapping_quality │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i32 ┆ str │
# ╞═══════════════════════════════════╪════════╪═════════════════╡
# │ A00519:1665:H3LNHDSX7:3:2631:254… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2406:262… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2247:326… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2340:113… ┆ 69486 ┆ null │
# │ … ┆ … ┆ … │
# │ A00519:1665:H3LNHDSX7:3:2657:215… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2607:125… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2346:582… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2240:126… ┆ 999937 ┆ 0 │
# └───────────────────────────────────┴────────┴─────────────────┘

Conclusion

This post highlight two recent updates to our tools:

  1. BioBear now exposes Exon's query engine allowing for more streamlined application in ETL and ML use-cases.
  2. Exon now supports projection and predicate pushdowns to enable more efficient queries on BAM files.

As always, feel free to reach out with questions, comments, or suggestions.

Highlighting some releases in July 2023

· 5 min read
Trent Hauck
Trent Hauck
Developer

We wanted to highlight three updates that we made in July 2023.

First, Exon now will rewrite queries to scan multiple files in parallel. This enables considerable speedups for queries over large datasets -- both for local files and object stores (e.g. S3, GCS, etc.).

Second, we're releasing an initial version of an MzML reader in our suite of tools. This is our first step towards supporting proteomics and metabolomics data in exon. This format is subject to change as we get feedback from users, so please let us know what you think!

Third, we're adding support for pandas dataframes in BioBear. This is a common request, so we're excited to be able to support it.

Parallel Scan

Not only is going fast fun, but it's also important for enabling the fast iteration cycles necessary for innovation. To that end, Exon and related tools will now automatically parallelize scans over multiple files.

For example, if you have a directory with FASTA files, you can now do the following:

import biobear as bb
from pathlib import Path

data = Path("./data/") # folder with fasta files
# note: FastaReader is now deprecated, use session.read_fasta_file(data)
df = FastaReader(data).to_polars() # need to install polars

And Exon will distribute reading the files across available cores. This is a huge win for data lakes, where you can now read multiple files in parallel without having to manually shard your data. E.g. here we see a 4x speed-up when reading 8 files.

drawing

MzML Reader

MzML is a very common format in metabolomics and proteomics. At its core are measurements of the spectrum from the mass spec, but there's also a Controlled Vocabulary that describes various metadata about the experiment.

Similar to other readers, you instantiate a reader and then depending on what you want to do with it, you can convert it to arrow or polars with BioBear, or a data.frame with R.

import biobear as bb

# object stores and local files are supported
df = bb.MzMLReader("mzml-data/GNPS00002_A3_p.mzML").to_polars()

df.head()
# shape: (5, 6)
# ┌───────────────────────────────────┬───────────────────────────────────┬───────────────────────────────────┬────────────┬───────────────────────────────────┬────────────────┐
# │ id ┆ mz ┆ intensity ┆ wavelength ┆ cv_params ┆ precursor_list │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ struct[1] ┆ struct[1] ┆ f32 ┆ object ┆ object │
# ╞═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╪════════════╪═══════════════════════════════════╪════════════════╡
# │ controllerType=0 controllerNumbe… ┆ {[80.948143, 80.993134, … 957.16… ┆ {[5416.88916, 7712.663574, … 480… ┆ null ┆ [('MS:1000579', {'accession': 'M… ┆ None │
# │ controllerType=0 controllerNumbe… ┆ {[80.284782, 81.017242, … 1176.3… ┆ {[4701.711426, 5527.709961, … 95… ┆ null ┆ [('MS:1000579', {'accession': 'M… ┆ None │
# │ controllerType=0 controllerNumbe… ┆ {[80.948288, 81.017365, … 1193.9… ┆ {[11152.501953, 9900.050781, … 8… ┆ null ┆ [('MS:1000579', {'accession': 'M… ┆ None │
# │ controllerType=0 controllerNumbe… ┆ {[80.948143, 81.017372, … 1074.1… ┆ {[6677.85791, 11401.181641, … 12… ┆ null ┆ [('MS:1000579', {'accession': 'M… ┆ None │
# │ controllerType=0 controllerNumbe… ┆ {[80.948174, 81.017387, … 968.84… ┆ {[6586.833984, 13542.833008, … 6… ┆ null ┆ [('MS:1000579', {'accession': 'M… ┆ None │
# └───────────────────────────────────┴───────────────────────────────────┴───────────────────────────────────┴────────────┴───────────────────────────────────┴────────────────┘

For R users, see the read_mzml_file function. And for exon-duckdb users see

Please see the docs for more information about MzML files, and again, please let us know if you have any feedback!

A Sneak Peak for Querying

While it's not quite ready yet for BioBear and ExonR, in Exon, we've added a set of MzML specific functions for querying. For example, you can filter spectra based on the presence of a specific peaks.

For example, this will filter a set of spectra to only those that have a peak at 100.0 m/z that's within 0.1 m/z of the peak.

SELECT id
FROM mzml -- same table as above
WHERE contains_peak(mz.mz, 100.0, 0.1) = true

This sort of declarative informatics is part of the Exon vision, which we'll be talking more about in the future.

A Note on Performance

Similar to BioBear comparing favorably to other scientific data libraries, we see modest out-performance for MzML file reading.

The following compares reading using a few difference libraries in Python on an MzML file with about 8k spectra.

drawing

Pandas Support in BioBear

There's not much to say besides that places where you could use to_polars you can now also use to_pandas. E.g.

import biobear as bb
from pathlib import Path

data = Path("./data/uniref50.fasta")
# note: FastaReader is now deprecated, use session.read_fasta_file(data)
df = FastaReader(data).to_pandas() # need to install pandas

df.head()
# id description sequence
# 0 UniRef50_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa t... MEEITQIKKRLSQTVRLEGKEDLLSKKDSITNLKTEEHVSVKKMVI...
# 1 UniRef50_A0A410P257 Glycogen synthase n=2 Tax=Candidatus Velamenic... MKAIAWLIVLTFLPEQVAWAVDYNLRGALHGAVAPLVSAATVATDG...
# 2 UniRef50_A0A8J3NBY6 Gln_amidase domain-containing protein n=2 Tax=... MEILGRNLPRILGNLVKTIKTAPVRVVARRGARTLTQKEFGKYLGS...
# 3 UniRef50_Q8WZ42 Titin n=3053 Tax=cellular organisms TaxID=1315... MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVI...
# 4 UniRef50_A0A401TRQ8 Ig-like domain-containing protein (Fragment) n... PPSFIHKPDPQEVLPGSNVKFTSVVTGTAPLKVSWFKGTTELVAGR...

Conclusion

We hope you enjoy the new features and please let us know if you have any feedback! We'll be back next month with more updates.

BioBear on Conda Forge

· One min read
Trent Hauck
Trent Hauck
Developer

Conda has become a popular way to distribute software in the bioinformatics community. We're excited to announce that we've added BioBear to Conda Forge. You can now install BioBear with the following command:

conda install -c conda-forge biobear

Pip still works as well:

pip install biobear

Spring & Summer 2023 Updates

· 7 min read
Trent Hauck
Trent Hauck
Developer

It's been around six weeks since our last public update. Here's what we've done:

  1. We've refactored our DuckDB extension into a standalone Rust library named Exon. This is a Rust-based library designed to implement a SQL engine that's aware of scientific data types and workflows.
  2. The DuckDB extension now utilizes Exon and has been renamed as Exon-DuckDB. This can be used with Python, R, and C++, wherever DuckDB is used.
  3. We've also released an open-source Python package, BioBear, which connects Exon with PyArrow and Polars.
  4. Exon, along with its DuckDB extension and BioBear, now supports Object Stores such as S3 and GCS. Local paths like ./path/to/file can be replaced with cloud storage paths like s3://bucket/path/to/file or gs://bucket/path/to/file, and the code will function the same.
  5. We've set up a documentation site with additional examples and tutorials, including Delta Lake integration and neighborhood analysis, similar to a map that helps users navigate.
  6. Finally, the libraries have been made open-source under permissive licenses. We hope that this will prompt others to use and contribute to the project, aiding in the development of a community around these tools.

Exon

Exon is a library written in Rust that takes advantage of DataFusion to develop a SQL engine proficient in handling scientific data types and workflows. It's designed for embedding into other applications and is the cornerstone of our DuckDB extension and BioBear packages.

This core infrastructure allows easy data management via Arrow and/or SQL and ensures adaptability across different execution environments. BioBear is specifically a Python package, but the Exon-DuckDB extension is adaptable, operating wherever DuckDB is used, such as Python, R, and JavaScript. We're currently exploring other language bindings, so feel free to reach out if you'd like to see a specific language supported.

See the Exon documentation, which has links to the installable crate, the Rust docs, and more.

Demonstration

For a brief demonstration, let's examine how we can use Exon to scan a GFF file, an output of CRT from a JGI dataset, and join the CRISPR array annotations with repeat units entirely within the array. If Rust isn't your focus and you're primarily interested in utilizing this code as a library, move forward to the BioBear section.

Fully Formed Example
use arrow::util::pretty::pretty_format_batches;
use datafusion::error::DataFusionError;
use datafusion::prelude::*;
use exon::context::ExonSessionExt;

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let ctx = SessionContext::new_exon();

let path = "./exon-examples/data/Ga0604745_crt.gff";
let sql = format!(
"CREATE EXTERNAL TABLE gff STORED AS GFF LOCATION '{}';",
path
);

ctx.sql(&sql).await?;

let df = ctx
.sql(
r#"SELECT crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
FROM (SELECT * FROM gff WHERE type = 'CRISPR') AS crispr
JOIN (SELECT * FROM gff WHERE type = 'repeat_unit') AS repeat
ON crispr.seqname = repeat.seqname
AND crispr.start <= repeat.start
AND crispr.end >= repeat.end

ORDER BY crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
LIMIT 10"#,
)
.await?;

// Show the logical plan.
let logical_plan = df.logical_plan();
assert_eq!(
format!("\n{:?}", logical_plan),
r#"
Limit: skip=0, fetch=10
Sort: crispr.seqname ASC NULLS LAST, crispr.start ASC NULLS LAST, crispr.end ASC NULLS LAST, repeat.start ASC NULLS LAST, repeat.end ASC NULLS LAST
Projection: crispr.seqname, crispr.start, crispr.end, repeat.start, repeat.end
Inner Join: Filter: crispr.seqname = repeat.seqname AND crispr.start <= repeat.start AND crispr.end >= repeat.end
SubqueryAlias: crispr
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("CRISPR")
TableScan: gff
SubqueryAlias: repeat
Projection: gff.seqname, gff.source, gff.type, gff.start, gff.end, gff.score, gff.strand, gff.phase, gff.attributes
Filter: gff.type = Utf8("repeat_unit")
TableScan: gff"#,
);

// Uncomment to show the physical plan, though it's obviously messier.
// let physical_plan = df.create_physical_plan().await?;
// println!("Physical plan: {:?}", physical_plan);

// Collect the results as a vector of Arrow RecordBatches.
let results = df.collect().await?;
let formatted_results = pretty_format_batches(results.as_slice())?;
assert_eq!(
format!("\n{}", formatted_results),
r#"
+------------------+-------+------+-------+-----+
| seqname | start | end | start | end |
+------------------+-------+------+-------+-----+
| Ga0604745_000026 | 1 | 3473 | 1 | 37 |
| Ga0604745_000026 | 1 | 3473 | 73 | 109 |
| Ga0604745_000026 | 1 | 3473 | 147 | 183 |
| Ga0604745_000026 | 1 | 3473 | 219 | 255 |
| Ga0604745_000026 | 1 | 3473 | 291 | 327 |
| Ga0604745_000026 | 1 | 3473 | 365 | 401 |
| Ga0604745_000026 | 1 | 3473 | 437 | 473 |
| Ga0604745_000026 | 1 | 3473 | 510 | 546 |
| Ga0604745_000026 | 1 | 3473 | 582 | 618 |
| Ga0604745_000026 | 1 | 3473 | 654 | 690 |
+------------------+-------+------+-------+-----+"#
);

Ok(())
}

BioBear

BioBear is a Python package that extends Exon, providing it with a Pythonic interface. For instance, you can read a GFF file from S3, and then load it into a PyArrow RecordBatch Reader and/or a Polars DataFrame.

PyArrow's strength lies in its ability to render our domain-specific data highly portable. For instance, suppose you have a Python job that produces a GFF file, perhaps the output of Prodigal. You can create an Arrow RecordBatch Reader that streams data from S3 into a Delta Lake table.

import biobear as bb
from deltalake import write_deltalake

batch_reader = bb.GFFReader("./job/output/test.gff").to_arrow()

write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})
# aws s3 ls s3://bucket/delta/gff
# PRE _delta_log/
# 2023-06-24 11:00:19 8509556 0-ccedd844-373f-4e70-a99e-2167535fcf35-0.parquet

You can also use BioBear to query local indexed files. For example, you can query a BCF file for records on chromosome 1.

import biobear as bb

reader = bb.BCFIndexedReader("example.bcf")
df = reader.query("1")

In this case, the query is executed by the BCFIndexedReader, which uses the index to only read the records on chromosome 1. This is much faster than reading the entire file and filtering in memory.

Object Stores

Scientific computing is increasingly done in the cloud -- the data that gets generated from workflows is plopped somewhere in you favorite object store (e.g. S3). This has made it less expensive to store the output but adds an additional layer of complexity when trying to do analysis.

To try to mitigate this complexity WHERE TRUE's tools now support reading from object stores. For example, we could rewrite the last example to read from S3:

import biobear as bb
from deltalake import write_deltalake

batch_reader = bb.GFFReader("s3://bucket/job/output/test.gff").to_arrow()
write_deltalake("s3a://bucket/delta/gff", batch_reader, storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'})

Put this in a lambda and you're about 3/4ths of the way to a cloud-based bioinformatics data platform.

Or, if DuckDB is more your speed, you can do:

SELECT *
FROM read_gff('s3://bucket/job/output/test.gff')

Better Home and Docs Site

We'll, you're here aren't you... kidding of course, but we've deployed an update to the main site and the docs site. The docs have more examples. We gave the example earlier of how to use BioBear to write a Delta Lake table from a GFF file. We also have an example of how to use Exon-DuckDB to do neighborhood analysis on a metagenomics sample.

Have a gander at our main site here.

Open Source

The ultimate goal of WHERE TRUE Technologies is to not exist. We want to build tools and popularize techniques that diminish the need for the majority of specialized bioinformatics file formats in favor of more general-purpose formats that are easier to work with, but don't make tradeoffs on performance or flexibility.

To get to that state, we've opened sourced our libraries under permissive licenses. We hope that this will encourage others to adopt and contribute to the project, and help us build a community around these tools.

You can review the licenses for each of the libraries at the following links: