BAM Files & BioBear
This post highlights a couple of updates to the BioBear
project to support SQL sessions and improvements to BAM file support in Exon
so it can scale to larger full scans (like 14G of BAM files from a 10X Genomics experiment).
BioBear Update
The main update to BioBear is that it now exposes Exon
's inner query engine allowing for more streamlined application in ETL and ML use-cases. If you work in Biotech/Pharma, you probably watch your normie friends use tools like polars, DataFusion, and/or DuckDB to simplify their data processing with a mix of jealousy and contempt. They reap the benefits of faster/cheaper workflows, while you schlep GFFs and BAM files between CLI tools, workflows, and scripts.
With this BioBear
release it becomes simpler to use Exon
's query engine in your ETL jobs, ML pipelines, and other execution methods (like AWS Lambdas).
So what does that look like. First, install the latest BioBear
version:
pip install -U biobear
Then, you can connect to a local Exon
session:
import biobear as bb
session = bb.connect()
With that session, you can work with bioinformatic data along side CSV, Parquet, and JSON data. As an example, imagine the case where you wanted to combine and summarize the contents of a Metagenomics NGS sample that had been run through primary and secondary analysis.
For example here we have a sample, Ga0451106
from IMG/M. Let's say we want to store the gene annotations from the primary analysis in S3 as a Parquet file. We also want to be conservative with out storage duplication, especially sequences, so we'll join the protein FASTA on a PFAM GFF, so that we only duplicate the protein sequences that are annotated with PFAMs we care about given our end goal.
Write Gene Annotations to Parquet
Taking the session
from earlier, create an external table backed by the GFF file, then run a COPY
query to write the results.
# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')
# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations
STORED AS GFF
LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")
df = session.sql("""
COPY gene_annotations
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet' (FORMAT parquet)
""")
df.to_polars()
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ --- │
# │ u64 │
# ╞═══════╡
# │ 7998 │
# └───────┘
Say we did have the GFF in S3, we could load it into polars directly:
# Register the store we'll be reading from
session.register_object_store_from_url('s3://wtt-01-dist-prd')
# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3
STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")
df = session.sql("SELECT * FROM gene_annotations_s3").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ ┆ f32 ┆ str ┆ str ┆ list[struct[2]] │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2 ┆ … ┆ 54.5 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 228 ┆ … ┆ 114.0 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 1097 ┆ … ┆ 224.399994 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2261 ┆ … ┆ 237.699997 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 3784 ┆ … ┆ 114.400002 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘
Subsetting our Sequences
As mentioned, it's important here to be cognizant of storage costs and thus reduce duplication into parquet except for amino acid sequences that likely contain a certain PFAM domain.
Similar to the GFF, we can create an external table for both the protein FASTA and the PFAM GFF:
# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE protein_fasta
STORED AS FASTA LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_proteins.faa'
""")
session.sql("""
CREATE EXTERNAL TABLE pfam_gff
STORED AS GFF LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_pfam.gff'
""")
# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')
gff = session.sql("""SELECT * FROM pfam_gff""").to_polars()
# shape: (5, 9)
# ┌──────────────────────────────┬─────────────────────────────┬─────────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ ┆ f32 ┆ str ┆ str ┆ list[struct[2]] │
# ╞══════════════════════════════╪═════════════════════════════╪═════════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14684 ┆ 673 ┆ … ┆ 100.900002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14685 ┆ 753 ┆ … ┆ 81.900002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF03572 ┆ 875 ┆ … ┆ 70.400002 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 459 ┆ … ┆ 16.799999 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 408 ┆ … ┆ 15.3 ┆ . ┆ null ┆ [{"ID",["Ga0451106_01_1000235_10… │
# └──────────────────────────────┴─────────────────────────────┴─────────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘
# Join the protein FASTA and PFAM GFF and write the results to S3
df = session.sql("""
COPY (
SELECT DISTINCT fasta.id, fasta.description, fasta.sequence
FROM protein_fasta AS fasta
INNER JOIN pfam_gff AS pfam
ON fasta.id = pfam.seqname
WHERE pfam.type = 'PF13561' -- Enoyl-(Acyl carrier protein) reductase
) TO 's3://wtt-01-dist-prd/sequences/pfam=PF13561/sample=Ga0455106/sequences.parquet' (FORMAT parquet)
""").to_polars()
print(df)
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ --- │
# │ u64 │
# ╞═══════╡
# │ 49 │
# └───────┘
Exon BAM Update
The other update to highlight here is that, Exon, and now by extension BioBear, has undergone optimizations that allow projection and predicate pushdowns to enable more efficient queries on BAM files.
Projection Pushdown
For example, say we have we have a couple of BAM files from this 10X Genomics experiment. For simplicity lets assume one file for one sample from the treatment and one from the control, and we just want to look at the read name, the start position, and the map quality for QC purposes.
We can create an external table for the directory containing the BAM files - a listing table allows us to query the files in a directory as one table. Stop for a moment, and consider what the approach to working with multiple SAM files looks like with samtools
. You'd have to write a script to loop over the files, then parse the output, and then combine the results. With Exon, you can just query the directory as a table.
$ ls -lh bam-files
-rw-r--r--@ 1 thauck staff 6.7G Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam
-rw-r--r-- 1 thauck staff 2.2M Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam.bai
-rw-r--r-- 1 thauck staff 7.0G Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam
-rw-r--r-- 1 thauck staff 2.3M Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam.bai
As you can see, this are a couple of larger files, totaling just under 14G for the main data files, and around 5M for the associated indexes.
import biobear as bb
session = bb.connect()
session.sql("""
CREATE EXTERNAL TABLE ten_x_experiment STORED AS BAM LOCATION 'bam-files/'
""")
# See which columns are available
columns = session.sql("""
DESCRIBE ten_x_experiment
""").to_polars().select('column_name')
# print(columns)
df = session.sql("""
SELECT name, start, mapping_quality
FROM ten_x_experiment
""").to_polars()
# print(df.shape)
# (287822789, 3)
Reading about 285,000,000 records took around 3 minutes to load on my laptop, an M2 MacBook Air with 24GB of RAM. And while I recognize running on a local laptop isn't the most scientific, it does represent how many practitioners will work with data. In fact, I think that's an exciting thing about Exon and related tools: building off increased compute power for personal computers to enable local analysis of large datasets.
Predicate Pushdown
We just pushed down the projection, i.e. which fields are "physically read", but we can also push down region filters, to determine which records are read in the first place, before hitting the query's predicate in the first place. This enables quickly subsetting files to just the region of interest.
import biobear as bb
session = bb.connect()
# Note that we've changed to an INDEXED_BAM table
session.sql("""
CREATE EXTERNAL TABLE ten_x_experiment_index STORED AS INDEXED_BAM LOCATION 'bam-files/'
""")
df = session.sql("""
SELECT name, start, mapping_quality
FROM ten_x_experiment_index
WHERE bam_region_filter('chr1:1-1000000', reference, start, "end")
""").to_polars()
# print(df)
# shape: (39_090, 3)
# ┌───────────────────────────────────┬────────┬─────────────────┐
# │ name ┆ start ┆ mapping_quality │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i32 ┆ str │
# ╞═══════════════════════════════════╪════════╪═════════════════╡
# │ A00519:1665:H3LNHDSX7:3:2631:254… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2406:262… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2247:326… ┆ 69486 ┆ null │
# │ A00519:1665:H3LNHDSX7:3:2340:113… ┆ 69486 ┆ null │
# │ … ┆ … ┆ … │
# │ A00519:1665:H3LNHDSX7:3:2657:215… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2607:125… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2346:582… ┆ 999932 ┆ 0 │
# │ A00519:1665:H3LNHDSX7:3:2240:126… ┆ 999937 ┆ 0 │
# └───────────────────────────────────┴────────┴─────────────────┘
Conclusion
This post highlight two recent updates to our tools:
BioBear
now exposesExon
's query engine allowing for more streamlined application in ETL and ML use-cases.Exon
now supports projection and predicate pushdowns to enable more efficient queries on BAM files.
As always, feel free to reach out with questions, comments, or suggestions.