BAM Files & BioBear

October 12, 2023 · 9 min read

Trent Hauck

Developer

This post highlights a couple of updates to the BioBear project to support SQL sessions and improvements to BAM file support in Exon so it can scale to larger full scans (like 14G of BAM files from a 10X Genomics experiment).

BioBear Update

The main update to BioBear is that it now exposes Exon's inner query engine allowing for more streamlined application in ETL and ML use-cases. If you work in Biotech/Pharma, you probably watch your normie friends use tools like polars, DataFusion, and/or DuckDB to simplify their data processing with a mix of jealousy and contempt. They reap the benefits of faster/cheaper workflows, while you schlep GFFs and BAM files between CLI tools, workflows, and scripts.

With this BioBear release it becomes simpler to use Exon's query engine in your ETL jobs, ML pipelines, and other execution methods (like AWS Lambdas).

So what does that look like. First, install the latest BioBear version:

pip install -U biobear

Then, you can connect to a local Exon session:

import biobear as bb

session = bb.connect()

With that session, you can work with bioinformatic data along side CSV, Parquet, and JSON data. As an example, imagine the case where you wanted to combine and summarize the contents of a Metagenomics NGS sample that had been run through primary and secondary analysis.

For example here we have a sample, Ga0451106 from IMG/M. Let's say we want to store the gene annotations from the primary analysis in S3 as a Parquet file. We also want to be conservative with out storage duplication, especially sequences, so we'll join the protein FASTA on a PFAM GFF, so that we only duplicate the protein sequences that are annotated with PFAMs we care about given our end goal.

Write Gene Annotations to Parquet

Taking the session from earlier, create an external table backed by the GFF file, then run a COPY query to write the results.

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations
STORED AS GFF
LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("""
COPY gene_annotations
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet' (FORMAT parquet)
""")

df.to_polars()
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ ---   │
# │ u64   │
# ╞═══════╡
# │ 7998  │
# └───────┘

Say we did have the GFF in S3, we could load it into polars directly:

# Register the store we'll be reading from
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3
STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("SELECT * FROM gene_annotations_s3").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Subsetting our Sequences

As mentioned, it's important here to be cognizant of storage costs and thus reduce duplication into parquet except for amino acid sequences that likely contain a certain PFAM domain.

Similar to the GFF, we can create an external table for both the protein FASTA and the PFAM GFF:

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE protein_fasta
STORED AS FASTA LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_proteins.faa'
""")

session.sql("""
CREATE EXTERNAL TABLE pfam_gff
STORED AS GFF LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_pfam.gff'
""")

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

gff = session.sql("""SELECT * FROM pfam_gff""").to_polars()
# shape: (5, 9)
# ┌──────────────────────────────┬─────────────────────────────┬─────────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname                      ┆ source                      ┆ type    ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---                          ┆ ---                         ┆ ---     ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str                          ┆ str                         ┆ str     ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════════════════════╪═════════════════════════════╪═════════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14684 ┆ 673   ┆ … ┆ 100.900002 ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14685 ┆ 753   ┆ … ┆ 81.900002  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF03572 ┆ 875   ┆ … ┆ 70.400002  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 459   ┆ … ┆ 16.799999  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 408   ┆ … ┆ 15.3       ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# └──────────────────────────────┴─────────────────────────────┴─────────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

# Join the protein FASTA and PFAM GFF and write the results to S3
df = session.sql("""
COPY (
    SELECT DISTINCT fasta.id, fasta.description, fasta.sequence
        FROM protein_fasta AS fasta
            INNER JOIN pfam_gff AS pfam
                ON fasta.id = pfam.seqname
        WHERE pfam.type = 'PF13561' -- Enoyl-(Acyl carrier protein) reductase
) TO 's3://wtt-01-dist-prd/sequences/pfam=PF13561/sample=Ga0455106/sequences.parquet' (FORMAT parquet)
""").to_polars()

print(df)
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ ---   │
# │ u64   │
# ╞═══════╡
# │ 49    │
# └───────┘

Exon BAM Update

The other update to highlight here is that, Exon, and now by extension BioBear, has undergone optimizations that allow projection and predicate pushdowns to enable more efficient queries on BAM files.

Projection Pushdown

For example, say we have we have a couple of BAM files from this 10X Genomics experiment. For simplicity lets assume one file for one sample from the treatment and one from the control, and we just want to look at the read name, the start position, and the map quality for QC purposes.

We can create an external table for the directory containing the BAM files - a listing table allows us to query the files in a directory as one table. Stop for a moment, and consider what the approach to working with multiple SAM files looks like with samtools. You'd have to write a script to loop over the files, then parse the output, and then combine the results. With Exon, you can just query the directory as a table.

$ ls -lh bam-files
-rw-r--r--@ 1 thauck  staff   6.7G Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam
-rw-r--r--  1 thauck  staff   2.2M Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam.bai
-rw-r--r--  1 thauck  staff   7.0G Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam
-rw-r--r--  1 thauck  staff   2.3M Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam.bai

As you can see, this are a couple of larger files, totaling just under 14G for the main data files, and around 5M for the associated indexes.

import biobear as bb

session = bb.connect()

session.sql("""
    CREATE EXTERNAL TABLE ten_x_experiment STORED AS BAM LOCATION 'bam-files/'
""")

# See which columns are available
columns = session.sql("""
    DESCRIBE ten_x_experiment
""").to_polars().select('column_name')
# print(columns)

df = session.sql("""
    SELECT name, start, mapping_quality
    FROM ten_x_experiment
""").to_polars()
# print(df.shape)
# (287822789, 3)

Reading about 285,000,000 records took around 3 minutes to load on my laptop, an M2 MacBook Air with 24GB of RAM. And while I recognize running on a local laptop isn't the most scientific, it does represent how many practitioners will work with data. In fact, I think that's an exciting thing about Exon and related tools: building off increased compute power for personal computers to enable local analysis of large datasets.

Predicate Pushdown

We just pushed down the projection, i.e. which fields are "physically read", but we can also push down region filters, to determine which records are read in the first place, before hitting the query's predicate in the first place. This enables quickly subsetting files to just the region of interest.

import biobear as bb

session = bb.connect()

# Note that we've changed to an INDEXED_BAM table
session.sql("""
    CREATE EXTERNAL TABLE ten_x_experiment_index STORED AS INDEXED_BAM LOCATION 'bam-files/'
""")

df = session.sql("""
    SELECT name, start, mapping_quality
    FROM ten_x_experiment_index
    WHERE bam_region_filter('chr1:1-1000000', reference, start, "end")
""").to_polars()
# print(df)
# shape: (39_090, 3)
# ┌───────────────────────────────────┬────────┬─────────────────┐
# │ name                              ┆ start  ┆ mapping_quality │
# │ ---                               ┆ ---    ┆ ---             │
# │ str                               ┆ i32    ┆ str             │
# ╞═══════════════════════════════════╪════════╪═════════════════╡
# │ A00519:1665:H3LNHDSX7:3:2631:254… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2406:262… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2247:326… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2340:113… ┆ 69486  ┆ null            │
# │ …                                 ┆ …      ┆ …               │
# │ A00519:1665:H3LNHDSX7:3:2657:215… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2607:125… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2346:582… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2240:126… ┆ 999937 ┆ 0               │
# └───────────────────────────────────┴────────┴─────────────────┘

Conclusion

This post highlight two recent updates to our tools:

BioBear now exposes Exon's query engine allowing for more streamlined application in ETL and ML use-cases.
Exon now supports projection and predicate pushdowns to enable more efficient queries on BAM files.

As always, feel free to reach out with questions, comments, or suggestions.

BioBear Update​

Write Gene Annotations to Parquet​

Subsetting our Sequences​

Exon BAM Update​

Projection Pushdown​

Predicate Pushdown​

Conclusion​