BioBear Quick Start
This page describes how to get started with BioBear.
Installation
BioBear is available from PyPI and can be installed with pip
:
pip install biobear
If you'd like to use BioBear with Polars, you'll need to install Polars as well:
pip install biobear polars
Usage
There's two ways to use BioBear: with an Exon session or with a reader. The Exon session provides a more SQL-like interface to the data, while the readers provide a more programmatic interface.
The Exon Session is more flexible and can be used to query data from any of the supported file types. The readers are more limited in what they can do - full table scans - but are easier to use.
Session
✅ Prefer to use the session vs the readers The session is more flexible and eventually the readers will be phased out.
The Exon session is a wrapper around an Exon session that provides some convenience methods for working with data in a more SQL native way. See the Exon Session page for more details or scroll down to the Examples section for some examples.
Readers
⚠️ Readers are being phased out Use the appropriate method on the Session instead.
The various subclasses of Reader
have a few methods in common:
to_arrow()
: Convert the data to an Arrow RecordBatchReaderto_arrow_scanner()
: Convert the data to an Arrow scannerto_polars()
: Read the data into a polars DataFrame
Based on what you're trying to do, you can use any of these methods to get your data into the format you need. For example, if you want to write your data to a Delta Lake table, you can use to_arrow()
to get an Arrow RecordBatchReader and then use the Delta Lake Python package to write the data to a Delta Lake table. See the Delta Lake Integration page for more details.
Indexed Datasets
⚠️ Indexed Datasets are being phased out Use the appropriate method on the Session instead.
BioBear supports some indexed file types that can be queried. To use these, create an indexed reader, then call query
with a region (e.g. chr1:1-100
). See the last example on this page for more details.
Examples
Session
Session is a wrapper around an Exon session that provides some convenience methods for working with data in a more SQL native way. See the Exon Session page for more details.
Create an Exon session:
import biobear as bb
session = bb.connect()
Create an external table backed by an object in an S3 bucket:
# Create the external table, this could also be on S3, relative to the bucket
session.execute("""
CREATE EXTERNAL TABLE gene_annotations
STORED AS GFF
LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")
Query the table and get the results as a polars DataFrame:
result = session.sql("""
SELECT seqname, score, start, "end" FROM gene_annotations WHERE score > 50
""")
df = result.to_polars()
Copy the results to a local parquet file:
session.sql("""
COPY (SELECT seqname, score, start, "end" FROM gene_annotations) TO './gene_annotations.parquet' (FORMAT parquet)
""").collect()
For more details on the Exon session, see the Exon Session page.
Full File Scans
There are a set of helper classes that can be used to read data from various file types. These classes are subclasses of Reader
and can be used to read data into a polars DataFrame, a pandas DataFrame, or an Arrow RecordBatchReader.
Read a FASTQ file as a pyarrow RecordBatchReader:
import biobear as bb
reader = bb.FastqReader("test.fastq").to_arrow()
print(reader)
# <pyarrow.lib.RecordBatchReader object at 0x7f9b1c0b4b70>
Read a FASTQ file and convert to a polars DataFrame:
import biobear as bb
df = bb.FastqReader("test.fastq").to_polars()
print(df.head())
# ┌─────────┬───────────────────────┬────────────────── ─────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═══════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ This is a description ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴───────────────────────┴───────────────────────────────────┴───────────────────────────────────┘
Read a gzipped FASTQ file:
import biobear as bb
from biobear.compression import Compression
df = bb.FastqReader("test.fastq.gz", compression=Compression.GZIP).to_polars()
print(df.head())
# ┌─────────┬─────────────┬───────────────────────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴─────────────┴───────────────────────────────────┴───────────────────────────────────┘
# The compression type is also inferred from the extension of the file
df = bb.FastqReader("test.fastq.gz").to_polars()
print(df.head())
# ┌─────────┬─────────────┬───────────────────────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴─────────────┴───────────────────────────────────┴───────────────────────────────────┘
Query an indexed VCF file into an Arrow RecordBatchReader:
import biobear as bb
# Will error if test.vcf.gz.tbi is not present
rbr = bb.VCFIndexedReader("vcf_file.vcf.gz").query("1")
print(rbr)
# <pyarrow.lib.RecordBatchReader at 0x127cf6ca0>
📄️ Exon Session
While BioBear exposes functions to read various data formats directly, it also
📄️ Delta Lake Integration
A goal of BioBear and WTT writ large is to bring bioinformatics data handling into the Data Lake era. "Data silos" is a bit trite, but it's true that bioinformatics data is often stored in a variety of formats and locations. This makes it difficult to combine metadata from different sources and the underlying experimental data for analysis.
📄️ DuckDB Integration
Using BioBear with DuckDB is straightforward. First, use BioBear to generate an Arrow RecordBatch Reader from your data. Then use the DuckDB Python package to read from that reader.
📄️ GenomicRanges Integration
[GenomicRanges] is a Python package that provides a convenient way to work with genomic ranges. BioBear can be used to read data from a GFF/GTF file quickly and convert it to a GenomicRanges object for further analysis.