Exon SQL Reference

At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.

Table Functions
- Scan Functions
  - Example(s)
- Indexed Scan Functions
  - Example(s)
  - Indexed FASTA Files
Scalar Functions
Creating A Table
- Compression Type
- Examples
Exporting Data
- FASTA Schema
- FASTQ Schema
File Formats

Table Functions

Exon exposes a number of table functions that can be used to read data directly from files or groups of files. It's a faster alternative to creating an external table and then querying it.

Scan Functions

These functions are used to scan a dataset and return a table.

Function Name	Required Arguments	Optional Arguments	Description
`fasta_scan`	`location`	`compression_type`	Scans a FASTA dataset.
`fastq_scan`	`location`	`compression_type`	Scans a FASTQ dataset.
`gff_scan`	`location`	`compression_type`	Scans a GFF dataset.
`gtf_scan`	`location`	`compression_type`	Scans a GTF dataset.
`bed_scan`	`location`	`compression_type`	Scans a BED dataset.
`hmm_dom_tab_scan`	`location`	`compression_type`	Scans a HMMER DOMTAB dataset.
`genbank_scan`	`location`	`compression_type`	Scans a GenBank dataset.
`sam_scan`	`location`	`compression_type`	Scans a SAM dataset.
`bam_scan`	`location`	-	Scans a BAM dataset.
`vcf_scan`	`location`	`compression_type`	Scans a VCF dataset.
`bcf_scan`	`location`	-	Scans a BCF dataset.
`fcs_scan`	`location`	-	Scans an FCS dataset.

Example(s)

Select all data from a FASTA file:

SELECT *
FROM fasta_scan('s3://bucket/file.fasta');

Indexed Scan Functions

These functions are used to scan indexed file types. They take the location (like file path) and a region literal and/or a region file path to filter by.

Function Name	Required Arguments	Optional Arguments	Description
`bam_indexed_scan`	`location`, `region`	-	Filters a BAM dataset by region.
`vcf_indexed_scan`	`location`, `region`	-	Filters a VCF dataset by region.
`gff_indexed_scan`	`location`, `region`	-	Filters a GFF dataset by region.
`fasta_indexed_scan`	`location`, `region`	-	Filters a FASTA dataset by region.
`fasta_indexed_scan`	`location`, `region_file`	-	Filters a FASTA dataset by regions, includes all regions in the file.

Example(s)

Select all data from a group of BAM files in S3 that overlap the region chr1:1000-2000:

SELECT *
FROM bam_indexed_scan('s3://bucket/experiment123/', 'chr1:1000-2000');

Indexed FASTA Files

You can work with indexed FASTA files. This is useful for quickly accessing specific sequences or regions in large sequence sets.

⚠️ You must have an index file associated with the FASTA file. The index file must have the same name as the FASTA file, but with a .fai extension. For example, if your FASTA file is test.fasta, the index file must be test.fasta.fai.

When querying an indexed FASTA file, you have two options:

Option One: You can pass one region directly to the fasta_indexed_scan function, like so:

SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'chr1:1000000-2000000');

In this case, the function will return all sequences that overlap the region chr1:1000000-2000000.

Option Two: You can pass a regions file to the fasta_indexed_scan function, like so:

SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'regions.txt');

Where regions.txt is a file with one region per line, e.g.:

chr1:1000000-2000000
chr2:1000000-2000000

This also works on Object Stores, like S3. You'll just need to pass the full path to the file, e.g. s3://my-bucket/fasta/test.fasta and s3://my-bucket/regions.txt.

SELECT *
FROM fasta_indexed_scan('s3://my-bucket/fasta/test.fasta', 's3://my-bucket/regions.txt');

The id of the returned sequence will be the region name, and the sequence will be the sequence of the region.

Scalar Functions

These function can be applied in the SELECT clause to transform data.

`gc_content`

For an input sequence, returns the GC content as a float.

SELECT gc_content('ATGC');

`quality_scores_to_string`

For an input list of quality scores on the Phred scale, returns a string of the quality scores.

SELECT quality_scores_to_string([2, 3, 4])
-- Returns #$%

`quality_scores_to_list`

For an input string of quality scores, returns a list of quality scores on the Phred scale.

SELECT quality_scores_to_list('#$%')
-- Returns [2, 3, 4]

`trim_polya`

For an input sequence, returns the sequence with any poly-A tail removed.

SELECT trim_polya('ATCGAAAA')
-- Returns ATCG

`alignment_score`

For an input sequence and a reference sequence, returns the (local) alignment score.

SELECT alignment_score('ATCG', 'ATCGG')
-- Returns 4

Creating A Table

Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE statement. The syntax is as follows:

CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP |  ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

Breaking this down into its components:

IF NOT EXISTS - If the table already exists, don't throw an error. Optional.
<TABLE_NAME> - The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.
<FILE_TYPE> - The type of file(s) that the table is backed by. See File Formats for more information.
COMPRESSION TYPE - The compression type of the file(s). Optional.
PARTITIONED BY (<column list>) - A comma separated list of columns to partition the table by. Optional.
OPTIONS (<key_value_list>) - A comma separated list of key value pairs. Optional.
LOCATION <literal> - The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.

Different file types have different functionality, so see the File Formats section for more information.

Compression Type

For the most part, Exon supports GZIP, ZSTD, and UNCOMPRESSED (omitted). In some cases, Exon will interpret GZIP and block gzip, especially when it is germane to the file type.

Examples

Create a table backed by a BAM file:

CREATE EXTERNAL TABLE my_bam
STORED AS BAM
LOCATION 's3://bucket/file.bam';

Create a table backed by a VCF file that is stored in GCS and compressed with GZIP:

CREATE EXTERNAL TABLE my_vcf
STORED AS VCF
COMPRESSION TYPE GZIP
LOCATION 'gs://bucket/file.vcf.gz';

Create a table backed by a object store directory containing a number of indexed VCF files:

CREATE EXTERNAL TABLE my_vcf
STORED AS INDEXED_VCF
COMPRESSION TYPE GZIP
LOCATION 's3://bucket/vcf_files/';

Create a table backed by an object store directory containing a number of FASTA files, where the object store layout follows hive partitioning. E.g. s3://bucket/fasta_files/reference_name=chr1/file.fa

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
PARTITIONED BY (reference_name)
LOCATION 's3://bucket/fasta_files/';

Exporting Data

It is possible to export data via the SQL COPY statement for FASTA and FASTQ files. For example, to copy a FASTA table to a local file:

COPY fasta_file TO 'test.fasta' STORED AS FASTA;

Or to copy a query,

COPY (SELECT * FROM fasta_file WHERE reference_name = 'chr1') TO 'test.fasta' STORED AS FASTA;

Or to copy a FASTQ table to a local file with GZIP compression:

COPY fastq_file TO 'test.fastq.gz' STORED AS FASTQ OPTIONS (compression 'GZIP');

FASTA Schema

When exporting a FASTA file, the schema is as follows:

Column Name	Type	Nullable	Description
id	String	NO	The unique identifier for the sequence
description	String	YES	The description of the sequence
sequence	String	NO	The sequence of the sequence

FASTQ Schema

When exporting a FASTQ file, the schema is as follows:

Column Name	Type	Nullable	Description
name	String	NO	The name of the read
description	String	YES	The description of the read
sequence	String	NO	The sequence of the read
quality_scores	String	NO	The quality scores of the read

File Formats

Depending on the underlying file type a table represents, Exon will expose different functionality. The following sections describe the functionality exposed by each file type.

Currently, Exon supports the following file types:

File Type	Description
`BAM`	A BAM file.
`BCF`	A BCF file.
`BED`	A BED file.
`FASTA`	A FASTA file (.fasta).
`FASTQ`	A FASTQ file (.fastq).
`FCS`	An FCS file.
`SDF`	An SDF (structure-data) file.
`GENBANK`	A GenBank file.
`GFF`	A GFF file.
`GTF`	A GTF file.
`HMMDOMTAB`	A HMMER DOMTAB file.
`INDEXED_BAM`	A BAM file with an associated index file.
`INDEXED_VCF`	A VCF file with an associated index file.
`INDEXED_VGFF`	A GFF file with an associated index file.
`MZML`	An mzML file.
`SAM`	A SAM file.
`VCF`	A VCF file.

FASTA Files

FASTA files can be queried individually or as a collection of files.

Controlling the FASTA File Extension

By default, Exon with look for likes that end with .fasta (or .fasta with a compression extension like .fasta.gz). If you're using something else, you can pass the file_extension option to the CREATE EXTERNAL TABLE statement.

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
OPTIONS (file_extension 'faa')
LOCATION 's3://bucket/file.faa';

This will cause Exon to look for files that end with .faa (or .faa with a compression extension like .faa.gz).

Data Types for FASTA Sequences

When creating a table from a FASTA file, you can control the data type used to store the sequence. By default, Exon will use a utf8 data type. However, this is not the only option. You can also use:

large_utf8 - A large UTF8 data type. This is useful for storing very large sequences (e.g. full chromosomes).
integer_encode_protein - An integer encoded protein sequence. This is useful for loading sequences for ML models without needing to go through the overhead of utf8 encoding.
integer_encode_dna - An integer encoded DNA sequence. This is useful for loading sequences for ML models without needing to go through the overhead of utf8 encoding.

To set the data type, you can use the exon.fasta_sequence_data_type variable. For example, to set the data type to large_utf8:

SET exon.fasta_sequence_data_type = 'large_utf8';

This will cause all subsequent FASTA tables to use the large_utf8 data type for the sequence column. Alternatively, you can set the data type on a per table basis by passing the sequence_data_type option to the CREATE EXTERNAL TABLE statement.

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
OPTIONS (sequence_data_type 'integer_encode_protein')
LOCATION 's3://bucket/file.fasta';

FASTQ Files

FASTQ files are used for reads from the sequencer. Similar to FASTA, you can use pass the file_extension option to override the default file extension.

CREATE EXTERNAL TABLE my_fastq
STORED AS FASTQ
OPTIONS (file_extension 'fq')
LOCATION 's3://bucket/file.fq';

SAM Files

SAM files are associated with the SAM file type. They are similar to BAM files except they can only be scanned, and not indexed.

Similar to BAM file you can control tag parsing:

SET exon.sam_parse_tags = true;

will cause the tags to be parsed and stored as a struct rather than a list of key value pairs, which is the default.

BED Files

A BED file may contain a variable number of fields. By default, Exon will assume that the file has 12 fields. If your file has a different number of fields, you can pass the n_fields option to the CREATE EXTERNAL TABLE statement.

CREATE EXTERNAL TABLE my_bed
STORED AS BED
OPTIONS ('bed.n_fields' '3')
LOCATION 's3://bucket/file.bed';

If you don't supply the n_fields option, Exon will assume the file has 12 fields and fill in any missing fields with nulls.

BAM Files

BAM Files can be associated with a BAM or INDEXED_BAM file types. BAM files may or may not be indexed, but INDEXED_BAM files must have an associated index file. INDEXED_BAM is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Querying an Indexed BAM File

When you have a BAM or INDEXED_BAM table that has an associated index, you can make use of that index by filtering for the region in the WHERE clause.

For example, say I had a table called my_bam that was backed by a file in S3 called s3://bucket/file.bam and an index called s3://bucket/file.bam.bai. I could query for all reads in the region chr1:1000-2000 like so:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1:1000-2000', reference, start, end) = true

If I only cared about the reference sequence name, I could omit the start and end columns:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1', reference) = true

Controlling Tag Parsing

By default, Exon will use a list of key value pairs to store the tags from a BAM file. You can however pass:

SET exon.bam_parse_tags = true;

And Exon will infer the schema of the tags and store them as a struct. This is useful because you get "typed" tags, but it can be slower and use more memory.

VCF Files

VCF Files can be associated with a VCF or INDEXED_VCF file types. VCF files may or may not be indexed, but INDEXED_VCF files must have an associated index file. INDEXED_VCF is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Query an Indexed VCF File

Similar to querying an indexed BAM file, you can query an indexed VCF file by using the vcf_region_filter function. It takes similar arguments to bam_region_filter, except rather than the start and end columns, it takes the pos column.

SELECT *
FROM my_vcf
WHERE vcf_region_filter('chr1:1000-2000', reference, pos) = true

Controlling `INFO` and `FORMAT` (Genotypes) Parsing

VCF files can contain a large number of INFO and FORMAT fields, and parsing them all can be expensive. By default, Exon will parse all INFO and FORMAT and struct and list of structs, respectively. However, you can control if they're parsed or just returned as strings by using the following options:

SET exon.vcf_parse_info = true;
SET exon.vcf_parse_format = true;

MzML Files

MzML files can be associated with a MZML file type. And they have the following schema:

Column Name	Type	Nullable	Description
id	String	NO	The unique identifier for the spectrum
mz	Struct{mz: FLOAT64[]}	YES	The m/z value of the spectrum
intensity	Struct{intensity: FLOAT64[]}	YES	The intensity value of the spectrum
wavelength	Struct{wavelength: FLOAT64[]}	YES	The wavelength value of the spectrum
cv_params	Map<STRING, STRING>	YES	The CV parameters of the spectrum
precursor_list	Struct	YES	The precursor list of the spectrum

Example

To select all data:

SELECT *
FROM mzml_table;

To select five example M/Z arrays:

SELECT mz.mz AS mz
FROM mzml_table
LIMIT 5;

Exon SQL Reference

Table Functions​

Scan Functions​

Example(s)​

Indexed Scan Functions​

Example(s)​

Indexed FASTA Files​

Scalar Functions​

gc_content​

quality_scores_to_string​

quality_scores_to_list​

trim_polya​

alignment_score​

Creating A Table​

Compression Type​

Examples​

Exporting Data​

FASTA Schema​

FASTQ Schema​

File Formats​

FASTA Files​

Controlling the FASTA File Extension​

Data Types for FASTA Sequences​

FASTQ Files​

SAM Files​

BED Files​

BAM Files​

Querying an Indexed BAM File​

Controlling Tag Parsing​

VCF Files​

Query an Indexed VCF File​

Controlling INFO and FORMAT (Genotypes) Parsing​

MzML Files​

Example​

Table Functions

Scan Functions

Example(s)

Indexed Scan Functions

Example(s)

Indexed FASTA Files

Scalar Functions

`gc_content`

`quality_scores_to_string`

`quality_scores_to_list`

`trim_polya`

`alignment_score`

Creating A Table

Compression Type

Examples

Exporting Data

FASTA Schema

FASTQ Schema

File Formats

FASTA Files

Controlling the FASTA File Extension

Data Types for FASTA Sequences

FASTQ Files

SAM Files

BED Files

BAM Files

Querying an Indexed BAM File

Controlling Tag Parsing

VCF Files

Query an Indexed VCF File

Controlling `INFO` and `FORMAT` (Genotypes) Parsing

MzML Files

Example