Exon SQL Reference
At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.
Table Functions
Exon exposes a number of table functions that can be used to read data directly from files or groups of files. It's a faster alternative to creating an external table and then querying it.
Scan Functions
These functions are used to scan a dataset and return a table.
Function Name | Required Arguments | Optional Arguments | Description |
---|---|---|---|
fasta_scan | location | compression_type | Scans a FASTA dataset. |
fastq_scan | location | compression_type | Scans a FASTQ dataset. |
gff_scan | location | compression_type | Scans a GFF dataset. |
gtf_scan | location | compression_type | Scans a GTF dataset. |
bed_scan | location | compression_type | Scans a BED dataset. |
hmm_dom_tab_scan | location | compression_type | Scans a HMMER DOMTAB dataset. |
genbank_scan | location | compression_type | Scans a GenBank dataset. |
sam_scan | location | compression_type | Scans a SAM dataset. |
bam_scan | location | - | Scans a BAM dataset. |
vcf_scan | location | compression_type | Scans a VCF dataset. |
bcf_scan | location | - | Scans a BCF dataset. |
fcs_scan | location | - | Scans an FCS dataset. |
Example(s)
Select all data from a FASTA file:
SELECT *
FROM fasta_scan('s3://bucket/file.fasta');
Indexed Scan Functions
These functions are used to scan indexed file types. They take the location (like file path) and a region literal and/or a region file path to filter by.
Function Name | Required Arguments | Optional Arguments | Description |
---|---|---|---|
bam_indexed_scan | location , region | - | Filters a BAM dataset by region. |
vcf_indexed_scan | location , region | - | Filters a VCF dataset by region. |
gff_indexed_scan | location , region | - | Filters a GFF dataset by region. |
fasta_indexed_scan | location , region | - | Filters a FASTA dataset by region. |
fasta_indexed_scan | location , region_file | - | Filters a FASTA dataset by regions, includes all regions in the file. |
Example(s)
Select all data from a group of BAM files in S3 that overlap the region chr1:1000-2000
:
SELECT *
FROM bam_indexed_scan('s3://bucket/experiment123/', 'chr1:1000-2000');
Indexed FASTA Files
You can work with indexed FASTA files. This is useful for quickly accessing specific sequences or regions in large sequence sets.
⚠️ You must have an index file associated with the FASTA file. The index file must have the same name as the FASTA file, but with a
.fai
extension. For example, if your FASTA file istest.fasta
, the index file must betest.fasta.fai
.
When querying an indexed FASTA file, you have two options:
Option One: You can pass one region directly to the fasta_indexed_scan
function, like so:
SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'chr1:1000000-2000000');
In this case, the function will return all sequences that overlap the region chr1:1000000-2000000
.
Option Two: You can pass a regions file to the fasta_indexed_scan
function, like so:
SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'regions.txt');
Where regions.txt
is a file with one region per line, e.g.:
chr1:1000000-2000000
chr2:1000000-2000000
This also works on Object Stores, like S3. You'll just need to pass the full path to the file, e.g. s3://my-bucket/fasta/test.fasta
and s3://my-bucket/regions.txt
.
SELECT *
FROM fasta_indexed_scan('s3://my-bucket/fasta/test.fasta', 's3://my-bucket/regions.txt');
The id of the returned sequence will be the region name, and the sequence will be the sequence of the region.
Scalar Functions
These function can be applied in the SELECT
clause to transform data.
gc_content
For an input sequence, returns the GC content as a float.
SELECT gc_content('ATGC');
quality_scores_to_string
For an input list of quality scores on the Phred scale, returns a string of the quality scores.
SELECT quality_scores_to_string([2, 3, 4])
-- Returns #$%
quality_scores_to_list
For an input string of quality scores, returns a list of quality scores on the Phred scale.
SELECT quality_scores_to_list('#$%')
-- Returns [2, 3, 4]
trim_polya
For an input sequence, returns the sequence with any poly-A tail removed.
SELECT trim_polya('ATCGAAAA')
-- Returns ATCG
alignment_score
For an input sequence and a reference sequence, returns the (local) alignment score.
SELECT alignment_score('ATCG', 'ATCGG')
-- Returns 4
Creating A Table
Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE
statement. The syntax is as follows:
CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
Breaking this down into its components:
IF NOT EXISTS
- If the table already exists, don't throw an error. Optional.<TABLE_NAME>
- The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.<FILE_TYPE>
- The type of file(s) that the table is backed by. See File Formats for more information.COMPRESSION TYPE
- The compression type of the file(s). Optional.PARTITIONED BY (<column list>)
- A comma separated list of columns to partition the table by. Optional.OPTIONS (<key_value_list>)
- A comma separated list of key value pairs. Optional.LOCATION <literal>
- The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.
Different file types have different functionality, so see the File Formats section for more information.
Compression Type
For the most part, Exon supports GZIP
, ZSTD
, and UNCOMPRESSED
(omitted). In some cases, Exon will interpret GZIP
and block gzip, especially when it is germane to the file type.
Examples
Create a table backed by a BAM file:
CREATE EXTERNAL TABLE my_bam
STORED AS BAM
LOCATION 's3://bucket/file.bam';
Create a table backed by a VCF file that is stored in GCS and compressed with GZIP:
CREATE EXTERNAL TABLE my_vcf
STORED AS VCF
COMPRESSION TYPE GZIP
LOCATION 'gs://bucket/file.vcf.gz';
Create a table backed by a object store directory containing a number of indexed VCF files:
CREATE EXTERNAL TABLE my_vcf
STORED AS INDEXED_VCF
COMPRESSION TYPE GZIP
LOCATION 's3://bucket/vcf_files/';
Create a table backed by an object store directory containing a number of FASTA files, where the object store layout follows hive partitioning. E.g. s3://bucket/fasta_files/reference_name=chr1/file.fa
CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
PARTITIONED BY (reference_name)
LOCATION 's3://bucket/fasta_files/';
Exporting Data
It is possible to export data via the SQL COPY
statement for FASTA and FASTQ files. For example, to copy a FASTA table to a local file:
COPY fasta_file TO 'test.fasta' STORED AS FASTA;
Or to copy a query,
COPY (SELECT * FROM fasta_file WHERE reference_name = 'chr1') TO 'test.fasta' STORED AS FASTA;
Or to copy a FASTQ table to a local file with GZIP compression:
COPY fastq_file TO 'test.fastq.gz' STORED AS FASTQ OPTIONS (compression 'GZIP');
FASTA Schema
When exporting a FASTA file, the schema is as follows:
Column Name | Type | Nullable | Description |
---|---|---|---|
id | String | NO | The unique identifier for the sequence |
description | String | YES | The description of the sequence |
sequence | String | NO | The sequence of the sequence |
FASTQ Schema
When exporting a FASTQ file, the schema is as follows:
Column Name | Type | Nullable | Description |
---|---|---|---|
name | String | NO | The name of the read |
description | String | YES | The description of the read |
sequence | String | NO | The sequence of the read |
quality_scores | String | NO | The quality scores of the read |
File Formats
Depending on the underlying file type a table represents, Exon will expose different functionality. The following sections describe the functionality exposed by each file type.
Currently, Exon supports the following file types:
File Type | Description |
---|---|
BAM | A BAM file. |
BCF | A BCF file. |
BED | A BED file. |
FASTA | A FASTA file (.fasta). |
FASTQ | A FASTQ file (.fastq). |
FCS | An FCS file. |
SDF | An SDF (structure-data) file. |
GENBANK | A GenBank file. |
GFF | A GFF file. |
GTF | A GTF file. |
HMMDOMTAB | A HMMER DOMTAB file. |
INDEXED_BAM | A BAM file with an associated index file. |
INDEXED_VCF | A VCF file with an associated index file. |
INDEXED_VGFF | A GFF file with an associated index file. |
MZML | An mzML file. |
SAM | A SAM file. |
VCF | A VCF file. |
FASTA Files
FASTA files can be queried individually or as a collection of files.
Controlling the FASTA File Extension
By default, Exon with look for likes that end with .fasta
(or .fasta
with a compression extension like .fasta.gz
). If you're using something else, you can pass the file_extension
option to the CREATE EXTERNAL TABLE
statement.
CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
OPTIONS (file_extension 'faa')
LOCATION 's3://bucket/file.faa';
This will cause Exon to look for files that end with .faa
(or .faa
with a compression extension like .faa.gz
).
Data Types for FASTA Sequences
When creating a table from a FASTA file, you can control the data type used to store the sequence. By default, Exon will use a utf8
data type. However, this is not the only option. You can also use:
large_utf8
- A large UTF8 data type. This is useful for storing very large sequences (e.g. full chromosomes).integer_encode_protein
- An integer encoded protein sequence. This is useful for loading sequences for ML models without needing to go through the overhead of utf8 encoding.integer_encode_dna
- An integer encoded DNA sequence. This is useful for loading sequences for ML models without needing to go through the overhead of utf8 encoding.
To set the data type, you can use the exon.fasta_sequence_data_type
variable. For example, to set the data type to large_utf8
:
SET exon.fasta_sequence_data_type = 'large_utf8';
This will cause all subsequent FASTA
tables to use the large_utf8
data type for the sequence column. Alternatively, you can set the data type on a per table basis by passing the sequence_data_type
option to the CREATE EXTERNAL TABLE
statement.
CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
OPTIONS (sequence_data_type 'integer_encode_protein')
LOCATION 's3://bucket/file.fasta';
FASTQ Files
FASTQ files are used for reads from the sequencer. Similar to FASTA
, you can use pass the file_extension
option to override the default file extension.
CREATE EXTERNAL TABLE my_fastq
STORED AS FASTQ
OPTIONS (file_extension 'fq')
LOCATION 's3://bucket/file.fq';
SAM Files
SAM files are associated with the SAM file type. They are similar to BAM files except they can only be scanned, and not indexed.
Similar to BAM file you can control tag parsing:
SET exon.sam_parse_tags = true;
will cause the tags to be parsed and stored as a struct rather than a list of key value pairs, which is the default.
BED Files
A BED file may contain a variable number of fields. By default, Exon will assume that the file has 12 fields. If your file has a different number of fields, you can pass the n_fields
option to the CREATE EXTERNAL TABLE
statement.
CREATE EXTERNAL TABLE my_bed
STORED AS BED
OPTIONS ('bed.n_fields' '3')
LOCATION 's3://bucket/file.bed';
If you don't supply the n_fields
option, Exon will assume the file has 12 fields and fill in any missing fields with nulls.
BAM Files
BAM Files can be associated with a BAM
or INDEXED_BAM
file types. BAM
files may or may not be indexed, but INDEXED_BAM
files must have an associated index file. INDEXED_BAM
is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.
Querying an Indexed BAM File
When you have a BAM
or INDEXED_BAM
table that has an associated index, you can make use of that index by filtering for the region in the WHERE
clause.
For example, say I had a table called my_bam
that was backed by a file in S3 called s3://bucket/file.bam
and an index called s3://bucket/file.bam.bai
. I could query for all reads in the region chr1:1000-2000
like so:
SELECT *
FROM my_bam
WHERE bam_region_filter('chr1:1000-2000', reference, start, end) = true
If I only cared about the reference sequence name, I could omit the start and end columns:
SELECT *
FROM my_bam
WHERE bam_region_filter('chr1', reference) = true
Controlling Tag Parsing
By default, Exon will use a list of key value pairs to store the tags from a BAM file. You can however pass:
SET exon.bam_parse_tags = true;
And Exon will infer the schema of the tags and store them as a struct. This is useful because you get "typed" tags, but it can be slower and use more memory.
VCF Files
VCF Files can be associated with a VCF
or INDEXED_VCF
file types. VCF
files may or may not be indexed, but INDEXED_VCF
files must have an associated index file. INDEXED_VCF
is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.
Query an Indexed VCF File
Similar to querying an indexed BAM file, you can query an indexed VCF file by using the vcf_region_filter
function. It takes similar arguments to bam_region_filter
, except rather than the start
and end
columns, it takes the pos
column.
SELECT *
FROM my_vcf
WHERE vcf_region_filter('chr1:1000-2000', reference, pos) = true
Controlling INFO
and FORMAT
(Genotypes) Parsing
VCF files can contain a large number of INFO
and FORMAT
fields, and parsing them all can be expensive. By default, Exon will parse all INFO
and FORMAT
and struct and list of structs, respectively. However, you can control if they're parsed or just returned as strings by using the following options:
SET exon.vcf_parse_info = true;
SET exon.vcf_parse_format = true;
MzML Files
MzML files can be associated with a MZML
file type. And they have the following schema:
Column Name | Type | Nullable | Description |
---|---|---|---|
id | String | NO | The unique identifier for the spectrum |
mz | Struct{mz: FLOAT64[]} | YES | The m/z value of the spectrum |
intensity | Struct{intensity: FLOAT64[]} | YES | The intensity value of the spectrum |
wavelength | Struct{wavelength: FLOAT64[]} | YES | The wavelength value of the spectrum |
cv_params | Map<STRING, STRING> | YES | The CV parameters of the spectrum |
precursor_list | Struct | YES | The precursor list of the spectrum |
Example
To select all data:
SELECT *
FROM mzml_table;
To select five example M/Z arrays:
SELECT mz.mz AS mz
FROM mzml_table
LIMIT 5;