Skip to main content

One post tagged with "r"

View All Tags

Table Partitions, ExonR, and GitHub Sponsors

ยท 4 min read
Trent Hauck
Trent Hauck
Developer

In addition to launching the public preview of Exome last week, we've also made some updates to Exon.

First, Exon now supports Hive style partitions. This means you can augment your table definitions with partition information. This is useful because:

  1. It allows you to augment your file schemas with additional metadata. For example, the experiment date, or the sample type.
  2. It allows you to query subsets of your data without having to scan the entire table. For example, you can query only the data from a specific sample.

Second, ExonR has been updated to support a SQL session allowing you to create and query tables directly from R, rather than just scan entire files.

Finally, as more of an administrative note, we've added a GitHub Sponsors page. If you find our open source software useful, please consider sponsoring us. Thanks!

Table Partitionsโ€‹

Table partitions are encoded in the file path when you store the data and additionally in the table definition.

For example, say you had a bunch of S3 objects at:

s3://bucket/fasta_table/sample=sampleA/file1.fasta
s3://bucket/fasta_table/sample=sampleA/file2.fasta
s3://bucket/fasta_table/sample=sampleA/file3.fasta
s3://bucket/fasta_table/sample=sampleB/file4.fasta

You could create a table definition like:

CREATE EXTERNAL TABLE fasta_table
STORED AS FASTA
PARTITIONED BY (sample)
LOCATION 's3://bucket/fasta_table'

Then you could query the data like:

SELECT *
FROM fasta_table
WHERE sample = 'sampleA'

And what happens is super useful, Exon will subset the files it scans based on the predicated and augment the FASTA table with the sample column.

So you end up with something like:

iddescriptionsequencesample
seq1......sampleA
seq2......sampleA

Moreover, this makes it straight forward to join with other data. For example, if you had a table of sample metadata:

CREATE EXTERNAL TABLE sample_metadata
STORED AS PARQUET
LOCATION 's3://bucket/sample_metadata/'

You could join the two tables like:

SELECT *
FROM fasta_table
JOIN sample_metadata
ON fasta_table.sample = sample_metadata.sample

Thus simplifying your analysis by allowing deeper analysis without complex ETL. Moveover, you can add additional partitions without having to reprocess your data, just add the new files to the appropriate location and Exon will pick them up.

ExonR SQL Sessionโ€‹

R is R first programming language love ๐Ÿ˜‰. So we continue to build out the R interface to Exon. The latest addition is the ability to work with Exon sessions directly from R. Previously you were only able to scan entire files, but now you can create and query tables directly from R.

For example, you can create a table like, then query it through DuckDB or just directly through a DataFrame.

First create the table from a local file.

library(exonr)

session <- ExonRSessionContext$new()
session$execute("CREATE EXTERNAL TABLE gene_annotations STORED AS GFF LOCATION 'annotations.gff'")

rdf <- session$sql("SELECT seqname, source, type, start, \"end\", score, strand, phase FROM gene_annotations")

Then for DuckDB, convert the result to an Arrow table and then load it into DuckDB.

arrow_table <- rdf$to_arrow()

con <- dbConnect(duckdb::duckdb())

arrow::to_duckdb(arrow_table, table_name = "gene_annotations", con = con)

result <- dbGetQuery(con, "SELECT * FROM gene_annotations")

df <- as.data.frame(result)

Or just work with the DataFrame directly.

rdf <- session$sql("SELECT seqname, source, type, start, \"end\", score, strand, phase FROM gene_annotations")
arrow_table <- rdf$to_arrow()

df <- data.frame(arrow_table)
# seqname source type start end score strand phase
# 1 sq0 caat gene 8 13 NA + <NA>
# 2 sq0 caat gene 8 13 NA + <NA>
# 3 sq0 caat gene 8 13 NA + <NA>
# 4 sq0 caat gene 8 13 NA + <NA>
# 5 sq0 caat gene 8 13 NA + <NA>
# 6 sq0 caat gene 8 13 NA + <NA>

GitHub Sponsorsโ€‹

We've added a GitHub Sponsors page. If you find our open source software useful, please consider sponsoring us.

Learn More: WHERE TRUE Technologies GitHub Sponsors

Wrapping Upโ€‹

In this post we covered two new features of Exon, table partitions and the ExonR SQL session. We hope you find these features useful and we look forward to hearing your feedback.

As always, if you have any questions or feedback, please reach out to us at support@wheretrue.com.