Skip to main content

FASTQs, Barcodes, and Partitions

· 6 min read
Trent Hauck
Trent Hauck
Developer

FASTQs and Barcodes

The process of multiplexing in NGS is to combine many samples into one pool and sequence that pool. This is a useful strategy to amoritize sequencing run costs over those samples. The rub, of course, is almost anytime something is multiplexed, it needs to be demultiplexed. Moreover for organizations engaging in many sequencing experiments, the underlying data needs to be cataloged in accordance with data engineering best practices to allow for easy retreival.

This can be an onerous process, but with Exon-DuckDB's help, you can quickly combine the reads, barcodes, and other metadata; do pre-ingestion QC analytics; group the reads into barcode-based partitions; and finally upload that data into a larger cloud warehouse.

If you'd like to follow along, I'm using data from QIIME2.

wget -O "sample-metadata.tsv" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv"
wget -O "barcodes.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
wget -O "sequences.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Ingestion without Indigestion

Looking at this data, there's three files: a metadata TSV, a gzipped barcodes FASTQ, and a gzipped sequences FASTQ. This data is easy enough to load. Let's pop into the DuckDB CLI, load Exon-DuckDB, and have a look.

Because the two FASTQ files end with "fastq.gz", Exon-DuckDB can detect the format and comperssion.

CREATE TABLE barcodes AS
SELECT * FROM 'barcodes.fastq.gz';

CREATE TABLE reads AS
SELECT * FROM 'sequences.fastq.gz';

And having a quick peak, we can see what we're working with.

-- description is empty in these examples
SELECT * EXCLUDE(description) FROM barcodes LIMIT 5;
┌─────────────────────────────────────┬──────────────┬────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├─────────────────────────────────────┼──────────────┼────────────────┤
│ HWI-EAS440_0386:1:23:17547:1423#0/1 │ ATGCAGCTCAGT │ IIIIIIIIIIIH │
│ HWI-EAS440_0386:1:23:14818:1533#0/1 │ CCCCTCAGCGGC │ DDD@D?@B<<+/ │
│ HWI-EAS440_0386:1:23:14401:1629#0/1 │ GACGAGTCAGTC │ GGEGDGGGGGDG │
│ HWI-EAS440_0386:1:23:15259:1649#0/1 │ AGCAGTCGCGAT │ IIIIIIIIIIII │
│ HWI-EAS440_0386:1:23:13748:2482#0/1 │ AGCACACCTACA │ GGGGBGGEEGGD │
└─────────────────────────────────────┴──────────────┴────────────────┘
SELECT MIN(LENGTH(sequence)), MAX(LENGTH(sequence)) FROM barcodes;
┌─────────────────────────┬─────────────────────────┐
min(length("sequence"))max(length("sequence"))
│ int64 │ int64 │
├─────────────────────────┼─────────────────────────┤
1212
└─────────────────────────┴─────────────────────────┘

The barcodes are in the sequence column, and they're all the same length. And for the actual reads?

-- isn't EXCLUDE cool
SELECT name, sequence, quality_scores FROM reads LIMIT 5;
┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGE… │
│ HWI-EAS440_0386:1:… │ CCCCNCAGCGGCAAAAAT… │ 64<2$24;1)/:*B<?BBDDBBD<>BDD######################… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'ACC8;;>;HHHHGHDHHHHHEEHHEHHHHHECHEEEHCHFHHHAG… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)DEE?CCCBIIIIIIGIIIIIHIIIIIIIIIHIIIHFIDIGIIIHH… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'?CC4<5<6HHHHHHH@@HGGEEGHHFHHHHBBHGFFGCGBBGG@D… │
└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────┘

The TSV is a nasty little bugger with a line for the header, prepended with #, and a second line, also prepended with #, that specifies an arbitrary type system. Anyways, thankfully DuckDB has a robust set of functions for ingesting CSV/TSV data.

CREATE TABLE metadata AS
SELECT *
FROM read_csv('sample-metadata.tsv', skip=2, sep='\t', columns={
'sample_id': 'VARCHAR',
'barcode': 'VARCHAR',
'linker_primer_sequence': 'VARCHAR',
'body_site': 'VARCHAR',
'year': 'INTEGER',
'month': 'INTEGER',
'day': 'INTEGER',
'subject': 'VARCHAR',
'reported_antibiotic_usage': 'VARCHAR',
'days_since_experiment_start': 'INTEGER',
'description': 'VARCHAR'});

That wasn't so bad. And now that the data is "in", we can use SQL to manipulate it before exporting it.

A Big Table

To facilitate analysis, let's join this data together and augment the quality scores by transforming them from ASCII scores to a list of integers.

CREATE TABLE analysis_table AS
SELECT r.*,
m.sample_id,
m.body_site,
m.subject,
b.sequence barcode,

-- e.g. [14, 23, 27, 25, 25, 35, 31, 27, 33, 35, 35, 33, ...]
quality_score_string_to_list(r.quality_scores) qc_scores

FROM reads r
INNER JOIN barcodes b
ON r.name = b.name
LEFT JOIN metadata m
ON b.sequence = m.barcode;

You could stop here and ship this off to your cloud service provider for storage in parquet, but we can do some analysis here and enjoy the quick feedback loop of local work.

For example, we can quickly find out that there are about 40k undetermined reads, and maybe they came from subject-1.

SELECT subject, COUNT(*) FROM analysis_table GROUP BY subject;
┌───────────┬──────────────┐
│ subject │ count_star()
varchar │ int64 │
├───────────┼──────────────┤
│ subject-1106684
│ subject-2157194
│ │ 38703
└───────────┴──────────────┘

Or maybe, we'd like to filter our analysis table to cases where the last 5 quality scores average above 35 (i.e. we expect some drop-off in quality across the read, but if it was drastic, let's omit).

SELECT *
FROM analysis_table
WHERE list_avg(qc_scores[-5:]) > 35;

Ok, so entering the home stretch, we've got our data in a "big table". We've done some analysis. Now let's move to the final task of storage and how partitioning can simplify retrieval.

Partitioning with DuckDB and Reading with Exon-DuckDB

The idea of partitioning is to split the data into, well, partitions based on some value that is well aligned with access patterns. For example, for web log data, you might partition by day since if there was some site issue last week, you'd want to quickly grab that data without going through everything else. Here, we'll partition by barcode, but it's on you to make the decision for you data.

The syntax from DuckDB is quite simple for generating partitions.

COPY (
SELECT * FROM analysis_table
) TO 'analysis_table' (FORMAT parquet, PARTITION_BY barcode);

Now we have a directory structure that looks like this:

$ tree analysis_table | head
analysis_table
├── barcode=AAAAAAAAAAAA
│ ├── data_0.parquet
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAATTTAGG
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAGAGCTTA
│ ├── data_0.parquet
...

And that's all well and good; if you need a specific barcode, you can just grab the directory. Or again you can copy this to your cloud provider.

Though there's one step missing, say you wanted to run the FASTQ of one barcodes through a tool, and that tool like all bioinformatic tools seemingly, can't work with parquet? Exon-DuckDB can help there too.

COPY (
SELECT * FROM parquet_scan('analysis_table/barcode=AAAAAAAAAAAA/*')
) TO 'AAAAAAAAAAAA.fastq.gz' (FORMAT fastq)

And then having a quick look...

$ gzcat AAAAAAAAAAAA.fastq.gz | head -n 4
@HWI-EAS440_0386:4:11:17168:10699#0/1
NAAAAAAAAAAAAAAAAAAAAAAAACAAACAAAAAAAATCTAAAAAAACTACAAACAAAAATCAAACAAACAAACAAAAATACAACATCTAAATATAAAATTCAAACACACTACATTACACTAGAAACCCTCTCTTCACTACTCACTAACAC
+
########################################################################################################################################################

That's it. Thanks for playing along. If you'd like to learn more about WTT, the company, or Exon-DuckDB, the product, please get in touch via the email in the footer.