Exon-DuckDB Snippet: Find Contigs with 2+ PFAMs of Interest
· 2 min read
This is a quick snippet for how you can use Exon-DuckDB to find contigs that have multiple PFAMs of interest. This has application in metagenomic discovery of things like CRISPR systems and antibiotic resistance genes.
Create a Table of Annotations
CREATE TABLE gff AS
SELECT
reference_sequence_name,
gff_parse_attributes(attributes) attributes
FROM read_gff_raw('IowNatreferecore.gff');
read_gff_raw
reads the GFF. If it was actually a gff, we could use read_gff. But this GFF is malformed from the source, so we useread_gff_raw
.gff_parse_attributes
is converting a gff attributes string into key value pairs.- This dataset has 12.7 million rows, and easily processed on a laptop.
Create a Table of PFAMs
CREATE TABLE pfams AS
SELECT
target_name,
accession[:7] AS pfam
FROM read_hmm_dom_tbl_out('hmmout')
WHERE evalue < 1e-10
read_hmm_dom_tbl_out
reads the domain hmm output.accession[:7]
munges the raw output from PF00000.12 to PF00000evalue
filtering gives us significant hits.
Combine Tables
CREATE TABLE contig_pfams AS
SELECT
reference_sequence_name AS contig_id,
LIST(pfam) AS contig_pfams
FROM gff
JOIN pfams
ON gff.attributes['locus_tag'][1] = pfams.target_name
GROUP BY reference_sequence_name
HAVING LEN(LIST(pfam)) > 5
- Aggregating into a LIST means one row per contig with a LIST of PFAMs in the other column
- Join Key is within attributes -- the 'locus_tag' key
- HAVING applies a heuristic that we only want contigs with a bit of length.
Discover Cool Things
SELECT *
FROM contig_pfams
WHERE list_contains(contig_pfams, 'PF17893')
AND list_contains(contig_pfams, 'PF01867')
AND list_contains(contig_pfams, 'PF09827')
list_contains
returns true if the list in the first argument contains the string in the second argument.- Easy to generalize to other CRISPR types or clusters of genes more generally.
- All done locally with just SQL.