Skip to main content

Exon-DuckDB Snippet: Find Contigs with 2+ PFAMs of Interest

· 2 min read
Trent Hauck
Trent Hauck
Developer

This is a quick snippet for how you can use Exon-DuckDB to find contigs that have multiple PFAMs of interest. This has application in metagenomic discovery of things like CRISPR systems and antibiotic resistance genes.

Create a Table of Annotations

CREATE TABLE gff AS
SELECT
reference_sequence_name,
gff_parse_attributes(attributes) attributes
FROM read_gff_raw('IowNatreferecore.gff');
  • read_gff_raw reads the GFF. If it was actually a gff, we could use read_gff. But this GFF is malformed from the source, so we use read_gff_raw.
  • gff_parse_attributes is converting a gff attributes string into key value pairs.
  • This dataset has 12.7 million rows, and easily processed on a laptop.

Create a Table of PFAMs

CREATE TABLE pfams AS
SELECT
target_name,
accession[:7] AS pfam
FROM read_hmm_dom_tbl_out('hmmout')
WHERE evalue < 1e-10
  • read_hmm_dom_tbl_out reads the domain hmm output.
  • accession[:7] munges the raw output from PF00000.12 to PF00000
  • evalue filtering gives us significant hits.

Combine Tables

CREATE TABLE contig_pfams AS
SELECT
reference_sequence_name AS contig_id,
LIST(pfam) AS contig_pfams
FROM gff
JOIN pfams
ON gff.attributes['locus_tag'][1] = pfams.target_name
GROUP BY reference_sequence_name
HAVING LEN(LIST(pfam)) > 5
  • Aggregating into a LIST means one row per contig with a LIST of PFAMs in the other column
  • Join Key is within attributes -- the 'locus_tag' key
  • HAVING applies a heuristic that we only want contigs with a bit of length.

Discover Cool Things

SELECT *
FROM contig_pfams
WHERE list_contains(contig_pfams, 'PF17893')
AND list_contains(contig_pfams, 'PF01867')
AND list_contains(contig_pfams, 'PF09827')
  • list_contains returns true if the list in the first argument contains the string in the second argument.
  • Easy to generalize to other CRISPR types or clusters of genes more generally.
  • All done locally with just SQL.