One post tagged with "snowflake" | WHERE TRUE Technologies Documentation

Data Prep for Bioinformatic ML

May 21, 2024 · 5 min read

Trent Hauck

Developer

This video is a quick walkthrough of:

What is a join, why it's important for bioinformatics analysis, and how to recognize when you could be using SQL vs Python.
How to use SQL to easily combine contigs, GFFs, and amino acid sequences into a dataset for a back-translation pLM from amino acids to CDSs.
1. Shows how to extract CDSs from contigs in a strand-aware manner.
2. Shows how to use contig metadata for Group K-Fold Cross Validation to avoid leakage.

The second part starts around the 5 minute mark.

If you'd like to try biobear, you can install it with pip:

pip install biobear

Or feel free to reach out if you have any question. Otherwise, the code from the video is below.

Tabular Example

For example, given the table:

gene_id	contig_id
1	1
2	1

Where each row represents a gene and the contig it is on, we can join this table with another table that has information about the contigs:

contig_id	length
1	1000

To get a table that looks like this:

gene_id	contig_id	length
1	1	1000
2	1	1000

So, with a genes table and a contigs table, this would look like:

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

What this Looks like in Python

It's important to recognize this pattern in Python w/ Dictionaries. You may already be using this pattern without realizing it.

# Mapping of gene_id to contig_id
genes = {
    1: 1,
    2: 1
}

# Mapping of contig_id to length
contigs = {
    1: 1000
}

These two dictionaries can then be "joined":

for gene_id, contig_id in genes.items():
    length = contigs[contig_id]
    print(gene_id, contig_id, length)

This is straightforward, but can run into memory issues if the datasets are large. SQL databases are optimized for this type of operation. It also is fairly inflexible, as addint additional data, doing aggregations, etc. can require material code changes.

DataFrames

DataFrames are a common data structure found in Python libraries like Pandas (and base R). Conceptually they are similar to tables in SQL, but can run into the same memory and performance issues as dictionaries.

import pandas as pd

genes = pd.DataFrame({
    'gene_id': [1, 2],
    'contig_id': [1, 1]
})

contigs = pd.DataFrame({
    'contig_id': [1],
    'length': [1000]
})

result = pd.merge(genes, contigs, on='contig_id')
print(result)

Types of Joins

There are several types of joins, but the two most common are:

Inner Join
Left Join

Inner Join

An inner join is the most common type of join. It only returns rows where there is a match in both tables. In the example above, this would mean that only genes that are on a contig would be returned.

Left Join

A left join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL. In the example above, this would mean that all genes would be returned, even if they are not on a contig.

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
LEFT JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

gene_id	contig_id	length
1	1	1000
2	1	1000
3	2	NULL

pLM Example with BioBear

Joining CDSs with Amino Acids and Metadata for the purposes of training a pLM.

End goals:

A table with the AA sequence and DNA sequence for each CDS
Additional metadata for each CDS, our use case: group k-fold cross-validation by contig

# | echo: false
# | output: false

import polars as pl

Start a New Session and Load Data

Load our GFF.

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE gff
STORED AS gff
LOCATION './12803.assembled.gff';
"""
)

session.sql(
    """
SELECT seqname, source, start, "end", strand, phase, attributes.locus_tag[1] locus_tag FROM gff WHERE type = 'CDS' LIMIT 5
"""
).to_polars().glimpse()

Load our contigs. This part is somewhat tricky as we have the contig sequence, which contains the CDSs. So we'll need to join and extract the CDS sequences in a minute.

session.execute(
    """
CREATE EXTERNAL TABLE contigs
STORED AS fasta OPTIONS (file_extension 'fna')
LOCATION './12803.assembled.fna';
"""
)

session.sql(
    """
SELECT id, sequence FROM contigs LIMIT 5
"""
).to_polars().glimpse()

Load our protein sequences.

session.execute(
    """
CREATE EXTERNAL TABLE protein
STORED AS fasta OPTIONS (file_extension 'faa')
LOCATION './12803.assembled.faa';
"""
)

session.sql(
    """
SELECT id, sequence FROM protein LIMIT 5
"""
).to_polars().glimpse()

Data is loaded, now we can join and extract the CDS sequences.

df = session.sql(
    """
SELECT seqname as contig,
    CASE
        WHEN strand = '1' THEN substr(contigs.sequence, start, "end" - start + 1)
        ELSE reverse_complement(substr(contigs.sequence, start, "end" - start + 1))
    END AS sequence,
    protein.sequence aa_sequence,
    gff.start,
    gff."end",
    gff.strand,
    "end" - start + 1 AS length
FROM gff
    JOIN contigs
        ON gff.seqname = contigs.id
    JOIN protein
        ON gff.attributes.locus_tag[1] = protein.id
"""
).to_polars()

df.head()

Now we can use sci-kit learn to do a group k-fold cross-validation using the contig as the group.

from sklearn.model_selection import GroupKFold

X = df["aa_sequence"]
Y = df["sequence"]
contigs = df["contig"]

group_kfold = GroupKFold(n_splits=5)

for train_index, test_index in group_kfold.split(X, Y, groups=contigs):
    print("TRAIN:", train_index)
    print("TEST:", test_index)

    X_train, X_test = X[train_index], X[test_index]

    # Do something with the data

    break

Tabular Example​

What this Looks like in Python​

DataFrames​

Types of Joins​

Inner Join​

Left Join​

pLM Example with BioBear​

Start a New Session and Load Data​