Skip to main content

Data Prep for Bioinformatic ML

· 5 min read
Trent Hauck
Trent Hauck
Developer

This video is a quick walkthrough of:

  1. What is a join, why it's important for bioinformatics analysis, and how to recognize when you could be using SQL vs Python.
  2. How to use SQL to easily combine contigs, GFFs, and amino acid sequences into a dataset for a back-translation pLM from amino acids to CDSs.
    1. Shows how to extract CDSs from contigs in a strand-aware manner.
    2. Shows how to use contig metadata for Group K-Fold Cross Validation to avoid leakage.

The second part starts around the 5 minute mark.

If you'd like to try biobear, you can install it with pip:

pip install biobear

Or feel free to reach out if you have any question. Otherwise, the code from the video is below.

Tabular Example

For example, given the table:

gene_idcontig_id
11
21

Where each row represents a gene and the contig it is on, we can join this table with another table that has information about the contigs:

contig_idlength
11000

To get a table that looks like this:

gene_idcontig_idlength
111000
211000

So, with a genes table and a contigs table, this would look like:

SELECT
genes.gene_id,
genes.contig_id,
contigs.length
FROM
genes
JOIN
contigs
ON
genes.contig_id = contigs.contig_id

What this Looks like in Python

It's important to recognize this pattern in Python w/ Dictionaries. You may already be using this pattern without realizing it.

# Mapping of gene_id to contig_id
genes = {
1: 1,
2: 1
}

# Mapping of contig_id to length
contigs = {
1: 1000
}

These two dictionaries can then be "joined":

for gene_id, contig_id in genes.items():
length = contigs[contig_id]
print(gene_id, contig_id, length)

This is straightforward, but can run into memory issues if the datasets are large. SQL databases are optimized for this type of operation. It also is fairly inflexible, as addint additional data, doing aggregations, etc. can require material code changes.

DataFrames

DataFrames are a common data structure found in Python libraries like Pandas (and base R). Conceptually they are similar to tables in SQL, but can run into the same memory and performance issues as dictionaries.

import pandas as pd

genes = pd.DataFrame({
'gene_id': [1, 2],
'contig_id': [1, 1]
})

contigs = pd.DataFrame({
'contig_id': [1],
'length': [1000]
})

result = pd.merge(genes, contigs, on='contig_id')
print(result)

Types of Joins

There are several types of joins, but the two most common are:

  1. Inner Join
  2. Left Join

Inner Join

An inner join is the most common type of join. It only returns rows where there is a match in both tables. In the example above, this would mean that only genes that are on a contig would be returned.

Left Join

A left join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL. In the example above, this would mean that all genes would be returned, even if they are not on a contig.

SELECT
genes.gene_id,
genes.contig_id,
contigs.length
FROM
genes
LEFT JOIN
contigs
ON
genes.contig_id = contigs.contig_id
gene_idcontig_idlength
111000
211000
32NULL

pLM Example with BioBear

Joining CDSs with Amino Acids and Metadata for the purposes of training a pLM.

End goals:

  1. A table with the AA sequence and DNA sequence for each CDS
  2. Additional metadata for each CDS, our use case: group k-fold cross-validation by contig
# | echo: false
# | output: false

import polars as pl

Start a New Session and Load Data

Load our GFF.

import biobear as bb

session = bb.new_session()

session.execute(
"""
CREATE EXTERNAL TABLE gff
STORED AS gff
LOCATION './12803.assembled.gff';
"""
)

session.sql(
"""
SELECT seqname, source, start, "end", strand, phase, attributes.locus_tag[1] locus_tag FROM gff WHERE type = 'CDS' LIMIT 5
"""
).to_polars().glimpse()

Load our contigs. This part is somewhat tricky as we have the contig sequence, which contains the CDSs. So we'll need to join and extract the CDS sequences in a minute.

session.execute(
"""
CREATE EXTERNAL TABLE contigs
STORED AS fasta OPTIONS (file_extension 'fna')
LOCATION './12803.assembled.fna';
"""
)

session.sql(
"""
SELECT id, sequence FROM contigs LIMIT 5
"""
).to_polars().glimpse()

Load our protein sequences.

session.execute(
"""
CREATE EXTERNAL TABLE protein
STORED AS fasta OPTIONS (file_extension 'faa')
LOCATION './12803.assembled.faa';
"""
)

session.sql(
"""
SELECT id, sequence FROM protein LIMIT 5
"""
).to_polars().glimpse()

Data is loaded, now we can join and extract the CDS sequences.

df = session.sql(
"""
SELECT seqname as contig,
CASE
WHEN strand = '1' THEN substr(contigs.sequence, start, "end" - start + 1)
ELSE reverse_complement(substr(contigs.sequence, start, "end" - start + 1))
END AS sequence,
protein.sequence aa_sequence,
gff.start,
gff."end",
gff.strand,
"end" - start + 1 AS length
FROM gff
JOIN contigs
ON gff.seqname = contigs.id
JOIN protein
ON gff.attributes.locus_tag[1] = protein.id
"""
).to_polars()

df.head()

Now we can use sci-kit learn to do a group k-fold cross-validation using the contig as the group.

from sklearn.model_selection import GroupKFold

X = df["aa_sequence"]
Y = df["sequence"]
contigs = df["contig"]

group_kfold = GroupKFold(n_splits=5)

for train_index, test_index in group_kfold.split(X, Y, groups=contigs):
print("TRAIN:", train_index)
print("TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]

# Do something with the data

break