Data Prep for Bioinformatic ML
This video is a quick walkthrough of:
- What is a join, why it's important for bioinformatics analysis, and how to recognize when you could be using SQL vs Python.
- How to use SQL to easily combine contigs, GFFs, and amino acid sequences into a dataset for a back-translation pLM from amino acids to CDSs.
- Shows how to extract CDSs from contigs in a strand-aware manner.
- Shows how to use contig metadata for Group K-Fold Cross Validation to avoid leakage.
The second part starts around the 5 minute mark.
If you'd like to try biobear, you can install it with pip:
pip install biobear
Or feel free to reach out if you have any question. Otherwise, the code from the video is below.
Tabular Example
For example, given the table:
gene_id | contig_id |
---|---|
1 | 1 |
2 | 1 |
Where each row represents a gene and the contig it is on, we can join this table with another table that has information about the contigs:
contig_id | length |
---|---|
1 | 1000 |
To get a table that looks like this:
gene_id | contig_id | length |
---|---|---|
1 | 1 | 1000 |
2 | 1 | 1000 |
So, with a genes
table and a contigs
table, this would look like:
SELECT
genes.gene_id,
genes.contig_id,
contigs.length
FROM
genes
JOIN
contigs
ON
genes.contig_id = contigs.contig_id
What this Looks like in Python
It's important to recognize this pattern in Python w/ Dictionaries. You may already be using this pattern without realizing it.
# Mapping of gene_id to contig_id
genes = {
1: 1,
2: 1
}
# Mapping of contig_id to length
contigs = {
1: 1000
}
These two dictionaries can then be "joined":
for gene_id, contig_id in genes.items():
length = contigs[contig_id]
print(gene_id, contig_id, length)
This is straightforward, but can run into memory issues if the datasets are large. SQL databases are optimized for this type of operation. It also is fairly inflexible, as addint additional data, doing aggregations, etc. can require material code changes.
DataFrames
DataFrames are a common data structure found in Python libraries like Pandas (and base R). Conceptually they are similar to tables in SQL, but can run into the same memory and performance issues as dictionaries.
import pandas as pd
genes = pd.DataFrame({
'gene_id': [1, 2],
'contig_id': [1, 1]
})
contigs = pd.DataFrame({
'contig_id': [1],
'length': [1000]
})
result = pd.merge(genes, contigs, on='contig_id')
print(result)
Types of Joins
There are several types of joins, but the two most common are:
- Inner Join
- Left Join
Inner Join
An inner join is the most common type of join. It only returns rows where there is a match in both tables. In the example above, this would mean that only genes that are on a contig would be returned.
Left Join
A left join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL. In the example above, this would mean that all genes would be returned, even if they are not on a contig.
SELECT
genes.gene_id,
genes.contig_id,
contigs.length
FROM
genes
LEFT JOIN
contigs
ON
genes.contig_id = contigs.contig_id
gene_id | contig_id | length |
---|---|---|
1 | 1 | 1000 |
2 | 1 | 1000 |
3 | 2 | NULL |
pLM Example with BioBear
Joining CDSs with Amino Acids and Metadata for the purposes of training a pLM.
End goals:
- A table with the AA sequence and DNA sequence for each CDS
- Additional metadata for each CDS, our use case: group k-fold cross-validation by contig
# | echo: false
# | output: false
import polars as pl
Start a New Session and Load Data
Load our GFF.
import biobear as bb
session = bb.new_session()
session.execute(
"""
CREATE EXTERNAL TABLE gff
STORED AS gff
LOCATION './12803.assembled.gff';
"""
)
session.sql(
"""
SELECT seqname, source, start, "end", strand, phase, attributes.locus_tag[1] locus_tag FROM gff WHERE type = 'CDS' LIMIT 5
"""
).to_polars().glimpse()
Load our contigs. This part is somewhat tricky as we have the contig sequence, which contains the CDSs. So we'll need to join and extract the CDS sequences in a minute.
session.execute(
"""
CREATE EXTERNAL TABLE contigs
STORED AS fasta OPTIONS (file_extension 'fna')
LOCATION './12803.assembled.fna';
"""
)
session.sql(
"""
SELECT id, sequence FROM contigs LIMIT 5
"""
).to_polars().glimpse()
Load our protein sequences.
session.execute(
"""
CREATE EXTERNAL TABLE protein
STORED AS fasta OPTIONS (file_extension 'faa')
LOCATION './12803.assembled.faa';
"""
)
session.sql(
"""
SELECT id, sequence FROM protein LIMIT 5
"""
).to_polars().glimpse()
Data is loaded, now we can join and extract the CDS sequences.
df = session.sql(
"""
SELECT seqname as contig,
CASE
WHEN strand = '1' THEN substr(contigs.sequence, start, "end" - start + 1)
ELSE reverse_complement(substr(contigs.sequence, start, "end" - start + 1))
END AS sequence,
protein.sequence aa_sequence,
gff.start,
gff."end",
gff.strand,
"end" - start + 1 AS length
FROM gff
JOIN contigs
ON gff.seqname = contigs.id
JOIN protein
ON gff.attributes.locus_tag[1] = protein.id
"""
).to_polars()
df.head()
Now we can use sci-kit learn to do a group k-fold cross-validation using the contig as the group.
from sklearn.model_selection import GroupKFold
X = df["aa_sequence"]
Y = df["sequence"]
contigs = df["contig"]
group_kfold = GroupKFold(n_splits=5)
for train_index, test_index in group_kfold.split(X, Y, groups=contigs):
print("TRAIN:", train_index)
print("TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
# Do something with the data
break