Syncmers
Minimizers are widely used in bioinformatics by softwares such as Minimap2 and Kraken2. Recently, the concept of syncmers was proposed as an alternative to minimizers. To quote the paper:
Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer.
Basically what this means is:
- Take a kmer of length
k
. - Check for the smallest substring (by value) of length
s
in the kmer. - If the location of this substring fulfills a given criteria, classify the kmer as a syncmer.
A very simplified example is a nucleotide ATCG
of length 4. Let k = 3
and s = 2
. Let's assume that our criteria is that the smallest valued substring must be located at the start of the kmer.
We can generate two kmers and for each of them, we check if the smallest valued substring of length s = 2
is located at the start of the kmer.
ATCG # nucleotide sequence.
ATC # kmer_1
TCG # kmer_2
We see that:
ATC
hasAT
as its smallest valued substring andAT
is located at the start.ATC
is a syncmer.TCG
hasCG
as its smallest valued substring andCG
is not located at the start.TCG
is not a syncmer.
Closed syncmers
In this section, we'll go through closed syncmers where the location criteria is that the smallest value substring must be located at either the start or end of the kmer. Expanding our previous example with this location criteria, we get
ATCG # nucleotide sequence.
ATC # kmer_1
TCG # kmer_2
We see that:
ATC
hasAT
as its smallest valued substring andAT
is located at the start.ATC
is a closed syncmer.TCG
hasCG
as its smallest valued substring andCG
is located at the end.TCG
is a closed syncmer.
Implementation
In the example below, we won't bother with nucleotide encoding. Rather, we'll just find closed syncmers by iterate over the kmers, using the .windows()
function and check the location of the smallest valued substring.
fn get_closed_syncmers<'a>(seq: &'a [u8], k: usize, s: usize) -> Vec<&'a [u8]> { assert!(k <= seq.len()); assert!(s <= k); let closed_syncmers: Vec<&[u8]> = seq .windows(k) // Generate kmers. .filter_map(|kmer| { let (smallest_index, _) = kmer .windows(s) // Substrings of length s for a given kmer. .enumerate() .min_by_key(|substring| substring.1) // Find smallest valued substring. .unwrap(); // Location criteria. if smallest_index == 0 || smallest_index == (k - s) { return Some(kmer); } return None; }) .collect(); return closed_syncmers; } fn main() { // Generate one single kmer. Syncmer "AT" is not in the start or end. assert_eq!(get_closed_syncmers(b"TTATT", 5, 2), Vec::<&[u8]>::new()); // The example from the introduction to this chapter. assert_eq!(get_closed_syncmers(b"ATCG", 3, 2), vec![b"ATC", b"TCG"]); // Example from the syncmer paper. assert_eq!(get_closed_syncmers(b"GGCAAGTGACA", 5, 2), vec![b"GGCAA", b"AAGTG", b"AGTGA", b"GTGAC"]); }