KeBaB: $k$-mer based breaking for finding super-maximal exact matches

#1 KeBaB: $k$ -mer based breaking for finding super-maximal exact matches [PDF] [Copy] [Kimi] [REL]

Authors: Nathaniel K. Brown, Anas Alhadi, Nour Allam, Dove Begleiter, Nithin Bharathi Kabilan Karpagavalli, Suchith Sridhar Khajjayam, Hamza Wahed, Travis Gagie

Suppose we have a tool for finding super-maximal exact matches (SMEMs) and we want to use it to find all the long SMEMs between a noisy long read $P$ and a highly repetitive pangenomic reference $T$ . Notice that if $L \geq k$ and the $k$ -mer $P [i..i + k - 1]$ does not occur in $T$ then no SMEM of length at least $L$ contains $P [i..i + k - 1]$ . Therefore, if we have a Bloom filter for the distinct $k$ -mers in $T$ and we want to find only SMEMs of length $L \geq k$ , then when given $P$ we can break it into maximal substrings consisting only of $k$ -mers the filter says occur in $T$ -- which we call pseudo-SMEMs -- and search only the ones of length at least $L$ . If $L$ is reasonably large and we can choose $k$ well then the Bloom filter should be small (because $T$ is highly repetitive) but the total length of the pseudo-SMEMs we search should also be small (because $P$ is noisy). Now suppose we are interested only in the longest $t$ SMEMs of length at least $L$ between $P$ and $T$ . Notice that once we have found $t$ SMEMs of length at least $\ell$ then we need only search for SMEMs of length greater than $\ell$ . Therefore, if we sort the pseudo-SMEMs into non-increasing order by length, then we can stop searching once we have found $t$ SMEMs at least as long as the next pseudo-SMEM we would search. Our preliminary experiments indicate that these two admissible heuristics may significantly speed up SMEM-finding in practice.

Subject: Data Structures and Algorithms

Publish: 2025-02-27 18:08:10 UTC

Stared Paper(s):

#1 KeBaB: $k$ -mer based breaking for finding super-maximal exact matches

2502.20338

#1 KeBaB: kk-mer based breaking for finding super-maximal exact matches [PDF] [Copy] [Kimi] [REL]

#1 KeBaB: $k$ -mer based breaking for finding super-maximal exact matches [PDF] [Copy] [Kimi] [REL]