Total: 1
Suppose we have a tool for finding super-maximal exact matches (SMEMs) and we want to use it to find all the long SMEMs between a noisy long read P and a highly repetitive pangenomic reference T. Notice that if L≥k and the k-mer P[i..i+k−1] does not occur in T then no SMEM of length at least L contains P[i..i+k−1]. Therefore, if we have a Bloom filter for the distinct k-mers in T and we want to find only SMEMs of length L≥k, then when given P we can break it into maximal substrings consisting only of k-mers the filter says occur in T -- which we call pseudo-SMEMs -- and search only the ones of length at least L. If L is reasonably large and we can choose k well then the Bloom filter should be small (because T is highly repetitive) but the total length of the pseudo-SMEMs we search should also be small (because P is noisy). Now suppose we are interested only in the longest t SMEMs of length at least L between P and T. Notice that once we have found t SMEMs of length at least ℓ then we need only search for SMEMs of length greater than ℓ. Therefore, if we sort the pseudo-SMEMs into non-increasing order by length, then we can stop searching once we have found t SMEMs at least as long as the next pseudo-SMEM we would search. Our preliminary experiments indicate that these two admissible heuristics may significantly speed up SMEM-finding in practice.