Comparing the genomes of different
species—or different members of the same species—is the basis of a great
deal of modern biology. DNA sequences that are conserved across species
are likely to be functionally important, while variations between
members of the same species can indicate different susceptibilities to
disease.
The
basic algorithm for determining how much two sequences of symbols have
in common—the "edit distance" between them—is now more than 40 years
old. And for more than 40 years,
computer science researchers have been trying to improve upon it, without much success.At the ACM Symposium on Theory of Computing (STOC) next week, MIT
researchers will report that, in all likelihood, that's because the
algorithm is as good as it gets. If a widely held assumption about
computational complexity is correct, then the problem of measuring the
difference between two genomes—or texts, or speech samples, or anything
else that can be represented as a string of symbols—can't be solved more
efficiently.In a sense, that's disappointing, since a computer running the
existing algorithm would take 1,000 years to exhaustively compare two
human genomes. But it also means that computer scientists can stop
agonizing about whether they can do better.