A new generation of faster methods to find DNA sequence matches was developed since 2000. They are tailored for short (<100 bp) reads alignment and 10–1000 times faster than general algorithms, such as BLAST. However, reads coming from the new sequencing technologies are not short any more, which makes efficiently aligning long reads against a long reference sequence a new challenge to the development of alignment tools.
This paper presents Burrows-Wheeler Aligner’s Smith-Waterman Alignment (BWA-SW), a novel algorithm to align long sequences up to 1Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAST, and is several to tens of times faster than both.
Similar to other algorithm papers, if you are familiar with the background, it’s easy for you to grasp the core algorithm within a few minutes. So I will just recommend several supporting material here:
- First, Heng Li et al.'s previous paper "Fast and accurate short read alignment with Burrows–Wheeler transform" gives a more detailed explanation of the building blocks of this algorithm, such as Prefix trie and Burrows–Wheeler transform. It is a good idea to read this paper in advance.
- Second, if you still have trouble in understanding how FM-index works, there is an excellent diagrammatic presentation in Alex's post "FM-Indexes and Backwards Search".
- Finally, if you have no idea what is "read alignment" but still interesting in this exquisite application of FM-index. You can read these two slides [1, 2] at first.