A team of scientists from the Singapore Centre for Environmental Life Sciences Engineering (SCELSE) and NTU School of Biological Sciences (SBS), has developed a new DNA sequence alignment algorithm that is faster and more accurate than existing methods.

Sequence alignment

One of the fundamental techniques in molecular and evolutionary biology is sequence alignment, where scientists compare DNA, RNA, or protein sequences. It essentially involves lining up two or more biological strands and comparing them to find which parts are similar and which are different, all with the aid of a computer programme. Sequence alignment helps scientists predict what new genes or proteins might do based on similar known ones. It can also help scientists find important parts of genes or proteins that do specific jobs or help see how closely related certain species are. This technique is essential in research areas such as cancer therapy development, vaccine design, and pathogen identification, where knowing which parts of a DNA or protein sequence to target and treat is vital.

To make the analysis of longer sequences more manageable, current computer algorithms break the query and reference sequences into smaller fixed-length pieces called k-mers, where k represents the number of DNA base pairs. For example, in the DNA sequence AGAT, 2-mers (k = 2) would be AG, GA or AT.  The computer programme will then align k-mers between the query and reference sequences to detect differences.

However, this process comes with certain challenges. If the reference sequence has too many differences, i.e. mutations, from to the query sequence, it is possible that none of the k-mers will match, preventing any desired positions from being found.

On the other hand, if the reference sequence contains many repeated regions and the k-mers are too short, there may be too many matches. This can result in a tedious and time-consuming task of having to sort through all these matches in order to identify the optimal positions. These issues can arise in different parts of a sequence when using the same fixed k-mer length.

A new algorithm for the future

To address these issues with current sequencing algorithms, a team of researchers from NTU SBS, led by Assistant Professor Anni Zhang, has introduced a new algorithm, X-Mapper, that produces significantly more accurate and consistent results. Instead of using fixed-length k-mers, X-Mapper uses k-mers of various lengths, called x-mers, and inserts gaps inside x-mers to support mutations. The team has shown that X-Mapper makes fewer mistakes, with 11 to 24 times fewer alignment errors, when analysing human DNA, and is more consistent, with a 3 to 579 times lower inconsistency rate, when working with bacterial DNA, compared to existing methods.

In another example of X-Mapper’s advantage over existing tools, the team demonstrated that when used to align bacterial DNA sequences, X-Mapper achieved a remarkably low false positive rate of less than 2%. In contrast, current algorithms produced a much higher false positive rate of 53% aligned to a wrong bacterial species.

With its vastly improved accuracy and consistency, X-Mapper would ensure better results and more efficient workflow processes in biological and clinical studies that use DNA sequences. This will help researchers work faster while at the same time produce higher quality results.

The team plans to apply X-Mapper to more computational biology software, including long-read sequence alignment and large language models. They also look forward to using their new algorithm in the fight against cancer, particularly in research that analyses cancer genomes to find ideal mutation targets for cancer treatment and cancer vaccine development.

If you think this research could contribute to your work, please contact the team at anni.zhang@ntu.edu.sg.