Probe generation
Probes, also referred to as ‘masks’ in the LexicHash paper, consist of a fixed number (m = 20,000 by default) of k-mers (k ≤ 32, k = 31 by default), which capture DNA sequences by prefix matching. A probe consists of two parts—the p-bp prefix and the remaining bases. To enable probes to match all possible reference and query sequences by prefix matching, all permutations of p-mers are generated as the base prefix set. The length of the prefix (p) is calculated to ensure that the total number of possible prefixes (4p) does not exceed the total number of probes (m). For instance, when m = 20,000, p = 7. Then, the base prefixes are duplicated to reach the number of probes m. Next, the suffixes of probes are randomly generated. The p′-bp (p′ = p + 1) prefixes are required to be distinct to enable fast locating of the probe by the prefix of a k-mer through an array data structure. Lastly, these m k-mers serve as probes that can match all potential sequence regions in both reference genomes and query sequences.
Seed computation
The k-mers are encoded as 64-bit unsigned integers, with a binary coding scheme of A = 00, C = 01, G = 10 and T = 11. Following the original implementation of LexicHash, the hash value between a probe and a k-mer is computed using a bitwise XOR operation. The value of this hash is that smaller hash values indicate longer common prefixes between the probe and the k-mer, whereas a hash value of zero means the probe and the k-mer are identical.
For each reference genome or query sequence, all k-mers from both forward and backward strands are compared with probes that share the same p-bp prefix. Each probe retains only the k-mer with the minimum hash value, which might in principle be located at more than one position. The k-mers of low-complexity are discarded by DUST algorithm37 with a loose score threshold of 50. A 64-bit integer encodes each position’s information, including the genome batch index (17 bits, as described below), genome index (17 bits), position (28 bits), strand (1 bit) and seed direction flag (1 bit, as described below). Ultimately, each probe captures one k-mer (as a seed) across the entire reference genome or query sequence, which shares the longest prefix with the probe.
However, because of the random distribution of captured k-mers (seeds), the distances between these seeds vary and there is no guarantee of consistent distance between successive seeds. Consequently, some regions of the sequence might remain uncovered by seeds, leading to what are known as ‘seed deserts’. These deserts are problematic because they can cause sequences homologous to the regions to fail to align. To address this issue, regions identified as seed deserts, which are longer than a certain threshold (100 bp by default), are extended by 1 kb both upstream and downstream. A second round of seed capture is then performed in these extended regions and new seeds are spaced about x bp (50 by default) apart within the region are added to the index of the corresponding probes. After filling these seed deserts, the total number of seeds may exceed the initial value of m.
Indexing
The input of LexicMap is a list of microbial genomes, with the sequences of each genome stored in separate FASTA files. These files can be in plain or compressed formats such as gzip, xz, zstd or bzip2. Each file must have a distinct genome identifier in the file name. To limit memory consumption, genomes are indexed in batches and all batches are merged at the end (Supplementary Fig. 5a).
In each batch, genomes with any sequence larger than a genome size threshold (15 Mb by default), such as nonisolate assemblies, are skipped. On the other hand, if only the total length of sequences exceeds the threshold, the genomes are split into multiple chunks and alignments from these chunks will be merged in the searching step. Additionally, unwanted sequences within genomes, such as plasmids, can be optionally discarded using regular expressions to match sequence names. Multiple sequences (contigs or scaffolds) of a genome are concatenated with 1-kb intervals of Ns to reduce the scale of sequences to be indexed. The original coordinates will be restored after sequence alignment. The complete genome or the concatenated contigs are then used to compute seeds with generated probes, as described above. Before this, any degenerated bases are converted to their corresponding alphabet-first bases (for example, N is converted to A). The genome sequence is then saved in a bit-packed format (2 bits per base), along with associated genome information (genome ID, size and sequence IDs and lengths of all contigs) to enable fast random subsequence extraction. Simultaneously, seeds and their positional information are appended to their corresponding probes and saved as seed files. After processing all genomes in the batch, these seed files are merged using an external sorting method once all batches have been indexed.
Seed data storage and variable-length seed matching
After indexing with all n reference genomes, each probe captures n or more seeds (k-mers, encoded as 64-bit integers), with each seed potentially having one or more positions (also represented as 64-bit integers), and there are m probes in total (20,000 by default). To scale to millions of prokaryotic genomes, the storage of seed data needs to be both compact and efficient for querying. Because the seeds of different probes are independent, the seed data are saved into c chunk files to enable parallel querying (Supplementary Fig. 5a). In each chunk file, the seed data of approximately m/c probes are simply concatenated. For each probe, all seeds are sorted in alphabetical order and the varint-GB38 algorithm is used to compress every two seeds along with their associated position counts. Because seeds captured by the same probe share common prefixes, the differences between two successive seeds are small, as are the position counts. As a result, two values can be stored using as little as 3 bytes instead of 16.
Unlike the approach in the LexicHash paper, where captured k-mers from each probe are stored in a prefix tree in main memory, here, they are alphabetically sorted and saved in a list-like structure within files. To enable efficient variable-length prefix matching of seeds, an index is created for each seed data file (Supplementary Fig. 5a,b). This index functions similarly to a table of contents in a dictionary, storing a list of marker k-mers along with their offsets (pages) in the seed data file. Each marker k-mer is the first one with a specific p″-bp subsequence (p″ = 6 by default) following the p-bp prefix (all seeds of a probe share the same p-bp prefix). For a query sequence, one k-mer is captured by each probe and searched within the corresponding probe’s seed data to return seeds that share a minimum length of prefix with the query k-mer. For example, searching with CATGCT for seeds (with p = 2, p″ = 1) that have at least 4 bp of common prefixes is equivalent to finding seeds in the range of CATGAA to CATGTT. The process starts by extracting the p″-bp subsequence from CATGAA (in this case, T) to locate the marker k-mer (for example, CATCAC) with the same p″-bp subsequence in the same region. The offset information (page) of this marker k-mer is then used as the starting point for scanning seeds within the k-mer range.
The index structure described above is extended to support the suffix matching of seeds. During the indexing phase, after a seed k-mer is saved into the seed data of its corresponding probe, the k-mer is reversed and added to the seed data of the probe that shares the longest prefix with the reversed k-mer. Additionally, the last bit of each position data is used as a seed direction flag, indicating that the seed k-mer is reversed. As a result, all seeds are doubled; there are ‘forward seeds’ for prefix matching and ‘reversed seeds’ for suffix matching, which is achieved through prefix matching of the reversed seeds. In the seed matching process, two rounds of matching are performed. The first round involves prefix matching (as described above) in the forward seed data. In the second round, the query k-mer is reversed and searched in the reversed seed data of the probe that shares the longest prefix with the reversed k-mer.
Chaining
After searching all captured k-mers from a query in the seed data of all probes, seed pairs (anchors) with different matched prefix and suffix lengths l are returned. These anchors indicate matched regions between the query sequences [y, y + l − 1] on the strand \({s}_{y}\) and reference genome [x, x + l − 1] on strand \({s}_{x}\). First, the anchors are grouped by the genome IDs. Within each group, the anchors are sorted by the following criteria: (1) ascending order of start positions in the query sequence; (2) descending order of end positions in the query sequence; and (3) ascending order of start positions in the target genome. Only one anchor is kept for duplicated anchors and any inner anchors that are nested within other anchors are removed. Then, overlapped anchors with no gaps are merged to a longer one, whereas, for these with gaps, only the nonoverlapped part of the second anchor is used to compute the weight, as described below. Next, a chaining function modified from Minimap2 is applied to chain all possible colinear anchors:
$$f\left(i\right)=\max\left\{\mathop{\max}\limits_{\begin{array}{c}i> j\ge 1\\{x}_{i}-G < {x}_{j}\le{x}_{i}\\{y}_{i}-G <{y}_{j}\le {y}_{i}\\ {d}_{{ji}}\le D\end{array}}\left\{\;f\left(j\right)+{w}_{i}-\gamma \left({g}_{{ji}}\right)\right\},{w}_{i}\right\}$$
(1)
Here, \(f\left(i\right)\) and \(f\left(j\right)\) are scores for anchors \({a}_{i}\) and \({a}_{j}\), respectively. The anchor weight \({w}_{i}=0.1\times {l}_{i}^{2}\), where \({l}_{i}\) is the length of the anchor \({a}_{i}\) and \(P\le {l}_{i}\le k\), with P being the minimum anchor length (15 by default). The seed distance \({d}_{{ji}}=\max \left\{\left|{y}_{i}-{y}_{j}\right|,\,\left|{x}_{i}-{x}_{j}\right|\right\}\) and D is the maximum seed distance (1,000 by default). The seed gap \({g}_{{ji}}=\left|\left|{\;y}_{i}-{y}_{j}\right|-\left|{x}_{i}-{x}_{j}\right|\right|\) and G is the maximum gap (50 by default). The gap penalty is calculated as \(\gamma \left(g\right)=0.1\times g+0.5\,{\log }_{2}(g)\). Additionally, a filter is applied to avoid a nonmonotonic increase or decrease in coordinates. For cases where there is only one anchor, the length of the anchor must be no less than a threshold P′ (17 by default).
Alignment
In contrast to the case of minimizers or closed syncmers39, seeds in LexicMap do not have a small window guarantee. Hence, a pseudoalignment algorithm is further used to find similar regions from the chained region, extended by 1 kb in the upstream and downstream directions. The k-mers (k = 31) on both strands of the query sequence are stored in a prefix tree and k-mers of a target region are queried in the prefix tree to return matches of \({{l}_{i}}^{{\prime} }\,\) (\({11\le {l}_{i}}^{{\prime} }\le 31\)) in the extended chained region. After removing duplicated anchors as above, the anchors are chained with the score function:
$$f{\,\prime} \left(i\right)=\max\left\{\mathop{\max}\limits_{\begin{array}{c}i> j\ge 1\\ {x}_{i}-G{\prime}<{x}_{j}\le {x}_{i}\\{y}_{i}-G{\prime}< {y}_{j}\le {y}_{i}\end{array}}\left\{\;f{\;\prime}\left(j\right)+{l{\prime}}_{i}-{g{\prime} }_{{ji}}\right\},{l{\prime} }_{i}\right\}$$
(2)
where \(G{\prime}\) is the maximum gap (20 by default) and seed distance \({g{\prime} }_{{ji}}=\left|\left|{y}_{i}-{y}_{j}\right|-\left|{x}_{i}-{x}_{j}\right|\right|\). The chaining is banded (100 bp or 50 anchors by default). Furthermore, chained regions across the interval regions between contigs are split. Next, regions are further extended in both ends, using a similar pseudoalignment algorithm based on 2-mer matches.
Similar regions between the reference and query sequences are further aligned with a reimplemented wavefront alignment algorithm (https://github.com/shenwei356/wfa; version 0.4.0), with a gap-affine penalty (match: 0, mismatch: 4, gap opening: 6 and gap extension: 2) and the adaptive reduction heuristic (minimum wavefront length: 10 and maximum distance: 50). Each alignment’s bit score and expect value (e value) are computed with Karlin–Altschul parameter40 values for substitution scores of 2 and −3 (gap opening cost: 5, gap extension cost: 2, lambda: 0.625 and K: 0.41) from BLASTn’s source code.
Tools, reference genome datasets and query sequences
LexicMap version 0.7.0, BLAST++ version 2.15.0, MMseqs2 version 16.747c6, Minimap2 version 2.28-r1209, Ropebwt3 version 3.9-r259, COBS (iqbal-lab-org fork version 0.3.0; https://github.com/iqbal-lab-org/cobs) and Phylign (AllTheBacteria fork https://github.com/AllTheBacteria/Phylign; commit 9fc65e6) were used for testing. The GTDB r214 complete dataset with 402,538 prokaryotic assemblies was downloaded with genome_updater (https://github.com/pirovc/genome_updater; version 0.6.3). The GenBank + RefSeq dataset with 2,340,672 prokaryotic assemblies was downloaded with genome_updater on February 15, 2024. The AllTheBacteria version 0.2 high-quality genome dataset with 1,858,610 bacteria genomes and the Phylign index were downloaded from GitHub (https://github.com/AllTheBacteria/AllTheBacteria; archived at https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/indexes/phylign/). Four query datasets were used for evaluating alignment accuracy and performance: a preprotein translocase subunit secY gene CDS sequence from an E. faecalis strain (NZ_CP064374.1_cds_WP_002359350.1_906), a 16S rRNA gene rrsB from E. coli str. K-12 substr. MG1655 (NC_000913.3: 4166659–4168200), a plasmid from a Serratia nevei strain (CP115019.1) and 1,033 AMR genes from the Bacterial Antimicrobial Resistance Reference Gene Database (PRJNA313047). All files were stored on a network-attached storage server equipped with HDD disks.
Simulating mutations
Ten bacterial genomes of common species (more details in Supplementary Table 5) with genome sizes ranging from 2.1 to 6.3 Mb were used for simulating queries with Badread version 0.4.1 and SeqKit41 version 2.8.2. The command ‘badread simulate –seed 1 –reference $ref –quantity 30x –length $qlen,0 –identity $ident,$ident,0 –error_model random –qscore_model ideal –glitches 0,0,0 –junk_reads 0 –random_reads 0 –chimeras 0 –start_adapter_seq ‘’ –end_adapter_seq ‘’ | seqkit seq -g -m $qlen | seqkit grep -i -s -v -p NNNNNNNNNNNNNNNNNNNN -o $ref.i$ident.q$qlen.fastq.gz’ simulated reads with a genome coverage of 30×, a mean length of 250, 500, 1,000 or 2,000 bp and mean percentage identities ranging from 80% to 100% with SNPs and indels. LexicMap, BLASTn, MMseqs2, Minimap2, Ropebwt3 and COBS were used to search simulated reads against the reference genomes. Each tool built an index for each genome and searched or aligned reads with the corresponding index. LexicMap built indices and aligned with default parameters (m = 20,000, k = 31, P = 15, P′ = 17). BLASTn built indices with default parameters and aligned reads with the default task mode ‘megablast’ using a word size of 28 (default) or 15 and the out format 6 was set to output tabular results. MMseqs2 built indices and aligned reads with the nucleotide mode and the format mode 4 was set to output tabular results. Minimap2 aligned reads with the ‘map-ont’ mode. Ropebwt3 built indices (including .fmd, .ssa and .fmd.len.gz files) with default parameters and aligned with the local alignment mode (‘ropebwt3 sw’). COBS built compact indices with k = 31 and a false-positive rate of 0.3 and searched reads with a k-mer coverage threshold of 0.33. For all reads, the source location was tracked. Alignment rate was determined by counting the proportion of reads that had an alignment position overlapping the source location (except for COBS, where it was defined as the proportion of reads detected as being present in the genome). Unless otherwise specified, all tests were performed in cluster nodes running with Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40 GHz with RAM of 500 or 2,000 GB.
Scalability testing
Seven genome sets with 1, 10, 100, 1,000, 10,000, 100,000 and 1,000,000 nonoverlapping prokaryotic genomes randomly chosen from the GenBank + RefSeq dataset were used for the scalability test. The queries consisted of the secY and a 16S rRNA rrsB gene sequences mentioned above. LexicMap, BLASTn, MMseqs2, Minimap2, Ropebwt3 and COBS built an index for each genome set and searched or aligned the queries against the corresponding index, using the options in the previous tests. LexicMap returned all possible matches by default, while other tools were explicitly set to return all matches. All tools used 48 threads. A Python script (https://github.com/shenwei356/memusg) was used to record the time and peak memory usage. Sequence alignment and searching were repeated four times on different cluster nodes over separate weeks and the average time and memory consumption were used for plotting.
Benchmarking
For indexing, LexicMap built indices with the genome batch size 5,000 (default) for GTDB and 25,000 for the AllTheBacteria dataset. BLASTn, MMseqs2 and Minimap2 build indices with parameters as mentioned above. The Phylign index was built with default parameters, including k = 31 and a false-positive rate of 0.3 for the COBS index; because the index building involved three workflows with multiple steps in multiple cluster nodes, the memory and time could not be measured accurately.
The four query datasets were used for sequence alignment. LexicMap returned all possible matches by default and other tools were set to return all possible matches according to the sequence number in a database. All tools used 48 threads. The main parameters of Phylign included threads = 48, cobs_kmer_thres = 0.33, minimap_preset = ‘asm20’, nb_best_hits = 5,000,000 and max_ram_gb = 100. For the cluster mode, the maximum number of Slurm jobs was set to 100.
The sequence alignment results from the four tools were divided into three categories according to query coverage and percentage identity and the genome numbers of each category were counted for comparison. The metrics of Minimap2 and Phylign were computed by sam2tsv.py (https://gist.github.com/apcamargo/2b7ca3032c1e80333adc1e54f47a0966). Alignments of high similarity were those with a query coverage of ≥90% (genes) or 70% (plasmids) and percentage identity of ≥90%. Alignments of low similarity are those with a query coverage of <50% (genes) or 30% (plasmid) or percentage identity of <80%. The remaining alignments were marked as medium similarity.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Source link