Reduce large DNA dataset without sacrificing information.
Its better to read the pdf version, github isn't rendering some of the LaTeX properly.
For studies with DNA datasets, big data is good but computationally expensive. This method aims to reduce it without losing the benefits. It does this by improving the nucleotide diversity of the sample.
An example is provided which uses SARS-CoV-2 virus (responsible for 2019 pandemic and has >9M seqs on GenBank) to show how to implement this method and how good it is.
The provided code is in R and the example uses some command line tools (datasets, SeqKit, EMBOSS, MAFFT), but the method can be implemented in any language.
According to the central limit theorem, larger and larger datasets approximate the normal distribution better, and biological data such as DNA sequences are no different. That is to say that majority of the sequences correspond to similar properties while a minority of the sequences correspond to dissimilar properties found on the extreme end of the distribution. However, given that the population of different species can vary between a few hundred to several trillions, it becomes difficult to study the entire distribution. Hence the focus of big data in bioinformatics is to "capture" the extreme ends of the distribution by sequencing more and more samples. And with over 3.7 billion nucleotide sequences hosted by GenBank in 2024, big data has facilitated the development of numerous fields, such as evolutionary biology, molecular biology, metagenomics, medicine, and forensic investigations. However, it has several caveats which limits its usage by the average researcher.
As big data becomes bigger, so does the computational power and storage requirements; which the average researcher may not have access to, and hence restricts their contributions to their fields. Hence a big focus of research is on making research with big data accessible; whether it by developing better and cheaper hardware, or developing softwares with better time complexities. However, a point of focus which often goes unseen is the data itself: that is to reduce the dataset in such a manner that data of significance (i.e. corresponding to extreme ends of normal distribution) is not lost.
While there are some methods available for this purpose, the most simplistic is perhaps to improve the nucleotide diversity of the sample. In addition to making the analysis computationally feasible, improving the diversity can change the corresponding distribution from normal to uniform. In normal distributions, there is chance of data of significance resembling and being regarded as random noise, but this chance is significantly reduced in a uniform distribution.
Overview: For a set of sequences, determine a distance matrix and then determine the nucleotide diversity of those sequences whose distance is greater than some proportion of the maximum distance. The unaligned sequences corresponding to the set of sequences with the maximum diversity comprise the reduced dataset.
Let
Using the maximum distance ($m = \max(D)$) and
Then,
It is recommended to cluster sequences and then improving the diversity for all clusters. Sequences can be clustered using existing classifications (e.g. Pangolin for SARS-CoV-2) or clustering softwares such as MeShClust, MMseqs etc. Since both distance matrix and nucleotide diversity computations are
It is also recommended to focus on smaller sequences rather than longer sequences, e.g. focusing on individual genes rather than the whole genome. This is because it can reduce
See "Example/ExamplePipeline.Rmd" for details.
The original dataset comprised of 183,780 sequences of length ~4kb. The data was divided into 1849 clusters. The improvements are shown below in the Table 1 and Figure 1 while Figure 2 shows the changes in the distributions of nucleotide diversity and number of sequences.
| Property | Original Dataset | Reduced Dataset | Change (%) |
|---|---|---|---|
| mean |
0.0009177 | 0.0074139 | 708 |
| min |
0.0000000 | 0.0000000 | 0 |
| max |
0.1621207 | 0.6161369 | 280 |
| ------------------------------ | ------------------------ | ------------------------ | ------------------------ |
| mean seqs per cluster | 99 | 3 | 3200 |
| min seqs per cluster | 3 | 2 | 50 |
| max seqs per cluster | 4,749 | 34 | 13,868 |
| ------------------------------ | ------------------------ | ------------------------ | ------------------------ |
| Total number of seqs | 183,780 | 6,673 | 2,654 |
Consider a set of aligned sequences
Nei's nucleotide diversity is calculated by determining the number of nucleotide differences for all pairs of sequences and normalizing the average difference by the sequence length. Since it performs
The ideal method to maximize
¹
Our approach relies on constructing a distance matrix and then determining
Although its not amazing, our method is much faster than the ideal approach. Figure 3 shows the comparison between the two methods as well as the actual data from the Example. Observe that in the ideal approach, after around
In the example data, the clusters with less than 4k sequences seem to follow the proposed method, while the remaining clusters do not, but it is not clear why this is the case. It might be due to some threshold effect which causes some function to change algorithms depending on the data size, but we were unable to find such effects. Regardless, the example data suggests that the proposed method is more efficient than proposed.


