Ryu's algorithm relies on the variable-order de Bruijn graph (voDBG) to generate the assembly.
A voDBG contains all possible de Bruijn graphs (DBGs) we can generate from the reads. This property means that, unlike the fixed-order DBG, the node labels are of arbitrary length (limited only by the length of the longest read).
Ryu requires two parameters,
The general idea of the assembly algorithm is as follows: we pick an arbitrary voDBG node whose frequency lies between
When we reach a point where we cannot extend the node further, we start to left-contract it while its frequency remains below
We can reduce the number of misassemblies if we set
We refer to our assembler as Ryu (龍), the Japanese word for dragon. Finnish speakers may see the reason for this name, a combination of the “lo-hi” thresholds that drive the assembly and the "snake"-like sequence of right extensions and left contractions of the algorithm through the voDBG.
- CLI11: Command line parser for C++
- kseq: FASTQ parser
- grlBWT: BWT construction library
- xxHash: Hashing library
- CMake 3.10 or greater
- SDSL-lite library
- C++17
- GCC or Clang
So far, we have tested the software on Linux/macOS with GCC 13.2.0 and Clang 17.0.0. In principle, any compiler that supports C++17 should work.
First, clone this repository, enter the resulting folder, and type:
mkdir build && cd build
Compile the source code:
cmake .. && make
Ryu will automatically search for the SDSL-library installation folder in one of the default paths (/home/your_usr or /usr).
If the library is not there, please pass the path to the command line:
cmake .. -DSDSL_LIBRARY_PATH=/your/path/to/the/SDSL
For better cross-compatibility, you may want to disable architecture-specific optimizations.
You can do this by modifying the CMakeLists.txt file or setting:
cmake .. -DARCH_OPTIM=OFF
The first step is to preprocess the reads:
ryu index input_reads.fastq -o input_reads.idx
Ryu will extract the DNA sequences from the input FASTQ/FASTA file and compress them in preproc_reads.idx. It will also perform homopolymer compression on the reads (also commonly known as run-length encoding).
The assembly requires the range
The software will compute it from the genome size, sequencing error rate, and reconstruction confidence. For instance, the reads have an error rate of 0.01, the genome is 200 Mb, and we want 99% confidence in the contig reconstruction.
Then the command should be:
ryu assemble -o assembled_lhtig -e 0.01 -q 0.99 -g 200mb -i input_reads.idx`
The other way is to specify the range directly:
ryu assemble -o assembled_lhtig -f 12 20 -i input_reads.idx`
Where
Note: the parameter set -e, -c, -g excludes -f. You can only use one of them.
A test can be found in test/pipeline.sh, where a small subsample of reads from E. coli
is downloaded, indexed, and assembled.
To conduct the test, please install Ryu (as described in Installation) and give executable permissions:
cd test; chmod +x pipeline.sh
The test could then be invoked by simply running:
./pipeline.sh
Ryu is not a complete genome assembler, as it does not perform several typical preprocessing steps:
- Read correction
- Graph trimming
- Scaffolding
This means that, for complex organisms, assembly contiguity may not match that of established long-read assemblers that use OLC (Overlap-Layout-Consensus) graphs. However, Ryu’s accuracy should be comparable to, or even better than, these tools. Additionally, Ryu offers significantly lower running time and memory usage.
We aim to make Ryu a more comprehensive tool that can avoid at least the first two steps in future releases.
Ryu is also not designed for:
- Polyploid genomes
- Nanopore reads
- Older PacBio reads
Although Ryu can technically process these data types, the resulting assemblies will likely be more fragmented. We are working on updating our assembly model to better support polyploid genomes.
If you use Ryu in your research, please cite the paper in the CITATION file in the root directory.
Please open an issue to report problems or bugs.