Ryu: A genome assembler based on variable-order de Bruijn graphs

Ryu's algorithm relies on the variable-order de Bruijn graph (voDBG) to generate the assembly.

A voDBG contains all possible de Bruijn graphs (DBGs) we can generate from the reads. This property means that, unlike the fixed-order DBG, the node labels are of arbitrary length (limited only by the length of the longest read).

Ryu requires two parameters, $\ell$ and $h$, that indicate the minimum and maximum frequencies (respectively) a node in the assembly can have. In this context, belonging to the assembly means that the node's label appears as a substring in at least one contig.

The general idea of the assembly algorithm is as follows: we pick an arbitrary voDBG node whose frequency lies between $\ell$ and $h$, and continuously right-extend its context as long as its frequency remains above $\ell$.

When we reach a point where we cannot extend the node further, we start to left-contract it while its frequency remains below $h$. Once we stop the left contractions, we begin to right-extend again. We iterate over this process until we reach a node we have already visited, or no further extensions or contractions exist. We collapse the graph labels we visit during the traversal into a single sequence, which later becomes a contig in the assembly.

We can reduce the number of misassemblies if we set $h/2< \ell <h$, which implies that, for every voDBG node, there is only one right extension that contributes to more than 50% of its frequency.

Why Ryu?

We refer to our assembler as Ryu (龍), the Japanese word for dragon. Finnish speakers may see the reason for this name, a combination of the “lo-hi” thresholds that drive the assembly and the "snake"-like sequence of right extensions and left contractions of the algorithm through the voDBG.

External tools that are already included in this project

CLI11: Command line parser for C++
kseq: FASTQ parser
grlBWT: BWT construction library
xxHash: Hashing library

Prerequisites

CMake 3.10 or greater
SDSL-lite library
C++17
GCC or Clang

So far, we have tested the software on Linux/macOS with GCC 13.2.0 and Clang 17.0.0. In principle, any compiler that supports C++17 should work.

Installation

First, clone this repository, enter the resulting folder, and type:

mkdir build && cd build

Compile the source code:

cmake .. && make

Ryu will automatically search for the SDSL-library installation folder in one of the default paths (/home/your_usr or /usr).

If the library is not there, please pass the path to the command line:

cmake .. -DSDSL_LIBRARY_PATH=/your/path/to/the/SDSL

For better cross-compatibility, you may want to disable architecture-specific optimizations.

You can do this by modifying the CMakeLists.txt file or setting:

cmake .. -DARCH_OPTIM=OFF

Steps for assembling a genome

1. Preprocessing the reads

The first step is to preprocess the reads:

ryu index input_reads.fastq -o input_reads.idx

Ryu will extract the DNA sequences from the input FASTQ/FASTA file and compress them in preproc_reads.idx. It will also perform homopolymer compression on the reads (also commonly known as run-length encoding).

2. Perform the assembly

The assembly requires the range $[\ell, h]$, which we can specify in two ways:

The software will compute it from the genome size, sequencing error rate, and reconstruction confidence. For instance, the reads have an error rate of 0.01, the genome is 200 Mb, and we want 99% confidence in the contig reconstruction.

Then the command should be:

ryu assemble -o assembled_lhtig -e 0.01 -q 0.99 -g 200mb -i input_reads.idx`

The other way is to specify the range directly:

ryu assemble -o assembled_lhtig -f 12 20 -i input_reads.idx`

Where $\ell=12$ and $h=20$. Notice that the range has to satisfy $\ell>h/2$.

Note: the parameter set -e, -c, -g excludes -f. You can only use one of them.

Test

A test can be found in test/pipeline.sh, where a small subsample of reads from E. coli is downloaded, indexed, and assembled.

To conduct the test, please install Ryu (as described in Installation) and give executable permissions:

cd test; chmod +x pipeline.sh

The test could then be invoked by simply running:

./pipeline.sh

Limitations

Ryu is not a complete genome assembler, as it does not perform several typical preprocessing steps:

Read correction
Graph trimming
Scaffolding

This means that, for complex organisms, assembly contiguity may not match that of established long-read assemblers that use OLC (Overlap-Layout-Consensus) graphs. However, Ryu’s accuracy should be comparable to, or even better than, these tools. Additionally, Ryu offers significantly lower running time and memory usage.

We aim to make Ryu a more comprehensive tool that can avoid at least the first two steps in future releases.

Ryu is also not designed for:

Polyploid genomes
Nanopore reads
Older PacBio reads

Although Ryu can technically process these data types, the resulting assemblies will likely be more fragmented. We are working on updating our assembly model to better support polyploid genomes.

Citation

If you use Ryu in your research, please cite the paper in the CITATION file in the root directory.

Contact

Please open an issue to report problems or bugs.

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
cmake		cmake
external		external
include		include
lib		lib
test		test
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
version.h.in		version.h.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ryu: A genome assembler based on variable-order de Bruijn graphs

Why Ryu?

External tools that are already included in this project

Prerequisites

Installation

Steps for assembling a genome

1. Preprocessing the reads

2. Perform the assembly

Test

Limitations

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ryu: A genome assembler based on variable-order de Bruijn graphs

Why Ryu?

External tools that are already included in this project

Prerequisites

Installation

Steps for assembling a genome

1. Preprocessing the reads

2. Perform the assembly

Test

Limitations

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages