Skip to content

ddiazdom/Ryu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

155 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ryu: A genome assembler based on variable-order de Bruijn graphs

Ryu's algorithm relies on the variable-order de Bruijn graph (voDBG) to generate the assembly.

A voDBG contains all possible de Bruijn graphs (DBGs) we can generate from the reads. This property means that, unlike the fixed-order DBG, the node labels are of arbitrary length (limited only by the length of the longest read).

Ryu requires two parameters, $\ell$ and $h$, that indicate the minimum and maximum frequencies (respectively) a node in the assembly can have. In this context, belonging to the assembly means that the node's label appears as a substring in at least one contig.

The general idea of the assembly algorithm is as follows: we pick an arbitrary voDBG node whose frequency lies between $\ell$ and $h$, and continuously right-extend its context as long as its frequency remains above $\ell$.

When we reach a point where we cannot extend the node further, we start to left-contract it while its frequency remains below $h$. Once we stop the left contractions, we begin to right-extend again. We iterate over this process until we reach a node we have already visited, or no further extensions or contractions exist. We collapse the graph labels we visit during the traversal into a single sequence, which later becomes a contig in the assembly.

We can reduce the number of misassemblies if we set $h/2< \ell <h$, which implies that, for every voDBG node, there is only one right extension that contributes to more than 50% of its frequency.

Why Ryu?

We refer to our assembler as Ryu (龍), the Japanese word for dragon. Finnish speakers may see the reason for this name, a combination of the “lo-hi” thresholds that drive the assembly and the "snake"-like sequence of right extensions and left contractions of the algorithm through the voDBG.

External tools that are already included in this project

  1. CLI11: Command line parser for C++
  2. kseq: FASTQ parser
  3. grlBWT: BWT construction library
  4. xxHash: Hashing library

Prerequisites

  1. CMake 3.10 or greater
  2. SDSL-lite library
  3. C++17
  4. GCC or Clang

So far, we have tested the software on Linux/macOS with GCC 13.2.0 and Clang 17.0.0. In principle, any compiler that supports C++17 should work.

Installation

First, clone this repository, enter the resulting folder, and type:

mkdir build && cd build

Compile the source code:

cmake .. && make

Ryu will automatically search for the SDSL-library installation folder in one of the default paths (/home/your_usr or /usr).

If the library is not there, please pass the path to the command line:

cmake .. -DSDSL_LIBRARY_PATH=/your/path/to/the/SDSL 

For better cross-compatibility, you may want to disable architecture-specific optimizations.

You can do this by modifying the CMakeLists.txt file or setting:

cmake .. -DARCH_OPTIM=OFF

Steps for assembling a genome

1. Preprocessing the reads

The first step is to preprocess the reads:

ryu index input_reads.fastq -o input_reads.idx 

Ryu will extract the DNA sequences from the input FASTQ/FASTA file and compress them in preproc_reads.idx. It will also perform homopolymer compression on the reads (also commonly known as run-length encoding).

2. Perform the assembly

The assembly requires the range $[\ell, h]$, which we can specify in two ways:

The software will compute it from the genome size, sequencing error rate, and reconstruction confidence. For instance, the reads have an error rate of 0.01, the genome is 200 Mb, and we want 99% confidence in the contig reconstruction.

Then the command should be:

ryu assemble -o assembled_lhtig -e 0.01 -q 0.99 -g 200mb -i input_reads.idx`

The other way is to specify the range directly:

ryu assemble -o assembled_lhtig -f 12 20 -i input_reads.idx`

Where $\ell=12$ and $h=20$. Notice that the range has to satisfy $\ell>h/2$.

Note: the parameter set -e, -c, -g excludes -f. You can only use one of them.

Test

A test can be found in test/pipeline.sh, where a small subsample of reads from E. coli is downloaded, indexed, and assembled.

To conduct the test, please install Ryu (as described in Installation) and give executable permissions:

cd test; chmod +x pipeline.sh

The test could then be invoked by simply running:

./pipeline.sh

Limitations

Ryu is not a complete genome assembler, as it does not perform several typical preprocessing steps:

  • Read correction
  • Graph trimming
  • Scaffolding

This means that, for complex organisms, assembly contiguity may not match that of established long-read assemblers that use OLC (Overlap-Layout-Consensus) graphs. However, Ryu’s accuracy should be comparable to, or even better than, these tools. Additionally, Ryu offers significantly lower running time and memory usage.

We aim to make Ryu a more comprehensive tool that can avoid at least the first two steps in future releases.

Ryu is also not designed for:

  • Polyploid genomes
  • Nanopore reads
  • Older PacBio reads

Although Ryu can technically process these data types, the resulting assemblies will likely be more fragmented. We are working on updating our assembly model to better support polyploid genomes.

Citation

If you use Ryu in your research, please cite the paper in the CITATION file in the root directory.

Contact

Please open an issue to report problems or bugs.

About

Genome assembly with variable-order de Bruijn graphs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors