Hi, Andrey
I am graduate student majoring in bioinformatics. I previously questioned regarding the use of IsoQuant during 2 samples (normal, disease paired) test. I am reaching out again while scaling up the analysis to a larger cohort. (I am using ONT long-read sequencing data)
In this current run, I attempted to process 21 samples simultaneously using a merged GTF file and 21 corresponding BAM files. Due to the high depth of the merged data, the coverage at certain loci exceeded tens of millions of reads (in many loci), leading to a drastic increase in computational time. I think this process seems infeasible, I re-attempted the analysis by setting max_coverage_normal_chr to 5,000,000, while keeping the chrM coverage cutoff at its default value. Despite these adjustments, the process still took 17 days to complete.
I suspect that the primary cause is the amplification of read counts at specific chromosomal positions when multiple samples are processed together. However, I am concerned that imposing a coverage cutoff might result in the loss of quantitative information, potentially underestimating the expression levels of highly expressed genes or transcripts in the final output matrix.
To address these issues, I would like to ask for your expert opinion on the following:
My opinion (question): Would it be a feasible alternative to run IsoQuant on each sample individually, and then merge the result matrix files in Seurat for downstream processing and batch correction?
Related to Cutoff: Do you think the results obtained using the max_coverage_normal_chr 5000000 cutoff are reliable for downstream analysis, or would this significantly bias the quantification?
Alternative Solutions: Are there any other recommended strategies to optimize performance for a 21 samples cohort without compromising data integrity?
Thanks for reading my question.
Hi, Andrey
I am graduate student majoring in bioinformatics. I previously questioned regarding the use of IsoQuant during 2 samples (normal, disease paired) test. I am reaching out again while scaling up the analysis to a larger cohort. (I am using ONT long-read sequencing data)
In this current run, I attempted to process 21 samples simultaneously using a merged GTF file and 21 corresponding BAM files. Due to the high depth of the merged data, the coverage at certain loci exceeded tens of millions of reads (in many loci), leading to a drastic increase in computational time. I think this process seems infeasible, I re-attempted the analysis by setting max_coverage_normal_chr to 5,000,000, while keeping the chrM coverage cutoff at its default value. Despite these adjustments, the process still took 17 days to complete.
I suspect that the primary cause is the amplification of read counts at specific chromosomal positions when multiple samples are processed together. However, I am concerned that imposing a coverage cutoff might result in the loss of quantitative information, potentially underestimating the expression levels of highly expressed genes or transcripts in the final output matrix.
To address these issues, I would like to ask for your expert opinion on the following:
My opinion (question): Would it be a feasible alternative to run IsoQuant on each sample individually, and then merge the result matrix files in Seurat for downstream processing and batch correction?
Related to Cutoff: Do you think the results obtained using the max_coverage_normal_chr 5000000 cutoff are reliable for downstream analysis, or would this significantly bias the quantification?
Alternative Solutions: Are there any other recommended strategies to optimize performance for a 21 samples cohort without compromising data integrity?
Thanks for reading my question.