This repository contains the core analysis code for the TPHP project. The associated study, “Spatial distribution of the proteome in the human body and in cancers”, has been published in Nature. Using data-independent acquisition mass spectrometry, the study quantified 13,609 proteins across 2,856 samples, covering 58 major tissue types, 251 tissue subtypes and 25 cancer types, thereby establishing a spatially resolved quantitative landscape of the human proteome in healthy, developmental and cancer states.
The dataset can be interactively explored through the ProteinTalks database.
The code is the repository performs per-cancer, per-protein tumor versus paired non-tumor comparisons using linear mixed-effects regression, enabling systematic analysis of oncogenic proteome changes across tissues. Results are exported as RDS objects for downstream statistical analysis, visualization and integration with the public proteome database.
tumor_dysregulation_analysis.R— main analysis scriptdata/tumor_compare_data.parquet— input table (paired tumor/non-tumor)output/compare_report_output.rds— main resultoutput/package_versions.txt— R and package versions used
The workflow is expected to run on every common desktop/server platforms that support R. The test environment is on Windows 11 x64 system.
- R (recommended R 4.0+)
- CRAN packages
tidyversearrowlme4lmerTestplyr
Exact tested version recorded in output/package_versions.txt
- CPU-only execution (no GPU required)
- Recommended RAM: ≥8 GB (larger datasets may require more)
- No required non-standard hardware
git clone <REPO_URL>
cd <REPO_DIR>Install R for the operating system from CRAN.
Optionally, install packages manually:
install.packages(c("tidyverse","arrow","lmerTest","lme4"))This project is script-based, and no software installation step is required beyond installing dependencies.
From the repository root:
Rscript tumor_dysregulation_analysis.RDefault input path:
data/tumor_compare_data.parquet
After a successful run, the script writes:
-
output/compare_report_output.rdsAn R list
DEAwith:DEA$Diff.report: full per-(cancer, protein) model resultsDEA$Diff.report.filter: filtered results usingHedges'g >= 0.5andp_adj_BH < 0.05
-
output/package_versions.txtR version and package versions used.
Runtime scales primarily with the number of model fits, i.e., cancer types × proteins × samples.
As a rough guide, on a standard desktop CPU (8–16 threads, 16–32 GB RAM), throughput is ~25 model fits per second, corresponding to < 10 minutes per cancer type under typical settings.
The script expects a Parquet table with metadata columns:
patient_ID(string)sample_type(must includeNTfor non-tumor andTfor tumor)cancer_abbr(string)cancer_subtype(string; may be constant within a subset)Gender(categorical)Age(integer-like)Dataset(categorical)
All remaining columns are treated as numeric features (proteins).
For each cancer_abbr and each protein, the script fits a mixed-effects model:
value ~ 1 + sample_type + (optional covariates) + (1 | patient_ID)
Optional covariates among {cancer_subtype, Gender, Age_c, Dataset} are included only when estimable within the cancer/protein subset.
The reported tumor effect is the fixed-effect coefficient for sample_typeT.
Each row in DEA$Diff.report corresponds to one (cancer, protein) fit and includes:
effect,se,tdf(degree of freedom)p,p_adj_BH(p value)sigma(residual SD)es_adj = effect / sigma(standardized effect)g_adj = J(df) * es_adj(small-sample adjusted standardized effect)formula(the exact model formula used)is_singular,re_var_patient(fit diagnostics)