Skip to content
View skerk001's full-sized avatar
😃
😃

Block or report skerk001

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
skerk001/README.md

Samir Kerkar

Healthcare data scientist working in causal inference, quasi-experimental program evaluation, and applied ML for managed care populations. Four years of pharmacist-led intervention studies at a 60,000-patient health system; now supporting publication-track outcomes work for an FDA-cleared cardiac device.

M.S. Data Science, UC San Diego (HDSI, starting Fall 2026) · B.S. Mathematics, UC Irvine

Available: Full-time · Contract · Remote-friendly

Samir2000VIP@gmail.com · LinkedIn · Irvine, CA


Current — Ventric Health

Clinical Research Data Scientist (Remote) — Analytical infrastructure and outcomes analyses supporting device validation and post-market evidence generation for an FDA-cleared cardiac device.


Selected Work — Desert Oasis Healthcare (2022–2026)

Data scientist across 20+ facilities, ~60,000 managed care patients. Methods chosen to defend causal claims a reviewer would actually question.

  • COPD pharmacist-led program — Propensity-matched cohort (n = 997) + difference-in-differences. $83.50 PMPM cost reduction (p = 0.0027); secondary PDC analysis showed ≈20% adherence improvement (p = 0.013).
  • Post-discharge pharmacist intervention (n = 878) — Negative binomial regression (selected over Poisson after overdispersion diagnostics). Adjusted IRR = 0.78 (p = 0.02) — 22% reduction in 30-day all-cause readmissions.
  • Hospitalization-risk models — Random forest / XGBoost over ~40 features (claims, EHR, pharmacy, lab) with SMOTE. AUROC = 0.78 (k-fold CV) on CHF / COPD cohorts. Monthly risk-stratified lists drove clinical-pharmacist outreach.
  • Patient-experience NLP — ~40K free-text comments / year. DistilBERT sentiment (~90% on a 5K labeled set) + LDA topic modeling, operationalized into provider- and facility-level Power BI dashboards.
  • AFib anticoagulation care-gap study — ICD-10 cohort identification (n = 1,381); 77 guideline-discordant cases flagged for pharmacist review. Contributed to a poster at the ASHP National Conference.

Publications

Improvements in HF-Related Utilization Outcomes Following Large-Scale Screening for LVEDP as Part of Routine Primary Care (under peer review) — Co-authored; sole DOHC-side analyst on the post-implementation utilization comparison. Year-over-year reductions in urgent care (p < 0.001) and ED (p = 0.006) utilization following screening rollout.


Projects

  • CausalCare — Causal inference on ICU mortality. Five-method stack (PSM, IPW, AIPW, Double ML, Causal Forest) on DoWhy's identify → estimate → refute workflow; method agreement as a robustness check rather than a single point estimate.
  • GenomicsGPT — Variant interpretation at scale. XGBoost / LightGBM ensemble over 1.69M ClinVar variants. Leakage-corrected AUC = 0.985; feature ablation defends against gene-name memorization (consequence + LoF alone = 0.97; gene-only = 0.78). SHAP audit + LLM-generated ACMG/AMP reports.
  • ClinicalRAG — RAG over 220 clinical documents with retrieval and refusal as first-class metrics: 97.6% recall, 85.7% citation rate, 95.2% abstention accuracy.
  • Diabetic Retinopathy — Custom CNN, 5-class grading. Weighted F1 = 0.94, outperforming ResNet-50 and VGG-16 on the same split. Grad-CAM confirms clinically meaningful attention.

Stack

Causal & statistics — PSM, DiD, IPW, AIPW, Double ML, Causal Forest, negative binomial & other GLMs, survival analysis, PMPM modeling ML / DL — XGBoost, LightGBM, scikit-learn, TensorFlow / Keras, SHAP LLM / NLP — RAG, LangChain, ChromaDB, HuggingFace Transformers Healthcare data — EHR, claims, pharmacy, ICD-10, HCC, HIPAA-compliant handling Languages & delivery — Python, SQL, R, Power BI, FastAPI, Git


Outside work: 2500+ rated chess · basketball · piano

Pinned Loading

  1. diabetic-retinopathy-classification diabetic-retinopathy-classification Public

    CNN-based 5-class diabetic retinopathy severity classification from retinal fundus images (F1 = 0.94)

  2. gene-cancer-prediction gene-cancer-prediction Public

    ML classification of AML vs. ALL leukemia subtypes from gene expression data (F1 = 0.95)

    Jupyter Notebook

  3. clinical-rag clinical-rag Public

    RAG system for clinical question answering over 220 discharge summaries with hallucination guardrails, citation tracking, and chunking strategy evaluation (97.6% condition recall)

    Python

  4. genomicsgpt genomicsgpt Public

    ML + LLM pipeline for genetic variant pathogenicity prediction (AUC 0.9949, 1.69M ClinVar variants) with SHAP explainability and clinical report generation via Llama 3 / Claude

    Jupyter Notebook

  5. CausalCare CausalCare Public

    Causal inference analysis of ICU beta-blocker treatment effects using propensity matching, IPW, doubly robust estimation, Double ML, and Causal Forest on eICU data

    Python