Synthetic RNA-seq Cohorts: A Feasibility Study for Privacy-Aware Collaboration on Sensitive Genomic Data

Transcriptomic data, especially when paired with clinical metadata, is invaluable across the drug discovery workflow. The problem: bulk RNA-seq count matrices retain genotypic signals that survive standard de-identification, creating real re-identification risk. A wave of state-level genomic privacy laws, a proposed NIH controlled-access data policy overhaul, and tightening IRB and DUA requirements are making cross-institutional sharing harder and slower.

Synthetic data is a credible path forward, but only if the synthetic cohort actually preserves the biology — DE concordance, ML accuracy, and pathway activation — not just the schema. That’s the core question we set out to answer with dbTwin. We partnered with Decode Health to validate synthetic clinico-transcriptomic data on a real-world sepsis bulk RNA-seq cohort, and the results are clear: dbTwin synthetic cohorts are biologically faithful, retaining the deep analytical structure of the real data while maintaining sample-level diversity. Full white paper below.

Synthetic RNA-seq Cohorts: A Feasibility Study on Privacy-Aware Collaboration on Sensitive Genomic Data

Engineered for any use case and blazing fast speed.

Synthetic RNA-seq Cohorts: A Feasibility Study for Privacy-Aware Collaboration on Sensitive Genomic Data

Quick Links