Skip to content

cdisc-org/data-definition-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Data Definition Engine

under development License: MIT License: CC BY 4.0

The Data Definition Engine (DDE) is an open-source tool that automatically generates CDISC regulatory submission artifacts from a structured clinical trial protocol, the famous Unified Study Definitions Model (USDM). It is developed as part of the CDISC 360i Program by the Define-XML Generation Project Team.

USDM stands for Unified Study Definitions Model β€” a Transcelerate / CDISC standard that represents a clinical trial's complete protocol in a machine-readable JSON format. It captures things like study objectives, arms, visits, eligibility criteria, and assessments in a structured, vendor-neutral way. It's the input to this tool β€” the "source of truth" for the study. It was co-developed through a formal partnership between CDISC and TransCelerate BioPharma as part of TransCelerate's Digital Data Flow (DDF) initiative.


Table of Contents


Background

Clinical trials submitted to regulatory agencies (such as the FDA) must include standardized metadata files that describe the structure, content, and meaning of all datasets. Producing these files β€” most notably Define-XML β€” has traditionally been a manual, error-prone, and time-consuming process.

The CDISC 360i program aims to automate this process end-to-end: starting from a machine-readable study protocol (USDM), the DDE derives all the metadata needed to generate regulatory submission artifacts. It eliminates the gap between protocol design and data submission by using the same source of truth throughout the study lifecycle.


How It Works

The DDE implements a three-stage pipeline:

USDM Study Design JSON
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      LOADER       β”‚  create_define_json.py
β”‚                   β”‚  β€’ Reads the USDM protocol
β”‚                   β”‚  β€’ Fetches Biomedical Concepts
β”‚                   β”‚  β€’ Retrieves Dataset Specializations
β”‚                   β”‚  β€’ Calls the CDISC Library API
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό  DDS JSON (define.json)  ◄── central intermediate model
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     GENERATOR     β”‚  define_generator.py
β”‚                   β”‚  β€’ Reads the DDS JSON
β”‚                   β”‚  β€’ Builds Define-XML v2.1 elements
β”‚                   β”‚  β€’ Writes the output XML
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
  Define-XML v2.1 (.xml)
  HTML rendering  (.html)

Loaders extract metadata from various sources and populate the central Data Definition Specification (DDS) model (a JSON file). Generators consume that DDS JSON and produce the final submission artifacts.

Why a central JSON model? No single source has all the metadata needed for a full submission. The DDS acts as an aggregation layer, combining protocol content, biomedical concept definitions, controlled terminology, and any manually filled gaps into one validated model.


Study Artifacts

Artifact Format Description Status
SDTM Define-XML .xml Metadata file describing SDTM dataset structure, variables, and controlled terminology for FDA submission βœ… Implemented
ADaM Define-XML .xml Same for Analysis Datasets (ADaM) πŸ”œ Planned
ODM CRFs .xml Case Report Form definitions in ODM format for data collection πŸ”œ Planned
Trial Design Datasets .json TA, TD, TE, TI, TM, TS, TV datasets describing study design πŸ”œ Planned
Dataset-JSON shells .json Empty dataset templates in Dataset-JSON format πŸ”œ Planned

Prerequisites

  • Python 3.8+
  • A CDISC Library API key β€” required to fetch Biomedical Concepts and Dataset Specializations during loading. Request one at CDISC Library.
  • Git (to clone the repository)

Installation

# Clone the repository
git clone https://github.com/cdisc-org/data-definition-engine.git
cd data-definition-engine

# Install loader dependencies
pip install -r src/define-xml/requirements.txt

# Install generator dependencies
pip install -r src/generators/define/requirements.txt

Set your CDISC Library API key β€” create a .env file in src/define-xml/:

# src/define-xml/.env
CDISC_API_KEY=your_api_key_here

Or pass it directly on the command line with --cdisc_api_key.


Usage

Step 1 β€” Run the USDM Loader

The loader reads a USDM protocol JSON file, enriches it with metadata from the CDISC Library, and writes a DDS JSON file.

Windows (PowerShell): use a backtick ` for line continuation instead of \.

# bash / macOS / Linux
cd src/define-xml

python create_define_json.py \
  --usdm_file ../../data/protocol/LZZT/usdm/pilot_LLZT_protocol.json \
  --output_template ./output/define.json \
  --sdtmct 2024-09-27
# Windows PowerShell
cd src/define-xml

python create_define_json.py `
  --usdm_file ..\..\data\protocol\LZZT\usdm\pilot_LLZT_protocol.json `
  --output_template .\output\define.json `
  --sdtmct 2024-09-27

All arguments:

Argument Required Default Description
--usdm_file Yes β€” Path to the USDM input JSON file
--output_template Yes β€” Path for the output DDS JSON file
--sdtmct Yes β€” SDTM Controlled Terminology date (yyyy-mm-dd)
--sdtmig No 3.4 SDTM Implementation Guide version
--studyversion No 0 Study version index in the USDM file (0-based)
--studydesign No 0 Study design index (0-based)
--docversion No 0 Document version index (0-based)
--cdisc_api_key No env var CDISC Library API key (falls back to CDISC_API_KEY)
--cosmosversion No v2 CDISC Cosmos API version
--validate No β€” Validate output against a LinkML YAML schema (uses define.yaml if no path given)
--validation_report No β€” Path to write an Excel validation report (required with --validate)
--patch_file No β€” Generate a YAML patch file listing all placeholder/null fields
--apply_patch No β€” Apply a completed patch file to fill in placeholder values
--debug No False Save intermediate dictionaries as JSON files for inspection
--cacert No β€” Path to a CA bundle (.pem) for SSL verification β€” use when behind a corporate proxy
--no_ssl_verify No False Disable SSL certificate verification (use only in trusted environments)

Tip: On the first run, use --patch_file gaps.yaml to generate a list of all fields that could not be derived automatically. Fill in the values, then re-run with --apply_patch gaps.yaml.


Step 2 β€” Run the Define-XML Generator

The generator reads the DDS JSON file and produces a Define-XML v2.1 file.

# bash / macOS / Linux
cd src/generators/define

python define_generator.py \
  --template ../../define-xml/output/define.json \
  --define ./output/define.xml
# Windows PowerShell
cd src\generators\define

python define_generator.py `
  --template ..\..\define-xml\output\define.json `
  --define .\output\define.xml

All arguments:

Argument Short Required Default Description
--template -t Yes β€” Path to the DDS JSON input file
--define -d No (built-in default) Path for the output Define-XML .xml file
--validate -s No False Schema-validate the generated XML after writing
--log-level -l No INFO Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL

Processing is logged to define_generator.log.


Full Pipeline Example

# bash / macOS / Linux β€” from the repository root

# 1. Load: USDM β†’ DDS JSON
cd src/define-xml
python create_define_json.py \
  --usdm_file ../../data/protocol/LZZT/usdm/pilot_LLZT_protocol.json \
  --output_template ../../output/define.json \
  --sdtmct 2024-09-27 \
  --validate \
  --validation_report ../../output/validation_report.xlsx

# 2. Generate: DDS JSON β†’ Define-XML
cd ../generators/define
python define_generator.py \
  --template ../../output/define.json \
  --define ../../output/define.xml \
  --validate
# Windows PowerShell β€” from the repository root

# 1. Load: USDM β†’ DDS JSON
cd src\define-xml
python create_define_json.py `
  --usdm_file ..\..\data\protocol\LZZT\usdm\pilot_LLZT_protocol.json `
  --output_template ..\..\output\define.json `
  --sdtmct 2024-09-27 `
  --validate `
  --validation_report ..\..\output\validation_report.xlsx

# 2. Generate: DDS JSON β†’ Define-XML
cd ..\generators\define
python define_generator.py `
  --template ..\..\output\define.json `
  --define ..\..\output\define.xml `
  --validate

The resulting output/define.xml is your SDTM Define-XML v2.1 submission file. To render it as HTML for human review, apply the bundled XSL stylesheet:

# Using xsltproc (Linux/macOS) or Saxon (Windows)
xsltproc src/generators/define/define2-1.xsl output/define.xml > output/define.html

Project Structure

data-definition-engine/
β”‚
β”œβ”€β”€ data/                         # Sample study data for development and testing
β”‚   β”œβ”€β”€ protocol/LZZT/usdm/       # CDISC pilot study LZZT in USDM format
β”‚   └── metadata_xlsx/LZZT/       # SDTM and ADaM metadata spreadsheets (LZZT)
β”‚
β”œβ”€β”€ documents/
β”‚   β”œβ”€β”€ Solution_Overview.md      # Architecture design document
β”‚   └── glossary.md               # Definitions of key terms
β”‚
β”œβ”€β”€ HowTos/                       # Guides and GIF walkthroughs
β”‚
└── src/
    β”œβ”€β”€ define-xml/               # LOADER: USDM β†’ DDS JSON
    β”‚   β”œβ”€β”€ create_define_json.py # Main loader script
    β”‚   β”œβ”€β”€ define.yaml           # LinkML schema for the DDS model
    β”‚   └── requirements.txt
    β”‚
    └── generators/
        └── define/               # GENERATOR: DDS JSON β†’ Define-XML
            β”œβ”€β”€ define_generator.py
            β”œβ”€β”€ define2-1.xsl     # XSL stylesheet for HTML rendering
            β”œβ”€β”€ requirements.txt
            └── tests/
                └── fixtures/     # Sample DDS JSON and expected XML/HTML outputs

Key Concepts

Term Definition
USDM (Unified Study Definitions Model) A TransCelerate / CDISC standard that represents a complete clinical trial protocol as structured, machine-readable JSON. It is the primary input to the DDE.
CDISC 360i A CDISC initiative to make the full clinical trial lifecycle β€” from protocol to submission β€” machine-readable and interoperable.
DDS (Data Definition Specification) The central intermediate JSON model in the DDE pipeline. It aggregates metadata from all sources and acts as the single input for all generators.
Define-XML An XML file submitted alongside clinical trial datasets that describes their structure, variables, permitted values, and controlled terminology. Required by the FDA. It is based on the ODM version 2.0
Biomedical Concepts (BCs) Standardized, reusable definitions of clinical observations (e.g., "Heart Rate") maintained in the CDISC Library.
Dataset Specializations (DSSs) CDISC Library mappings that describe how a Biomedical Concept is represented in a specific SDTM domain.
CDISC Library CDISC's REST API providing access to controlled terminology, SDTM variables, Biomedical Concepts, and Dataset Specializations.
odmlib A Python library for creating and parsing CDISC ODM and Define-XML documents, used internally by the generator.
LinkML A modeling language used to define and validate the DDS JSON schema (define.yaml).
VLM (Variable Level Metadata) Metadata that applies to specific values within a variable (e.g., rules that only apply when VSTEST = "SYSBP").

Current Status & Roadmap

The project is in active development, currently completing Phase 2 of the CDISC 360i Program.

Phase 1 (complete):

  • USDM loader (create_define_json.py)
  • SDTM Define-XML generator (define_generator.py)
  • DDS JSON schema (define.yaml)

Phase 2 (in progress):

  • ADaM Define-XML generator
  • ODM CRF generator
  • Dataset-JSON shell generator
  • Incremental loading with metadata provenance tracking
  • Quality and conformance checks

This project is provided "as is" without warranty or guarantee of suitability for any particular purpose. Expect breaking changes as the new ADaM models and additional generators are developed.


Contributing

Contributions are welcome. Please read CONTRIBUTING.md before submitting pull requests. All contributions must follow the Code of Conduct and will fall under the project licenses below.


License

Code and Models

Licensed under the MIT License.

Content (documentation, etc.)

Licensed under CC-BY-4.0.

When re-using content, please cite as:

Content based on Data Definition Engine (GitHub) used under the CC-BY-4.0 license.


References

About

DDE software automates processes using DDS metadata

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors