ISARIC Data Schema¶
Beyond providing the question bank that drives BRIDGE CRF generation, ARC defines the ISARIC data schema — the standardized output format used across the ISARIC data ecosystem, including DataHub and other tools. Any dataset converted into this schema can be pooled and analysed alongside other ISARIC studies without additional harmonization work.
The schema follows an entity-attribute-value design: a fixed core table holds the small set of fields expected for every patient, while a flexible long table holds all other observations as (attribute, value) pairs.
All date fields are strings in ISO 8601 format. A full datetime (YYYY-MM-DDThh:mm:ss), full date (YYYY-MM-DD), year-month (YYYY-MM), or year-only (YYYY) are all valid. All other fields are strings unless the type is stated otherwise.
Schema overview¶
- Core table (wide format)
-
One row per patient. Captures fixed, patient-level fields: identifiers, demographics, admission, and outcome. The core schema is considered stable; only additions are permitted, and only for fields expected to be present for most patients. Sparse indicator data (symptoms, comorbidities, etc.) belongs in the long table.
Required fields:
Field
Type
Description
subjidstring
Patient Identification Number (PIN). Note that
subjididentifies an encounter, not necessarily a unique patient across all encounters.siteidstring
Site that collected the data.
dataset_idstring
Dataset identifier.
dataset_diseasestring
Primary disease/syndrome for the dataset (e.g.
"COVID-19"). The same value applies to every patient in a dataset.demog_sexenum
"Male","Female","Other","Not specified/Unknown"demog_age_daysinteger ≥ 0
Age in days.
demog_country_iso3string
ISO 3166-1 alpha-3 country code (e.g.
"GBR").pres_admenum
"Yes","No","Unknown"pres_datedate
Most recent presentation/admission date at this facility.
outco_outcomeenum
One of:
"Discharged alive","Still hospitalised","Transfer to other facility","Death","Palliative care","Discharged against medical advice","Alive not admitted","Hospitalized"outco_datedate
Outcome date.
- Long table (long format)
-
One row per observation per patient. Covers all ARC variables not included in the core table — symptoms, vital signs, lab results, medications, imaging, and more — using ARC variable names as the
attributefield.Required fields:
Field
Description
subjidPatient PIN (links back to core).
dataset_idDataset identifier.
phaseHealthcare encounter phase when the event occurred. One of
"presentation","pre_observation","during_observation","follow_up","outcome".attributeARC variable name for the observation (e.g.
"adsym_fever","vital_rr"). Where an attribute with the same or substantially similar semantics exists in ARC, that name must be used.attribute_statusData collection status.
"VAL"— value collected and present invalue/value_num;"UNK"— unknown;"NI"— no information;"NASK"— not asked;"NA"— not applicable.Optional fields:
Field
Description
valueString/categorical value. Y/N/NK attributes should be stored here as strings (
"Yes"/"No"/"Unknown") to allow future extension with additional codes.value_numNumeric value (float). Used for measurements such as temperature, blood pressure, or heart rate.
dateDate of the observation.
durationDuration of the event in days (integer).
attribute_unitUnit of the recorded value. Omit if the attribute has no unit.
arcverARC version that the attribute belongs to. Omit if the attribute is not present in any ARC version.
event_idID linking attributes that belong to a single event (e.g. the name, dosage, and route of a single medication administration).
reldate_admRelative day since admission (integer).
Each row must have either
valueorvalue_num(not both) populated.At the analysis stage, the subset of long-table rows with attributes that appear only once per patient can be pivoted into wide format and merged with the core table for easier access.
The phases in the long table correspond to the data capture schema that ARC is structured around.
Schema files¶
The JSON schema files that formally define and validate these two tables live in the schemas/ directory:
schemas/isaric-core.json— validates the core (wide) table.schemas/arc_{version}_isaric_long.schema.json— validates the long (narrow) table. This file is auto-generated from the current ARC variable list byschemas/isaric_schema.pyeach time a new ARC version is released.
Converting data to the ISARIC schema¶
The tool used to transform source datasets into the ISARIC schema is ADTL (Another Data Transformation Language). ADTL reads a TOML parser file which describes how each field in the source data maps to the ISARIC schema, then writes the two output tables.
There are two paths for generating a parser, depending on how the source data was collected:
Data collected via BRIDGE / REDCap¶
If data was collected using a CRF built with BRIDGE,
the REDCap export already uses ARC variable names. A parser for this data can
be auto-generated from the ARC file using schemas/draft_parser.py:
python schemas/draft_parser.py
This produces a file schemas/global_arc_{version}_parser.toml that covers
all ARC variables and handles the REDCap checkbox/radio/list field encoding
conventions. For a study that uses a defined preset, pass the preset name:
python schemas/draft_parser.py --preset "preset_ARChetype Disease CRF_Covid"
The generated file will contain TODO: FILL THIS IN markers for
dataset-specific fields (such as dataset_id and dataset_disease)
that cannot be inferred automatically, and must be filled in before it can be used.
Once edited, the parser can be used to convert the REDCap export into the ISARIC schema:
adtl parse <your parser file> <your-data-file.csv> --include-transform schemas/isaric-transformations.py
Note that the --include-transform option is required as a source for the ISARIC-specific
transformations used in the auto-generated parser.
Data collected using other tools¶
If data was not collected via a BRIDGE CRF, you will need to write a custom parser. The parser is a TOML file that maps your source column names and data formats to the ISARIC schema fields.
A worked example covering a COVID-19 study is available in
docs/examples/. See Writing a Custom Parser for a full walkthrough.
Further reading¶
ADTL documentation — full reference for the parser format and CLI.
Writing a Custom Parser — step-by-step tutorial for writing a custom parser.
Data Capture Schema — the clinical schema that defines the observation phases.