ISARIC Data Schema¶

Beyond providing the question bank that drives BRIDGE CRF generation, ARC defines the ISARIC data schema — the standardized output format used across the ISARIC data ecosystem, including DataHub and other tools. Any dataset converted into this schema can be pooled and analysed alongside other ISARIC studies without additional harmonization work.

The schema follows an entity-attribute-value design: a fixed core table holds the small set of fields expected for every patient, while a flexible long table holds all other observations as (attribute, value) pairs.

All date fields are strings in ISO 8601 format. A full datetime (YYYY-MM-DDThh:mm:ss), full date (YYYY-MM-DD), year-month (YYYY-MM), or year-only (YYYY) are all valid. All other fields are strings unless the type is stated otherwise.

Schema overview¶

Core table (wide format)

One row per patient. Captures fixed, patient-level fields: identifiers, demographics, admission, and outcome. The core schema is considered stable; only additions are permitted, and only for fields expected to be present for most patients. Sparse indicator data (symptoms, comorbidities, etc.) belongs in the long table.

Required fields:

Field	Type	Description
`subjid`	string	Patient Identification Number (PIN). Note that `subjid` identifies an encounter, not necessarily a unique patient across all encounters.
`siteid`	string	Site that collected the data.
`dataset_id`	string	Dataset identifier.
`dataset_disease`	string	Primary disease/syndrome for the dataset (e.g. `"COVID-19"`). The same value applies to every patient in a dataset.
`demog_sex`	enum	`"Male"`, `"Female"`, `"Other"`, `"Not specified/Unknown"`
`demog_age_days`	integer ≥ 0	Age in days.
`demog_country_iso3`	string	ISO 3166-1 alpha-3 country code (e.g. `"GBR"`).
`pres_adm`	enum	`"Yes"`, `"No"`, `"Unknown"`
`pres_date`	date	Most recent presentation/admission date at this facility.
`outco_outcome`	enum	One of: `"Discharged alive"`, `"Still hospitalised"`, `"Transfer to other facility"`, `"Death"`, `"Palliative care"`, `"Discharged against medical advice"`, `"Alive not admitted"`, `"Hospitalized"`
`outco_date`	date	Outcome date.

Long table (long format)

One row per observation per patient. Covers all ARC variables not included in the core table — symptoms, vital signs, lab results, medications, imaging, and more — using ARC variable names as the attribute field.

Required fields:

Field	Description
`subjid`	Patient PIN (links back to core).
`dataset_id`	Dataset identifier.
`phase`	Healthcare encounter phase when the event occurred. One of `"presentation"`, `"pre_observation"`, `"during_observation"`, `"follow_up"`, `"outcome"`.
`attribute`	ARC variable name for the observation (e.g. `"adsym_fever"`, `"vital_rr"`). Where an attribute with the same or substantially similar semantics exists in ARC, that name must be used.
`attribute_status`	Data collection status. `"VAL"` — value collected and present in `value`/`value_num`; `"UNK"` — unknown; `"NI"` — no information; `"NASK"` — not asked; `"NA"` — not applicable.

Optional fields:

Field	Description
`value`	String/categorical value. Y/N/NK attributes should be stored here as strings (`"Yes"`/`"No"`/`"Unknown"`) to allow future extension with additional codes.
`value_num`	Numeric value (float). Used for measurements such as temperature, blood pressure, or heart rate.
`date`	Date of the observation.
`duration`	Duration of the event in days (integer).
`attribute_unit`	Unit of the recorded value. Omit if the attribute has no unit.
`arcver`	ARC version that the attribute belongs to. Omit if the attribute is not present in any ARC version.
`event_id`	ID linking attributes that belong to a single event (e.g. the name, dosage, and route of a single medication administration).
`reldate_adm`	Relative day since admission (integer).

Each row must have either value or value_num (not both) populated.

At the analysis stage, the subset of long-table rows with attributes that appear only once per patient can be pivoted into wide format and merged with the core table for easier access.

The phases in the long table correspond to the data capture schema that ARC is structured around.

Schema files¶

The JSON schema files that formally define and validate these two tables live in the schemas/ directory:

schemas/isaric-core.json — validates the core (wide) table.
schemas/arc_{version}_isaric_long.schema.json — validates the long (narrow) table. This file is auto-generated from the current ARC variable list by schemas/isaric_schema.py each time a new ARC version is released.

Converting data to the ISARIC schema¶

The tool used to transform source datasets into the ISARIC schema is ADTL (Another Data Transformation Language). ADTL reads a TOML parser file which describes how each field in the source data maps to the ISARIC schema, then writes the two output tables.

There are two paths for generating a parser, depending on how the source data was collected:

Data collected via BRIDGE / REDCap¶

If data was collected using a CRF built with BRIDGE, the REDCap export already uses ARC variable names. A parser for this data can be auto-generated from the ARC file using schemas/draft_parser.py:

python schemas/draft_parser.py

This produces a file schemas/global_arc_{version}_parser.toml that covers all ARC variables and handles the REDCap checkbox/radio/list field encoding conventions. For a study that uses a defined preset, pass the preset name:

python schemas/draft_parser.py --preset "preset_ARChetype Disease CRF_Covid"

The generated file will contain TODO: FILL THIS IN markers for dataset-specific fields (such as dataset_id and dataset_disease) that cannot be inferred automatically, and must be filled in before it can be used.

Once edited, the parser can be used to convert the REDCap export into the ISARIC schema:

adtl parse <your parser file> <your-data-file.csv> --include-transform schemas/isaric-transformations.py

Note that the --include-transform option is required as a source for the ISARIC-specific transformations used in the auto-generated parser.

Data collected using other tools¶

If data was not collected via a BRIDGE CRF, you will need to write a custom parser. The parser is a TOML file that maps your source column names and data formats to the ISARIC schema fields.

A worked example covering a COVID-19 study is available in docs/examples/. See Writing a Custom Parser for a full walkthrough.