Writing a Custom Parser

This tutorial walks through converting a clinical dataset into the two ISARIC output tables described in ISARIC Data Schema. The conversion tool is ADTL (Another Data Transformation Language), which reads a TOML parser file that you will write to describe how your source columns map to the schema.

Install ADTL and read through its introductory documentation before continuing:

pip install adtl

The example files used throughout this tutorial live in docs/examples/:

File

Description

example_data.csv

Synthetic COVID-19 source dataset (5 patients)

example_parser.toml

Completed parser — the end result of this tutorial

covid-study-core.csv

Expected core table output

covid-study-long.csv

Expected long table output

What you are building

Running ADTL with the example_parser.toml file and synthetic data produces two CSV files. The core table has one row per patient, with fixed demographic and outcome columns:

subjid  dataset_id    demog_sex  demog_age_days  demog_country_iso3  outco_outcome       outco_date
C001    COVID-STUDY   Male       20088           GBR                 Discharged alive    2023-01-17
C002    COVID-STUDY   Female     26298           DEU                 Death               2023-01-28
...

The long table has one row per observation per patient. Instead of one column per variable, it uses a single attribute column to name the observation, and value or value_num to hold its value:

subjid  attribute          value   value_num  phase         date        attribute_status
C001    adsym_fever        Yes              presentation  2023-01-10  VAL
C001    vital_highesttem_c          38.1      presentation  2023-01-10  VAL
C001    comor_hypertensi   Yes              presentation  2023-01-10  VAL
...

Every row in the long table also carries an attribute_status: VAL - a value was recorded; UNK - unknown; NI - no information; NASK - not asked; NA - not applicable. This matters for pooled analyses because a missing row could mean “not asked” or “asked but unknown” — the status distinguishes between them.

The parser file is what tells ADTL how to turn your source columns into this structure. The rest of this tutorial builds it up step by step.

Depending on how closely your dataset resembles one generated using a BRIDGE CRF & REDCap, you may wish to start with the auto-generated parser produced by the draft_parser.py script in the schemas/ directory, and edit that file rather than writing one from scratch. Running adtl check with the auto-generated parser and your source data will show you which fields are missing; however, it won’t look at the mapping so you should check the output data carefully to make sure it has been transformed correctly.

Step 1: Find where your data goes

Before writing any mapping rules, work out where each of your source columns belongs in the ISARIC schema. There are two questions to answer for each field:

Core or long?

The core table is for fields that apply once per patient: identifiers, demographics, admission details, and the final outcome. Everything else — symptoms, vital signs, lab results, treatments, complications — goes in the long table. The core table is deliberately short, so finding the fields in your dataset which correspond to the core table variables should not take long; you can find the fields in ISARIC Data Schema.

Everything else goes in the long table, with the variable name specified in the attribute column.

What is the ISARIC attribute name?

The long table uses ARC variable names in the attribute column. To find the right name for your field, search ARC.csv for a matching concept. For example, if your dataset has a column called comorbid_hypertension, search for “hypertension”.

This returns comor_hypertensi — the ARC variable name to use as the attribute value in your parser.

The full ARC variable list, with descriptions and answer options, is in ARC.csv at the root of this repository.

Note

Sometimes there is no single ARC attribute that matches your source field exactly. A source column called comps_bacterial_pneumonia (a yes/no field) does not map to a single ARC attribute — instead, it maps to compl_pneum (was pneumonia present?) and separately to compl_pneum_type (type of pneumonia). The complications section below shows how to handle this.

If there is no good match for your field, you should contact the ISARIC team about how best to proceed.

The source data

example_data.csv represents a COVID-19 hospital study with five patients. A selection of columns is shown below:

usubjid,studyid,siteid_final,country_iso,slider_sex,age,date_admit,date_outcome,outcome,...
C001,COVID-STUDY,SITE-GBR-01,GBR,Male,55,2023-01-10,2023-01-17,discharge,...
C002,COVID-STUDY,SITE-DEU-01,DEU,Female,72,2023-01-11,2023-01-28,death,...
C003,COVID-STUDY,SITE-USA-01,USA,Male,38,2023-01-12,2023-01-19,discharge,...
C004,COVID-STUDY,SITE-GBR-02,GBR,Female,61,2023-01-13,NA,ongoing care,...
C005,COVID-STUDY,SITE-ESP-01,ESP,Male,48,2023-01-14,2023-01-21,transferred,...

Missing values are represented as NA throughout. Boolean fields use TRUE / FALSE.

Comparing the source columns to the ISARIC schema reveals several things that need handling:

Source

ISARIC

How to handle

NA for missing values

Omit the field / row

emptyFields = "NA"

TRUE / FALSE

"Yes" / "No"

Reusable value mapping

age in years

demog_age_days in days

Unit conversion

outcome as free text

Fixed set of allowed strings

Value mapping + ignoreMissingKey

Treatments in both treat_* and icu_treat_* columns

Single attribute row

combinedType = "firstNonNull"

Step 2: Set up the parser file

Create a new file (e.g. my-study-parser.toml) and start with the metadata block. The name value determines the output filenames:

[adtl]
  name        = "covid-study"
  description = "Example COVID-19 study parser"
  emptyFields = "NA"

emptyFields tells ADTL which string in your source data represents a missing value. Any field containing that string will be treated as absent — no output row will be produced. If your data uses blank cells instead of a placeholder, omit this line.

Next, declare the two output tables. These lines tell ADTL what kind of table each is and where to find the schema file it should validate against:

[adtl.tables.core]
  kind        = "groupBy"
  groupBy     = "subjid"
  aggregation = "lastNotNull"
  schema      = "../../schemas/isaric-core.json"

[adtl.tables.long]
  kind          = "oneToMany"
  schema        = "../../schemas/arc_1.2.2_isaric_long.schema.json"
  discriminator = "attribute"
  common = { subjid = { field = "usubjid" }, dataset_id = { field = "studyid" }, arcver = "1.2.2" }

kind = "groupBy" collapses any duplicate source rows for the same patient into one output row, keeping the last non-null value for each field. kind = "oneToMany" expands each source row into multiple output rows — one per [[long]] block that produces a non-null value.

The common setting lists fields that should appear on every long table row. Putting subjid and dataset_id here means you do not have to repeat them in every observation block.

Replace 1.2.2 with the ARC version you are targeting. The schema paths are relative to the parser file.

Step 3: Map the core table

The [core] section maps your source columns to the core table fields. The simplest case is a direct column-to-field mapping:

[core]
  subjid             = { field = "usubjid" }
  siteid             = { field = "siteid_final" }
  dataset_id         = { field = "studyid" }
  dataset_disease    = "COVID-19"
  demog_country_iso3 = { field = "country_iso" }
  pres_adm           = "Unknown"
  pres_date          = { field = "date_admit" }
  outco_date         = { field = "date_outcome" }

subjid = { field = "usubjid" } means: take the value from the source column named usubjid (referred to as the field) and write it to the subjid column in the core table.

Values without a field key are written as-is for every patient. dataset_disease = "COVID-19" and pres_adm = "Unknown" are examples: the disease is the same for all patients in this dataset, and admission status was not explicitly collected.

When the values need translating

The core schema requires demog_sex to be "Male" or "Female", exactly. The source data happens to use the same strings — but an explicit values mapping is still good practice because it documents the intent, rejects unexpected values like "M" or "F", and makes it easy to add "Other" later if needed:

[core.demog_sex]
  field  = "slider_sex"
  values = { Male = "Male", Female = "Female" }

When the units are different

The schema requires demog_age_days as an integer number of days, but the source records age in years. ADTL handles unit conversion automatically, when you specify the source_unit (the units your data was collected in) and the target unit (what the ISARIC schema requires):

[core.demog_age_days]
  field       = "age"
  unit        = "days"
  source_unit = "years"

Note

On TOML syntax

The above TOML code-block is equivalent to

[core]
  demog_age_days = { field = "age", unit = "days", source_unit = "years" }

Which format you choose is largely dependent on personal preference and readability. If, like in the example below, there are many sub-keys for a single field, the sub-table format is often easier to read as it doesn’t disappear off the edge of the screen.

If using an IDE such as VSCode to edit your parser, there are auto-formatters available such as Even Better TOML which will automatically format your parser file for you and highlight any syntax errors.

When the outcome is recorded as free text

Clinical outcome can be recorded in many ways across different sites — "discharge", "released", "cured (confirmed by a negative covid test)" — but the ISARIC schema only accepts a fixed set of strings. A values map converts each source string to the correct schema value.

By default, if a source value is not found in the values map, ADTL silently ignores it. Setting ignoreMissingKey = true changes this: unmapped values pass through unchanged, and ADTL will flag them at validation time if they are not valid schema values. This is useful when you cannot know in advance every possible free-text string a site might enter:

[core.outco_outcome]
  field            = "outcome"
  ignoreMissingKey = true
  [core.outco_outcome.values]
    discharge                                    = "Discharged alive"
    released                                     = "Discharged alive"
    "released with home care"                    = "Discharged alive"
    "cured (confirmed by a negative covid test)" = "Discharged alive"
    "recovery (confirmed by a negative test)"    = "Discharged alive"
    "ongoing care"                               = "Still hospitalised"
    transferred                                  = "Transfer to other facility"
    "moved to facility"                          = "Transfer to other facility"
    death                                        = "Death"

Multiple source strings can map to the same schema value. The sub-table syntax ([core.outco_outcome.values]) is used here instead of inline braces because the mapping is too long to fit on one line.

Step 4: Map the long table

Each observation type gets its own [[long]] block. The minimum each block needs is an attribute name, a value source, a phase and an attribute_status.

Reusing phase and date across many blocks

Most observations belong to one of two healthcare encounter phases in this dataset: presentation (at admission) or outcome (at discharge). Rather than writing the phase and date on every single block, define them once as reusable references:

[adtl.defs.phase_presentation]
  phase = "presentation"
  date  = { field = "date_admit" }

[adtl.defs.phase_outcome]
  phase = "outcome"
  date  = { field = "date_outcome" }

Any [[long]] block can then include e.g. ref = "phase_presentation" to inherit both phase and date from the definition.

String and boolean observations (symptoms, comorbidities)

In this example dataset, boolean fields — where the source value is TRUE or FALSE — are common. These need two things: a mapping from TRUE/FALSE to "Yes"/"No", and an attribute_status to record whether the data was actually collected.

Define the value mapping once as a reusable def:

[adtl.defs."Y/N/NK"]
  values = { TRUE = "Yes", FALSE = "No" }

Then reference it in each observation block:

[[long]]
  attribute        = "adsym_fever"
  value            = { field = "symptoms_history_of_fever", ref = "Y/N/NK" }
  attribute_status = { field = "symptoms_history_of_fever", apply = { function = "attribute_status_fill" } }
  ref              = "phase_presentation"

[[long]]
  attribute        = "comor_hypertensi"
  value            = { field = "comorbid_hypertension", ref = "Y/N/NK" }
  attribute_status = { field = "comorbid_hypertension", apply = { function = "attribute_status_fill" } }
  ref              = "phase_presentation"

ref = "Y/N/NK" expands the values map inside the value field. ref = "phase_presentation" expands into phase and date at the block level. ADTL applies these substitutions before producing output.

The attribute_status_fill function is defined in schemas/isaric_transformations.py (not built into ADTL itself). It determines the status code from the raw source value:

  • A null value (absent, or matched by emptyFields) → row is suppressed entirely

  • A pre-defined status code (UNK, NI, NASK, NA) → passed through as-is

  • Any other non-null value (including TRUE or FALSE) → VAL

This same pattern — ref = "Y/N/NK" for the value, attribute_status_fill for the status — applies to every boolean field: symptoms, comorbidities, treatments, and complications.

Numeric observations (vital signs, lab values)

For numeric measurements, use value_num instead of value, and add attribute_unit to record the unit:

[[long]]
  attribute        = "vital_highesttem_c"
  value_num        = { field = "vs_temp" }
  attribute_unit   = "°C"
  attribute_status = { field = "vs_temp", apply = { function = "attribute_status_fill" } }
  ref              = "phase_presentation"

[[long]]
  attribute        = "labs_crp_mgl"
  attribute_unit   = "mg/L"
  value_num        = { field = "lab_crp" }
  attribute_status = { field = "lab_crp", apply = { function = "attribute_status_fill" } }
  ref              = "phase_outcome"

Vital signs are assigned to the phase_presentation phase; lab values to phase_outcome. This may differ for you, depending on the timing of your measurements. Adjust the ref accordingly.

When the same data is in two source columns

Some studies record treatments separately for general ward and ICU patients. Rather than producing two rows for the same attribute, combinedType = "firstNonNull" merges them: ADTL evaluates the list of fields in order and uses the first non-null result.

For non-ICU patients, the icu_treat_* column is "NA" (null), so the ward column is used. For patient C002 (who was in the ICU), the ward column is FALSE but the ICU column is TRUE — so the ICU value takes effect.

The [long.attribute_status] block mirrors the same field order so the status always reflects the same source column as the selected value:

[[long]]
  attribute = "medi_medtype"
  ref       = "phase_outcome"
  [long.value]
    combinedType = "firstNonNull"
    fields = [
      { field = "treat_corticosteroids",     values = { "TRUE" = "Corticosteroid" } },
      { field = "icu_treat_corticosteroids", values = { "TRUE" = "Corticosteroid" } },
    ]
  [long.attribute_status]
    combinedType = "firstNonNull"
    fields = [
      { field = "treat_corticosteroids",     apply = { function = "attribute_status_fill" } },
      { field = "icu_treat_corticosteroids", apply = { function = "attribute_status_fill" } },
    ]

When an observation has its own date

The ICU admission block cannot use the presentation or outcome phase refs, because its date (icu_in) is different from both date_admit and date_outcome. Define the phase and date inline instead:

[[long]]
  attribute        = "crito_icu"
  value            = { field = "slider_icu_ever", ref = "Y/N/NK" }
  attribute_status = { field = "slider_icu_ever", apply = { function = "attribute_status_fill" } }
  phase            = "during_observation"
  date             = { field = "icu_in" }
  duration         = { field = "icu_in", apply = { function = "durationDays", params = ["$icu_out"] } }

The duration field records the ICU length of stay in days. durationDays computes the number of days between the value of icu_in and the column named in params ($icu_out — the $ prefix means “look up this column in the same source row”). For patients without an ICU admission, both columns are "NA" (null), so duration is left empty.

ADTL ships with a number of built-in functions similar to durationDays, which can be found in the ADTL documentation.

When one source field maps to multiple attributes

Sometimes a single yes/no source column corresponds to more than one ARC attribute. The source column comps_bacterial_pneumonia is an example: the ARC 1.2.2 schema does not have a single attribute for “bacterial pneumonia as a complication”. Instead, it separates the concept into two attributes: compl_pneum (was pneumonia present?) and compl_pneum_type (what was the etiology?).

Write two [[long]] blocks from the same source column. For compl_pneum_type, only map TRUE — ADTL silently skips rows where the source value has no entry in the values map, so patients where comps_bacterial_pneumonia = FALSE will not get a compl_pneum_type row:

[[long]]
  attribute        = "compl_pneum"
  value            = { field = "comps_bacterial_pneumonia", ref = "Y/N/NK" }
  attribute_status = { field = "comps_bacterial_pneumonia", apply = { function = "attribute_status_fill" } }
  ref              = "phase_outcome"

[[long]]
  attribute        = "compl_pneum_type"
  value            = { field = "comps_bacterial_pneumonia", values = { "TRUE" = "Bacterial" } }
  attribute_status = { field = "comps_bacterial_pneumonia", apply = { function = "attribute_status_fill" } }
  ref              = "phase_outcome"

Step 5: Run the parser and check the output

Before running against a full dataset, use adtl check to catch problems early. This validates that all field names in the parser exist in your data, and warns about source columns that are not mapped:

adtl check docs/examples/example_parser.toml docs/examples/example_data.csv

Once you are happy, run the parser to produce the output files:

adtl parse docs/examples/example_parser.toml docs/examples/example_data.csv

This creates two files in the current directory — covid-study-core.csv and covid-study-long.csv — and prints a validation summary:

|table          |valid  |total  |percentage_valid|
|---------------|-------|-------|----------------|
|core           |4      |5      |80.000000%      |
|long           |109    |109    |100.000000%     |

Understanding validation errors

A row that fails validation is still written to the output file, with adtl_valid = False and an explanation in the adtl_error column. No data is lost. In this example, patient C004 fails:

data must contain ['subjid', 'siteid', 'dataset_id', 'dataset_disease',
'demog_sex', 'demog_age_days', 'demog_country_iso3', 'pres_adm',
'pres_date', 'outco_outcome', 'outco_date'] properties

C004’s date_outcome is "NA" — the patient is still hospitalised, so no outcome date was recorded. Because emptyFields = "NA", ADTL omits outco_date from the output row entirely. The core schema marks outco_date as required, so the row fails validation even though the data itself is correct.

This is expected for ongoing-care patients. At the analysis stage you would decide whether to include or exclude such rows. The long table is unaffected because it validates each observation row independently.

For large datasets, add --parallel for a significant speed improvement:

adtl parse docs/examples/example_parser.toml large-study-data.csv --parallel

Going further

The patterns above cover the most common cases. Below are a few more that appear in real-world datasets.

Repeated columns

If the source data has multiple follow-up visits as separate columns (e.g. fu_fever_1 through fu_fever_5), use a for loop instead of five identical blocks:

[[long]]
  phase       = "follow_up"
  date        = { field = "fu_date_{n}" }
  attribute   = "adsym_fever"
  value       = { field = "fu_fever_{n}", ref = "Y/N/NK" }
  for.n.range = [1, 5]

This will expand out into 5 blocks when run, and will create a long table row for each follow-up visit that has a non-null value.

Linking related observations

Some ARC forms — medications and pathogen testing, for example — can have multiple entries per patient per day. A patient might receive two different medications on the same date, so the date alone is not enough to tell those entries apart in the long table. Related observations in the long table ( e.g. the name, dose, and route of a single medication) need to be linked by a shared event_id.

ADTL can generate this ID automatically using the generate key. It produces a UUID5, which is deterministic: the same inputs always produce the same ID. The fields listed in values are combined to generate the ID, so they must together uniquely identify the event. In the example below, subjid + medi_date + drug_name is sufficient — two different medications given to the same patient on the same day will have different names, giving each its own ID:

[[long]]
  attribute = "medi_medname"
  value     = { field = "drug_name" }
  event_id  = { generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] } }

[[long]]
  attribute = "medi_dose"
  value_num = { field = "drug_dose_mg" }
  event_id  = { generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] } }

Good practise would be to create a reusable definition for, e.g., all medication-related blocks, so that the same event ID generation logic is applied consistently across all related observations.

That might look something like this:

[adtl.defs.medication]
  phase    = "during_observation"
  date     = { field = "medi_date" }
  duration = { field = "medi_numdays" }

  [adtl.defs.medication.event_id]
    generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] }

[[long]]
  ref       = "medication"
  attribute = "medi_medname"
  value     = { field = "drug_name" }

[[long]]
  ref       = "medication"
  attribute = "medi_dose"
  value_num = { field = "drug_dose_mg" }

Further reading