.. _writing-a-parser: Writing a Custom Parser ======================= This tutorial walks through converting a clinical dataset into the two ISARIC output tables described in :ref:`isaric-data-schema`. The conversion tool is `ADTL `_ (Another Data Transformation Language), which reads a TOML **parser file** that you will write to describe how your source columns map to the schema. Install ADTL and read through its introductory documentation before continuing: .. code-block:: bash pip install adtl The example files used throughout this tutorial live in ``docs/examples/``: .. list-table:: :header-rows: 1 :widths: 35 65 * - File - Description * - ``example_data.csv`` - Synthetic COVID-19 source dataset (5 patients) * - ``example_parser.toml`` - Completed parser — the end result of this tutorial * - ``covid-study-core.csv`` - Expected core table output * - ``covid-study-long.csv`` - Expected long table output .. _what-you-build: What you are building --------------------- Running ADTL with the ``example_parser.toml`` file and synthetic data produces two CSV files. The **core table** has one row per patient, with fixed demographic and outcome columns: .. code-block:: text subjid dataset_id demog_sex demog_age_days demog_country_iso3 outco_outcome outco_date C001 COVID-STUDY Male 20088 GBR Discharged alive 2023-01-17 C002 COVID-STUDY Female 26298 DEU Death 2023-01-28 ... The **long table** has one row per *observation* per patient. Instead of one column per variable, it uses a single ``attribute`` column to name the observation, and ``value`` or ``value_num`` to hold its value: .. code-block:: text subjid attribute value value_num phase date attribute_status C001 adsym_fever Yes presentation 2023-01-10 VAL C001 vital_highesttem_c 38.1 presentation 2023-01-10 VAL C001 comor_hypertensi Yes presentation 2023-01-10 VAL ... Every row in the long table also carries an ``attribute_status``: ``VAL`` - a value was recorded; ``UNK`` - unknown; ``NI`` - no information; ``NASK`` - not asked; ``NA`` - not applicable. This matters for pooled analyses because a missing row could mean "not asked" or "asked but unknown" — the status distinguishes between them. The parser file is what tells ADTL how to turn your source columns into this structure. The rest of this tutorial builds it up step by step. Depending on how closely your dataset resembles one generated using a BRIDGE CRF & REDCap, you may wish to start with the auto-generated parser produced by the `draft_parser.py` script in the ``schemas/`` directory, and edit that file rather than writing one from scratch. Running ``adtl check`` with the auto-generated parser and your source data will show you which fields are missing; however, it won't look at the mapping so you should check the output data carefully to make sure it has been transformed correctly. .. _step-1-find-data-target: Step 1: Find where your data goes ---------------------------------- Before writing any mapping rules, work out where each of your source columns belongs in the ISARIC schema. There are two questions to answer for each field: **Core or long?** The core table is for fields that apply once per patient: identifiers, demographics, admission details, and the final outcome. Everything else — symptoms, vital signs, lab results, treatments, complications — goes in the long table. The core table is deliberately short, so finding the fields in your dataset which correspond to the core table variables should not take long; you can find the fields in :ref:`isaric-data-schema`. Everything else goes in the long table, with the variable name specified in the ``attribute`` column. **What is the ISARIC attribute name?** The long table uses ARC variable names in the ``attribute`` column. To find the right name for your field, search ``ARC.csv`` for a matching concept. For example, if your dataset has a column called ``comorbid_hypertension``, search for "hypertension". This returns ``comor_hypertensi`` — the ARC variable name to use as the ``attribute`` value in your parser. The full ARC variable list, with descriptions and answer options, is in ``ARC.csv`` at the root of this repository. .. note:: Sometimes there is no single ARC attribute that matches your source field exactly. A source column called ``comps_bacterial_pneumonia`` (a yes/no field) does not map to a single ARC attribute — instead, it maps to ``compl_pneum`` (was pneumonia present?) and separately to ``compl_pneum_type`` (type of pneumonia). The :ref:`complications section ` below shows how to handle this. If there is no good match for your field, you should contact the ISARIC team about how best to proceed. .. _the-source-data: The source data ~~~~~~~~~~~~~~~ ``example_data.csv`` represents a COVID-19 hospital study with five patients. A selection of columns is shown below: .. code-block:: text usubjid,studyid,siteid_final,country_iso,slider_sex,age,date_admit,date_outcome,outcome,... C001,COVID-STUDY,SITE-GBR-01,GBR,Male,55,2023-01-10,2023-01-17,discharge,... C002,COVID-STUDY,SITE-DEU-01,DEU,Female,72,2023-01-11,2023-01-28,death,... C003,COVID-STUDY,SITE-USA-01,USA,Male,38,2023-01-12,2023-01-19,discharge,... C004,COVID-STUDY,SITE-GBR-02,GBR,Female,61,2023-01-13,NA,ongoing care,... C005,COVID-STUDY,SITE-ESP-01,ESP,Male,48,2023-01-14,2023-01-21,transferred,... Missing values are represented as ``NA`` throughout. Boolean fields use ``TRUE`` / ``FALSE``. Comparing the source columns to the ISARIC schema reveals several things that need handling: .. list-table:: :header-rows: 1 :widths: 35 30 35 * - Source - ISARIC - How to handle * - ``NA`` for missing values - Omit the field / row - ``emptyFields = "NA"`` * - ``TRUE`` / ``FALSE`` - ``"Yes"`` / ``"No"`` - Reusable value mapping * - ``age`` in years - ``demog_age_days`` in days - Unit conversion * - ``outcome`` as free text - Fixed set of allowed strings - Value mapping + ``ignoreMissingKey`` * - Treatments in both ``treat_*`` and ``icu_treat_*`` columns - Single ``attribute`` row - ``combinedType = "firstNonNull"`` .. _step-2-setup-parser-file: Step 2: Set up the parser file ------------------------------- Create a new file (e.g. ``my-study-parser.toml``) and start with the metadata block. The ``name`` value determines the output filenames: .. code-block:: toml [adtl] name = "covid-study" description = "Example COVID-19 study parser" emptyFields = "NA" ``emptyFields`` tells ADTL which string in your source data represents a missing value. Any field containing that string will be treated as absent — no output row will be produced. If your data uses blank cells instead of a placeholder, omit this line. Next, declare the two output tables. These lines tell ADTL what kind of table each is and where to find the schema file it should validate against: .. code-block:: toml [adtl.tables.core] kind = "groupBy" groupBy = "subjid" aggregation = "lastNotNull" schema = "../../schemas/isaric-core.json" [adtl.tables.long] kind = "oneToMany" schema = "../../schemas/arc_1.2.2_isaric_long.schema.json" discriminator = "attribute" common = { subjid = { field = "usubjid" }, dataset_id = { field = "studyid" }, arcver = "1.2.2" } ``kind = "groupBy"`` collapses any duplicate source rows for the same patient into one output row, keeping the last non-null value for each field. ``kind = "oneToMany"`` expands each source row into multiple output rows — one per ``[[long]]`` block that produces a non-null value. The ``common`` setting lists fields that should appear on every long table row. Putting ``subjid`` and ``dataset_id`` here means you do not have to repeat them in every observation block. Replace ``1.2.2`` with the ARC version you are targeting. The schema paths are relative to the parser file. .. _step-3-map-core: Step 3: Map the core table --------------------------- The ``[core]`` section maps your source columns to the core table fields. The simplest case is a direct column-to-field mapping: .. code-block:: toml [core] subjid = { field = "usubjid" } siteid = { field = "siteid_final" } dataset_id = { field = "studyid" } dataset_disease = "COVID-19" demog_country_iso3 = { field = "country_iso" } pres_adm = "Unknown" pres_date = { field = "date_admit" } outco_date = { field = "date_outcome" } ``subjid = { field = "usubjid" }`` means: take the value from the source column named ``usubjid`` (referred to as the ``field``) and write it to the ``subjid`` column in the core table. Values without a ``field`` key are written as-is for every patient. ``dataset_disease = "COVID-19"`` and ``pres_adm = "Unknown"`` are examples: the disease is the same for all patients in this dataset, and admission status was not explicitly collected. **When the values need translating** The core schema requires ``demog_sex`` to be ``"Male"`` or ``"Female"``, exactly. The source data happens to use the same strings — but an explicit ``values`` mapping is still good practice because it documents the intent, rejects unexpected values like ``"M"`` or ``"F"``, and makes it easy to add ``"Other"`` later if needed: .. code-block:: toml [core.demog_sex] field = "slider_sex" values = { Male = "Male", Female = "Female" } **When the units are different** The schema requires ``demog_age_days`` as an integer number of days, but the source records age in years. ADTL handles unit conversion automatically, when you specify the ``source_unit`` (the units your data was collected in) and the target ``unit`` (what the ISARIC schema requires): .. code-block:: toml [core.demog_age_days] field = "age" unit = "days" source_unit = "years" .. note:: On TOML syntax The above TOML code-block is equivalent to .. code-block:: toml [core] demog_age_days = { field = "age", unit = "days", source_unit = "years" } Which format you choose is largely dependent on personal preference and readability. If, like in the example below, there are many sub-keys for a single field, the sub-table format is often easier to read as it doesn't disappear off the edge of the screen. If using an IDE such as VSCode to edit your parser, there are auto-formatters available such as `Even Better TOML `_ which will automatically format your parser file for you and highlight any syntax errors. **When the outcome is recorded as free text** Clinical outcome can be recorded in many ways across different sites — ``"discharge"``, ``"released"``, ``"cured (confirmed by a negative covid test)"`` — but the ISARIC schema only accepts a fixed set of strings. A ``values`` map converts each source string to the correct schema value. By default, if a source value is not found in the ``values`` map, ADTL silently ignores it. Setting ``ignoreMissingKey = true`` changes this: unmapped values pass through unchanged, and ADTL will flag them at validation time if they are not valid schema values. This is useful when you cannot know in advance every possible free-text string a site might enter: .. code-block:: toml [core.outco_outcome] field = "outcome" ignoreMissingKey = true [core.outco_outcome.values] discharge = "Discharged alive" released = "Discharged alive" "released with home care" = "Discharged alive" "cured (confirmed by a negative covid test)" = "Discharged alive" "recovery (confirmed by a negative test)" = "Discharged alive" "ongoing care" = "Still hospitalised" transferred = "Transfer to other facility" "moved to facility" = "Transfer to other facility" death = "Death" Multiple source strings can map to the same schema value. The sub-table syntax (``[core.outco_outcome.values]``) is used here instead of inline braces because the mapping is too long to fit on one line. .. _step-4-map-long: Step 4: Map the long table --------------------------- Each observation type gets its own ``[[long]]`` block. The minimum each block needs is an ``attribute`` name, a value source, a phase and an ``attribute_status``. .. _reuse-definitions: **Reusing phase and date across many blocks** Most observations belong to one of two healthcare encounter phases in this dataset: presentation (at admission) or outcome (at discharge). Rather than writing the phase and date on every single block, define them once as reusable references: .. code-block:: toml [adtl.defs.phase_presentation] phase = "presentation" date = { field = "date_admit" } [adtl.defs.phase_outcome] phase = "outcome" date = { field = "date_outcome" } Any ``[[long]]`` block can then include e.g. ``ref = "phase_presentation"`` to inherit both ``phase`` and ``date`` from the definition. .. _string-bool-observations: **String and boolean observations (symptoms, comorbidities)** In this example dataset, boolean fields — where the source value is ``TRUE`` or ``FALSE`` — are common. These need two things: a mapping from ``TRUE``/``FALSE`` to ``"Yes"``/``"No"``, and an ``attribute_status`` to record whether the data was actually collected. Define the value mapping once as a reusable def: .. code-block:: toml [adtl.defs."Y/N/NK"] values = { TRUE = "Yes", FALSE = "No" } Then reference it in each observation block: .. code-block:: toml [[long]] attribute = "adsym_fever" value = { field = "symptoms_history_of_fever", ref = "Y/N/NK" } attribute_status = { field = "symptoms_history_of_fever", apply = { function = "attribute_status_fill" } } ref = "phase_presentation" [[long]] attribute = "comor_hypertensi" value = { field = "comorbid_hypertension", ref = "Y/N/NK" } attribute_status = { field = "comorbid_hypertension", apply = { function = "attribute_status_fill" } } ref = "phase_presentation" ``ref = "Y/N/NK"`` expands the ``values`` map inside the ``value`` field. ``ref = "phase_presentation"`` expands into ``phase`` and ``date`` at the block level. ADTL applies these substitutions before producing output. The ``attribute_status_fill`` function is defined in ``schemas/isaric_transformations.py`` (not built into ADTL itself). It determines the status code from the raw source value: - A null value (absent, or matched by ``emptyFields``) → row is suppressed entirely - A pre-defined status code (``UNK``, ``NI``, ``NASK``, ``NA``) → passed through as-is - Any other non-null value (including ``TRUE`` or ``FALSE``) → ``VAL`` This same pattern — ``ref = "Y/N/NK"`` for the value, ``attribute_status_fill`` for the status — applies to every boolean field: symptoms, comorbidities, treatments, and complications. .. _numeric-observation: **Numeric observations (vital signs, lab values)** For numeric measurements, use ``value_num`` instead of ``value``, and add ``attribute_unit`` to record the unit: .. code-block:: toml [[long]] attribute = "vital_highesttem_c" value_num = { field = "vs_temp" } attribute_unit = "°C" attribute_status = { field = "vs_temp", apply = { function = "attribute_status_fill" } } ref = "phase_presentation" [[long]] attribute = "labs_crp_mgl" attribute_unit = "mg/L" value_num = { field = "lab_crp" } attribute_status = { field = "lab_crp", apply = { function = "attribute_status_fill" } } ref = "phase_outcome" Vital signs are assigned to the ``phase_presentation`` phase; lab values to ``phase_outcome``. This may differ for you, depending on the timing of your measurements. Adjust the ``ref`` accordingly. .. _two-sources-section: **When the same data is in two source columns** Some studies record treatments separately for general ward and ICU patients. Rather than producing two rows for the same attribute, ``combinedType = "firstNonNull"`` merges them: ADTL evaluates the list of fields in order and uses the first non-null result. For non-ICU patients, the ``icu_treat_*`` column is ``"NA"`` (null), so the ward column is used. For patient C002 (who was in the ICU), the ward column is ``FALSE`` but the ICU column is ``TRUE`` — so the ICU value takes effect. The ``[long.attribute_status]`` block mirrors the same field order so the status always reflects the same source column as the selected value: .. code-block:: toml [[long]] attribute = "medi_medtype" ref = "phase_outcome" [long.value] combinedType = "firstNonNull" fields = [ { field = "treat_corticosteroids", values = { "TRUE" = "Corticosteroid" } }, { field = "icu_treat_corticosteroids", values = { "TRUE" = "Corticosteroid" } }, ] [long.attribute_status] combinedType = "firstNonNull" fields = [ { field = "treat_corticosteroids", apply = { function = "attribute_status_fill" } }, { field = "icu_treat_corticosteroids", apply = { function = "attribute_status_fill" } }, ] .. _own-date-section: **When an observation has its own date** The ICU admission block cannot use the presentation or outcome phase refs, because its date (``icu_in``) is different from both ``date_admit`` and ``date_outcome``. Define the phase and date inline instead: .. code-block:: toml [[long]] attribute = "crito_icu" value = { field = "slider_icu_ever", ref = "Y/N/NK" } attribute_status = { field = "slider_icu_ever", apply = { function = "attribute_status_fill" } } phase = "during_observation" date = { field = "icu_in" } duration = { field = "icu_in", apply = { function = "durationDays", params = ["$icu_out"] } } The ``duration`` field records the ICU length of stay in days. ``durationDays`` computes the number of days between the value of ``icu_in`` and the column named in ``params`` (``$icu_out`` — the ``$`` prefix means "look up this column in the same source row"). For patients without an ICU admission, both columns are ``"NA"`` (null), so ``duration`` is left empty. ADTL ships with a number of built-in functions similar to `durationDays`, which can be found in the `ADTL documentation `_. .. _complications-section: **When one source field maps to multiple attributes** Sometimes a single yes/no source column corresponds to more than one ARC attribute. The source column ``comps_bacterial_pneumonia`` is an example: the ARC 1.2.2 schema does not have a single attribute for "bacterial pneumonia as a complication". Instead, it separates the concept into two attributes: ``compl_pneum`` (was pneumonia present?) and ``compl_pneum_type`` (what was the etiology?). Write two ``[[long]]`` blocks from the same source column. For ``compl_pneum_type``, only map ``TRUE`` — ADTL silently skips rows where the source value has no entry in the ``values`` map, so patients where ``comps_bacterial_pneumonia = FALSE`` will not get a ``compl_pneum_type`` row: .. code-block:: toml [[long]] attribute = "compl_pneum" value = { field = "comps_bacterial_pneumonia", ref = "Y/N/NK" } attribute_status = { field = "comps_bacterial_pneumonia", apply = { function = "attribute_status_fill" } } ref = "phase_outcome" [[long]] attribute = "compl_pneum_type" value = { field = "comps_bacterial_pneumonia", values = { "TRUE" = "Bacterial" } } attribute_status = { field = "comps_bacterial_pneumonia", apply = { function = "attribute_status_fill" } } ref = "phase_outcome" .. _step-5-run-and-check: Step 5: Run the parser and check the output -------------------------------------------- Before running against a full dataset, use ``adtl check`` to catch problems early. This validates that all field names in the parser exist in your data, and warns about source columns that are not mapped: .. code-block:: bash adtl check docs/examples/example_parser.toml docs/examples/example_data.csv Once you are happy, run the parser to produce the output files: .. code-block:: bash adtl parse docs/examples/example_parser.toml docs/examples/example_data.csv This creates two files in the current directory — ``covid-study-core.csv`` and ``covid-study-long.csv`` — and prints a validation summary: .. code-block:: text |table |valid |total |percentage_valid| |---------------|-------|-------|----------------| |core |4 |5 |80.000000% | |long |109 |109 |100.000000% | .. _understanding-errors: **Understanding validation errors** A row that fails validation is still written to the output file, with ``adtl_valid = False`` and an explanation in the ``adtl_error`` column. No data is lost. In this example, patient C004 fails: .. code-block:: text data must contain ['subjid', 'siteid', 'dataset_id', 'dataset_disease', 'demog_sex', 'demog_age_days', 'demog_country_iso3', 'pres_adm', 'pres_date', 'outco_outcome', 'outco_date'] properties C004's ``date_outcome`` is ``"NA"`` — the patient is still hospitalised, so no outcome date was recorded. Because ``emptyFields = "NA"``, ADTL omits ``outco_date`` from the output row entirely. The core schema marks ``outco_date`` as required, so the row fails validation even though the data itself is correct. This is expected for ongoing-care patients. At the analysis stage you would decide whether to include or exclude such rows. The long table is unaffected because it validates each observation row independently. For large datasets, add ``--parallel`` for a significant speed improvement: .. code-block:: bash adtl parse docs/examples/example_parser.toml large-study-data.csv --parallel Going further ------------- The patterns above cover the most common cases. Below are a few more that appear in real-world datasets. .. _repeated-columns: **Repeated columns** If the source data has multiple follow-up visits as separate columns (e.g. ``fu_fever_1`` through ``fu_fever_5``), use a ``for`` loop instead of five identical blocks: .. code-block:: toml [[long]] phase = "follow_up" date = { field = "fu_date_{n}" } attribute = "adsym_fever" value = { field = "fu_fever_{n}", ref = "Y/N/NK" } for.n.range = [1, 5] This will expand out into 5 blocks when run, and will create a long table row for each follow-up visit that has a non-null value. .. _event-id: **Linking related observations** Some ARC forms — medications and pathogen testing, for example — can have multiple entries per patient per day. A patient might receive two different medications on the same date, so the date alone is not enough to tell those entries apart in the long table. Related observations in the long table ( e.g. the name, dose, and route of a single medication) need to be linked by a shared ``event_id``. ADTL can generate this ID automatically using the ``generate`` key. It produces a UUID5, which is deterministic: the same inputs always produce the same ID. The fields listed in ``values`` are combined to generate the ID, so they must together uniquely identify the event. In the example below, ``subjid`` + ``medi_date`` + ``drug_name`` is sufficient — two different medications given to the same patient on the same day will have different names, giving each its own ID: .. code-block:: toml [[long]] attribute = "medi_medname" value = { field = "drug_name" } event_id = { generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] } } [[long]] attribute = "medi_dose" value_num = { field = "drug_dose_mg" } event_id = { generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] } } Good practise would be to create a reusable definition for, e.g., all medication-related blocks, so that the same event ID generation logic is applied consistently across all related observations. That might look something like this: .. code-block:: toml [adtl.defs.medication] phase = "during_observation" date = { field = "medi_date" } duration = { field = "medi_numdays" } [adtl.defs.medication.event_id] generate = { type = "uuid5", values = ["subjid", "medi_date", "drug_name"] } [[long]] ref = "medication" attribute = "medi_medname" value = { field = "drug_name" } [[long]] ref = "medication" attribute = "medi_dose" value_num = { field = "drug_dose_mg" } Further reading --------------- - `ADTL documentation `_ — full reference for all mapping rules, CLI options, and the Python API. - :ref:`isaric-data-schema` — overview of the ISARIC core and long schema tables.