Skip to content

bennettoxford/opensafely-variable-survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSAFELY Variable Survey

This repository contains work by @bennettoxford/team-rsi to survey how "variables" are used by researchers in OpenSAFELY study.

A "variable" is a demographic or clinical feature of interest within an Electronic Health Record (EHR) that is used within an EHR study.

The work in this repository encompasses studies written using the deprecated cohort-extractor framework, and those using its replacement - ehrQL.

cohort-extractor studies

The main entry point for cohort-extractor related code in this repository is main.py which is most conveniently run using the just command, just run.

just run fetch will fetch all variables from all studies in all repositories in the opensafely organisation. This requires a suitable GitHub Personal Access Token (PAT), more information on this is available in the Developer Documentation.

just run notebooks will start a local Marimo notebook server, from which the names.py and definitions.py notebooks can be accessed. These notebooks contain the analysis of variable names and definitions performed as part of this work, and can be used a starting point for any desired future analysis.

ehrQL studies

We catalogue the variables defined in ehrql datasets across public studies in the opensafely GitHub organisation. The main script searches for dataset pipelines declared in each study's project.yaml, clones the matching repositories, evaluates the dataset definitions with a spoofed runtime environment, and emits structured summaries of every variable that is declared.

Execution

# Run the full ehrQL extraction process
just ehrql

# Run the ehrQL extraction process with verbose logging
just ehrql --verbose

# Run the ehrQL extraction process with a specific study
just ehrql study-name

How It Works

  1. Invokes the GitHub CLI (gh) to search for project.yaml files that reference ehrql dataset generation commands within the opensafely organisation.
  2. Clones each matching repository into .ehrql_repo_cache/ (one shallow clone per commit SHA) and extracts the dataset authoring files listed in project.yaml.
  3. Imports each dataset module inside a controlled environment that spoofs external inputs (files, command-line arguments, parameters, and selected ehrql behaviours) so that variable definitions can be resolved without the original study artefacts.
  4. Records every discovered variable including the dataset file, variable name, inferred ehrql series type, source line number (where available), and two deterministic hashes of the compiled query-model node.
  5. Writes the aggregated results to ehrql_variables.json and captures the hashed query-model lookup table in ehrql_qm_dump.json.

Key files

  • ehrql_extractor.py – end-to-end collector used to discover and execute dataset definitions.
  • spoofed_data/ – configurable fixtures used to satisfy dataset expectations (JSON, CSV/Arrow data, parameters, and CLI arguments).
  • ehrql_variables.json / ehrql_qm_dump.json – outputs generated by the collector.

Output Format

ehrql_variables.json captures a timestamped snapshot of every processed project:

{
  "generated_at": "YYYY-MM-DD HH:MM:SS",
  "projects": {
    "opensafely/example-study": {
      "sha": "commitsha",
      "files": {
        "analysis/dataset.py": [
          ["variable_name", "StrPatientSeries", 42, "expr_hash", "expr_hash_no_codes"]
        ]
      }
    }
  }
}
  • series_type reflects the runtime type reported by ehrql (e.g. StrPatientSeries).
  • line_no stores the best-effort Python source line for the assignment.
  • expr_hash is a deterministic SHA-256 hash of the ehrql query-model node; expr_hash_no_codes removes code sets to normalise across codelist variants.

The ehrql_variables.json file is currently manually copy/pasted into this repository and made available via a tool deployed here.

ehrql_qm_dump.json maps every expr_hash_no_codes to the corresponding query model string which might be useful for future debugging, but is massive and so not committed to git.

Spoofed Inputs and Adaptation

Many studies reference generated files, runtime parameters, or CLI arguments that are not committed to their repositories. The collector fills these gaps by providing stand-ins under spoofed_data/:

  • args.json – fake command-line arguments keyed by repository. Add new entries if a study terminates early because an expected option is missing.
  • parameters.json – values returned when ehrql.get_parameter() is invoked. Repository-specific overrides can be provided alongside a default.
  • json_data.json and csv_data.csv – minimal payloads returned whenever datasets open JSON or CSV files (including the gzipped and Arrow derivatives generated automatically).

Update the CSV or JSON fixtures as new columns or properties are required. The .csv.gz and .arrow files are regenerated from csv_data.csv automatically by the main script each time it is run.

Planned Improvements

  • Currently this is just for patient-level datasets. Add support:
    • for event-level datasets,
    • for measures.
  • Ignore unmodified template studies that only define a sex variable.
  • Improve stack traces so errors reference the original dataset source line.
  • Increase line-number coverage for variable assignments.
    • Sometimes the line number is for a different file, so the output should be the filename and line number.
    • Still some variables have no line number at all.
  • Improve the fuzzy matching:
    • where() followed by sort_by() result in the same things, but the current variant analysis treats these as different things
    • Multiple where() commands can be chained together. Two variables that only differ in the order of the where() statements currently show as different things

Developer docs

Please see the additional information.

About

Extraction and analysis of study variables in OpenSAFELY studies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •