This repository contains work by @bennettoxford/team-rsi to survey how "variables" are used by researchers in OpenSAFELY study.
A "variable" is a demographic or clinical feature of interest within an Electronic Health Record (EHR) that is used within an EHR study.
The work in this repository encompasses studies written using the deprecated cohort-extractor framework, and those using its replacement - ehrQL.
The main entry point for cohort-extractor related code in this repository is main.py which is most conveniently run using the just command, just run.
just run fetch will fetch all variables from all studies in all repositories in the opensafely organisation.
This requires a suitable GitHub Personal Access Token (PAT),
more information on this is available in the Developer Documentation.
just run notebooks will start a local Marimo notebook server,
from which the names.py and definitions.py notebooks can be accessed.
These notebooks contain the analysis of variable names and definitions performed as part of this work,
and can be used a starting point for any desired future analysis.
We catalogue the variables defined in ehrql datasets across public studies in
the opensafely GitHub organisation. The main script searches for dataset
pipelines declared in each study's project.yaml, clones the matching
repositories, evaluates the dataset definitions with a spoofed runtime
environment, and emits structured summaries of every variable that is declared.
# Run the full ehrQL extraction process
just ehrql
# Run the ehrQL extraction process with verbose logging
just ehrql --verbose
# Run the ehrQL extraction process with a specific study
just ehrql study-name- Invokes the GitHub CLI (
gh) to search forproject.yamlfiles that referenceehrqldataset generation commands within theopensafelyorganisation. - Clones each matching repository into
.ehrql_repo_cache/(one shallow clone per commit SHA) and extracts the dataset authoring files listed inproject.yaml. - Imports each dataset module inside a controlled environment that spoofs
external inputs (files, command-line arguments, parameters, and selected
ehrqlbehaviours) so that variable definitions can be resolved without the original study artefacts. - Records every discovered variable including the dataset file, variable name,
inferred
ehrqlseries type, source line number (where available), and two deterministic hashes of the compiled query-model node. - Writes the aggregated results to
ehrql_variables.jsonand captures the hashed query-model lookup table inehrql_qm_dump.json.
ehrql_extractor.py– end-to-end collector used to discover and execute dataset definitions.spoofed_data/– configurable fixtures used to satisfy dataset expectations (JSON, CSV/Arrow data, parameters, and CLI arguments).ehrql_variables.json/ehrql_qm_dump.json– outputs generated by the collector.
ehrql_variables.json captures a timestamped snapshot of every processed
project:
{
"generated_at": "YYYY-MM-DD HH:MM:SS",
"projects": {
"opensafely/example-study": {
"sha": "commitsha",
"files": {
"analysis/dataset.py": [
["variable_name", "StrPatientSeries", 42, "expr_hash", "expr_hash_no_codes"]
]
}
}
}
}series_typereflects the runtime type reported byehrql(e.g.StrPatientSeries).line_nostores the best-effort Python source line for the assignment.expr_hashis a deterministic SHA-256 hash of theehrqlquery-model node;expr_hash_no_codesremoves code sets to normalise across codelist variants.
The ehrql_variables.json file is currently manually copy/pasted into this repository
and made available via a tool deployed here.
ehrql_qm_dump.json maps every expr_hash_no_codes to the corresponding query
model string which might be useful for future debugging, but is massive and so
not committed to git.
Many studies reference generated files, runtime parameters, or CLI arguments that
are not committed to their repositories. The collector fills these gaps by
providing stand-ins under spoofed_data/:
args.json– fake command-line arguments keyed by repository. Add new entries if a study terminates early because an expected option is missing.parameters.json– values returned whenehrql.get_parameter()is invoked. Repository-specific overrides can be provided alongside adefault.json_data.jsonandcsv_data.csv– minimal payloads returned whenever datasets open JSON or CSV files (including the gzipped and Arrow derivatives generated automatically).
Update the CSV or JSON fixtures as new columns or properties are required. The .csv.gz and .arrow
files are regenerated from csv_data.csv automatically by the main script each time it is run.
- Currently this is just for patient-level datasets. Add support:
- for event-level datasets,
- for measures.
- Ignore unmodified template studies that only define a
sexvariable. - Improve stack traces so errors reference the original dataset source line.
- Increase line-number coverage for variable assignments.
- Sometimes the line number is for a different file, so the output should be the filename and line number.
- Still some variables have no line number at all.
- Improve the fuzzy matching:
-
where()followed bysort_by()result in the same things, but the current variant analysis treats these as different things - Multiple
where()commands can be chained together. Two variables that only differ in the order of thewhere()statements currently show as different things
-
Please see the additional information.