OpenSAFELY Variable Survey

This repository contains work by @bennettoxford/team-rsi to survey how "variables" are used by researchers in OpenSAFELY study.

A "variable" is a demographic or clinical feature of interest within an Electronic Health Record (EHR) that is used within an EHR study.

The work in this repository encompasses studies written using the deprecated cohort-extractor framework, and those using its replacement - ehrQL.

cohort-extractor studies

The main entry point for cohort-extractor related code in this repository is main.py which is most conveniently run using the just command, just run.

just run fetch will fetch all variables from all studies in all repositories in the opensafely organisation. This requires a suitable GitHub Personal Access Token (PAT), more information on this is available in the Developer Documentation.

just run notebooks will start a local Marimo notebook server, from which the names.py and definitions.py notebooks can be accessed. These notebooks contain the analysis of variable names and definitions performed as part of this work, and can be used a starting point for any desired future analysis.

ehrQL studies

We catalogue the variables defined in ehrql datasets across public studies in the opensafely GitHub organisation. The main script searches for dataset pipelines declared in each study's project.yaml, clones the matching repositories, evaluates the dataset definitions with a spoofed runtime environment, and emits structured summaries of every variable that is declared.

Execution

# Run the full ehrQL extraction process
just ehrql

# Run the ehrQL extraction process with verbose logging
just ehrql --verbose

# Run the ehrQL extraction process with a specific study
just ehrql study-name

How It Works

Invokes the GitHub CLI (gh) to search for project.yaml files that reference ehrql dataset generation commands within the opensafely organisation.
Clones each matching repository into .ehrql_repo_cache/ (one shallow clone per commit SHA) and extracts the dataset authoring files listed in project.yaml.
Imports each dataset module inside a controlled environment that spoofs external inputs (files, command-line arguments, parameters, and selected ehrql behaviours) so that variable definitions can be resolved without the original study artefacts.
Records every discovered variable including the dataset file, variable name, inferred ehrql series type, source line number (where available), and two deterministic hashes of the compiled query-model node.
Writes the aggregated results to ehrql_variables.json and captures the hashed query-model lookup table in ehrql_qm_dump.json.

Key files

ehrql_extractor.py – end-to-end collector used to discover and execute dataset definitions.
spoofed_data/ – configurable fixtures used to satisfy dataset expectations (JSON, CSV/Arrow data, parameters, and CLI arguments).
ehrql_variables.json / ehrql_qm_dump.json – outputs generated by the collector.

Output Format

ehrql_variables.json captures a timestamped snapshot of every processed project:

{
  "generated_at": "YYYY-MM-DD HH:MM:SS",
  "projects": {
    "opensafely/example-study": {
      "sha": "commitsha",
      "files": {
        "analysis/dataset.py": [
          ["variable_name", "StrPatientSeries", 42, "expr_hash", "expr_hash_no_codes"]
        ]
      }
    }
  }
}

series_type reflects the runtime type reported by ehrql (e.g. StrPatientSeries).
line_no stores the best-effort Python source line for the assignment.
expr_hash is a deterministic SHA-256 hash of the ehrql query-model node; expr_hash_no_codes removes code sets to normalise across codelist variants.

The ehrql_variables.json file is currently manually copy/pasted into this repository and made available via a tool deployed here.

ehrql_qm_dump.json maps every expr_hash_no_codes to the corresponding query model string which might be useful for future debugging, but is massive and so not committed to git.

Spoofed Inputs and Adaptation

Many studies reference generated files, runtime parameters, or CLI arguments that are not committed to their repositories. The collector fills these gaps by providing stand-ins under spoofed_data/:

args.json – fake command-line arguments keyed by repository. Add new entries if a study terminates early because an expected option is missing.
parameters.json – values returned when ehrql.get_parameter() is invoked. Repository-specific overrides can be provided alongside a default.
json_data.json and csv_data.csv – minimal payloads returned whenever datasets open JSON or CSV files (including the gzipped and Arrow derivatives generated automatically).

Update the CSV or JSON fixtures as new columns or properties are required. The .csv.gz and .arrow files are regenerated from csv_data.csv automatically by the main script each time it is run.

Planned Improvements

Developer docs

Please see the additional information.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
.vscode		.vscode
github_utils		github_utils
notebooks		notebooks
parsing		parsing
spoofed_data		spoofed_data
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
DEVELOPERS.md		DEVELOPERS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cohort_extractor_variables.json		cohort_extractor_variables.json
ehrql_extractor.py		ehrql_extractor.py
ehrql_variables.json		ehrql_variables.json
justfile		justfile
main.py		main.py
parsing_errors.json		parsing_errors.json
pyproject.toml		pyproject.toml
repository_errors.json		repository_errors.json
requirements.dev.in		requirements.dev.in
requirements.dev.txt		requirements.dev.txt
requirements.prod.in		requirements.prod.in
requirements.prod.txt		requirements.prod.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenSAFELY Variable Survey

cohort-extractor studies

ehrQL studies

Execution

How It Works

Key files

Output Format

Spoofed Inputs and Adaptation

Planned Improvements

Developer docs

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

bennettoxford/opensafely-variable-survey

Folders and files

Latest commit

History

Repository files navigation

OpenSAFELY Variable Survey

cohort-extractor studies

ehrQL studies

Execution

How It Works

Key files

Output Format

Spoofed Inputs and Adaptation

Planned Improvements

Developer docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages