Skip to content

Commit 7a8d8e3

Browse files
authored
Merge pull request #34 from mitre/t9-mitre-reference-medians
Add NHANES reference medians, updating behavior of sd.recenter; closes mitre#9
2 parents 7842e55 + 34bbbbf commit 7a8d8e3

File tree

10 files changed

+29698
-69
lines changed

10 files changed

+29698
-69
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: growthcleanr
22
Type: Package
33
Title: Growth Measurements Cleaner
4-
Version: 1.2.3
4+
Version: 1.2.5
55
Authors@R: c(
66
person("Daymont","Carrie",,"[email protected]",c("aut","cre")),
77
person("Grundmeier","Robert",,"[email protected]","aut"),

NEWS.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
# growthcleanr
22

3+
## [1.2.5] - 2021-02-26
4+
5+
### Changed
6+
7+
- Updated behavior of `sd.recenter` option to include new NHANES reference
8+
medians and explicit specification with "NHANES" or "derive"
9+
(https://github.com/mitre/growthcleanr/issues/9)
10+
- Switched `README.md` to be generated from `README.Rmd` w/knitr (thanks
11+
@mcanouil) (#17)
12+
- Switched to use `file.path()` more consistently in `R/growth.R`
13+
14+
### Added
15+
16+
- Added `inst/extdata/nhanes-reference-medians.csv`, reference medians for
17+
recentering derived from NHANES (described in README)
18+
319
## [1.2.4] - 2021-01-14
420

521
### Changed

R/extdata.R

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,20 @@ NULL
4343
#'
4444
NULL
4545

46+
#' NHANES reference medians
47+
#'
48+
#' Contains reference median values for default recentering, derived from NHANES
49+
#' years 2009-2018
50+
#'
51+
#' @name nhanes-reference-medians
52+
#'
53+
#' @section nhanes-reference-medians.csv:
54+
#'
55+
#' Used in function `cleangrowth()`
56+
#'
57+
#'
58+
NULL
59+
4660
#' Tanner Growth Velocity Table
4761
#'
4862
#' Part of default CDC-derived tables
@@ -177,4 +191,4 @@ NULL
177191
#' Used to test function `ext_bmiz()`
178192
#'
179193
#'
180-
NULL
194+
NULL

R/growth.R

Lines changed: 88 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1604,10 +1604,24 @@ cleanbatch <- function(data.df,
16041604
#' considering excluding all measurements. Defaults to 2.
16051605
#' @param error.load.threshold threshold of percentage of excluded measurement count to included measurement
16061606
#' count that must be exceeded before excluding all measurements of either parameter. Defaults to 0.5.
1607-
#' @param sd.recenter Data frame or table with median SD-scores per day of life
1608-
#' by gender and parameter. Columns in the table must include param, sex,
1609-
#' agedays, and sd.median. If not supplied, the median values will be
1610-
#' calculated using the growth data that is being cleaned. Defaults to NA.
1607+
#' @param sd.recenter specifies how to recenter medians. May be a data frame or
1608+
#' table w/median SD-scores per day of life by gender and parameter, or "NHANES"
1609+
#' or "derive" as a character vector.
1610+
#' \itemize{
1611+
#' \item If `sd.recenter` is specified as a data set, use the data set
1612+
#' \item If `sd.recenter` is specified as "`nhanes`", use NHANES reference medians
1613+
#' \item If `sd.recenter` is specified as "`derive`", derive from input
1614+
#' \item If `sd.recenter` is not specified or `NA`:
1615+
#' \itemize{
1616+
#' \item If the input set has at least 5,000 observations, derive medians from input
1617+
#' \item If the input set has fewer than 5,000 observations, use NHANES
1618+
#' }
1619+
#' }
1620+
#'
1621+
#' If specifying a data set, columns must include param, sex, agedays, and sd.median
1622+
#' (referred to elsewhere as "modified Z-score"), and those medians will be used
1623+
#' for recentering. A summary of how the NHANES reference medians were derived is
1624+
#' available in README.md. Defaults to NA.
16111625
#' @param sdmedian.filename Name of file to save sd.median data calculated on the input dataset to as CSV.
16121626
#' Defaults to "", for which this data will not be saved. Use for extracting medians for parallel processing
16131627
#' scenarios other than the built-in parallel option.
@@ -1739,9 +1753,8 @@ cleangrowth <- function(subjid,
17391753
# recode column names to match syntactic style ("." rather than "_" in variable names)
17401754
tanner_ht_vel_path <- ifelse(
17411755
ref.data.path == "",
1742-
system.file("extdata/tanner_ht_vel.csv", package = "growthcleanr"),
1743-
paste(ref.data.path, "tanner_ht_vel.csv", sep =
1744-
"")
1756+
system.file(file.path("extdata", "tanner_ht_vel.csv"), package = "growthcleanr"),
1757+
file.path(ref.data.path, "tanner_ht_vel.csv")
17451758
)
17461759

17471760
tanner.ht.vel <- fread(tanner_ht_vel_path)
@@ -1756,16 +1769,14 @@ cleangrowth <- function(subjid,
17561769

17571770
who_max_ht_vel_path <- ifelse(
17581771
ref.data.path == "",
1759-
system.file("extdata/who_ht_maxvel_3sd.csv", package = "growthcleanr"),
1760-
paste(ref.data.path, "who_ht_maxvel_3sd.csv", sep =
1761-
"")
1772+
system.file(file.path("extdata", "who_ht_maxvel_3sd.csv"), package = "growthcleanr"),
1773+
file.path(ref.data.path, "who_ht_maxvel_3sd.csv")
17621774
)
17631775

17641776
who_ht_vel_3sd_path <- ifelse(
17651777
ref.data.path == "",
1766-
system.file("extdata/who_ht_vel_3sd.csv", package = "growthcleanr"),
1767-
paste(ref.data.path, "who_ht_vel_3sd.csv", sep =
1768-
"")
1778+
system.file(file.path("extdata", "who_ht_vel_3sd.csv"), package = "growthcleanr"),
1779+
file.path(ref.data.path, "who_ht_vel_3sd.csv")
17691780
)
17701781
who.max.ht.vel <- fread(who_max_ht_vel_path)
17711782
who.ht.vel <- fread(who_ht_vel_3sd_path)
@@ -1902,24 +1913,62 @@ cleangrowth <- function(subjid,
19021913
cat(sprintf("[%s] Re-centering data...\n", Sys.time()))
19031914

19041915
# see function definition below for explanation of the re-centering process
1905-
# returns a data table indexed by param, sex, agedays
1916+
# returns a data table indexed by param, sex, agedays. can use NHANES reference
1917+
# data, derive from input, or use user-supplied data.
19061918
if (!is.data.table(sd.recenter)) {
1907-
sd.recenter <- data.all[exclude < 'Exclude', sd_median(param, sex, agedays, sd.orig)]
1908-
if (sdmedian.filename != "") {
1909-
write.csv(sd.recenter, sdmedian.filename, row.names = F)
1919+
# Use NHANES medians if the string "nhanes" is specified instead of a data.table
1920+
# or if sd.recenter is not specified as "derive" and N < 5000.
1921+
if ((is.character(sd.recenter) & tolower(sd.recenter) == "nhanes") |
1922+
(!(is.character(sd.recenter) & tolower(sd.recenter) == "derive") & (data.all[, .N] < 5000))) {
1923+
nhanes_reference_medians_path <- ifelse(
1924+
ref.data.path == "",
1925+
system.file(file.path("extdata", "nhanes-reference-medians.csv"), package = "growthcleanr"),
1926+
file.path(ref.data.path, "nhanes-reference-medians.csv")
1927+
)
1928+
sd.recenter <- fread(nhanes_reference_medians_path)
19101929
if (!quietly)
19111930
cat(
19121931
sprintf(
1913-
"[%s] Wrote re-centering medians to %s...\n",
1914-
Sys.time(),
1915-
sdmedian.filename
1932+
"[%s] Using NHANES reference medians...\n",
1933+
Sys.time()
19161934
)
19171935
)
1936+
} else {
1937+
# Derive medians from input data
1938+
sd.recenter <- data.all[exclude < 'Exclude', sd_median(param, sex, agedays, sd.orig)]
1939+
if (!quietly)
1940+
cat(
1941+
sprintf(
1942+
"[%s] Using re-centering medians derived from input...\n",
1943+
Sys.time()
1944+
)
1945+
)
1946+
if (sdmedian.filename != "") {
1947+
write.csv(sd.recenter, sdmedian.filename, row.names = F)
1948+
if (!quietly)
1949+
cat(
1950+
sprintf(
1951+
"[%s] Wrote re-centering medians to %s...\n",
1952+
Sys.time(),
1953+
sdmedian.filename
1954+
)
1955+
)
1956+
}
19181957
}
19191958
} else {
1920-
# ensure passed-in medians are sorted correctly
1921-
setkey(sd.recenter, param, sex, agedays)
1959+
# Use specified data
1960+
if (!quietly)
1961+
cat(
1962+
sprintf(
1963+
"[%s] Using specified re-centering medians...\n",
1964+
Sys.time()
1965+
)
1966+
)
19221967
}
1968+
1969+
# ensure recentering medians are sorted correctly
1970+
setkey(sd.recenter, param, sex, agedays)
1971+
19231972
# add sd.recenter to data, and recenter
19241973
setkey(data.all, param, sex, agedays)
19251974
data.all <- sd.recenter[data.all]
@@ -1938,6 +1987,19 @@ cleangrowth <- function(subjid,
19381987
)
19391988
}
19401989

1990+
# notification: ensure awareness of small subsets in data
1991+
if (!quietly) {
1992+
year.counts <- data.all[, .N, floor(agedays / 365.25)]
1993+
if (year.counts[N < 100, .N] > 0) {
1994+
cat(
1995+
sprintf(
1996+
"[%s] Note: input data has at least one age-year with < 100 subjects...\n",
1997+
Sys.time()
1998+
)
1999+
)
2000+
}
2001+
}
2002+
19412003
# safety check: treat observations where tbc.sd cannot be calculated as missing
19422004
data.all[is.na(tbc.sd), exclude := 'Missing']
19432005

@@ -2025,22 +2087,22 @@ read_anthro <- function(path = "", cdc.only = F) {
20252087
# set correct path based on input reference table path (if any)
20262088
weianthro_path <- ifelse(
20272089
path == "",
2028-
system.file(file.path("extdata","weianthro.txt"), package = "growthcleanr"),
2090+
system.file(file.path("extdata", "weianthro.txt"), package = "growthcleanr"),
20292091
file.path(path, "weianthro.txt")
20302092
)
20312093
lenanthro_path <- ifelse(
20322094
path == "",
2033-
system.file(file.path("extdata","lenanthro.txt"), package = "growthcleanr"),
2095+
system.file(file.path("extdata", "lenanthro.txt"), package = "growthcleanr"),
20342096
file.path(path, "lenanthro.txt")
20352097
)
20362098
bmianthro_path <- ifelse(
20372099
path == "",
2038-
system.file(file.path("extdata","bmianthro.txt"), package = "growthcleanr"),
2100+
system.file(file.path("extdata", "bmianthro.txt"), package = "growthcleanr"),
20392101
file.path(path, "bmianthro.txt")
20402102
)
20412103
growth_cdc_ext_path <- ifelse(
20422104
path == "",
2043-
system.file(file.path("extdata","growthfile_cdc_ext.csv"), package = "growthcleanr"),
2105+
system.file(file.path("extdata", "growthfile_cdc_ext.csv"), package = "growthcleanr"),
20442106
file.path(path, "growthfile_cdc_ext.csv")
20452107
)
20462108

README.Rmd

Lines changed: 96 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,17 @@ devtools::install_github("carriedaymont/growthcleanr", ref="main")
246246
Note that `ref="main"` is required; the default branch is "main", and must be
247247
referred to explicitly.
248248

249+
If you are unable to install `devtools`, a similar function is available in the
250+
`remotes` package:
251+
252+
```{r, eval = FALSE}
253+
install.packages("remotes")
254+
remotes::install_github("carriedaymont/growthcleanr", ref="main")
255+
```
256+
257+
Note that `ref="main"` is required; the default branch is "main", and must be
258+
referred to explicitly.
259+
249260
### Source-level install for developers
250261

251262
If you want to work with and potentially change the `growthcleanr` code itself,
@@ -507,8 +518,9 @@ The following options change the behavior of the growthcleanr algorithm.
507518
and included as valid measurements for cleaning.
508519

509520
- `sd.extreme` - default `25`; a very extreme value check on modified
510-
(recentered) Z-scores used as a first-pass elimination of clearly implausable
521+
(recentered) Z-scores used as a first-pass elimination of clearly implausible
511522
values, often due to misplaced decimals.
523+
512524
- `z.extreme` - default `25`; similar usage as `sd.extreme`, for absolute
513525
Z-scores.
514526

@@ -555,10 +567,23 @@ techniques.
555567
- `flag.both` - in case of two measurements with at least one beyond
556568
thresholds, flag both instead of one (as in default)
557569

558-
- `sd.recenter` - defaults to NA; data frame or table w/median SD-scores per day
559-
of life by gender and parameter. Columns must include param, sex, agedays, and
560-
sd.median (referred to elsewhere as "modified Z-score"). By default, median
561-
values will be calculated using growth data to be cleaned.
570+
- `sd.recenter` - default `NA`; specifies how to recenter medians. May be a data frame
571+
or table w/median SD-scores per day of life by gender and parameter, or "`nhanes`"
572+
or "`derive`" as a character vector.
573+
574+
- If `sd.recenter` is specified as a data set, use the data set
575+
- If `sd.recenter` is specified as "`nhanes`", use NHANES reference medians
576+
- If `sd.recenter` is specified as "`derive`", derive from input
577+
- If `sd.recenter` is not specified or `NA`:
578+
- If the input set has at least 5,000 observations, derive medians from input
579+
- If the input set has fewer than 5,000 observations, use NHANES
580+
581+
If specifying a data set, columns must include param, sex, agedays, and sd.median
582+
(referred to elsewhere as "modified Z-score"), and those medians will be used for
583+
centering. This data set must include a row for every ageday present in the dataset
584+
to be cleaned; the NHANES reference medians include a row for every ageday in the
585+
range (731-7305 days). A summary of how the NHANES reference medians were derived is
586+
below under [NHANES reference data](#nhanes).
562587

563588
### Operational options
564589

@@ -959,20 +984,77 @@ for `cleangrowth()`.
959984

960985
## <a name="related"></a>Related tools
961986

962-
The CDC provides a
963-
[SAS Program for the 2000 CDC Growth Charts](https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm)
964-
which can also be used to identify biologically implausible values using a different
965-
approach, as also implemented for `growthcleanr` in the function `ext_bmiz()`, described
966-
above under [Computing BMI percentiles and Z-scores](#bmi).
987+
The CDC provides a [SAS Program for the 2000 CDC Growth
988+
Charts](https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm) which can
989+
also be used to identify biologically implausible values using a different approach, as
990+
also implemented for `growthcleanr` in the function `ext_bmiz()`, described above under
991+
[Computing BMI percentiles and Z-scores](#bmi).
967992

968993
[GrowthViz](https://github.com/mitre/GrowthViz) provides insights into how
969-
`growthcleanr` assesses data, packaged in a Jupyter notebook. It ships with the
970-
same `syngrowth` synthetic example dataset as `growthcleanr`, with results
971-
included.
994+
`growthcleanr` assesses data, packaged in a Jupyter notebook. It ships with the same
995+
`syngrowth` synthetic example dataset as `growthcleanr`, with results included.
996+
997+
## <a name="nhanes"></a>NHANES reference medians
998+
999+
`growthcleanr` [releases](https://github.com/carriedaymont/growthcleanr/releases) up to
1000+
1.2.4 offered two options for recentering medians, either the default of deriving
1001+
medians from the input set, or supplying an externally-defined set of medians. These
1002+
left out an option for researchers working with either small datasets or with data
1003+
which might otherwise not be representative of the population, as deriving medians from
1004+
the input set in those cases might be problematic. To provide a standard default
1005+
reference to address these latter cases, a set of medians were derived from the
1006+
[National Health and Nutrition Examination
1007+
Survey](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx) (NHANES). A summary of that
1008+
process is below. As of release 1.2.5, the default behavior is:
1009+
1010+
- If `sd.recenter` is specified as a data set, use the data set
1011+
- If `sd.recenter` is specified as `nhanes`, use NHANES
1012+
- If `sd.recenter` is specified as `derive`, derive from input
1013+
- If `sd.recenter` is not specified or `NA`:
1014+
- If the input set has at least 5,000 observations, derive medians from input
1015+
- If the input set has fewer than 5,000 observations, use NHANES
1016+
1017+
With the verbose `cleangrowth()` option `quietly = FALSE`, the recentering medians
1018+
approach used will be noted in the output. If the input set has fewer than 100
1019+
observations for any age-year, this will also be noted in the output.
1020+
1021+
The NHANES reference medians are based primarily on data from NHANES 2009-2010 through
1022+
2017-2018, including approximately 39,000 heights/lengths and weights from children and
1023+
adolescents between the ages of 0 months and <240 months. Weight and height SD scores
1024+
were calculated from the [L, M, and S
1025+
parameters](https://www.cdc.gov/growthcharts/percentile_data_files.htm) for the [CDC
1026+
growth charts](https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm) were
1027+
used as the reference to calculate weight and height SD scores for the NHANES 2009-2010
1028+
through 2017-2018 samples. Based on the distributions of age-days in children at 0
1029+
months, an age adjustment was made based on the median number of days among these
1030+
infants. This adjustment was made after consultation with the National Center for
1031+
Health Statistics confirmed that a general assumption of ages occurring at the midpoint
1032+
of the indicated integer month of age did not apply to children recorded as 0 months,
1033+
and uses 0.75 months instead.
1034+
1035+
Weights were supplemented with a random sample of birthweights from NCHS's [Vital
1036+
Statistics Natality Birth
1037+
Data](https://www.nber.org/research/data/vital-statistics-natality-birth-data) for 2018. These had sample weights assigned so that the sum of the sample weights for the
1038+
sample equalled the sum of the sample weights for each month for infants in NHANES, as
1039+
NHANES is a multi-stage complex survey. The reference data was then smoothed using the
1040+
`svysmooth()` function in the R
1041+
[`survey`](https://cran.r-project.org/web/packages/survey/index.html) package to
1042+
estimate the weight and height SD scores for each day up to 7,305 days, with a
1043+
bandwidth chosen to balance between over- and under-fitting, and interpolation between
1044+
the estimates from this function was used to obtain an estimate for each day of age.
1045+
Predictions from a regression model fit to smoothed height SDs between 23 and 365 days
1046+
(the youngest child in NHANES had an estimated age in days of 23) were used to extend
1047+
smoothed height SD scores to children between 1 and 22 days of age.
9721048

9731049
## <a name="changes"></a>Changes
9741050

975-
For a detailed history of released versions, see `NEWS.md`.
1051+
For a detailed history of released versions, see `NEWS.md`. Tagged releases, starting
1052+
with 1.2.3 in January 2021, are listed [at
1053+
GitHub](https://github.com/carriedaymont/growthcleanr/releases).
1054+
1055+
In release 1.2.5 in February 2021, the default behavior of recentering medians
1056+
changed as described in [NHANES reference medians](#nhanes). To confirm prior
1057+
results based on derived medians, specify the `sd.recenter` option "derive".
9761058

9771059
In release 1.2.4 in January 2021, an update was made to the WHO height velocity 3sd
9781060
files to correct a small number of errors:

0 commit comments

Comments
 (0)