Documentation

Standard procedure for dealing with p-values and confidence intervals

Most datasets come with pre-computed effect sizes and standard errors. Some (especially databases of clinical trials and large sets of scraped studies) report p-values. Reporting in clinical trial database is done by investigators and data entry fields are not standardised, requiring more extensive clean-up and a procedure for deriving z-values.

Cleaning up data

For confidence interval levels, we clean them up by setting values above 100 or below 0 to missing, and treating values below 1 as proportions rather than percentages by multiplying by 100. This means that, for example, 0.95 is treated as a 95% confidence interval. For missing values in EU CTR database, we assume 95%.

In clinical trial datasets we classify the effect measure using text matching to a small set of classes (Mean Difference, Odds Ratio, Risk Ratio, Hazard Ratio, Geometric Ratio, Risk Difference, Difference in Percentages, Ratio/Other Ratio, and Other). This string-based mapping is an arbitrary choice. Unstandardised data in these datasets have many hundreds of values.

Confidence intervals

When working with CIs, ratio measures were analysed on the log scale (if and only if lower bound, upper bound, and point estimate were all strictly positive) and non-ratio measures were always kept on the raw scale.

For z-statistics derived from confidence intervals, we computed the critical value using stated CI percent and the reported number of sides (if missing, we defaulted to a two-sided interval). To avoid undefined critical values (e.g. when at 100% or extremely close to it), we bounded the implied \(\alpha\) at a small value. We then computed two one-sided standard error estimates from the upper and lower bounds and used their average when both were finite.

We computed a simple symmetry diagnostic for the confidence interval on the chosen scale, defined as the smaller of the two half-widths divided by the larger. We treated intervals as “symmetric enough” only if this ratio exceeded 0.8. This threshold is arbitrary and affects when we trust the CI-derived z-statistic.

We also flagged potentially non-Wald intervals using text matching on type of parameter column, if available, searching for keywords such as median, Hodges, posterior/Bayes, exact, Fieller, bootstrap, and permutation. Any row flagged this way was treated as unreliable for CI-based inference, regardless of its numeric properties.

p-values

The default approach, used for several datasets which report p-values, especially those scraping data from PubMed/Medline, is to treat p-values as two-sided, using the upper-tail standard normal quantile corresponding to \(p/2\). In EU CTR, when the result is marked as one-sided, we use the corresponding one-sided normal quantile instead; missing sidedness is treated as two-sided. ClinicalTrials.gov does not provide a separate p-value sidedness field, so its p-values are treated as two-sided.

When a source reports t-statistics and degrees of freedom, we first derive the two-sided p-value from the t distribution as twice the upper-tail probability beyond the absolute observed t-statistic. We then store the signed normal-equivalent statistic by applying the sign of the t-statistic to the upper-tail standard normal quantile corresponding to half that two-sided p-value.

For clinical trials databases (clinicaltrials.gov and EU CTR) we sometimes found ambiguity in how p-values were recorded. Sometimes significant results report p-values close to 1, suggesting that authors reported cumulative distribution, \(\Phi(z)\). Therefore for each row we also calculated \(2\cdot\min\{p,1-p\}\). When a CI-derived z-statistic was available, we chose whichever p-value led to a z-value that was closest to the CI-derived z.

To avoid dropping rows, we retained unsigned z-statistics when the sign could not be determined. If the sign of effect was known, we applied it to the “p-derived” z; otherwise we used absolute value.

Choice of a single z-value when multiple candidates available

In situations where \(z\) can be derived from both CIs and p-values. We choose a single “best” z-statistic per row as follows. We prefer the CI-derived \(z\) when it looked like a plausible Wald interval on the chosen scale, meaning: (1) finite \(z\) and standard error, (2) positive standard error, (3) not flagged as non-Wald by keyword search in measure label, and (4) either missing symmetry information or symmetry above 0.8. Otherwise we used the p-derived z.