validation_correction
A Python package for measurement error correction in regression analysis using validation data.
Installation
pip install validation_correction
Usage
The package provides a simple interface for correcting measurement error in both linear and logistic regression using validation data. The correction is implemented using a bootstrap procedure that:
- Resamples the validation data to estimate misclassification probabilities
- Applies these probabilities to a bootstrap sample of the research data
- Repeats this process to obtain valid confidence intervals
Linear Regression with Mismeasured Predictor
import pandas as pd
from validation_correction import validation_correction
research_data = pd.read_csv("research_data.csv")
validation_data = pd.read_csv("validation_data.csv")
result = validation_correction.ols(
formula="y ~ w || x + z",
data=research_data,
val_data=validation_data,
bootstrap=True,
n_boots=1000
)
naive_result = validation_correction.ols(
formula="y ~ w + z",
data=research_data,
val_data=None
)
print(result)
validation_correction.plot_coefficients(naive_result, result)
validation_correction.plot_bootstrap_distributions()
Logistic Regression with Mismeasured Predictor
result = validation_correction.logit(
formula="u||y ~ x + z",
data=research_data,
val_data=validation_data,
bootstrap=True
)
print(result)
Formula Specification
The package uses a special formula syntax to specify the relationship between true and mismeasured variables:
-
For mismeasured predictors:
- Format:
y ~ w || x + z
- Where
x
is the true variable and w
is its mismeasured version
- Additional covariates (
z
) are measured without error
-
For mismeasured outcomes (not yet implemented):
- Format:
u || y ~ x + z
- Where
y
is the true outcome and u
is its mismeasured version
- Predictors go on the right side of the
~
-
For naive regression (no correction):
- Standard formula format:
y ~ w + z
- No
||
operator needed
- Set
val_data=None
Visualization
The package provides two types of visualizations:
-
Coefficient Comparison Plot:
validation_correction.plot_coefficients(naive_result, corrected_result)
- Shows point estimates and confidence intervals for both naive and corrected models
- Useful for comparing the magnitude and direction of bias
-
Bootstrap Distribution Plot:
validation_correction.plot_bootstrap_distributions()
- Shows the distribution of coefficient estimates from bootstrap samples
- Includes 95% confidence interval markers
- Must run regression with
bootstrap=True
first
Bootstrap confidence intervals:
- Uses percentile method (2.5th and 97.5th percentiles)
- Accessible via
result['[0.025]']
and result['[0.975]']
- Number of bootstrap iterations controlled by
n_boots
parameter
- Bootstrap is required for measurement error correction
Data Requirements
- Main dataset (
data
): Must contain all variables in the formula
- Validation dataset (
val_data
): Must contain both the true and mismeasured versions of the relevant variable
- Both datasets should be pandas DataFrames
References
License
This project is licensed under the MIT License - see the LICENSE file for details.