Imputation deals with how we process missing values in our dataset.
If values are missing at random, we can assume a generative model form for
- Simple imputation replaces missing values with the featureโs average or majority.
- Regressed imputation trains a regression model for each feature using non-missing data points, then fills in missing values with predictions. The model learns with ๐ Expectation Maximization: use current values to find regression model, use regression model to fill in missing values, use new values to train a better model, and so on.
However, most data arenโt missing at random; in this case, itโs better to add indicator features for whether thereโs a missing value as well as imputing with one of the strategies above. For a categorical feature, treat missing as its own class.
Info
If there are many categories, we can perform dimensionality reduction with clustering, PCA, or autoencoders.