A Comparison of Imputation Methods for Categorical Data

Citation DataSSRN, ISSN: 1556-5068

Publication Year2023

0
Citations
319
Usage
0
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Usage
319
- Abstract Views
  254
- Downloads
  65
Ratings
- Download Rank
  702,318
  - SSRN
    702,318

Article Description

Objectives Missing data is commonplace in clinical databases, which are being increasingly used for research. These databases contain mainly categorical variables. The questionable aspect is the best imputation method for categorical data. Materials and methods We utilized data extracted from paper-based maternal health records from Kawempe National Referral Hospital, Uganda. We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential Hot-Deck (SHD), and Multiple Imputation by Chained Equations (MICE). In the first approach to compare the imputation methods, random missing data was injected at varying proportions in the complete dataset (5%-50%). The missing values were imputed by the five imputation methods which were then compared by precision score. In the second approach, the complete dataset was split into training and testing dataset. Random missing data (5%-50%) was then injected into only the training set. Imputation methods were then compared by predictive accuracy of the outcome variable in four classifiers on a single testing set. The consistency of performance of imputation methods was assessed by Kendall’s W test. Results KNN imputation had the highest precision score at levels of missing data (Kendall’s W = 0.853, p = 0.0000842). However, the methods performed differently at all proportions of missing data in the four classifiers. Conclusions KNN imputation is the best method in predicting missing values in categorical variables. There is no universal best imputation method that yields the highest predictive accuracy at all proportions of missing data.

Bibliographic Details

DOI10.2139/ssrn.4574180

SSRN ID4574180

URL IDhttp://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85172435303&origin=inward; http://dx.doi.org/10.2139/ssrn.4574180; https://dx.doi.org/10.2139/ssrn.4574180; https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4574180; https://ssrn.com/abstract=4574180

AUTHOR(S)

Shaheen M.Z. Memon; Ignace H. Kabano; Robert Wamala

PUBLISHER(S)

Elsevier BV

TAG(S)

Multidisciplinary; Imputation; categorical variables; precision score; single imputation; multiple imputation

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know