Distinguishing the Forest from the TREES: A Comparison of Tree-Based Data Mining Methods
By Richard A. Derrig, Louise A. Francis
One of the most commonly used data mining techniques is decision trees, also referred to as classification and regression trees or C&RT. Several new decision tree methods are based on ensembles or networks of trees and carry names like TreeNet and Random Forest. Viaene et al. compared several data mining procedures, including tree methods and logistic regression, for modeling expert opinion of fraud/no fraud using a small fixed data set of fraud indicators or “red flags.” They found that simple logistic regression did as well at matching expert opinion on fraud/no fraud as the more sophisticated procedures. In this paper we will introduce some publicly available regression tree approaches and explain how they are used to model four proxies for fraud in insurance claim data. We find that the methods all provide some explanatory value or lift from the available variables with significant differences in fit among the methods and the four targets. All modeling outcomes are compared to logistic regression as in Viaene et al., with some model/software combinations doing significantly better than the logistic model.
Keywords: Fraud, data mining, ROC curve, claim investigation, decision trees