Note: The work below was originally published in March 2017 as my Insight Data Science project.
As an Insight Data Science Fellow, I worked on a consulting project for an online company that assists their users in solving complex financial questions using tutorials, tools, and calculators. This financial company had recently partnered with a loan company, and was advertising for them on their site. In order to increase ROI for their advertising partners, the finance company must drive interested users to sign up for a specific product (in this case, a loan). Therefore, my task was to determine which factors best predict whether a user will click on a loan web advertisement and submit their contact information (i.e., lead converters), versus those who will not (i.e., non-converters). The company will use these insights, provided in a written report, to specifically target users who are more likely to convert, thereby increasing the number of converters and ROI for their advertising partner.
One month’s worth of online user data was available, beginning with the inclusion of the advertisement to the finance company’s website. The data was organized by event types; that is, types of actions occurring on their webpages, which included page loadings, users submitting information on calculators, refreshing a loan calculator, etc. Each online user received a unique identification code. Along with these event types, data included user location, time stamp, device, browser, operating system, and webpage information such as referring page, search engine used, and type of page visited (i.e., loan page, tax page, property tax page, affordability, closing, among others). From these variables, I was able to engineer additional features such as weekday, time, and total number of event types per user. In all, there were approximately 2.7 million actions performed by over 450,000 users.
First, I statistically compared converters and non-converters on these features utilizing t tests and chi-squares. These tests allowed me to further explore the data as well as infer possible useful features to include in subsequent predictive models. Overall, there were significant differences between converter and non-converter groups on features such as: browser type, device, operating system, day of the week, and the number of specific event types. For example, converters reported a significantly higher number of event types such as inputting information into a calculator, refreshing a loan rates table, clicking on a loan offer, as well as reporting more actions overall, compared to non-converters. Therefore, one can infer that the more actions performed by a user, the more they were engaged in the company’s website, presumably increasing the likelihood of conversion.
While there were a large number of non-converters (~455,000), only 180 users converted (i.e., clicked on the web advertisement and submitted their contact information), thereby presenting a tremendous group imbalance. Data is generally considered imbalanced when one outcome occurs less than 15% of the time. Unfortunately, machine learning algorithms are highly sensitive to disproportionate group sizes, leading to an increase in biased predictions and misleading accuracy levels. This is quite simply due to a lack of information necessary for the algorithm to make an accurate prediction based on the minority group. Fortunately, imbalanced data sets are quite common and there exist a number of different methods available to account for this imbalance. However, there is no one “correct” method; each method has its own advantages and disadvantages. One method may work better than others, given a particular data set.
To prepare for prediction, my first step involved feature selection. I applied a robust feature selection algorithm that removed highly correlated and linearly dependent features (and thus, removing unnecessary overlap), as well as features with near-zero variance. Accounting for imbalanced data, I ran a couple of popular algorithms to get a feel for the data: logistic regression, decision trees, boosted general linear model (GLM), and random forest. Of course, logistic regression and boosted GLM presupposes that features are related to the classifier in a linear fashion, while tree-based algorithms classify the target variable based on its ability to categorize observations for subsequently relevant features. I divided the data into training (N = 365,434) and testing (N = 91,358) sets for 10-fold cross validation for all algorithms. Rather than using the accuracy metric, I utilized the area under the ROC curve (AUC) to assess the optimal predictive algorithm. Interestingly, the logistic regression presented the greatest predictive ability at 0.73, while the other three algorithms performed approximately at chance level. However, assessment of additional accuracy metrics yielded a problem: the logistic model was accurate in that the algorithm classified all test observations as non-converters! For example, the precision metric, which represents the likelihood of how often the algorithm is correct when it predicts a user as a converter, was zero. Therefore, the uncorrected logistic model, while accurate, is not a useful model in determining who will convert.
Over and Under Sampling Methods
Results for the uncorrected analyses should be taken with caution, given the significant group imbalance and low precision metric. Two methods to account for disproportionate groups sizes include oversampling and undersampling. Oversampling includes replicating observations from the minority class to balance the data. Undersampling, on the other hand, reduces the number of observations from the majority class to balance the data set. Both methods have advantages and disadvantages. Oversampling leads to no data loss, but may potentially lead to overfitting the model given the simple replication of minority observations. Undersampling may reduce running time for complex models, but leads to data loss for potentially important information pertaining to the majority class.
I re-ran the four models, both with oversampling and undersampling correction. While there were no visual changes in the AUC statistic for the logistic regression, both sampling procedures significantly improved prediction accuracy for the tree, forest, and GLM boost algorithms by at least 20%. The most predictive model at this point included the logistic regression model with oversampling correction, with an AUC of 0.74. Also, precision dramatically increased to 0.7, indicating some significant power in correctly predicting who is likely to convert. The most important variables associated with this model included: Internet Explorer user, Android OS, location, and visiting specific types of webpages.
Synthetic Minority Oversampling Technique (SMOTE)
I applied one final imbalance correction: the synthetic minority oversampling technique (SMOTE). As its name suggests, this method includes generating more observations from the minority group. However, these are not mere replications but synthetically (or artificially) generated data points based on feature-space similarities from minority samples using bootstrapping and k-nearest-neighbor techniques. Interestingly, SMOTE application yielded the most predictive model for each algorithm, with the Random Forest model demonstrating the highest predictive performance for all algorithms (AUC = 0.78). Also, precision increased to 0.8, suggesting that the algorithm correctly identified a user as a converter 80% of the time. Here, important variables included user action, time, Chrome use, and visiting a Finance-related page.
Overall, the logistic regression yielded the most consistent model, with few changes in predictive performance after applying imbalance correction. However, the other three algorithms demonstrated increasing predictive performance with each sampling technique, particularly using SMOTE. This demonstrates that tree-based algorithms are quite sensitive to data imbalance, which improved with additional observations from the minority class. Nevertheless, these results suggest that the data may best be characterized linearly, even in the case of data imbalance. Also, data imbalance correction significantly improved precision, and therefore the algorithm's ability to correctly identify converters. These results signify the importance of considering group size disparity, as well as assessing multiple accuracy metrics to determine how "good" a model is.
Given that all models performed at a similar level with the SMOTE correction, I compared the top ten important variables across all four algorithms and discovered consistent features with the highest predictive power, such as: users who submit input on online tools and calculators, Chrome users, day of the week, and those who visit financing-related webpages. Therefore, in my final report to the finance company, I may suggest that the company increase conversion rate by targeting Chrome users and place their loan web advertisements on pages that include interactive tools and calculators, as well as financing-related pages, in the beginning of the week (i.e., Monday). Future data collection and analysis is necessary to experimentally test these insights using A/B testing.
To view and/or download my R code, visit my Github page.