*Source: https://research.fb.com/the-facebook-field-guide-to-machine-learning-video-series/*

Let us continue with our first machine learning model!

*What is Machine Learning?*

What exactly is machine learning? According to Arthur Samuel, a pioneer in computer science and artificial intelligence, machine learning is "the field of study that gives computers the capability to learn without being explicitly programmed." Put more simply, machine learning includes automating and improving a computer's learning process based on their "experience" without actually being programmed by humans. By "experience", we mean feeding the computer with quality data and then **training **our machine (i.e., computer) by building a model using the data and different algorithms. The chosen algorithm depends on the type of data we are utilizing and the kinds of questions we'd like to answer.

As psychologists who are well-versed in statistics, we are familiar with building models. In our right-wing authoritarian (RWA) example, we built a multiple linear regression model to discover how well our variables predict the level of RWA. But our model-building always results in a fundamental question: **how generalizable are our results to future samples, or the general population? **We attempt to address this issue by employing various random sampling methods, but as we know, we cannot control for everything. Perhaps there is a quirk in our dataset which we did not consider that is incorporated into our models, and ultimately yields a model that is low in prediction accuracy. How do we discover whether our models are truly predictive or not?

Therein lies the advantage of machine learning: when we build our model (i.e., algorithm), we only use a subset of our observations, called the **training data**. Once we have built our model, we feed the remaining observations, or the **testing data**, to that model and assess how well the the predicted values generated by our model matches the testing data. By utilizing training and testing data, we are able to assess how accurate our model is (and generate accuracy statistics) for future data. While different algorithms may be used for your model (e.g., regression, decision trees, support vector machines, neural networks, etc.), all machine learning incorporates the use of training data to build the model, and testing data to assess the accuracy of your model.

Typically, your training data should be 75 - 80% of your total data, with the remainder (i.e., 20 - 25%) comprising your testing data. Also, the best method includes **randomly** splitting your data into training and testing data sets. Finally, rather than splitting your data once, you can split your data multiple, or *k*, times; this procedure is called *K-Fold Cross Validation.* In this case, a training set is used to fit the model, a validation set is used to estimate prediction error for model selection, and the test set is used for assessment of the generalization error of the final chosen model.

In k-fold cross-validation:

1. Partition the dataset into k equal-sized partitions.

2. Select one partition as the validation data.

3. Use the remaining k-1 as the training data.

4. Train the model and determine accuracy from the validation data.

5. Repeat the process k times, selecting a different partition each time for the validation data.

6. Average the accuracy results.

* Source: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6/*

There are alternate methods of cross-validation (e.g., leave-one-out, leave-k-out, etc.). Nevertheless, cross validation is useful because it potentially utilizes data points in both the training and test set, yielding a less-biased model (and also explains why it's a great technique to use for smaller data sets).

If your model is too biased, this is known as **overfitting**, meaning your model fits too closely to the training data and may potentially not predict untrained or new data well. However, the opposite may also occur: your model does not fit your training data well enough and also cannot be generalized to new data. This is called **underfitting**, and usually occurs when not enough predictors were included or an inappropriate algorithm was used. Cross-validation is one significant way to reduce overfitting (in practice, it is likelier that overfitting will occur, compared to underfitting).

* Source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76*

As you can imagine, the amount of observations needed to conduct a machine learning model is more than a standard statistical model that psychologists use. The general rule-of-thumb is that you need **10 observations per predictor** in your training set; remember that this number does not include your test set. This can be problematic to some psychological researchers, and may pose as a disadvantage of machine learning where significant amounts of data are necessary to train your models.

Now, let us return to our RWA dataset and create a simple machine learning model. We begin by importing our libraries and cleaned data set:

In our last tutorial, the following *a priori* predictors were included in our multiple regression analysis: extraversion, agreeableness, conscientiousness, emotional stability, openness to experience, education, gender, religion, voting status, and married status. For comparability, we will include these same predictors in the current model.

As before, we will also need to dummy code our categorical variables, and drop one of the newly created dummy-coded variables to prevent issues of multicollinearity.

Now we introduce a new step that will allow us to build a machine learning model, by splitting our entire sample into training data (which allows us to build our model) and testing data (the data which will be used to assess the prediction accuracy of our model):

*Training and Testing Data*

As the code demonstrates, we have created four dataframes: two frames that will allow us to create an algorithm (training), and two frames that will allow us to apply that algorithm and test it (testing). Also, we have requested that the testing dataframe be 20% (or 0.2) of the total dataset size.

Another necessary step includes 'scaling' all of our predictors. Put another way, this includes standardizing our variables to achieve Gaussian parameters with a mean of zero and one-unit variance. This is necessary so that the variance of our predictors (i.e., features) are in the same range. If a predictor's variance is orders of magnitude more than the variance of the other predictors, that predictor might dominate the others and the model will place additional emphasis on this predictor (which we do not want). Fortunately, feature scaling is easily handled:

*Feature Scaling*

Now we can build our model using the training data and conduct a multiple regression analysis:

*Build our Model*

The code above is allowing us to do a number of things:

The "Fit a model" code allows us to conduct a multiple linear regression model using our training data (both X_train and X_test).

We can request various statistics to assess the results of that model, such as the y-intercept, the coefficients of each of our predictors, and the coefficient of determination (R^2). In this case, our training algorithm has yielded an

**R^2 = 0.07**, meaning that our model accounts for 7% of the variance for our right-wing authoritarian personality score. As I'm sure you agree, that's not a very accurate model.

Maybe one reason why our model was not highly accurate is because it's biased, unbeknownst to us. One way to decrease bias is to use cross-validation! Therefore, let's re-run our prediction model using 10-fold cross validation:

*10-Fold Cross Validation*

Calling on the library "cross_val_score", we are able to 10-fold cross validate, which yields 10 coefficients of determination:

Cross-validated R^2 scores: [0.05572, 0.08719, 0.07625, 0.02015, 0.05791, 0.05714, 0.07982, 0.04607, 0.06844, 0.05562]

You can then compute the average of these coefficients to generate a single statistic for the 10-fold cross validation procedure, which in this case is: **0.0604**. It appears that our training/testing split did not create unbiased samples, given that including cross-validation did not improve our model's coefficient of determination.

Let us continue by running our test data through our trained regression model:

The graph depicts our actual RWA scores against what our model has predicted. The closer the points are to the line, the more accurate our prediction is. As evident below, our model is not very good at predicting the RWA total score:

Finally, we can compute metrics to compare the accuracy of our predicted RWA scores to the actual RWA scores we had set aside when we created our test sets (i.e., y_test).

There are a number metrics one can use to measure prediction accuracy. A popular one includes the root mean square error (RMSE), as it provides a clear value which represents the amount of total error in the model (which the coefficient of determination does not tell us). **Better prediction accuracy is determined with a lower RMSE value. **Our current model yielded the following prediction accuracy metrics:

Mean Absolute Error: 7.599542

Mean Squared Error: 107.963569

Root Mean Squared Error: 10.390551

RMSE units are calculated from the predictor variables, which themselves generally range from 2 to 14. Therefore, our RMSE is quite large, and not indicative of a highly accurate model.

As a final test, we can run a traditional hypothesis-testing analysis (that we psychologists are familiar with) to generate *p*-values and see which predictors appear to statistically predict RWA total scores (as we conducted in the previous tutorial):

Overall, this tutorial offers an introduction to common machine learning concepts such as training/testing data, cross-validation, and prediction accuracy metrics.

However, our overall model was really not a good first effort in predicting right-wing authoritarian personality scores. One very important aspect of model building includes **determining which predictors to include in your model**. In the current example, we included predictors *a priori*, as psychologists commonly do. However, one advantage of machine learning is utilizing techniques that will allow us to figure out which variables may be the most predictive in reference to our dependent variable. The next tutorial will focus on common machine learning techniques used for feature selection.

Until next time!

To view and/or download my Python Jupyter notebook, visit my Github page.