A simple function to create nice correlation plots
When we ran a previous Machine Learning workshop we did an example of linear and multiple linear regression in R. This correlation plot function has been created on the back of feedback received from the group. This function can be freely used and plugged into your R environment to simplify and visualise correlations between your data and features.
Loading the correlation plot package
The package I used for creating my correlation plots was the corrplot package, this can be installed and loaded into the R workspace by using the syntax below:
#install.packages("corrplot")
library("corrplot")
At this point I would encourage you to check out help for the corrplot function, as it allows you to pass a multitude of parameters to the function.
Deconstructing the function
As mentioned previously, this plotting function has a multitude of uses, but all the parameters can be off putting to a newbie! This was me 6 years ago vigorously typing ‘how to do this with R relating to x’ into Google.
The function I have created uses the functionality of the corrplot packages, but it simplifies the inputs. I will include the function in stages to explain each step, however, if you just want to use the function and are not bothered with the underpinnings then skip the following section.
Step 1 – Function Parameters
Parameters of the function are as below:
dd_corrplot <- function(df_numeric_vals,
method_corrplot,
colour_min,
colour_middle,
colour_max="green") {
The parameters to pass to the function are:
- df_numeric_vals this means a data frame of numeric values only, so any categorical (factor) data needs to be stripped out before passing the data frame to the function;
- method_corrplot this is a numeric range from 1 – 5. So, for a shaded correlation plot you would use 1. Further examples of the various options will be discussed when I describe how the if statement works;
- colour_min this uses a gradient colour setting for the negative positive correlations. An example of an input here would be “green”;
- colour_middle this is the middle range colour, normally I set this equal to (=) “white”;
- colour_max this is the colour of the strong positive correlations.
For information on the strength of correlations, refer to this simple guide.
Step 2 – Creating the condition (IF) statement to select correlation plot type
The below conditional statement uses the input of the function e.g. 1-5 to select the type of chart to display. This is included in the code block below:
if(method_corrplot == 1 ){
type_var <- "shade"
method_corrplot <- type_var
}
else if (method_corrplot ==2) {
type_var <- "number"
method_corrplot <- type_var
}
else if (method_corrplot ==3) {
type_var <- "pie"
method_corrplot <- type_var
}
else if (method_corrplot ==4) {
type_var <- "ellipse"
method_corrplot <- type_var
}
else if (method_corrplot ==5) {
type_var <- "circle"
method_corrplot <- type_var
}
else{
type_var <- "shade"
method_corrplot <- type_var
}
What does this do then? Well firstly nested in the function I make sure that the corrplot library is referenced to allow for the correlation plot functionality to be used.
The next series of steps repeat this method:
- Basically, this says that if the method_corrplot parameter of the function equals input 1, 2, 3, etc – then select the relevant type of correlation plot.
- The type_var is a variable that sets the value of the variable equal to the string stated. These strings link directly back to the parameters of the corrplot function, as I know a type of correlation plot is equal to shade or number, etc.
- Finally, the last step is to convert method_corrplot equal to the textual type specified in the preceding bullet. In essence, what has been inputted as numeric value into the parameter i.e. 1; set the type_var equal to a text string that matches something that corrplot is expecting and then set the method_corrplot variable equal to that of the type variable. Essentially, turning the integer value passed into the parameter into a string / character output.
Step 3 – Hacking the corrplot function
As specified in the previous sections, this function has a lot of inputs and is in need of simplifying, so that is exactly what I have tried to do. The corrplot function is the last step in my more simple function to take lots of parameters and simplify down to just 5 input parameters:
corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
order = "AOE",
addCoef.col = 'black',
number.cex = 0.5,
tl.cex = 0.6,
tl.col = 'black',
col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
cl.cex = 0.3)
}
Let’s explain this function.
So, the corrplot function is the main driver for this and the second nested cor is just as important, as this is the command to create a correlation matrix.
The settings are to use the df_numeric_vals data frame as the data to use with the function, the use=’all.obs’ just tells the function to use all observations in the data frame and the method=method_corrplot uses the if statement I created in step 2 to select the relevant chart from the input. The order uses the angular ordering method and the addCoef.col=’black’ sets the coefficient values to always show black, as well as the colour of the labels. The background colour of the correlation plot uses the colorRampPalette function to create a gradient scale for the function and the parameters of each of the colour settings like to those inputs I explained in step 1.
Utilising the function
The example dataset I will use here is the mpg sample file provided by ggplot. Load the R script provided towards the end of the last section first, as this will create the function in R’s environment. Next, add this code to the end to look at the various different iterations and charts that can be created from the data:
##------------------CREATE DATASET---------------------------------------
numeric_df <- data.frame(mpg[c(3,5,8,9)])
#This relates to the numeric variables in the data frame to use with my function
##------------------USE FUNCTION-----------------------------------------
dd_corrplot(numeric_df,1, "steelblue2","white", "whitesmoke")
dd_corrplot(numeric_df,2, "steelblue2","black", "black")
dd_corrplot(numeric_df,3, "steelblue2","white", "whitesmoke")
dd_corrplot(numeric_df,4, "steelblue2","white", "whitesmoke")
dd_corrplot(numeric_df,5, "steelblue2","white", "whitesmoke")
The outputs of the charts are reliant on the correlation plot type select 1-5, and the colour ranges selected. You can choose any colour and I would recommend using the command colours() in R console or script to pull up the list of colours native to R.
How about these visualisations:
Keep reading our blog posts and if you need any answers to any of your problems, then please let us know. Like the correlation plot and our working group, it is all about relationships!