In this session, you learnt about the various types of bivariate and multivariate analyses. These include the following:

  • Analysis between two numerical variables: The most important thing to remember is that correlation and scatter plots are the best methods to perform an analysis on numerical variables. Correlation coefficient indicates how much two numerical variables are correlated linearly. And scatter plots offer the exact visualisation between the numerical variables.

As you can observe in the correlation matrix above, among all the combinations in the data set, there is a high correlation between petal length and sepal length, and petal width and petal length.

  • Analysis between numerical and categorical variables: This gives an idea about the variation of a particular numerical variable with respect to different categories of a categorical variable. Boxplot is the best way to look at a numerical variable with respect to a categorical variable. However, boxplots may sometimes not be useful because of the huge difference between the maximum and minimum values in the data set, or due to the higher concentration of data in the numerical variable. Another approach could be to look into the mean/median or quartiles, which are a more efficient way to deal with a numerical variable when combined with a categorical variable. Take a look at the example shown below.

As you can see in the box plot (already explained in the bank marketing dataset) above, customers with a higher salary range are more likely to give a positive response.

  • Correlation vs causation: This is a very important concept of data anaylsis, which states that correlation is not always related to causation. Although there may be a very high correlation between variables, there may be no causation at all.
  • Analysis between two categorical variables: A bar graph is the best approach to analysing two categorical variables.

One of the interesting examples, also covered in the bank marketing dataset, is that the bank has mostly contacted people in the age group of 30-50, although people in the age group of 60+ gave more positive responses among all the age groups. This is a very important inference that the bank can draw, i.e., it should contact more individuals in the age group of 60+.

  • Multivariate analysis: Multivariate analysis yields very specific information about a data set. It basically involves the analysis of more than two variables at a time. For instance, heat maps are the best way to look at three variables at a time. In multivariate analysis, it is essential to look into the data by grouping the variables and infer decisions from it.

As you have seen already in the bank marketing case study, single people with tertiary education are more likely to give a positive response to term deposit. And married individuals and those who have completed up to primary education are least likely to give a positive response.

In the next segment, Rahim will summarise the module on Exploratory Data Analysis.

Report an error