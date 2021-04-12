This article is excerpted from the course "Fundamental Machine Learning," part of the Machine Learning Specialist certification program from Arcitura Education. It is the fifth part of the 13-part series, "Using machine learning algorithms, practices and patterns."

Continuing from part four of the series, this next article introduces two more machine learning data exploration techniques: the associativity computation and the graphical summary computation. As explained in part four, these techniques are documented in a standard pattern profile format to ensure consistency.

Associativity computation: Overview How can the existence of relationship(s) between variables in a data set be determined?

Problem. Gaining an understanding of a data set and the subsequent model development requires finding connections between variables. Failure to do so results in ineffective models comprising irrelevant variables as predictors.

Solution. The connection between variables is expressed in the form of relationship between variables and is quantified via the application of proven statistical techniques.

Application. Numerical values present in the data set are taken in pairs and the measures of association (correlation and covariance) are calculated.

Associativity computation: Explained Problem In a data set, there is generally a set of variables that influence each other in some shape or form. Where the application of central tendency computation and variability computation patterns allow for gaining insight on a single variable (univariate analysis), they do not provide information about how to draw intuitions about the aforementioned inter-variable dependencies (bivariate analysis). This is important in order to be able to choose the best predictors for a machine learning problem (Figure 1). Figure 1. A data set contains values of ice cream sold for different temperature readings recorded over three days (1). A technique needs to be applied in order to find out if the number of ice cream sales is related to the temperature readings (2, 3). Solution The inter-variable relationship is quantified by taking a pair of variables in turn, such that each variable is compared against the rest of the variables in the data set. This quantification results in a numerical value that can then be used to choose the variables with the strongest relationship. Normally, a variable of interest, such as a variable whose value needs to be predicted, is chosen and compared against other variables. Application The measures of association quantify the relationship between two variables in a data set. The measures of association include correlation and covariance. Correlation is the degree of linear association between two variables, measured using a correlation coefficient. The relationship is considered to be linear when the scatter plot of the variables' values results in a straight line, which means that both variables change with the same proportion at a constant rate. The strength of the correlation is the absolute value of the correlation coefficient and ranges from 0 to 1. The direction of the correlation is given by the sign of the coefficient. A negative sign indicates an inverse relationship, meaning as the values of one variable tend to increase, the values of the other variable decrease. A correlation coefficient near zero implies little or no correlation between the variables' values. The presence of correlation does not constitute causation. Correlation only constitutes a mathematical association between the variables rather than a factual association. Regardless, the visualization of a scatterplot and the calculation of a correlation coefficient can provide useful insights about the data. Pearson's product moment coefficient is one example of a commonly used correlation coefficient for measuring the correlation between two variables. Non-linear associations may also exist between variables, in which case Spearman's rank correlation can be used. However, a monotonic relationship must exist between the variables. A monotonic relationship is where one variable always either increases or decreases while the other may remain constant. Variables that first increase and then decrease or vice versa do not constitute such a monotonic relationship. Both the Pearson and the Spearman correlation coefficients have a range of -1 to +1 and are interpreted in the same manner. The Pearson correlation coefficient is affected by outliers as it takes into account the actual magnitude of the values. Instead of using the values as is, the calculation of Spearman's correlation coefficient requires converting original values to ranked values. As a result, Spearman's correlation coefficient is not affected by outliers as the actual magnitude of the values is ignored. Like correlation, covariance is a measure of how two variables change collectively. However, unlike correlation, its value can be any negative or positive number and is in the same units as the units of the variables. Unlike correlation, the value of covariance is dependent on the units used, meaning the covariance value for inches will be different from the covariance value for centimeters. However, the value of correlation is standardized and is not affected by the units used (Figure 2). Figure 2. A data set contains values of ice cream sold for different temperature readings recorded over three days (1). The measures of association are found in order to determine whether the number of ice cream sales is related to the temperature readings (2). Based on the value of correlation, it is concluded that there is a strong positive relationship between the number of ice cream sales and the temperature readings, which means as the temperature increases, more ice cream is sold and vice versa (3).

Graphical summary computation: Overview How can intuition about a data set be developed beyond computation of simple descriptive statistics?

Generating descriptive statistics, such as numerical summaries, helps quantify various aspects of a data set. However, these techniques alone fail to capture any trends or patterns hidden in the data set that can be easily identified by humans.

The trends and patterns hidden in a data set are identified by stimulating visual perception of humans through generating various charts.

Various graphical summaries are generated from the data set, including bar chart, histogram, scatter plot, cross-tabulation, and box-and-whisker plot.