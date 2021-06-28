This article is excerpted from the course "Fundamental Machine Learning," part of the Machine Learning Specialist certification program from Arcitura Education. It is the tenth part of the 13-part series, "Using machine learning algorithms, practices and patterns."

The category discovery and pattern discovery techniques are two unsupervised learning techniques that can be applied to solve machine learning (ML) problems where the objective is to find similar groups in the data, rather than the value of some target variable. They can also be applied to carry out data mining tasks. As explained in part 4, these techniques are documented in a standard pattern profile format.

Category discovery: Overview Requirement. How can data be categorized into meaningful groups when the groups are not known in advance?

How can data be categorized into meaningful groups when the groups are not known in advance? Problem. Knowing plausible categories to which data might belong is not always possible, which makes it impossible to use classification algorithms to categorize data into relevant categories.

Knowing plausible categories to which data might belong is not always possible, which makes it impossible to use classification algorithms to categorize data into relevant categories. Solution A clustering model is built that automatically groups similar data points into the same categories based on the intrinsic similarities between data attributes.

A clustering model is built that automatically groups similar data points into the same categories based on the intrinsic similarities between data attributes. Application. Clustering algorithms -- like K-means, K-medians and hierarchical clustering -- build the clustering model that organizes the data points into homogenous categories.

Category discovery: Explained Problem With data mining and exploratory data analysis, there is little information available about the makeup of data at hand. One example is a data set that contains data describing the shopping habits of online customers of an e-commerce retailer. The lack of knowledge about the groups to which different instances may belong and the lack of grouping examples renders the use of supervised machine learning techniques impossible. The objective, therefore, may be to find out if any natural customer groups exist, with the end goal of finding the characteristics of each group (Figure 1). Figure 1: A data set contains data about the spending behavior of customers in a retail store (1). An understanding of the data is required to be gained by finding groups of customers who behave in a similar way, for which a machine learning technique needs to be applied (2). The application of the machine learning technique should result in grouping similar customers together (3). Solution The data set is exposed to clustering, an unsupervised machine learning technique. This involves dividing data into different groups so the data in each group has similar properties. There is no prior learning of categories required. Instead, categories are implicitly generated and subsequently named and interpreted, based on the data groupings. How the data is grouped depends on the type of algorithm used. Each algorithm uses a different technique to identify clusters. Once the groups are found, instances belonging to the same group are considered to be similar to each other. These groups can be further analyzed to gain a better understanding of their makeup and to determine why instances were allocated to different groups. Clustering results can be used to preprocess data for semi-supervised learning, where class labels are first created based on the resulting groups, and the instances belonging to each group are then assigned the corresponding class labels. The labeled data can then be used for classification. While clustering automatically creates homogeneous groups, the machine-generated labels often carry no real meaning. Humans must analyze the properties of each group and create meaningful labels as per the nature of the data analysis task, the business domain or the individuals to which the data mining results must be communicated. Application K-means is a common clustering algorithm that uses distance as a measure for creating clusters of homogeneous items. K is a user-defined number that denotes the number of clusters needed to be created and means refers to the center point of the cluster, or centroid. The centroid forms the basis for cluster creation around which other similar items that make up a cluster are located; it is determined from the mean of all point locations that represent the cluster items in a multidimensional space whose number of dimensions depends on the number of features of items to cluster. The value of K must be set within 1 ≤ K ≤ n, where n is the total number of items in the data set. K-means is similar to K-nearest neighbors, in that it generally uses the same Euclidian distance calculation for determining closeness between the centroid and the items (represented as points) that requires the user to specify the K value (Figure 2). Operating in an iterative fashion, K-means begins with less homogeneous groups of instances and modifies each group during each iteration to attain increased homogeneity within the group. The process continues until maximum homogeneity within the groups and maximum heterogeneity between the groups is achieved. Figure 2: This scatter plot displays the application of the K-means algorithm. Each of the groups contain data points that are similar in nature, but the groups are different from each other. The category discovery pattern requires the application of the feature encoding pattern, as the category discovery pattern involves distance measurement, which requires all features to be numerical in nature. The application of the feature standardization pattern also ensures none of the large magnitude features overshadow smaller magnitude features in the context of distance measurement. In addition, the application of the feature discretization pattern helps reduce the feature dimensionality, which contributes to faster execution and increased generalizability of the model (Figure 3). Figure 3: A data set contains data about the spending behavior of customers in a retail store (1). An understanding of the data is required to be gained by finding customers who behave similarly, for which the K-means algorithm is applied whereby the value of K is 3 (2). This results in similar customers being grouped together into three groups (3). By looking at the centroid of each group, meaningful labels are then allocated to each group (4).

Pattern discovery: Overview Requirement. How can repeated sequences be found in large data sets made up of a number of features without any previous examples of such sequences?

How can repeated sequences be found in large data sets made up of a number of features without any previous examples of such sequences? Problem. The discovery of naturally occurring groups within data is helpful with understanding the structure of the data. However, this does not help find meaningful repeating patterns within the data that can represent business opportunities or threats.

The discovery of naturally occurring groups within data is helpful with understanding the structure of the data. However, this does not help find meaningful repeating patterns within the data that can represent business opportunities or threats. Solution. An associative model is developed that identifies patterns within the data in the form of rules; these rules signify the relationship between data items.

An associative model is developed that identifies patterns within the data in the form of rules; these rules signify the relationship between data items. Application. Associative rule learning algorithms, such as Apriori and Eclat, are employed to build an associative model that extracts rules (patterns) based on how frequently certain data items appear together.