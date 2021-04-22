This article is excerpted from the course "Fundamental Machine Learning," part of the Machine Learning Specialist certification program from Arcitura Education. It is the sixth part of the 13-part series, "Using machine learning algorithms, practices and patterns."

Continuing from part five of the series, this article examines two more of the common machine learning patterns, the feature selection pattern and the feature extraction data reduction pattern.

Feature selection: Overview How can only the most relevant set of features be extracted from a data set for model development?

Development of a simple yet effective machine learning model requires the ability to select only the features that carry the maximum prediction potential. However, when faced with a data set comprising a large number of features, a trial-and-error approach leads to loss of time and processing resources.

The data set is analyzed methodically, and only a subset of features is kept for model selection, thereby keeping the model simple yet effective.

Established feature selection techniques, such as forward selection, backward elimination and decision-tree induction, are applied to the data set to help filter out the features that do not significantly contribute toward building an effective yet simple model.

Feature selection: Explained Problem A data set normally consists of multiple features. These features become the input for learning a model from the data and the subsequent predictions. For training purposes, though all features can be fed to the model training process, all features seldom equally contribute toward predicting the target value. Some features may not even contribute at all. This leads to excessive use of processing resources and increased costs, especially if the analytics platform is cloud-based. Developing a model with a large number of features results in a complex model that is slower to execute and is prone to overfitting (Figure 1). Figure 1: A data set is prepared that consists of a large number of features (1). The data set is then used to train a model (2, 3). The resulting model has reduced accuracy, takes longer to train and carry out predictions, and suffers from overfitting (4). Solution Only the most relevant features are selected by determining the predictors or features that carry the maximum potential for developing an effective model. Although different methods exist, the underlying technique generally works on the basis of evaluating the predictive power of different features before selecting the ones that carry the maximum predictive usefulness. Selecting the most relevant subset of features also helps to keep the model simple -- further contributing toward model interpretability -- and to better understand the process that generated the data in the first place. Application Forward selection, backward elimination and decision-tree induction techniques are applied for feature selection. Forward selection is a top-down approach where all features are excluded at the start and are then re-added in a step-by-step manner (Figure 2). Each newly added feature is evaluated numerically, and only value-bearing features are kept. The evaluation is done either via correlation or information gain measures. Figure 2: An example of the forward selection process that starts with zero features and ends with the selection of three features. Backward elimination is a bottom-up approach where all features are included by default at the start and are then removed in a step-by-step manner (Figure 3). Features providing the least value are removed after numerical evaluation. Both the forward selection and backward elimination techniques are heuristics-based in that they both work on a trial-and-error approach. Figures 3: An example of the backward elimination process that starts with five features and ends with the selection of two features. Decision-tree induction is an algorithm-driven technique whereby a treelike structure is constructed (Figure 4). Figure 4: A data set composed of five features. Using decision-tree induction, the three most value-bearing features are selected. The non-leaf node performs a test on a feature, and each leaf node represents a predicted class. The algorithm only chooses the most relevant feature at each non-leaf node. The complete tree represents the subset of most value-bearing attributes (Figure 5). Figure 5: A data set is prepared that consists of a large number of features (1). The analytics engine mechanism is used to assist with feature selection by exposing the data set to the decision-tree induction technique (2). This results in a subset of the original training data set with only the most relevant features (3). This data set is then used to train a new model (4, 5). The resulting model has increased accuracy, takes a shorter time to train and carry out predictions, and only slightly suffers from overfitting (6).