The primary motivator of big data initiatives is the ability to institute an advanced analytics program that can...
help improve business operations and lead to increased marketing and sales opportunities. And companies are increasingly tapping automated machine learning algorithms to aid in the big data analytics process.
The growing interest in machine learning techniques is coupled with a lowered barrier to entry due to the wider availability of analytics tools supporting them. Yet it's easy to become overly enthusiastic about machine learning without really understanding what the algorithms are intended to accomplish, let alone what they do and how they do it. To that end, let's look at three examples of machine learning methods, their objectives and how selected algorithms work.
1. Classification puts data in its place
We'll start with classification, whose objective is to predict an outcome by creating separate classes in a data set. Uses for classification algorithms include junk email detection and healthcare risk analysis. In the former, after scanning the text of an email and tagging recognized words and phrases, the email's "signature" can be fed into a classification algorithm to determine whether it qualifies as spam. In the latter, a patient's vital statistics, health history, activity levels and demographic data can be run through an algorithm to assign a risk score for particular diseases.
One method of classification is a decision tree, which is similar to a flow chart in providing a hierarchical sequence of information -- in this case, "tests" of different parameters in the data entities being classified. The tests could be as simple as yes-no questions or include a broader set of differentiating variables. At each level of a decision tree, they're applied to the data to further the classification until the bottom of the tree is reached, with data entities separated into distinct classes.
A decision tree is built using a machine learning algorithm. Given a predefined set of classes, the algorithm iteratively searches for the variables that provide the greatest differentiation among the data entities being classified. Once such variables are found and decision rules are determined, the existing data set is split into two or more groups based on the rules. The data analysis is done recursively on each of the resulting subsets until all the decision rules relevant to the classification process are identified.
2. Clustering herds data sets together
Examples of machine learning methods also include clustering. The goal of a cluster analysis algorithm is to consider entities in a single large pool and formulate smaller groups that share similar characteristics. For example, a cable television company that wants to determine the demographic breakdown of viewers watching different networks can do so by creating clusters based on available data about subscribers and what they're watching. A restaurant chain might cluster its clientele based on menu selections by geographic location and then tweak its menus accordingly.
In general, clustering algorithms examine a designated number of data characteristics and map each data entity to a corresponding point in a dimensional plot. The algorithms then look to group elements together based on their relative proximity to one another in the plot.
A commonly used type is the k-means clustering algorithm. Such algorithms split a set of data entities into clustered groups, with k representing the number of groups created. The algorithms refine the assignment of entities to different clusters by iteratively calculating the mean midpoint, or centroid, of each cluster. The centroids become the focal points of the iterations, which refine their locations in the plot and reassign data entities to fit the new locations. An algorithm repeats itself until the groupings are optimized and the centroids no longer "move."
3. Affinity analysis builds relationships
Affinity analysis is another approach to mining and analyzing data that can be done via machine learning. Its aim is to discover correlations among data attributes or processing events. For example, it's used by retailers in market-basket analysis applications to identify items often purchased at the same time; an online retailer could use the results to guide product placement on its website.
Cybersecurity efforts also commonly incorporate affinity analysis. Sequences of network transactions that precede cyberattacks are analyzed to identify patterns of transactions occurring within close proximity to one another. The correlated events can then be used to formulate prescriptive analytics applications designed to pre-empt similar attacks.
One of the popular algorithms used for affinity analysis is called Apriori. It looks for correlations -- formally called association rules -- among data attribute values in transactional database records. Like the other algorithms mentioned, Apriori works iteratively. Each phase increases the number of variables -- the itemset size, in analytics parlance -- that the algorithm considers in an effort to find as many correlations as possible in the data being analyzed.
To uncover data affinities, the Apriori algorithm computes a pair of metrics: "support," which divides the number of database records that contain the set of specified variables by the total number of records, and "confidence," which calculates the probability that a record will include one of the desired attribute values when it contains the others.
Correlations with support and confidence measurements above a predefined threshold are logged as association rules; for example, "95% of the time when a customer buys beer, she also buys potato chips." After an iteration is completed, the database records that didn't contribute to the association rules are pruned, the itemset of variables is expanded and the process is repeated until no additional associations are found.
In addition to these examples of machine learning algorithms and approaches, there are many other algorithms that can be used to accomplish similar analytics results. Don't stop here: Creating a full inventory of available algorithms will help guide your data scientists and business analysts in choosing the right machine learning methods for their analytics applications.
More from David Loshin: Predictive analytics requires a solid data mining plan
Self-driving cars will put the public's trust in machine learning tools to the test
Get additional info and advice in our guide to artificial intelligence applications