kirill_makarov - stock.adobe.com
Supervised learning tends to get the most publicity in discussions of artificial intelligence techniques since it's often the last step used to create the AI models for things like image recognition, better predictions, product recommendation and lead scoring.
In contrast, unsupervised learning tends to work behind the scenes earlier in the AI development lifecycle: It is often used to set the stage for the supervised learning's magic to unfold, much like the grunt work that allows a manager to shine.
Introduction to machine learning techniques
Technically speaking, the terms supervised and unsupervised learning refer to whether the raw data used to create algorithms has been prelabeled or not.
Supervised learning. In supervised learning, data scientists feed algorithms with labeled training data and define the variables they want the algorithm to assess for correlations. Both the input and the output of the algorithm is specified in the training data. For example, if you are trying to train an algorithm to infer if a picture has a cat in it using supervised learning, data scientists create a label for each picture used in the training data indicating whether the image contains a cat or not.
Unsupervised learning. In an unsupervised learning approach, the algorithm is trained on unlabeled data. It scans through data sets looking for any meaningful connection. This approach is useful when you don't know what you're looking for. If you showed this algorithm many thousands or millions of pictures, it might come to categorize a subset of the pictures as images of what humans would recognize as felines.
An algorithm trained on labeled data of cats versus canines, in contrast, will be able to identify images of cats with a high degree of confidence. But if the supervised learning project takes a million labeled images to develop the model, the machine-generated prediction requires a lot of human effort.
Semi-supervised learning. Data scientists can take a sort of shortcut called semi-supervised learning that combines both approaches. Semi-supervised learning describes a specific workflow in which unsupervised learning algorithms are used to automatically generate labels, which can be fed into supervised learning algorithms. In this approach, humans manually label some images, unsupervised learning guesses the labels for others, and then all these labels and images are fed to supervised learning algorithms to create an AI model.
Semi-supervised machine can lower the cost of labeling the large data sets used in machine learning. "If you can get humans to label 0.01% of your millions of samples, then the computer can leverage those labels to significantly increase its predictive accuracy," said Aaron Kalb, co-founder and CDO of Alation, an enterprise data catalog platform.
Reinforcement learning. Another machine learning approach is reinforcement learning. Typically used to teach a machine to complete a sequence of steps, reinforcement learning is different from both supervised and unsupervised learning. Data scientists program an algorithm to perform a task, giving it positive or negative cues as it works out how to do the task. The programmer sets the rules for the rewards but leaves it to the algorithm to decide on its own what steps it needs to take to maximize the reward -- and therefore complete the task.
Choosing unsupervised vs. supervised machine learning
Shivani Rao, senior applied researcher at LinkedIn, said the best practices for adopting a supervised or unsupervised machine learning are often dictated by the circumstances, the assumptions you can make about the data, and the application.
The choice of using supervised learning versus unsupervised machine learning algorithms can also change over time, Rao said. "Often in early stages of the model building process, data is unlabeled, and one can expect labeled data in the later stages of modeling."
For example, for a problem that predicts if a LinkedIn member will watch a course video, the very first model will be based on an unsupervised technique. Once these recommendations are served, a metric recording whether someone clicks on the recommendation provides new data to generate a label.
LinkedIn also uses this technique for tagging online courses with skills that a student might want to acquire. Human labelers such as an author, publisher or student can provide a very precise and accurate list of skills that the course teaches, but it is not possible for them to provide an exhaustive list of such skills. Hence, this data can be thought of as incompletely tagged. These types of problems can use semi-supervised techniques to help build a more exhaustive set of tags.
Bharath Thota, vice president of data science for the advanced analytics practice at Kearney, a global strategy and management consulting firm, said that practical considerations also tend to govern his team's choice of using supervised or unsupervised learning.
"We choose supervised learning for applications when labeled data is available and the goal is to predict or classify future observations," Thota said. "We use unsupervised learning when labeled data is not available and the goal is to build strategies by identifying patterns or segments from the data."
Alation takes a similar approach to developing models, said Andrea Levy, the company's data science lead.
"Supervised models make a lot of sense when labels are easy to acquire or gathered as a natural part of the product," she said. For example, in an online marketplace setting, a click or purchase can indicate interest -- and can be used as labeled training data.
In Levy's view, the goal of unsupervised learning is to find the natural structure in data that has already been seen -- but hasn't been categorized or labeled. Unsupervised models are often used in data exploration and dimensionality reduction, which involves finding more efficient ways to represent data for a given type of problem.
Alation's Kalb said one of his favorite examples of using unsupervised learning to find patterns was a model dubbed Muthuball created by then-Stanford undergraduate Muthu Alagappan, in which topological clusters of basketball data revealed transformative new ways for NBA coaches and managers to think about team composition. This analysis revealed that coaches might benefit from thinking about thirteen virtual positions rather than the five they are used to.
Alation data scientists use unsupervised learning internally for a variety of applications, Kalb said. For example, they have developed a human computer collaboration process for translating arcane data object names into human language. (e.g. "na_gr_rvnu_ps" into "North American Gross Revenue from Professional Services"). In this case, the machines guess, humans confirm and machines learn.
"You could think of it as semi-supervised learning in an iterative loop, creating a virtuous cycle of increased accuracy," Kalb said.
Michael Kim, vice president at AArete, a global consultancy specializing in data-informed performance, agreed that unsupervised learning techniques can help categorize data or cluster data to demonstrate patterns not readily seen by human experts. The technique is a very powerful way to test initial hypotheses or help frame up future supervised learning models, he said. The downside, he added, is that it can be hard to interpret for operational decision-making.
Kim finds that supervised-learning-trained models are easier to interpret, as the results are framed as probabilities or odds of an outcome. The tradeoff is that supervised methods are subject to a lot more bias as there are preconceived notions of what the inputs or outputs should be.
Supervised, unsupervised and semi-supervised learning at Zillow
Zillow, the housing service, uses supervised learning to provide personalized recommendations to its customers. The underlying recommender model is supervised by feedback signals such as home page views and home saves. Also, the Zestimate, an estimate of a house price, is calculated as a classical supervised learning problem where the labels come from the recent sale prices in real estate transactions.
"Labeled data often comes at a cost, but it provides important supervision signals for model training," said Sangdi Lin, senior applied scientist at Zillow.
Unlabeled data, by contrast, is largely available data -- and it also is useful in model building, Lin said. "We use unsupervised learning to understand the underlying data pattern and distribution when labels are not provided."
Unsupervised learning has been used at Zillow, for example, to understand the characteristics of different customer segments such as users at different home shopping stages (e.g., early exploration stage or ready to transact stage).
Semi-supervised learning is used to fill in the cracks when labeled data is not available, Lin said. Using some labeled data for supervision together with unlabeled data to capture the underlying data patterns, the approach improves generation of the model.
For example, Lin's team used semi-supervised learning in a project where they extracted key phrases from listing descriptions to provide home insights for customers. They started with unsupervised key phrase extraction techniques, then incorporated supervision signals from both the human annotators and the customer engagement of the key phrase landing page to further improve the model accuracy.
Lin said that they sometimes use the various approaches across different parts of the model development lifecycle. For example, data exploration with the help of unsupervised learning techniques is often conducted at an early stage of a data science project. A data scientist might apply unsupervised clustering techniques and various visualization methods to understand the best way to frame a recommendation problem to train a supervised learning algorithm. In other cases, data scientists may discover clusters and find they can get better results by training different supervised learning models on each separate cluster rather than a single model for all the data. Alternatively, they might create cluster labels for training one model.
Anomaly detection, another unsupervised learning technique, is used by Zillow to improve the quality of data that is later fed into supervised learning algorithms, Lin said.
"Identifying anomalies and improving the training data quality can often result in improved accuracy of machine learning models," he said. Zillow has used this approach to significantly improve the accuracy of home price estimation models.
Conclusion: 5 unsupervised learning techniques
At a high level, supervised learning techniques tend to focus on either linear regression (fitting a model to a collection of data points for prediction) or classification (does an image have a cat or not?).
Unsupervised learning techniques often complement the work of supervised learning using a variety of ways to slice and dice raw data sets, including the following:
1. Data clustering. Data points with similar characteristics are grouped together to help understand and explore data more efficiently. For example, Zillow uses data clustering methods to identify user segments and discover similar listings.
2. Dimensionality reduction. Each variable in a dataset is considered a separate dimension. However, many models work better by analyzing a specific relationship between variables. A simple example of dimensionality reduction is using profit as a single dimension, which represents income minus expenses -- two separate dimensions. However, more sophisticated types of new variables can be generated using algorithms such as principle component analysis (PCA), auto-encoders (i.e. converting text words into vectors) or T-distributed Stochastic Neighbor Embedding (t-SNE).
Zillow's Lin said that dimensionality reduction can help reduce the curse of overfitting, in which a model works well for a small data set but does not generalize well to new data. This technique also enables Zillow to visualize high-dimensional data in a 2D or 3D space, which humans can easily understand. For example, Zillow uses dimensional reduction to visualize the way a home recommendation algorithm represents the relationship between multiple home attributes.
3. Anomaly or outlier detection. Unsupervised learning can help identify data points that fall out of the regular data distribution. Identifying and removing the anomalies as a data preparation step may improve the performance of machine learning models.
4. Transfer learning. These algorithms utilize a model that was trained on a related but different task. For example, transfer learning techniques would make it easy to fine tune a classifier trained on Wikipedia articles to tag arbitrary new types of text with the right topics. LinkedIn's Rao said this is one of most efficient and quickest ways to solve a data problem where there are no labels.
5. Graph-based algorithms. These techniques attempt to build a graph that captures the relationship between the data points, said Rao. For example, if each data point represents a LinkedIn member with skills, then a graph can be used to represent members, where the edge indicates the skill overlap between members. Graph algorithms can also help transfer labels from known data points to unknown but strongly related data points. Unsupervised learning can also be used for building a graph between entities of different types (the source and the target). The stronger the edge, the higher the affinity of the source node to the target node. For example, LinkedIn uses them to match members with courses based on skills.
Supervised vs. unsupervised learning in finance
Tom Shea, founder and CEO of OneStream Software, a corporate performance management platform, said supervised learning is often used in finance for building highly precise models, whereas unsupervised techniques are better suited for back-of-the-envelope types of tasks.
In supervised learning projects, data scientists will work with finance teams to utilize their domain expertise on key products, pricing and competitive insights as a critical element for demand forecasting. The domain expertise is particularly germane in more granular levels of forecasting needs where every region, product and even SKU have unique experiences and require intuition. These types of models derived from supervised learning can help to improve forecast accuracy and resulting inventory holding metrics.
Shea sees unsupervised learning being used to improve regional or divisional management jobs that don't require the direct domain knowledge of supervised learning. For example, unsupervised learning could help identify the normal rate of spending among a group of related items and the outliers. This is particularly useful in analyzing large transactional data sets (orders, expenses, invoicing) as well helping increase accuracy during the financial close processes.