Supervised learning is an approach to creating artificial intelligence (AI), where the program is given labeled input data and the expected output results. The AI system is specifically told what to look for, thus the model is trained until it can detect the underlying patterns and relationships, enabling it to yield good results when presented with never-before-seen data.
Supervised learning is good at classification and regression problems, such as determining what category a news article belongs to or predicting the volume of sales for a given future date. In supervised learning, the aim is to make sense of data toward specific measurements. In contrast to supervised learning is the unsupervised learning method, which tries to make sense of the data in itself. There are no external measurements or guidelines in unsupervised learning; the algorithm just has to comprehend the data and detect the patterns or similarities.
How does supervised learning work?
Like all machine learning algorithms, supervised learning is based on training. The system is fed with massive amounts of data during its training phase, which instruct the system what output should be obtained from each specific input value. The trained model is then presented with test data to verify the result of the training and measure the accuracy.
In neural network algorithms, the supervised learning process is improved by constantly measuring the resulting output of the model and fine-tuning the system to get closer to its target accuracy. The level of accuracy obtainable depends on two things: the data available and the algorithm in use.
This article is part of
A high accuracy is not necessarily a good indication; it could also mean that the model is suffering from overfitting -- i.e., it is overtuned to its particular training data set. Such a data set might perform well in test scenarios but fail miserably when presented with real-world challenges. To avoid overfitting, it is important that the test data is different from the training data to ensure the model is not drawing answers from its previous experience, but instead that the model's inference is generalized.
The training data must also be balanced and cleaned. Garbage or duplicate data will skew the AI's understanding -- hence data scientists must be careful with the data the model is trained on.
The diversity of the data determines how well the AI will perform when presented with new cases; if there are not enough samples in the training data set, the model will falter and will fail to yield any reliable answers.
The algorithm, on the other hand, determines how that data can be put in use. For instance, deep learning algorithms can be trained to extract billions of parameters from their data and reach unprecedented levels of accuracy, as demonstrated by OpenAI's GPT-3.
Classification and regression
Supervised learning algorithms primarily generate two kinds of results: classification and regression.
A classification algorithm tries to determine the class or the category of the data it is presented with. For instance, object recognition algorithms are classification problems, where the AI is tasked to determine what category of objects the item it is presented with belongs to. Character recognition, email spam classification, sentiment analysis and drug classification are examples of problems requiring the AI to determine what class the data belongs to.
Many times, an object might belong to several categories, and the AI needs to determine what those categories are and how much confidence the algorithm has in its predictions.
Regression tasks are different, as they expect the model to produce a numerical value. For instance, predicting click rates in online ads, predicting real estate prices or determining how much a customer would be willing to pay for a certain product.
Supervised learning algorithms
Common supervised machine learning algorithms include the following:
- linear regression
- logistic regression
- artificial neural networks (ANNs)
- linear discriminant analysis
- decision trees
- similarity learning
- Bayesian logic
- random forests
When choosing a supervised learning algorithm, there are a few things that should be considered. The first is the bias and variance that exist within the algorithm, as there is a fine line between being flexible enough and too flexible. Another is the complexity of the model or function that the system is trying to learn. Additionally, the heterogeneity, accuracy, redundancy and linearity of the data should be analyzed before choosing an algorithm.
Supervised vs. unsupervised learning
Comparing unsupervised vs. supervised learning, unsupervised learning is when an algorithm is only given input data, without corresponding output values, as a training set. Unlike supervised learning, there are no correct output values. Instead, algorithms are able to function freely in order to learn more about the data and present interesting findings. Unsupervised learning is popular in applications of clustering (the act of uncovering groups within data) and association (the act of predicting rules that describe the data).
Supervised learning models have some advantages over the unsupervised approach, but they also have limitations. The systems are more likely to make judgments that humans can relate to, for example, because humans have provided the basis for decisions. However, in the case of a retrieval-based method, supervised learning systems have trouble dealing with new information. If a system with categories for cars and trucks is presented with a bicycle, for example, it would have to be incorrectly lumped in one category or the other. If the AI system was generative, however, it may not know what the bicycle is, but would be able to recognize it as belonging to a separate category.
Uses and examples
Consider the news categorization problem from earlier. One approach is to determine what category each piece of news belongs to, such as business, finance, technology or sports. To solve this problem, a supervised model would be the best fit. Humans would present the model with various news articles and their categories and have the model learn what kind of news belongs to each category. This way, the model becomes capable of recognizing the news category of any article it looks at based on its previous training experience.
However, humans might also come to the conclusion that classifying news based on the predetermined categories is not sufficiently informative or flexible, as some news may talk about climate change technologies or the workforce problems in an industry. There are billions of news articles out there, and separating them into 40 or 50 categories may be an oversimplification. Instead, a better approach would be to find the similarities between the news articles and group the news accordingly. That would be looking at news clusters instead, where similar articles would be grouped together. There are no specific categories anymore.
This is what unsupervised learning achieves: It determines the patterns and similarities within the data, as opposed to relating it to some external measurement.
Supervised learning may be the ideal solution for many AI problems. However, it requires huge amounts of correctly labeled data to reach acceptable performance levels, and such data may not always be available. Unsupervised learning does not suffer from this problem and can work with unlabeled data as well.
In cases where supervised learning is needed but there is a a lack of quality data, semisupervised learning may be the appropriate learning method. This learning model resides between supervised learning and unsupervised; it accepts data that is partially labeled -- i.e., the majority of the data lacks labels.
Semisupervised learning determines the correlations between the data points -- just like unsupervised learning -- and then uses the labeled data to mark those data points. Finally, the entire model is trained based on the newly applied labels.
Semisupervised learning has proven to yield accurate results and is applicable to many real-world problems where the small amount of labeled data would prevent supervised learning algorithms from functioning properly. As a rule of thumb, a data set with at least 25% labeled data is suitable for semisupervised learning.
Facial recognition, for instance, is ideal for semisupervised learning; the vast number of images of different people is clustered by similarity and then made sense of with a labeled picture giving identity to the clustered photos.