kirill_makarov - stock.adobe.com

Feature

Using small data sets for machine learning models sees growth

While massive data sets allow for easy training, developers are using new techniques to mine and transfer data that allows for training on limited labeled information.

George Lawton

Published: 13 Nov 2019

Early generations of machine learning tools required massive data sets to get useful results, which limited the types of machine learning models that could be created. Currently, however, researchers and vendors are developing new AI technologies that use a variety of techniques to reduce the amount of data required.

"Few-shot" and "n-shot" training approaches can train models with small data sets for machine learning algorithms. Researchers are also exploring "zero-shot" techniques that can learn from related data or descriptions of what to look for in the data -- without any designated data sets. The development of training models requiring limited data can make it easier for an enterprise with less data to create and develop AI strategies.

Grow more vs. know more

Nate Nichols, distinguished principal at AI company Narrative Science, said there are two broad approaches to finding success with small training data sets -- grow more or know more.

Grow-more approaches, like simulations or generative adversarial networks (GANs), grow more data for the model to learn from. Simulations are great for problems related to user behavior or physical processes. Microsoft AI & researchers AirSim can generate fake videos of pedestrians walking in all kinds of dim lighting for AIs to train on various weather conditions. GANs have been used to generate fake videos but also show promise in generating better training data. Adversarial networks work well for situations that naturally have an adversarial component, such as fraud.

Many enterprises sit on vast troves of unlabeled data. Few-shot approaches could also help clean and label data sets for machine learning modeling and grow more data. The ability to learn with limited labeled data opens new product possibilities and allows enterprises to use large pools of otherwise unusable data to be innovative.

“Enterprises typically sit on large pools of data, but most of this data does not have labels and cannot be used to build a model,” said Bethann Noble, director of product marketing for machine learning at Cloudera.

Know-more approaches, like transfer learning or pretrained models, rely on the model learning from a broader set of data than just training data. Transfer learning involves training a new model on a previous model with existing data. The model then doesn't need to solve your actual problem from scratch, but just needs to learn the difference between the problem it was trained on originally and the problem you're training it on now, which often requires less training data, Nichols said.

Facebook has seen success with transfer learning in their translation systems. Once their system performed well in translating from English to Spanish (which has a lot of data), they were able to get it to translate from English to Urdu using significantly less data. Adjacent problems like VGG-Face for face detection, Mask R-CNN for finding objects in a photo or word2vec for language understanding are prime candidates for bootstrapping via know more techniques like transfer learning or using pretrained models.

Active learning is another know-more approach that seeks to create new feedback loops for more efficiently labeling data, Noble said. In active learning, the model identifies the data point that it has the most difficulty with and requests a label for it that a human can step in and provide. Knowing labels for these data points will help the model shorten the gap between various steps of the process.

Meta-learning is another approach that shifts the focus from training a model to training a model how to learn on small data sets for machine learning. In traditional machine learning, we focus on collecting many examples of a class. In meta-learning, the focus changes to collecting many tasks. Indirectly, this implies the need to collect data for many diverse classes. Meta-learning will become more practical for enterprises as the underlying algorithms become mature in use cases such as product classification or rare disease classification, where data exhibits many classes, but each class merely has a few examples, Noble said.

Shining a light on small data sets for machine learning

There are a wide variety of problems developers can run into when working with smaller data sets, but developers can reduce these concerns by taking a first principles model approach, said Greg Makowski, head of data science solutions at FogHorn, an IoT Platform provider. For example, if you know the physics or chemistry equations, then you don't need as many samples to develop effective machine learning models.

Specifically, developers and researchers are struggling to create models that are inclusive of outliers. Anomalies, by definition, don’t happen that often, which means the datasets are small and often non-varied. Few-shot learning will help data science teams reduce the burdens of gathering a large set of the right data and paying for the compute to train a model on that large dataset. Those are both hard and expensive undertakings, Nichols said. If few-shot learning really starts working on many tasks, then it will vastly expand the number of tasks to which machine learning can be applied.

Start with smaller problems

Arijit Sengupta, CEO of automated machine learning platform Aible, said that developers are likely to see the best results with limited data by finding ways to break a project into smaller problems or smaller models. One approach is to create very focused AI for a very specific use case such as a product type, country or vertical market.

“When you narrow the problem down, you can train the AI on a smaller data set and know that you’ve covered most of the examples for that very focused case,” he said.

The traditional AI approach is to deploy many large models, but that’s very expensive and time-consuming. With advances in automated machine learning and model deployment, it’s now possible to take many small models and stitch them together into an overall predictive model. It’s less exotic than some of the new techniques, but these approaches are better understood, and they create immediate value, Sengupta said.

Another promising approach is to use evolutionary techniques that start with a simple model that is refined through simulation and population-based learning when traditional models are sub-optimized or where datasets are too small to apply AI.

“This is substantially faster and more efficient than other methods and produces more optimal models with much less starting data,” said Bret Greenstein, head of AI & Analytics for Cognizant Digital Business.

Trulia learns from new documents

Most machine learning models either require sufficient labeled data, or related unsupervised data, to learn a stable data distribution. This is required to ensure models generalize well and provide expected test time performance.

Jyoti Prakash Maheswari, applied scientist at Trulia, the real estate services company owned by Zillow, is applying AI and machine learning to document understanding on scanned transaction documents. Given the large variation across these documents, it can be challenging to secure enough annotated data for every document type and the increasing frequency of new document types. Of the large number of document types, only a few occur with the frequency typically required to train a model, Maheswari said.

Techniques like transfer learning and multi-modal knowledge transfer help learn and apply key data features to new domains. Weakly supervised or semi-supervised techniques help both in data creation and training. Data augmentation and active learning with humans in the loop are also very promising techniques of smart data generation

Trulia, and, by extension, Zillow, has been using these techniques for the creation of training datasets using images and text descriptions associated with real-estate listings. They have used a transfer technique to train scene classification and real estate attribute recognition models, and unsupervised and self-supervised learning techniques provide creative ways to learn data representation and real estate word embeddings, and to extract keywords from a description.

Explainability required

Few-shot and n-shot training algorithms require an understanding of deep learning architecture and mathematical formulations. Putting these algorithms into practice requires defining the target, constructing a deep learning architecture and proper placement of the type of learning in the deep architecture. Mainstream deep learning libraries, such as Keras and PyTorch, currently do not support these algorithms in a way that can go into production immediately.

Today, the algorithms available in modules and classes are combined with handwritten scripts. In the future, the algorithms may come in the tools as APIs that can go directly into production. For now, it’s important to incorporate better explainability into these APIs as well, said Bradley Hayes, CTO at Circadence, a cybersecurity learning company.

“Until we can develop explainable AI techniques to enable us to examine the models' underlying logic, even if only in an intuitive sense, it will be irresponsible to place our trust in their ability to act on our behalf,” Hayes said.

In the long run, learning from limited data will benefit from finding new algorithms and approaches for combining the power of deep learning with the explicit reasons and semantics of traditional AI.

Using small data sets for machine learning models sees growth

While massive data sets allow for easy training, developers are using new techniques to mine and transfer data that allows for training on limited labeled information.

Grow more vs. know more

Shining a light on small data sets for machine learning

Start with smaller problems

Trulia learns from new documents

Explainability required

Dig Deeper on Machine learning platforms

Prompt engineering takes shape for devs as agentic AI dawns

How and why to create synthetic data with generative AI

What are vision language models (VLMs)?

What is chain-of-thought prompting (CoT)? Examples and benefits

Grow more vs. know more

Shining a light on small data sets for machine learning

Start with smaller problems

Trulia learns from new documents

Explainability required

Related Resources

Dig Deeper on Machine learning platforms

Prompt engineering takes shape for devs as agentic AI dawns

How and why to create synthetic data with generative AI

What are vision language models (VLMs)?

What is chain-of-thought prompting (CoT)? Examples and benefits