This content is part of the Conference Coverage: Strata + Hadoop World 2016: Hadoop and Spark in spotlight

It's back to the future for machine learning applications of big data

Machine learning techniques are far from new. What is new, however, is the number of parallelized data processing platforms available for applications of big data.

The recent rush of machine learning technology and products is formidable, but machine learning techniques are far from new. What's new is the number of parallelized data processing platforms becoming available for machine learning applications of big data.

At the recent Strata + Hadoop World conference in San Jose, Calif., data specialists said the complexity of predictive machine learning algorithms and models, as well as the sheer numbers of such models, can limit use of machine learning in large corporations. They also discussed tools that help them address these limits.

"The power of the machine learning techniques scales with the data, but training times can increase exponentially," said Ryan Michaluk, a data scientist at Allstate Insurance Co. in Northbrook, Ill.

With more sophisticated models and growing masses of digital data to process, Michaluk added, iterative machine learning actually became a bit of a bottleneck in his part of the organization. As a result, models ran on samples, not full or near-full data sets, which resulted in some compromised accuracy and predictability.

He said that using Hadoop data pooling is a sensible step toward addressing size issues with models and data, but that machine learning problems still may remain hard to solve. "Some algorithms parallelize trivially -- some don't," he said.

Data size and model complexity are limiters

Michaluk said his group began using Hadoop along with machine learning software from Skytree Inc. in San Jose to speed the time it took for parallel model development. 

He and his colleagues are now able to take existing learning models and run them on larger sets of data, which can lead to better predictions. These models can improve decision making around pricing, fraud prevention, underwriting, marketing and webpage design.

Michaluk said the insurance industry's bread-and-butter work with actuarial tables long ago made it a hotspot for use of statistical machine learning algorithms that predict outcomes.

But data size, model complexity and the number of iterations that were required to successfully train models had become processing limitations. He indicated that newly available big data processing platforms can streamline and, thus, expand use of machine learning.

"Things you couldn't even try before, you now can do. The biggest thing for me is instead of watching the computer do these iterations, I have more time to use to solve other problems," he said.

Time frames for modeling

For Lou Carvalheira, advanced analytics manager in IT vendor Cisco Systems Inc.'s customer intelligence unit, which is also based in San Jose, machine learning has underpinned analytics for many years. That notion is so familiar that "it is not something we speak about anymore," he said.

What is new in the quest to identify potential buyers, he continued, is that "we are finding ways to scale processing. Machine learning is empowered by the fact that you can now process much more data. You use a tremendous amount of computation power, not just one computer."

But Cisco has many business partners, resellers and marketing initiatives to support. The time it takes to run literally thousands of learning models became a challenge for Carvalheira. Time sensitivity comes from the fact that these analyses are supposed to lead to concrete actions. So it is important, he said, to quickly  identify buyer characteristics that teams can actually act on. But there needs to be time in the sales cycle to get analytical information to marketing and sales personnel. They in turn must create product packages that appeal to customers. 

"You create a probability measure of who will buy and how much they will spend. The combination can get pretty powerful. The problem we had was in creating as many predictive models as we needed in a sufficient amount of time that would allow the customization of the actions," Carvalheira said. 

To close the gap, Carvalheira and his Cisco colleagues worked with H2O (formerly 0xdata), a Mountain View, Calif., maker of a distributed machine learning platform for analytics.

In effect, said Carvalheira, H2O has an improved version of MapReduce, the processing framework that breaks down computational work into distributed jobs, and was part and parcel of original Hadoop. "It is optimized for statistical techniques," he said of H2O. "If you think about the movement of [customer relationship management], we have been doing this type of identification of value forever -- predicting what companies are going to buy next. Now, the tools we have for doing that have been changing. "

Spark of interest

While it has use in a variety of jobs, the Apache Spark data processing engine has often seen use in what might be described as new-age machine learning applications -- the recommendation engines found on many websites being a prime example.

Industry analyst Krishna Roy at The 451 Group has placed H2O and Skytree in a field of machine learning startups that includes Ayasdi, BigML, Nutonian and others. These startups in turn vie with established advanced analytics offerings from larger companies such as IBM, Microsoft and SAS Institute.

Both H2O and Skytree began life well before the rise of Apache Spark as a machine learning platform. Both companies have announced Spark support, each suggesting that Spark engines running on distributed clusters are basically complementary to their own undertakings.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Learn how new algorithms scale machine learning 

Loop AI's Brett Peintner discusses AI and deep machine learning

Dig Deeper on AI infrastructure