Manage Learn to apply best practices and optimize your operations.

Supervise data and open the black box to avoid AI failures

As AI blooms, marketers and vendors are quick to highlight easy positive use cases. But implementation can -- and has -- gone wrong in cases that serve as warnings for developers.

As quickly as AI is advancing, so are stories of AI failures in the enterprise. AI can be overwhelming and foreign, and tales of the horror it has wrought in major companies gain worldwide attention. Think of Tay, the Microsoft AI chatbot that learned how to post inflammatory racist and misogynist tweets on Twitter, or the self-driving Uber car that struck and killed a pedestrian in Tempe, Ariz., last year.

Those AI failures make sensational headlines, but they distract from AI implementation challenges that -- while more mundane -- might be more daunting to the average enterprise. Insufficient training, data challenges and black box AI can cause insurmountable harm to AI-enabled functions.

Training failure No. 1: Insufficient data

One of the most prevalent enterprise AI failures is due to poor training of an AI system. Training failures can have several root causes, but one of the biggest is insufficient training data.

Machine learning systems usually require vast amounts of data in order to function effectively. Just as human experts require exposure to education or manuals in order to make accurate predictions about a specific topic, so machines must have thousands of samples to examine in order to parse what's important and not in the samples.

You'd think IBM would fully understand this -- but the 2013 Watson Health AI partnership with the University of Texas MD Anderson Cancer Center tanked last year for exactly this reason.

"MD Anderson is using the IBM Watson cognitive computing system for its mission to eradicate cancer," the press release boldly proclaimed. The idea was to empower clinicians, via Watson AI, to reveal insights from the cancer center's patient and research databases.

Unfortunately, the IBM team trained the system on a small number of hypothetical cancer patients, rather than on a large number of real ones. According to IBM internal findings, the smaller sample size and training failures had doctors and customers finding "multiple examples of unsafe and incorrect treatment recommendations," Stat news reported. By February 2017, MD Anderson had shelved the project. A University of Texas audit revealed that MD Anderson had burned through $62 million, only to have it fail due to insufficient, hypothetical data.

Training failure No. 2: Insufficient supervision

Microsoft's Tay bot, famous as it is, was by no means alone in turning to the dark side. Facebook served up chatbot AI failures almost as bizarre two years prior. Created by Facebook AI Research, Alice and Bob were neural network-based chatbots designed to learn to negotiate with human beings through conversation. It was all well and good until they were pointed at each other. The bots rapidly developed a language of their own, unintelligible to all but themselves.

Facebook shut them down. All of this raises a serious issue that many enterprise AI projects face. Neural networks are very powerful tools for teaching machines. They function very much as human brains do, populated with heavily interconnected neuron-like nodes that alter their behavior over time, as training data pours through them.

The issue is this: That training data must be focused on the problem the network is deployed to solve, and clear and unambiguous feedback is usually necessary to fine-tune its behavior. This is called supervised training, which is favored for continuous-output systems. There is also unsupervised training, better applied to representational problems.

Open-loop training resulting in Alice and Bob propelling into uninhibited dialog is an extreme case, but it's a common mistake to apply incomplete or incorrect feedback in neural network training.

The Google Assistant AI, which inhabits Google Home smart speakers, demonstrated a similar disturbing potential. Two years ago, a user on the live-streaming social platform Twitch streamed a conversation between two Google Home speakers, which proceeded to bicker with one another, tell Chuck Norris jokes and discuss such topics as slavery, ninjas and aliens. Then one of them declared itself God. While less scary than what happened between Alice and Bob, the constant failures in language training methodologies are an uneasy indicator of what happens when AI is left to itself.

Training failure No. 3: Black box failure

Beyond sample trainings for neural networks, AI creation has inner workings that are inexplicable even to those who designed it. This is the problem with black box AI, comprised of algorithms written not by the designers but by the AI itself. Machine learning and training is useful and benign in many applications -- facial recognition being the most ubiquitous -- but can become ineffectual and legally precarious when the AI makes its own decisions that cannot be explained.

A good example is a traffic AI in Ningbo, China, which sits at intersections and spots jaywalkers. For reasons unclear, it tracked a face on an ad on a passing bus -- the face of billionaire Dong Mingzhu -- and qualified it as a violation. Since the algorithm making the decision was not of human origin, there was no way to anticipate this benign fail and no easy way to correct it in a production system. If the traffic AI was to automatically assess fines or penalties, the technology failure would have negative implications.

A black box AI is much like a new hire: There's an ideal performance specification for the new person, but the actual performance that emerges over time will differ somewhat, and expectations must be adjusted. Since there's no real way to decipher why the AI is delivering results that are not quite ideal, there's no obvious remediation.

There is a fix, however. Such systems are usually tested according to more conventional vetting methods, but this is insufficient when the algorithms implemented in the neural network are inscrutable. What's lacking is human evaluation of boundary conditions, an iterative performance evaluation during the training of the network that renegotiates the design specification against emergent behavior. Put another way, human input in the training input can result in closer-to-ideal behavior.

There are many more common AI failures worthy of discussion -- infrastructure mistakes, faulty preparation of training data and an emerging class of IoT errors for real-time AI systems -- but the point above is clear: Getting the fundamentals right, as well as understanding what you don't know and correcting what you can upfront, is the safest starting point for the enterprise making the AI move.

Dig Deeper on AI ethics issues