In 1970, the AI luminary Marvin Minsky predicted that researchers would develop an artificial intelligence on par with human intelligence within the decade. That milestone, as we know, was not met and the achievement of artificial general intelligence has proved elusive. Every breakthrough in AI, it seems, is a reminder of how little we understand the miraculous machine between our ears.
In its quest for artificial general intelligence, however, the field has nevertheless made remarkable advances in recent years, due largely to three factors: improved algorithms, in particular neural networks; an explosion of data that can be used for training algorithms; and increased computing power.
To set realistic expectations of AI -- without missing opportunities -- it is important to fully understand the algorithms, both their capabilities and limitations.
In this article, we explore two algorithms that have propelled the field of AI forward -- convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We will cover what they are, how they differ, how they work, what their limitations are and where they complement each other.
This article is part of
But first, some basics.
ANNs, CNNs, RNNs: What are neural networks?
The neural network was widely recognized at the time of its invention as a major breakthrough in the field. Taking a hint from how the neurons in our brains work, neural network architecture introduced an algorithm that allowed the computer to fine-tune its decision-making -- in other words, to learn.
An artificial neural network, or ANN, consists of many perceptrons. In its simplest form, a perceptron consists of a function that takes two inputs, multiplies them by two random weights, adds them together with a bias value, passes the results through an activation function and prints the results. The weights and the bias values are adjustable, and they define the outcome of the perceptron, given two specific input values.
Bias in artificial neurons
"In both artificial and biological networks, when neurons process the input they receive, they decide whether the output should be passed onto the next layer as input. The decision of whether or not to send information on is called bias and it's determined by an activation function built into the system. For example, an artificial neuron may only pass an output signal onto the next layer if its inputs (which are actually voltages) sum to a value above some particular threshold value."
-- Linda Tucci
Follow this link to learn more about artifcial neurons.
This architecture was genius: combining the perceptrons generated layers of adjustable variables that could take on almost any task. The problem, though, was what numbers to pick for the weights and the bias values to make a correct calculation.
This was taken care of via a mechanism called "backpropagation." The ANN is given an input, and the result is compared to the expected output. The difference between the desired output and the actual output is put back into the neural network via a mathematical calculation, which determines how each perceptron should be adjusted to reach the desired result.
This procedure -- where the AI is trained -- is repeated until a satisfying level of accuracy is reached.
A neural network like this works great for simple statistical predictions, such as predicting a person's favorite football team, given the person's age, gender and geographical location. But how can AI be used for more difficult tasks such as image recognition? The answer begs the question of how do we feed the data into the network in the first place.
Convolutional neural networks
What we see as images in a computer is actually a set of color values, distributed over a certain width and height. What we see as shapes and objects appear as an array of numbers to the machine. Convolutional neural networks make sense of this data through a mechanism called filters and pooling layers.
"A filter is a matrix of randomized numbers. In a CNN, filters are multiplied against matrix representations of parts of the image, effectively scanning the picture pixel by pixel and getting the average value of all adjacent pixels, thereby detecting the most important features," explained Ajay Divakaran, the senior technical director of the Vision and Learning Laboratory in SRI International's Center for Vision Technologies, a nonprofit scientific research institute.
"This information is passed through a pooling layer, which condenses the acquired feature map into its most essential information," he added. This last step greatly reduces the size of the data and makes the neural network much faster. The resulting information is then fed into the neural network.
Fred NavruzovData science lead, Competera
A CNN consists of several layers of perceptrons, and the filters effectively build a network that understands more and more of the image with every passing layer. While the first layer understands the outlines and borders, the second layer starts understanding shapes, and the third one understands objects. The power of this model is its capability to recognize objects regardless of where in the picture they appear or their rotation.
CNNs are great at recognizing objects, animals and people, but what if we want to understand what is happening in the pictures?
For instance, consider a picture of a ball in the air. How can we know if the ball is thrown and going up, or if it is falling? Answering this question would require more information than a single picture -- we would need a video. The sequence of the pictures would determine if the ball is going up or down. But how can we make neural networks remember the information they had previously worked on and work that into their calculation?
Recurrent neural networks
The problem of remembering goes beyond videos -- in fact, many natural language understanding algorithms (that typically only deal with text) require some sort of remembering, such as the topic of the discussion or the previous words in the sentence.
Recurrent neural networks were designed to tackle exactly this problem. This algorithm feeds the result back into itself, making it a part of the final answer.
To illustrate, assume we want to translate the following sentence: "What date is it?" The algorithm feeds each word separately into the neural network, and by the time it arrives at the word "it," its output is already influenced by the word "What."
RNNs do have a problem, though. In the previous example, the words that are fed last into the network have a higher influence on the result (in our case, the words "is it?"). Those two words are not giving us much understanding of the full sentence -- the algorithm is suffering from "memory loss." This issue has not gone unnoticed, and newer algorithms such as Long Short-Term Memory (LSTM) solve that problem.
CNNs vs. RNNs: Strengths and weaknesses
Having seen how each network was designed, we can now point out the strengths and weaknesses of each.
"CNNs are preferred in interpreting visual data, sparse data or data that does not come in sequence," explained Prasanna Arikala, CTO of Kore.ai, an enterprise virtual assistant platform. "Recurrent neural networks, on the other hand, are designed to recognize sequential or temporal data. They do better predictions considering the order or sequence of the data as they relate to previous or the next data nodes."
Applications where CNNs are particularly useful include face detection, medical analysis, drug discovery and image analysis, Arikala said. RNNs are useful for language translation, entity extraction, conversational intelligence, sentiment analysis and speech analysis.
Because RNNs rely on the previous state to predict the future state, they "make sense for the stock market, as predicting where a stock would be headed depends a lot on where it has been earlier," he said.
However, as we learned earlier, when scanning a picture, a CNN's filter takes the adjacent pixels into account as it works. Could it not use the same mechanism for adjacent words?
"It is not that such an approach would not work at all," Divakaran explained. "[But] it's a needlessly roundabout approach." According to Divakaran, trying to use the spatial modeling capabilities of the CNN to capture what is basically a temporal phenomenon is suboptimal by definition and requires much more effort and memory to accomplish the same task.
CNNs vs. RNNs: Complementary models
But there are cases where the two models complement each other. Arikala shared an interesting case.
"For some of the Asian languages like Chinese, Japanese and Korean where characters are like special images, we use deep neural networks built with a combination of CNN and RNN for intent detection and sentiment analysis," he said.
In these so-called logographic languages, some characters can translate to one or several English words, while others only mean something when they are suffixed to other characters, changing the meaning of the original character.
"The reason why a combination of neural networks works here is that we do character tokenization in logographic languages compared to [using] Treebank/WordNet tokenization in other languages," Arikala explained. "A combination of CNN and LSTM works much better than pure RNN."
Fred Navruzov, the data science lead at Competera, an AI company that helps retailers set optimal prices, agreed that the models can cooperate instead of compete with each other.
"Nowadays, the boundaries between CNN and RNN usage are somewhat blurred, as you can combine those architectures into CRNN for increased effectiveness in solving specific tasks like video tagging or gesture recognition," he said. In an analysis of a sequence of video frames, for example, RNN can be used to capture temporal information and the CNN can be used to extract spatial features from single frames.