To set realistic expectations of AI -- without missing opportunities -- it is important to understand algorithms, both their capabilities and limitations.

In this article, we explore two algorithms that have propelled the field of AI forward -- convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We will cover what they are, how they work, what their limitations are and where they complement each other.

But first, a brief summary of the main differences between a CNN vs. an RNN.

CNNs are commonly used in solving problems related to spatial data, such as images. RNNs are better suited to analyzing temporal, sequential data, such as text or videos.

A CNN has a different architecture from an RNN. CNNs are "feed-forward neural networks" that use filters and pooling layers, whereas RNNs feed results back into the network (more on this point below).

In CNNs, the size of the input and the resulting output are fixed. That is, a CNN receives images of fixed size and outputs them to the appropriate level, along with the confidence level of its prediction. In RNNs, the size of the input and the resulting output may vary.

Use cases for CNNs include facial recognition, medical analysis and classification. Use cases for RNNs include text translation, natural language processing, sentiment analysis and speech analysis.

Convolutional neural networks What we see as images in a computer is actually a set of color values, distributed over a certain width and height. What we see as shapes and objects appear as an array of numbers to the machine. Convolutional neural networks make sense of this data through a mechanism called filters and then pooling layers. "A filter is a matrix of randomized numbers. In a CNN, filters are multiplied against matrix representations of parts of the image, effectively scanning the picture pixel by pixel and getting the average value of all adjacent pixels, thereby detecting the most important features," explained Ajay Divakaran, the senior technical director of the Vision and Learning Laboratory in SRI International's Center for Vision Technologies, a nonprofit scientific research institute. "This information is passed through a pooling layer, which condenses the acquired feature map into its most essential information," he added. This last step greatly reduces the size of the data and makes the neural network much faster. The resulting information is then fed into the neural network. A CNN consists of several layers of perceptrons, and the filters effectively build a network that understands more and more of the image with every passing layer. While the first layer understands the outlines and borders, the second layer starts understanding shapes, and the third one understands objects. The power of this model is its capability to recognize objects, regardless of where in the picture they appear or their rotation. CNNs are great at recognizing objects, animals and people, but what if we want to understand what is happening in the pictures? For instance, consider a picture of a ball in the air. How can we know if the ball is thrown and going up or if it is falling? Answering this question would require more information than a single picture -- we would need a video. The sequence of the pictures would determine if the ball is going up or down. But how can we make neural networks remember the information they had previously worked on and work that into their calculation?

Recurrent neural networks The problem of remembering goes beyond videos -- in fact, many natural language understanding algorithms (that typically only deal with text) require some sort of remembering, such as the topic of the discussion or the previous words in the sentence. Recurrent neural networks were designed to tackle exactly this problem. This algorithm feeds the result back into itself, making it a part of the final answer. To illustrate, assume we want to translate the following sentence: "What date is it?" The algorithm feeds each word separately into the neural network, and by the time it arrives at the word "it," its output is already influenced by the word "What." RNNs do have a problem, though. In the previous example, the words that are fed last into the network have a higher influence on the result (in our case, the words "is it?"). Those two words are not giving us much understanding of the full sentence -- the algorithm is suffering from "memory loss." This issue has not gone unnoticed, and newer algorithms such as Long Short-Term Memory (LSTM) solve that problem. The diagram below, from Wikimedia Commons, shows a one-unit recurrent neural network. This diagram, courtesy of Wikimedia Commons, depicts a one-unit RNN. From bottom to top: input state, hidden state, output state. U, V, W are the weights of the network. Compressed diagram on the left and the unfold version of it on the right.

CNNs vs. RNNs: Strengths and weaknesses Having seen how each network was designed, we can now point out the strengths and weaknesses of each. "CNNs are preferred in interpreting visual data, sparse data or data that does not come in sequence," explained Prasanna Arikala, CTO at Kore.ai, a chatbot development company. "Recurrent neural networks, on the other hand, are designed to recognize sequential or temporal data. They do better predictions considering the order or sequence of the data as they relate to previous or the next data nodes." Nowadays, the boundaries between CNN and RNN usage are somewhat blurred. Fred NavruzovData science lead, Competera Applications where CNNs are particularly useful include face detection, medical analysis, drug discovery and image analysis, Arikala said. RNNs are useful for language translation, entity extraction, conversational intelligence, sentiment analysis and speech analysis. Because RNNs rely on the previous state to predict the future state, they "make sense for the stock market, as predicting where a stock would be headed depends a lot on where it has been earlier," he said. However, as we learned earlier, when scanning a picture, a CNN's filter takes the adjacent pixels into account as it works. Could it not use the same mechanism for adjacent words? "It is not that such an approach would not work at all," Divakaran explained. "[But] it's a needlessly roundabout approach." According to Divakaran, trying to use the spatial modeling capabilities of the CNN to capture what is basically a temporal phenomenon is suboptimal by definition and requires much more effort and memory to accomplish the same task.

CNNs vs. RNNs: Complementary models But there are cases where the two models complement each other. Arikala shared an interesting case. "For some of the Asian languages like Chinese, Japanese and Korean, where characters are like special images, we use deep neural networks built with a combination of CNN and RNN for intent detection and sentiment analysis," he said. In these so-called logographic languages, some characters can translate to one or several English words, while others only mean something when they are suffixed to other characters, changing the meaning of the original character. "The reason why a combination of neural networks works here is that we do character tokenization in logographic languages compared to [using] Treebank/WordNet tokenization in other languages," Arikala explained. "A combination of CNN and LSTM works much better than pure RNN." Fred Navruzov, the data science lead at Competera, an AI company that helps retailers set optimal prices, agreed that the models can cooperate instead of compete with each other. "Nowadays, the boundaries between CNN and RNN usage are somewhat blurred, as you can combine those architectures into CRNN for increased effectiveness in solving specific tasks like video tagging or gesture recognition," he said. In an analysis of a sequence of video frames, for example, RNN can be used to capture temporal information and the CNN can be used to extract spatial features from single frames.