2019-07-28 作者:www.8455.com   |   浏览(129)




Recurrent Neural Networks with Swift and Accelerate
6 APRIL 2017

澳门新萄京官方网站 1

  • 理解 LSTM 网络
    • 递归神经网络
    • 长期依赖性问题
    • LSTM 网络
    • LSTM 的核心想法
    • 逐步解析 LSTM 的流程
    • 长短期记忆的变种
    • 结论
    • 鸣谢
  1. 是什么?
  2. 为什么?
  3. 做什么?
  4. 怎么做?

With new neural network architectures popping up every now and then, it’s hard to keep track of them all. Knowing all the abbreviations being thrown around (DCIGN, BiLSTM, DCGAN, anyone?) can be a bit overwhelming at first.

In this blog post we’ll use a recurrent neural network (RNN) to teach the iPhone toplay the drums. It will sound something like this:

figure 1 LSTM.JPG

本文翻译自 Christopher Olah 的博文 [Understanding LSTM Networks]( LSTM 网络。

本文主要根据Understanding LSTM Networks-colah's blog 编写,包括翻译并增加了自己浅薄的理解。

So I decided to compose a cheat sheet containing many of those architectures. Most of these are neural networks, some are completely different beasts. Though all of these architectures are presented as novel and unique, when I drew the node structures… their underlying relations started to make more sense.

The timing still needs a little work but it definitely sounds like someone playing the drums!
We’ll teach the computer to play drums without explaining what makes a good rhythm, or what even a kick drum or a hi-hat is. The RNN will learn how to drum purely from examples of existing drum patterns.
The reason we’re using a recurrent network for this task is that this type of neural network is very good at understanding sequences of things, in this case sequences of MIDI notes.
Apple’s BNNS and Metal CNN libraries don’t support recurrent neural networks at the moment, but no worries: we can get pretty far already with just a few matrix multiplications.
As usual we train the neural network on the Mac (using TensorFlow and Python), and then copy what it has learned into the iOS app. In the iOS app we’ll use the Accelerate framework to handle the math.
In this post I’m only going to show the relevant bits of the code. The full source is on GitHub, so look there to follow along.
What is an RNN?
A regular neural network, also known as a feed-forward network, is a simple pipeline: the input data goes into one end and comes out the other end as a prediction of some kind, often in the form of a probability distribution.
The interesting thing about a recurrent neural network is that it has an additional input and output, and these two are connected. The new input gets its data from the RNN’s output, so the network feeds back into itself, which is where the name “recurrent” comes from.

澳门新萄京官方网站 2

【翻译】理解 LSTM 及其图示 或许可以进一步帮助理解。



LSTM(Long Short-Term Memory) 长短期记忆网络,是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。

澳门新萄京官方网站 3

澳门新萄京官方网站 4

figure 2 LSTM.JPG

理解 LSTM 网络

Understanding LSTM Networks


One problem with drawing them as node maps: it doesn’t really show how they’re used. For example, variational autoencoders (VAE) may look just like autoencoders (AE), but the training process is actually quite different. The use-cases for trained networks differ even more, because VAEs are generators, where you insert noise to get a new sample. AEs, simply map whatever they get as input to the closest training sample they “remember”. I should add that this overview is in no way clarifying how each of the different node types work internally (but that’s a topic for another day).

1. tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)


Recurrent Neural Networks




Humans don't start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don't throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can't do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It's unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

澳门新萄京官方网站 5

上图中有局部神经网络——(A),输入值 (x_t),和输出值 (h_t) 。一个循环保证信息一步一步在网络中传递。


In the above diagram, a chunk of neural network, class="math inline">(A), looks at some input class="math inline">(x_t) and outputs a value class="math inline">(h_t). A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren't all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

澳门新萄京官方网站 6


当然,也是可用的。最近几年,RNN 在语音识别、语言建模、翻译、图像描述等等领域取得了难以置信的成功。我把对 RNN 所取得成果的讨论留在 Andrej Karpathy 的博客里。RNN 真的很神奇!

这些成功的关键是 “LSTM” ——一种特殊的递归神经网络,在许多问题上比标准版本的 RNN 好得多。几乎所有递归神经网络取得的出色成果均源于 LSTM 的使用。这篇文章要介绍的正是 LSTM。

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They're the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I'll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy's excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It's these LSTMs that this essay will explore.


一般神经网络没有考虑数据的持续影响。通常,前面输入神经元的数据对后输入的数据有影响。考虑到这点或者说为了解决传统神经网络不能捕捉/利用previous event affect the later ones,提出了RNN,网络中加入循环。下图是RNN网络示图。

澳门新萄京官方网站 7



澳门新萄京官方网站 8


"LSTMs",a very special kind of recurrent neural network which works,for many tasks,much much better tahn the standard version.

It should be noted that while most of the abbreviations used are generally accepted, not all of them are. RNNs sometimes refer to recursive neural networks, but most of the time they refer to recurrent neural networks. That’s not the end of it though, in many places you’ll find RNN used as placeholder for any recurrent architecture, including LSTMs, GRUs and even the bidirectional variants. AEs suffer from a similar problem from time to time, where VAEs and DAEs and the like are called simply AEs. Many abbreviations also vary in the amount of “N”s to add at the end, because you could call it a convolutional neural network but also simply a convolutional network (resulting in CNN or CN).

Feed-forward network vs. recurrent network

Type: type


The Problem of Long-Term Dependencies

RNN 的吸引力之一是它们能够将先前的信息与当前的问题连接,例如使用先前的视频画面可以启发对当前画面的理解。如果 RNN 可以做到这一点,它们会非常有用。但它可以吗?嗯,这是有条件的。

有时候,我们只需要查看最近的信息来应对当前的问题。例如,一个语言模型试图根据先前的词汇预测下一个词汇。如果我们试图预测 “the clouds are in the sky” 中的最后一个词,我们不需要任何进一步的上下文背景,很明显,下一个词将是 sky。在这种情况下,相关信息与它所在位置之间的距离很小,RNN 可以学习使用过去的信息。

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they'd be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don't need any further context –– it's pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it's needed is small, RNNs can learn to use the past information.

澳门新萄京官方网站 9

但也有些情况下我们需要更多的上下文。考虑尝试预测 “I grew up in France… I speak fluent French.” 中的最后一个词。最近的信息表明,下一个单词可能是一种语言的名称,但如果我们想要具体到哪种语言,我们需要从更远的地方获得上下文——France。因此,相关信息与它所在位置之间的距离非常大是完全可能的。

遗憾的是,随着距离的增大,RNN 开始无法将信息连接起来。

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It's entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

澳门新萄京官方网站 10

理论上,RNN 绝对有能力处理这种“长期依赖性”。人类可通过仔细挑选参数来解决这种形式的“玩具问题”。遗憾的是在实践中,RNN 似乎无法学习它们。这个问题是由 Hochreiter 和 Bengio 等人深入探讨。他发现了问题变困难的根本原因。

谢天谢地,LSTM 没这种问题!

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don't seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don't have this problem!

The Problem of Long-Term Dependencies[1]

RNNs模型可以connect previous information to the present task,such as using previous video frames might inform the understanding of the present frame.


有时,我们只需要查看最近的信息来执行当前的任务。 例如,考虑一个语言模型试图根据以前的单词预测下一个词。 如果我们试图预测“the clouds are in the sky ”的最后一个词,我们不需要任何进一步的背景(上下文) - 很明显,下一个词将是sky。 在这种情况下,当前任务训练时RNNs模型需要过去n个信息且n很小。the gap between the relevant information and the place that it’s needed is small

但是也有需要很多上下文信息的情况。如果我们试图预测长句的最后一个单词:Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.”,最近的信息I speak fluent French表示/提示下一个单词可能是某种语言的名称,但是如果我们缩小范围到具体某种语言时,我们需要关于France的背景信息。那么使用RNNs训练时需要过去n个信息,且n要足够大。the gap between the relevant information and the point where it is needed to become very large

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

理论上,RNNs可以处理“long-term dependencies.”,但是实际操作中,RNNs不能学习/训练这样的问题,即需要的过去信息n数量过大的情况下,RNNs将不再适用。The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

LSTM模型可以处理“long-term dependencies”的问题

Composing a complete list is practically impossible, as new architectures are invented all the time. Even if published it can still be quite challenging to find them even if you’re looking for them, or sometimes you just overlook some. So while this list may provide you with some insights into the world of AI, please, by no means take this list for being comprehensive; especially if you read this post long after it was written.

I said that RNNs are good at understanding sequences. For this purpose, the RNN keeps track of some internal state. This state is what the RNN has remembered of the sequence it has seen so far. The extra input/output is for sending this internal state from the previous timestep into the next timestep.
To make the iPhone play drums, we train the RNN on a sequence of MIDI notesthat represent different drum patterns. We look at just one element from this sequence at a time — this is called a timestep. At each time step, we teach the RNN to predict the next note from the sequence.
Essentially, we’re training the RNN to remember all the drum patterns that are in the sequence. It remembers this data in its internal state, but also in the weights that connect the input x and the predicted output y to this state.


Basic LSTM recurrent network cell.

The implementation is based on: http://arxiv.org/abs/1409.2329.

We add forget_bias (default: 1) to the biases of the forget gate in order to reduce the scale of forgetting in the beginning of the training.

It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.

For advanced models, please use the full @{tf.nn.rnn_cell.LSTMCell} that follows.


LSTM Networks

长短期记忆网络——通常被称为 LSTM,是一种特殊的 RNN,能够学习长期依赖性。由 Hochreiter 和 Schmidhuber(1997)提出的,并且在接下来的工作中被许多人改进和推广。LSTM 在各种各样的问题上表现非常出色,现在被广泛使用。

LSTM 被明确设计用来避免长期依赖性问题。长时间记住信息实际上是 LSTM 的默认行为,而不是需要努力学习的东西!

所有递归神经网络都具有神经网络的链式重复模块。在标准的 RNN 中,这个重复模块具有非常简单的结构,例如只有单个 tanh 层。

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

澳门新萄京官方网站 11

LSTM 也具有这种类似的链式结构,但重复模块具有不同的结构。不是一个单独的神经网络层,而是四个,并且以非常特殊的方式进行交互。

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

澳门新萄京官方网站 12

不要担心细节。稍后我们将逐步浏览 LSTM 的图解。现在,让我们试着去熟悉我们将使用的符号。

Don't worry about the details of what's going on. We'll walk through the LSTM diagram step by step later. For now, let's just try to get comfortable with the notation we'll be using.

澳门新萄京官方网站 13


In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.


For each of the architectures depicted in the picture, I wrote a very, very brief description. You may find some of these to be useful if you’re quite familiar with some architectures, but you aren’t familiar with a particular one.

澳门新萄京官方网站 14

Init docstring:

Initialize the basic LSTM cell.

LSTM 的核心想法

The Core Idea Behind LSTMs

LSTM 的关键是细胞状态,即图中上方的水平线。


The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

澳门新萄京官方网站 15

LSTM 可以通过所谓“门”的精细结构向细胞状态添加或移除信息。

门可以选择性地以让信息通过。它们由 S 形神经网络层和逐点乘法运算组成。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

澳门新萄京官方网站 16

S 形网络的输出值介于 0 和 1 之间,表示有多大比例的信息通过。0 值表示“没有信息通过”,1 值表示“所有信息通过”。

一个 LSTM 有三种这样的门用来保持和控制细胞状态。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.


基于 LSTM 的系统可以学习翻译语言、控制机器人、图像分析、文档摘要、语音识别、图像识别、手写识别、控制聊天机器人、预测疾病、点击率和股票、合成音乐等等任务

The weights in an RNN


num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
state_is_tuple: If True, accepted and returned states are 2-tuples of the c_state and m_state. If False, they are concatenated along the column axis. The latter behavior will soon be deprecated.
activation: Activation function of the inner states. Default: tanh.
reuse: (optional) Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

逐步解析 LSTM 的流程

Step-by-Step LSTM Walk Through

LSTM 的第一步要决定从细胞状态中舍弃哪些信息。这一决定由所谓“遗忘门层”的 S 形网络层做出。它接收 (h_{t-1}) 和 (x_t),并且对细胞状态 (C_{t-1}) 中的每一个数来说输出值都介于 0 和 1 之间。1 表示“完全接受这个”,0 表示“完全忽略这个”。


The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at class="math inline">(h_{t-1}) and class="math inline">(x_t), and outputs a number between (0) and class="math inline">(1) for each number in the cell state (C_{t-1}). A class="math inline">(1) represents “completely keep this” while a (0) represents “completely get rid of this.”

Let's go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

澳门新萄京官方网站 17

下一步就是要确定需要在细胞状态中保存哪些新信息。这里分成两部分。第一部分,一个所谓“输入门层”的 S 形网络层确定哪些信息需要更新。第二部分,一个 tanh 形网络层创建一个新的备选值向量—— (tilde{C}_t),可以用来添加到细胞状态。在下一步中我们将上面的两部分结合起来,产生对状态的更新。


The next step is to decide what new information we're going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we'll update. Next, a tanh layer creates a vector of new candidate values, class="math inline">(tilde{C}_t), that could be added to the state. In the next step, we'll combine these two to create an update to the state.

In the example of our language model, we'd want to add the gender of the new subject to the cell state, to replace the old one we're forgetting.

澳门新萄京官方网站 18

现在更新旧的细胞状态 (C_{t-1}) 更新到 (C_{t})。先前的步骤已经决定要做什么,我们只需要照做就好。

我们对旧的状态乘以 (f_t),用来忘记我们决定忘记的事。然后我们加上 (i_t*tilde{C}_t),这是新的候选值,根据我们对每个状态决定的更新值按比例进行缩放。


It's now time to update the old cell state, class="math inline">(C_{t-1}), into the new cell state (C_t). The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by class="math inline">(f_t), forgetting the things we decided to forget earlier. Then we add class="math inline">(i_t*tilde{C}_t). This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we'd actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.

澳门新萄京官方网站 19

最后,我们需要确定输出值。输出依赖于我们的细胞状态,但会是一个“过滤的”版本。首先我们运行 S 形网络层,用来确定细胞状态中的哪些部分可以输出。然后,我们把细胞状态输入 (tanh)(把数值调整到 (-1) 和 (1) 之间)再和 S 形网络层的输出值相乘,这样我们就可以输出想要输出的部分。


Finally, we need to decide what we're going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we're going to output. Then, we put the cell state through class="math inline">(tanh) (to push the values to be between (-1) and class="math inline">(1)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that's what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that's what follows next.

澳门新萄京官方网站 20


在 2015 年,谷歌通过基于CTC 训练的 LSTM 程序大幅提升了安卓手机和其他设备中语音识别的能力,其中就使用了Jürgen Schmidhuber的实验室在 2006 年发表的方法。百度也使用了 CTC;苹果的 iPhone 在 QucikType 和 Siri 中使用了 LSTM;微软不仅将 LSTM 用于语音识别,还将这一技术用于虚拟对话形象生成和编写程序代码等等。亚马逊 Alexa 通过双向 LSTM 在家中与你交流,而谷歌使用 LSTM 的范围更加广泛,它可以生成图像字幕,自动回复电子邮件,它包含在新的智能助手 Allo 中,也显著地提高了谷歌翻译的质量(从 2016 年开始)。目前,谷歌数据中心的很大一部分计算资源现在都在执行 LSTM 任务。

澳门新萄京官方网站 21

Of course, we don’t want the RNN to just remember existing drum patterns — we want it to come up with new drums on its own.
To do that, we will mess a little with the RNN’s memory: we reset the internal state by filling it up with random numbers — but we don’t change the weights. From then on the model will no longer correctly predict the next note in the sequence because we “erased” its memory of where it was.
Now when we ask the RNN to “predict” the next notes in the sequence, it will come up with new, original drum patterns. These are still based on its knowledge of what “good drums” are (because we did not erase the learned weights), but they are no longer verbatim replications of the training patterns.
The data
I mentioned we’re training on drum patterns. The dataset I used consists of a large number of MIDI files. When you open such a MIDI file in GarageBand or Logic Pro it looks like this:

2. tf.nn.dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None)


Variants on Long Short Term Memory

目前我所描述的还只是一个相当一般化的 LSTM 网络。但并非所有 LSTM 网络都和之前描述的一样。事实上,几乎所有文章都会改进 LSTM 网络得到一个特定版本。差别是次要的,但有必要认识一下这些变种。

一个流行的 LSTM 变种由 Gers 和 Schmidhuber 提出,在 LSTM 的基础上添加了一个“窥视孔连接”,这意味着我们可以让门网络层输入细胞状态。

What I've described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it's worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

澳门新萄京官方网站 22



The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we're going to input something in its place. We only input new values to the state when we forget something older.

澳门新萄京官方网站 23

一个更有意思的 LSTM 变种称为 Gated Recurrent Unit(GRU),由 Cho 等人提出。GRU 把遗忘门和输入门合并成为一个“更新门”,把细胞状态和隐含状态合并,还有其他变化。这样做使得 GRU 比标准的 LSTM 模型更简单,因此正在变得流行起来。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

澳门新萄京官方网站 24

这些只是若干知名 LSTM 变种中的一小部分。还有其他变种,例如 Yao 等人提出的 Depth Gated RNN。也有一些完全不同的方法处理长期依赖性,例如 Koutnik 等人提出的 Clockwork RNN。

这些变种哪一个是最好的?它们之间的区别重要吗?Greff 等人做了研究,细致的比较流行的变种,结果发现它们几乎都一样。Jozefowicz 等人测试了一万余种 RNN 架构,发现在特定问题上有些架构的表现好于 LSTM。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There's also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they're all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.


Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997).

LSTMs are explicitly designed to avoid the long-term dependency problem.Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

因此标准的RNN模型具有神经网络模块链式结构,模块结构可以非常简单,比如只包含一个tanh layer,如下图所示:

澳门新萄京官方网站 25





澳门新萄京官方网站 26


In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

Feed forward neural networks (FF or FFNN) and perceptrons (P) are very straight forward, they feed information from the front to the back (input and output, respectively). Neural networks are often described as having layers, where each layer consists of either input, hidden or output cells in parallel. A layer alone never has connections and in general two adjacent layers are fully connected (every neuron form one layer to every neuron to another layer). The simplest somewhat practical network has two input cells and one output cell, which can be used to model logic gates. One usually trains FFNNs through back-propagation, giving the network paired datasets of “what goes in” and “what we want to have coming out”. This is called supervised learning, as opposed to unsupervised learning where we only give it input and let the network fill in the blanks. The error being back-propagated is often some variation of the difference between the input and the output (like MSE or just the linear difference). Given that the network has enough hidden neurons, it can theoretically always model the relationship between the input and output. Practically their use is a lot more limited but they are popularly combined with other networks to form new networks.

澳门新萄京官方网站 27

Type: function



早先,我注意到有些人使用 RNN 取得了显著的成果,这些几乎都是通过 LSTM 网络做到的。对于绝大部分问题,LSTM 真的更好用!

罗列一大堆公式之后,LSTM 看起来令人生畏。还好,文章中逐步的解析让它们更容易接受。

LSTM 是 RNN 取得的一大进步。很自然地要问:还有其他的进步空间吗?研究人员的普遍答案是:Yes!还有进步的空间,那就是注意力(attention)!注意力的想法是让 RNN 中的每一步都从信息更加富集的地方提取信息。例如,你想使用 RNN 对一幅图片生成描述,它也需要提取图片中的一部分来生成输出的文字。事实上,Xu 等人就是这么做的,如果你想探索注意力,这会是一个相当不错的起始点。还有许多出色的成果使用了注意力,注意力未来还将发挥更大的威力...

注意力并非 RNN 研究中唯一一个激动人心的思路。Kalchbrenner 等人提出的 Grid LSTM 看起来极具潜力。Gregor等人、Chung 等人,或者 Bayer 与 Osendorfer 在生成模型中使用 RNN 的想法也非常有意思。最近几年是递归神经网络的明星时间,新出的成果只会更具前景。

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It's natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it's attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There's been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn't the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt(输送带). It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTM可以去除或增加cell state的信息,并被称为门(gates)的结构仔细调控。

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation(逐点乘法运算).

澳门新萄京官方网站 28

forget gate layer

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.
Original Paper PDF

A MIDI pattern in GarageBand


Creates a recurrent neural network specified by RNNCell cell.

Performs fully dynamic unrolling of inputs.

Inputs may be a single Tensor where the maximum time is either the first or second dimension (see the parameter time_major). Alternatively, it may be a (possibly nested) tuple of Tensors, each of them having matching batch and time dimensions. The corresponding output is either a single Tensor having the same number of time steps and batch size, or a (possibly nested) tuple of such tensors, matching the nested structure of cell.output_size.

The parameter sequence_length澳门新萄京官方网站, is optional and is used to copy-through state and zero-out outputs when past a batch element's sequence length. So it's more for correctness than performance.

cell: An instance of RNNCell.
inputs: The RNN inputs.

If `time_major == False` (default), this must be a `Tensor` of shape:
  `[batch_size, max_time, ...]`, or a nested tuple of such

If `time_major == True`, this must be a `Tensor` of shape:
  `[max_time, batch_size, ...]`, or a nested tuple of such

This may also be a (possibly nested) tuple of Tensors satisfying this property.  The first two dimensions must match across all the inputs, but otherwise the ranks and other shape components may differ. In this case, input to `cell` at each time-step will replicate the structure of these tuples, except for the time dimension (from which the time is taken).

The input to cell at each time step will be a Tensor or (possibly nested) tuple of Tensors each with dimensions [batch_size, ...].
sequence_length: (optional) An int32/int64 vector sized [batch_size].
initial_state: (optional) An initial state for the RNN. If cell.state_size is an integer, this must be a Tensor of appropriate type and shape [batch_size, cell.state_size]. If cell.state_size is a tuple, this should be a tuple of tensors having shapes [batch_size, s] for s in cell.state_size.
dtype: (optional) The data type for the initial state and expected output. Required if initial_state is not provided or RNN state has a heterogeneous dtype.
parallel_iterations: (Default: 32). The number of iterations to run in parallel. Those operations which do not have any temporal dependency and can be run in parallel, will be. This parameter trades off time for space. Values >> 1 use more memory but take less time, while smaller values use less memory but computations take longer.
swap_memory: Transparently swap the tensors produced in forward inference but needed for back prop from GPU to CPU. This allows training RNNs which would typically not fit on a single GPU, with very minimal (or no) performance penalty.
time_major: The shape format of the inputs and outputs Tensors.
If true, these Tensors must be shaped [max_time, batch_size, depth].
If false, these Tensors must be shaped [batch_size, max_time, depth].
Using time_major = True is a bit more efficient because it avoids
transposes at the beginning and end of the RNN calculation. However,
most TensorFlow data is batch-major, so by default this function
accepts input and emits output in batch-major form.
scope: VariableScope for the created subgraph; defaults to "rnn".



我很感谢有许多人帮助我更好地理解 LSTM 网络,无论是可视化上边的评注,还是文章后面的反馈。

我非常感谢我在 Google 的同事提供了有益的反馈,特别是 Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis 和 Ilya Sutskever。我也非常感谢其他花时间帮助我的同事,包括 Dario Amodei 和 Jacob Steinhardt。我要特别感谢 Kyunghyun Cho 针对文章图解的极具关切的来信。

在这篇博客之前,我已经在两个系列研讨班上阐述过 LSTM 网络,当时我正在做神经网络方面的教学。感谢所有参加过研讨班的人以及他们提出的反馈。

I'm grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I'm very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I'm also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I'm especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

  1. In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.↩

Step-by-Step LSTM Walk Through

第一步是选择cell state中要被丢弃的信息,这一步由被称为“forget gate layer”的sigmoid layer完成。sigmoid layer根据输入ht-1和xt,并为cell state Ct-1中每个值输出一个介于0-1之间的值。当输出为 1 表示完全保留这个cell state信息,当输出为 0 表示完全抛弃。比如说如果我们尝试利用语言模型,根据之前所有的背景信息来预测下一个词。在这样的问题中,cell state可能包括当前主体的性别,因此可以使用正确的代词。 当我们看到一个新的主体时,我们想忘记旧主体的性别。

下图即为“forget gate layer”示图:


接下来选择/决定要存入到cell state的新信息。这步有两个部分。首先,被称为“input gate layer”的sigmoid layer决定我们将更新哪些值。接下来,tanh层创建一个新的候选值向量Ct,可以添加到状态state中。在下一步中,我们将结合这两者来实现细胞状态cell state的更新。
在我们的语言模型的例子中,我们希望将新主体的性别添加到cell state中,以替换我们抛弃的旧主体性别信息。

下图为“input gate layer” tanh layer示图:

澳门新萄京官方网站 29

input gate layer tanh layer

现在是时候将之前的cell state Ct-1更新为cell status Ct。 之前的步骤已经决定要做什么,我们只需要真正做到这一点。

我们将旧状态Ct-1乘以ft,忘记/抛弃我们早先决定抛弃的信息。 然后加上it*Ct。 这是新的候选值,根据我们决定更新每个状态值的比例进行缩放。


澳门新萄京官方网站 30

更新cell state

最后,我们需要决定我们要输出的内容。 这个输出将基于我们的cell state,但将是一个过滤版本。 首先,我们运行一个sigmoid layer,它决定我们要输出的cell state的哪些部分。 然后,将cell state 通过tanh(将值推到-1和1之间)并将其乘以sigmoid gate的输出,以便我们只输出决定输出的部分。

对于语言模型示例,由于它刚刚看到了一个主体,因此它可能需要输出与动词相关的信息,以防接下来会发生什么。 例如,它可能会输出主体是单数还是复数,以便我们知道如果接下来是什么,应该将动词的形式结合到一起。这个部分是通过sigmoid layer实现cell state的过滤,根据过滤版本的cell state修改输出ht.


澳门新萄京官方网站 31


The green bars represent the notes that are being played. The note C1 is a kick drum, D1 is a snare drum, G#1 is a hi-hat, and so on. The drum patterns in the dataset are all 1 measure (or 4 beats) long.
In a MIDI file the notes are stored as a series of events:
NOTE ON time: 0 channel: 0 note: 36 velocity: 80NOTE ON time: 0 channel: 0 note: 46 velocity: 80NOTE OFF time: 120 channel: 0 note: 36 velocity: 64NOTE OFF time: 0 channel: 0 note: 46 velocity: 64NOTE ON time: 120 channel: 0 note: 44 velocity: 80NOTE OFF time: 120 channel: 0 note: 44 velocity: 64NOTE ON time: 120 channel: 0 note: 38 velocity: 80NOTE OFF time: 120 channel: 0 note: 38 velocity: 64NOTE ON time: 0 channel: 0 note: 44 velocity: 80. . . and so on . . .


A pair (outputs, state) where:

outputs: The RNN output `Tensor`.

  If time_major == False (default), this will be a `Tensor` shaped:
    `[batch_size, max_time, cell.output_size]`.

  If time_major == True, this will be a `Tensor` shaped:
    `[max_time, batch_size, cell.output_size]`.

  Note, if `cell.output_size` is a (possibly nested) tuple of integers
  or `TensorShape` objects, then `outputs` will be a tuple having the
  same structure as `cell.output_size`, containing Tensors having shapes
  corresponding to the shape data in `cell.output_size`.

state: The final state.  If `cell.state_size` is an int, this
  will be shaped `[batch_size, cell.state_size]`.  If it is a
  `TensorShape`, this will be shaped `[batch_size]   cell.state_size`.
  If it is a (possibly nested) tuple of ints or `TensorShape`, this will
  be a tuple having the corresponding shapes.

Variants on Long Short Term Memory

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

澳门新萄京官方网站 32

"peephole connections"

Another variation is to use coupled(耦合) forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

澳门新萄京官方网站 33


A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

澳门新萄京官方网站 34


These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

澳门新萄京官方网站 35

To begin playing a note there is a NOTE ON event, to stop playing there is a NOTE OFF event. The duration of the note is determined by the amount of time between NOTE ON and NOTE OFF. For us, the duration of the notes isn’t really important because drum sounds are short — they aren’t sustained like a flute or violin. All we care about is the NOTE ON events, which tell us when new drum sounds begin.
Each NOTE ON event includes a few different bits of data, but for our purposes we only need to know the timestamp and the note number.
The note number is an integer that represents the drum sound. For example, 36 is the number for note C in octave 1, which is the kick drum. (The General MIDI standard defines which note number is mapped to which percussion instrument.)
The timestamp for an event is a “delta” time, which means it is the number of tickswe should wait before processing this event. For the MIDI files in our dataset, there are 480 ticks per beat. So if we play the drums at 120 beats-per-minute, then one second has 960 ticks in it. This is not really important to remember; just know that for each note in the drum pattern there’s also a delay measured in ticks.
Our input sequence to the RNN then has the following form:
(note, ticks) (note, ticks) (note, ticks) . . .


TypeError: If cell is not an instance of RNNCell.
ValueError: If inputs is None or an empty list.



  1. Attention:Xu, et al. (2015)
  2. Grid LSTMs:Kalchbrenner, et al. (2015)
  3. RNN in generative models:Gregor, et al. (2015),Chung, et al. (2015),Bayer & Osendorfer (2015)

Radial basis function (RBF) networks are FFNNs with radial basis functions as activation functions. There’s nothing more to it. Doesn’t mean they don’t have their uses, but most FFNNs with other activation functions don’t get their own name. This mostly has to do with inventing them at the right time.

At every timestep we insert a (note, ticks)
pair into the RNN and it will try to predict the next (note, ticks)
pair from the same sequence. For the example above, the sequence is:
(36, 0) (46, 0) (44, 240) (38, 240) (44, 0) . . .


  1. Understanding LSTM Networks-colah's blog

Broomhead, David S., and David Lowe. Radial basis functions, multi-variable functional interpolation and adaptive networks. No. RSRE-MEMO-4148. ROYAL SIGNALS AND RADAR ESTABLISHMENT MALVERN (UNITED KINGDOM), 1988.
Original Paper PDF

That’s a kick drum (36) and an open-hihat (46) on the first beat, followed by a pedal hi-hat (44) after 240 ticks, followed by a snare drum (38) and a pedal hi-hat (44) after another 240 ticks, and so on.
The dataset I used for training has 2700 of these MIDI files. I glued them together into one big sequence of 52260 (note, ticks)
pairs. Just think of this sequence as a ginormous drum solo. This is the sequence we’ll try to make the RNN remember.
Note: This dataset of drum patterns comes from a commercial drum kit plug-in for use in audio production tools such as Logic Pro. I was looking for a fun dataset for training an RNN when I realized I had a large library of drum patterns in MIDI format sitting in a folder on my computer… and so the RNN drummer was born. Unfortunately, it also means this dataset is copyrighted and I can’t distribute it with the GitHub project. If you want to train the RNN yourself, you’ll need to find your own collection of drum patterns in MIDI format — I can’t give you mine.

One-hot encoding
You’ve seen that the MIDI note numbers are regular integers. We’ll be using the note numbers between 35 and 60, which is the range reserved in the General MIDI standard for percussion instruments.
The ticks are also integers, between 0 and 1920. (That’s how many ticks go into one measure and each MIDI file in the dataset is only one measure long.)
However, we can’t just feed integers into our neural network. In machine learning when you encode something using an integer (or a floating-point value), you imply there is an order to it: the number 55 is bigger than the number 36.
But this is not true for our MIDI notes: the drum sound represented by MIDI note number 55 is not “bigger” than the drum sound with number 36. These numbers represent completely different things — one is a kick drum, the other a cymbal.
Instead of truly being numbers on some continuous scale, our MIDI notes are examples of what’s called categorical variables. It’s better to encode that kind of data using one-hot encoding rather than integers (or floats).
For the sake of giving an example, let’s say that our entire dataset only uses five unique note numbers:
36 kick drum38 snare drum42 closed hi-hat48 tom55 cymbal

澳门新萄京官方网站 36

We can then encode any given note number using a 5-element vector. Each index in this vector corresponds to one of those five drum sounds. A kick drum (note 36) would be encoded as:
[ 1, 0, 0, 0, 0 ]

Hopfield network (HN) is a network where every neuron is connected to every other neuron; it is a completely entangled plate of spaghetti as even all the nodes function as everything. Each node is input before training, then hidden during training and output afterwards. The networks are trained by setting the value of the neurons to the desired pattern after which the weights can be computed. The weights do not change after this. Once trained for one or more patterns, the network will always converge to one of the learned patterns because the network is only stable in those states. Note that it does not always conform to the desired state (it’s not a magic black box sadly). It stabilises in part due to the total “energy” or “temperature” of the network being reduced incrementally during training. Each neuron has an activation threshold which scales to this temperature, which if surpassed by summing the input causes the neuron to take the form of one of two states (usually -1 or 1, sometimes 0 or 1). Updating the network can be done synchronously or more commonly one by one. If updated one by one, a fair random sequence is created to organise which cells update in what order (fair random being all options (n) occurring exactly once every n items). This is so you can tell when the network is stable (done converging), once every cell has been updated and none of them changed, the network is stable (annealed). These networks are often called associative memory because the converge to the most similar state as the input; if humans see half a table we can image the other half, this network will converge to a table if presented with half noise and half a table.

while a snare drum would be encoded as:
[ 0, 1, 0, 0, 0 ]

Hopfield, John J. “Neural networks and physical systems with emergent collective computational abilities.” Proceedings of the national academy of sciences 79.8 (1982): 2554-2558.
Original Paper PDF

and so on… It’s called “one-hot” because the vector is all zeros except for a one at the index that represents the thing you’re encoding. Now all these vectors have the same “length” and there is no longer an ordering relationship between them.
We do the same thing for the ticks, and then combine these two one-hot encoded vectors into one big vector called x:

澳门新萄京官方网站 37

澳门新萄京官方网站 38

Markov chains (MC or discrete time Markov Chain, DTMC) are kind of the predecessors to BMs and HNs. They can be understood as follows: from this node where I am now, what are the odds of me going to any of my neighbouring nodes? They are memoryless (i.e. Markov Property) which means that every state you end up in depends completely on the previous state. While not really a neural network, they do resemble neural networks and form the theoretical basis for BMs and HNs. MC aren’t always considered neural networks, as goes for BMs, RBMs and HNs. Markov chains aren’t always fully connected either.

One-hot encoded input vector

Hayes, Brian. “First links in the Markov chain.” American Scientist 101.2 (2013): 252.
Original Paper PDF

In the full dataset there are 17 unique note numbers and 209 unique tick values, so this vector consists of 226 elements. (Of those elements, 224 are 0 and two are 1.)
The sequence that we present to the RNN does not really exist of (note, ticks)
pairs but is a list of these one-hot encoded vectors:
[ 0, 0, 1, 0, 0, 0, ..., 0 ] [ 1, 0, 0, 0, 0, 0, ..., 0 ] [ 0, 0, 0, 1, 0, 0, ..., 1 ]. . . and so on . . .

Because there are 52260 notes in the dataset, the entire training sequence is made up of 52260 of those 226-element vectors.
The script convert_midi.py reads the MIDI files from the dataset and outputs a new file X.npy that contains this 52260×226 matrix with the full training sequence. (The script also saves two lookup tables that tell us which note numbers and tick values correspond to the positions in the one-hot vectors.)
Note: You may be wondering why we’re one-hot encoding the ticks too as these are numerical variables and not categorical. A timespan of 200 ticks definitely means that it’s twice as long as 100 ticks. Fair question. I figured I would keep things simple and encode the note numbers and ticks in the same way. This is not necessarily the most efficient way to encode the durations of the notes but it’s good enough for this blog post.

澳门新萄京官方网站 39

Long Short-Term Memory (huh?!)
The kind of recurrent neural network we’re using is something called an LSTM or Long Short-Term Memory. It looks like this on the inside:

Boltzmann machines (BM) are a lot like HNs, but: some neurons are marked as input neurons and others remain “hidden”. The input neurons become output neurons at the end of a full network update. It starts with random weights and learns through back-propagation, or more recently through contrastive divergence (a Markov chain is used to determine the gradients between two informational gains). Compared to a HN, the neurons mostly have binary activation patterns. As hinted by being trained by MCs, BMs are stochastic networks. The training and running process of a BM is fairly similar to a HN: one sets the input neurons to certain clamped values after which the network is set free (it doesn’t get a sock). While free the cells can get any value and we repetitively go back and forth between the input and hidden neurons. The activation is controlled by a global temperature value, which if lowered lowers the energy of the cells. This lower energy causes their activation patterns to stabilise. The network reaches an equilibrium given the right temperature.

澳门新萄京官方网站 40

Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming in Boltzmann machines.” Parallel distributed processing: Explorations in the microstructure of cognition 1 (1986): 282-317.
Original Paper PDF

The gates inside an LSTM cell

澳门新萄京官方网站 41

The vector x is a single input that we feed into the network. It’s one of those 226-element vectors from the training sequence that combines the note number and the delay in ticks for a single drum sound.
The output y is the prediction that is computed by the LSTM. This is also a 226-element vector but this time it contains a probability distribution over the possible note numbers and tick values. The goal of training the LSTM is to get an output y that is (mostly) equal to the next element from the training sequence.
Recall that a recurrent network has “internal state” that acts as its memory. The internal state of the LSTM is given by two vectors: c and h. The c vector helps the LSTM to remember the sequence of MIDI notes it has seen so far, and h is used to predict the next notes in the sequence.
At every time step we compute new values for c and h, and then feed these back into the network so they are used as inputs for the next timestep.
The most interesting feature of the LSTM is that it has gates that can be either 0 (closed) or 1 (open). The gates determine how data flows through the LSTM layer.
The gates perform different jobs:
The “input” gate i determines whether the input x is added to the memory vector c. If this gate is closed, the input is basically ignored.
The g gate determines how much of input x gets added to c if the input gate is open.
The “output” gate o determines what gets put into the new value of h.
The “forget” gate f is used to reset parts of the memory c.

Restricted Boltzmann machines (RBM) are remarkably similar to BMs (surprise) and therefore also similar to HNs. The biggest difference between BMs and RBMs is that RBMs are a better usable because they are more restricted. They don’t trigger-happily connect every neuron to every other neuron but only connect every different group of neurons to every other group, so no input neurons are directly connected to other input neurons and no hidden to hidden connections are made either. RBMs can be trained like FFNNs with a twist: instead of passing data forward and then back-propagating, you forward pass the data and then backward pass the data (back to the first layer). After that you train with forward-and-back-propagation.

The inputs x and h are connected to these gates using weights — Wxf, Whf, etc. When we train the LSTM, what it learns are the values of those weights. (It does not learn the values of h or c.)
Thanks to this mechanism with the gates, the LSTM can remember things over the long term, and it can even choose to forget things it no longer considers important.
Confused how this works? It doesn’t matter. Exactly how or why these gates work the way they do isn’t very important for this blog post. (If you really want to know,read the paper.) Just know this particular scheme has proven to work very well for remembering long sequences.
Our job is to make the network learn the optimal values for the weights betweenx and h and these gates, and for the weights between h and y.
The math
To implement an LSTM any sane person would use a tool such as Keras which lets you simply write layer = LSTM()
. However, we are going to do it the hard way, using primitive TensorFlow operations.
The reason for doing it the hard way, is that we’re going to have to implement this math ourselves in the iOS app, so it’s useful to understand the formulas that are being used.
The formulas needed to implement the inner logic of the LSTM layer look like this:
f = tf.sigmoid(tf.matmul(x[t], Wxf) tf.matmul(h[t - 1], Whf) bf)i = tf.sigmoid(tf.matmul(x[t], Wxi) tf.matmul(h[t - 1], Whi) bi)o = tf.sigmoid(tf.matmul(x[t], Wxo) tf.matmul(h[t - 1], Who) bo)g = tf.tanh(tf.matmul(x[t], Wxg) tf.matmul(h[t - 1], Whg) bg)

Smolensky, Paul. Information processing in dynamical systems: Foundations of harmony theory. No. CU-CS-321-86. COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
Original Paper PDF

What goes on here is less intimidating than it first appears. Let’s look at the line for the f gate in detail:
f = tf.sigmoid( tf.matmul(x[t], Wxf) # 1 tf.matmul(h[t - 1], Whf) # 2 bf # 3 )

This computes whether the f gate is open (1) or closed (0). Step-by-step this is what it does:
First multiply the input x for the current timestep with the matrix Wxf. This matrix contains the weights of the connections between x and f.

澳门新萄京官方网站 42

Also multiply the input h with the weights matrix Whf. In these formulas, t
is the index of the timestep. Because h feeds back into the network we use the value of h from the previous timestep, given by h[t - 1]

Autoencoders (AE) are somewhat similar to FFNNs as AEs are more like a different use of FFNNs than a fundamentally different architecture. The basic idea behind autoencoders is to encode information (as in compress, not encrypt) automatically, hence the name. The entire network always resembles an hourglass like shape, with smaller hidden layers than the input and output layers. AEs are also always symmetrical around the middle layer(s) (one or two depending on an even or odd amount of layers). The smallest layer(s) is|are almost always in the middle, the place where the information is most compressed (the chokepoint of the network). Everything up to the middle is called the encoding part, everything after the middle the decoding and the middle (surprise) the code. One can train them using backpropagation by feeding input and setting the error to be the difference between the input and what came out. AEs can be built symmetrically when it comes to weights as well, so the encoding weights are the same as the decoding weights.

Add a bias value bf.

Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer perceptrons and singular value decomposition.” Biological cybernetics 59.4-5 (1988): 291-294.
Original Paper PDF

Finally, take the logistic sigmoid of the whole thing. The sigmoid function returns 0, 1, or a value in between.

The same thing happens for the other gates, except that for g we use a hyperbolic tangent function to get a number between -1 and 1 (instead of 0 and 1). Each gate has its own set of weight matrices and bias values.
Once we know which gates are open and which are closed, we can compute the new values of the internal state c and h:
c[t] = f * c[t - 1] i * gh[t] = o * tf.tanh(c[t])

澳门新萄京官方网站 43

We put the new values of c and h into c[t]
and h[t]
, so that these will be used as the inputs for the next timestep.
Now that we know the new value for the state vector h, we can use this to predict the output y for this timestep:
y = tf.matmul(h[t], Why) by

Sparse autoencoders (SAE) are in a way the opposite of AEs. Instead of teaching a network to represent a bunch of information in less “space” or nodes, we try to encode information in more space. So instead of the network converging in the middle and then expanding back to the input size, we blow up the middle. These types of networks can be used to extract many small features from a dataset. If one were to train a SAE the same way as an AE, you would in almost all cases end up with a pretty useless identity network (as in what comes in is what comes out, without any transformation or decomposition). To prevent this, instead of feeding back the input, we feed back the input plus a sparsity driver. This sparsity driver can take the form of a threshold filter, where only a certain error is passed back and trained, the other error will be “irrelevant” for that pass and set to zero. In a way this resembles spiking neural networks, where not all neurons fire all the time (and points are scored for biological plausibility).

This prediction performs yet another matrix multiplication, this time using the weights Why between h and y. (This is a simple affine function like the one that happens in a fully-connected layer.)
Recall that our input x is a vector with 226 elements that contains two separate data items: the MIDI note number and the delay in ticks. This means we also need to predict the note and tick values separately, and so we use two softmax functions, each on a separate portion of the y vector:
y_note[t] = tf.nn.softmax(y[:num_midi_澳门新萄京官方网站:长短时记忆网络,神经网络大全。notes])y_tick[t] = tf.nn.softmax(y[num_midi_notes:])

Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. “Efficient learning of sparse representations with an energy-based model.” Proceedings of NIPS. 2007.
Original Paper PDF

And that’s in a nutshell how the math in the LSTM layer works. To read more about these formulas, see the Wikipedia page.
Note: Even though the above LSTM formulas are taken from the Python training script and use TensorFlow to do the computations, we need to implement exactly the same formulas in the iOS app. But instead of TensorFlow, we’ll use the Accelerate framework for that.

Too many matrices!
As you know, when a neural network is trained it will learn values for the weights and biases. The same is true here: the LSTM will learn the values of Wxf, Whf, bf,Why, by, and so on. Notice that this is 9 different matrices and 5 different bias values.
We can be clever and actually combine these matrices into one big matrix:

澳门新萄京官方网站 44

澳门新萄京官方网站 45

Variational autoencoders (VAE) have the same architecture as AEs but are “taught” something else: an approximated probability distribution of the input samples. It’s a bit back to the roots as they are bit more closely related to BMs and RBMs. They do however rely on Bayesian mathematics regarding probabilistic inference and independence, as well as a re-parametrisation trick to achieve this different representation. The inference and independence parts make sense intuitively, but they rely on somewhat complex mathematics. The basics come down to this: take influence into account. If one thing happens in one place and something else happens somewhere else, they are not necessarily related. If they are not related, then the error propagation should consider that. This is a useful approach because neural networks are large graphs (in a way), so it helps if you can rule out influence from some nodes to other nodes as you dive into deeper layers.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
Original Paper PDF

Combing the weight matrices into a big matrix

We first put the value of x for this timestep and the value of h of the previous timestep into a new vector (plus the constant 1, which gets multiplied with the bias). Likewise, we put all the weights and biases into one matrix. And then we multiply these two together.
This does the exact same thing as the eight matrix multiplies from before. The big advantage is that we now have to manage only a single weight matrix for x and h(and no bias value, since that is part of this big matrix too).
We can simplify the computation for the gates to just this:
combined = tf.concat([x[t], h[t - 1], tf.ones(1)], axis=0)gates = tf.matmul(combined, Wx)

澳门新萄京官方网站 46

And then compute the new values of c and h as follows:
c[t] = tf.sigmoid(gates[0])c[t - 1] tf.sigmoid(gates[1])tf.tanh(gates[3])h[t] = tf.sigmoid(gates[2])*tf.tanh(c[t])

Denoising autoencoders (DAE) are AEs where we don’t feed just the input data, but we feed the input data with noise (like making an image more grainy). We compute the error the same way though, so the output of the network is compared to the original input without noise. This encourages the network not to learn details but broader features, as learning smaller features often turns out to be “wrong” due to it constantly changing with noise.

These two formulas for c and h didn’t really change — I just moved the sigmoid and tanh functions here.
Now when we train the LSTM we only have to deal with two weight matrices: Wx, which is the big matrix I showed you here, and Wy, the matrix that for the weights between h and y. Those two matrices are the learned parameters that get loaded into the iOS app.
OK, let’s recap where we are now:
We’ve got a dataset of 52260 one-hot encoded vectors that describe MIDI notes and their timing. Together, these 52260 vectors make up a very long sequence of drum patterns.

Vincent, Pascal, et al. “Extracting and composing robust features with denoising autoencoders.” Proceedings of the 25th international conference on Machine learning. ACM, 2008.
Original Paper PDF

We want to train the LSTM to memorize this sequence. In other words, for every note of the sequence the LSTM should be able to correctly predict the note that follows.

We have the formulas for computing what happens in an LSTM layer. It takes an input x, which is one of these vectors describing a single drum sound, and two state vectors h and c. The LSTM then computes new values for h and c, as well as a prediction y for what the next note in the sequence will be.

澳门新萄京官方网站 47

Now we need to put this all together to train the recurrent network. This will give us two matrices Wx and Wy that describe the weights of the connections between the different parts of the LSTM.
And then we can use those weights in the iOS app to play new drum patterns.
Note: The GitHub repo only contains a few drum patterns since I am not allowed to distribute the full dataset. So unless you have your own library of drum patterns, there isn’t much use in doing the training yourself. However, you can still run the iOS app, as the trained weights are included in the Xcode project.
That said, if you really want to, you can run the lstm.py script to train the neural network on the included drum patterns (see the README file for instructions). Don’t get your hopes up though — because there isn’t nearly enough data to train on, the model won’t be very good.

Deep belief networks (DBN) is the name given to stacked architectures of mostly RBMs or VAEs. These networks have been shown to be effectively trainable stack by stack, where each AE or RBM only has to learn to encode the previous network. This technique is also known as greedy training, where greedy means making locally optimal solutions to get to a decent but possibly not optimal answer. DBNs can be trained through contrastive divergence or back-propagation and learn to represent the data as a probabilistic model, just like regular RBMs or VAEs. Once trained or converged to a (more) stable state through unsupervised learning, the model can be used to generate new data. If trained with contrastive divergence, it can even classify existing data because the neurons have been taught to look for different features.

A few notes about training
Training an LSTM isn’t very different from training any other neural network. We use backpropagation with an SGD (Stochastic Gradient Descent) optimizer and we train until the loss is low enough.
However, the nature of this network being recurrent — where the outputs h andc are always connected to the inputs h and c — makes backpropagation a little tricky. We don’t want to get stuck in an infinite loop!
The way to deal with this is a technique called backpropagation through timewhere we backpropagate through all the steps of the entire training sequence.
In the interest of keeping this blog post short, I’m not going to explain the entire training procedure here. You can find the complete implementation in lstm.py in the function train()
However, I do want to mention a few things:
The learning capacity of the LSTM is determined by the size of the h and cvectors. A size of 200 units (or neurons if you will) seems to work well. More units might work even better but at some point you’ll get diminishing returns, and you’re better off stacking multiple LSTMs on top of each other (making the network deeper rather than wider).

Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.” Advances in neural information processing systems 19 (2007): 153.
Original Paper PDF

It’s not practical to backpropagate through all 52260 steps of the training sequence, even though that would give the best results. Instead, we only go back 200 timesteps. After a bit of experimentation this seemed like a reasonable number. To achieve this, the training script actually sticks 200 LSTM units together and processes the training sequence in chunks of 200 notes at a time.

Every so often the training script computes the percentage of predictions it has correct. It does this on the training set (there is no validation set) so take it with a grain of salt, but it’s a good indicator of whether the training is still making progress or not.

澳门新萄京官方网站 48

The final model took a few hours to train on my iMac but that’s because it doesn’t have a GPU that TensorFlow can use (sad face). I let the training script run until the learning seemed to have stalled (the accuracy and loss did not improve), then I pressed Ctrl C, lowered the learning rate in the script, and resumed training from the last checkpoint.

Convolutional neural networks (CNN or deep convolutional neural networks, DCNN) are quite different from most other networks. They are primarily used for image processing but can also be used for other types of input such as as audio. A typical use case for CNNs is where you feed the network images and the network classifies the data, e.g. it outputs “cat” if you give it a cat picture and “dog” when you give it a dog picture. CNNs tend to start with an input “scanner” which is not intended to parse all the training data at once. For example, to input an image of 200 x 200 pixels, you wouldn’t want a layer with 40 000 nodes. Rather, you create a scanning input layer of say 20 x 20 which you feed the first 20 x 20 pixels of the image (usually starting in the upper left corner). Once you passed that input (and possibly use it for training) you feed it the next 20 x 20 pixels: you move the scanner one pixel to the right. Note that one wouldn’t move the input 20 pixels (or whatever scanner width) over, you’re not dissecting the image into blocks of 20 x 20, but rather you’re crawling over it. This input data is then fed through convolutional layers instead of normal layers, where not all nodes are connected to all nodes. Each node only concerns itself with close neighbouring cells (how close depends on the implementation, but usually not more than a few). These convolutional layers also tend to shrink as they become deeper, mostly by easily divisible factors of the input (so 20 would probably go to a layer of 10 followed by a layer of 5). Powers of two are very commonly used here, as they can be divided cleanly and completely by definition: 32, 16, 8, 4, 2, 1. Besides these convolutional layers, they also often feature pooling layers. Pooling is a way to filter out details: a commonly found pooling technique is max pooling, where we take say 2 x 2 pixels and pass on the pixel with the most amount of red. To apply CNNs for audio, you basically feed the input audio waves and inch over the length of the clip, segment by segment. Real world implementations of CNNs often glue an FFNN to the end to further process the data, which allows for highly non-linear abstractions. These networks are called DCNNs but the names and abbreviations between these two are often used interchangeably.

The model that is included in the GitHub repo澳门新萄京官方网站:长短时记忆网络,神经网络大全。 has an accuracy score of about 92%, which means 8 in every 100 notes from the training sequence are remembered wrong. Once the model reached 92% accuracy, it didn’t seem to want to go much further than that, so we’ve probably reached the capacity of our model.
An accuracy of “only” 92% is good enough for our purposes: we don’t want the LSTM to literally remember every example from the training data, just enough to get a sense of what it means to play the drums.
So how good is it?
Don’t fire the drummer from your band just yet. :–)

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.
Original Paper PDF

澳门新萄京官方网站 49

澳门新萄京官方网站 50

The question is: has the recurrent neural network really learned anything from the training data, or does it just output random notes?
Here’s an MP3 of randomly chosen notes and durations from the training data. It doesn’t sound like real drums at all.
Compare it with this recording that was produced by the LSTM. It’s definitely much more realistic! (In fact, it sounds a lot like the kid down the street practicing.)
Of course, the model we’re using is very basic. It’s a single LSTM layer with “only” 200 neurons. No doubt there are much better ways to train a computer to play the drums. One way is to make the network deeper by stacking multiple LSTMs. This should improve the performance by a lot!
The weights that are learned by the model take up 1.5 MB of storage. The dataset, on the other hand, is only 1.3 MB! That doesn’t seem very efficient. But just having the dataset does not mean you know how to drum — the weights are more than just a way to remember the training data, they also “understand” in some way what it means to play the drums.
The cool thing is that our neural network doesn’t really know anything about music: we just gave it examples and it has learned drumming from that (to some extent anyway). The point I’m trying to make with this blog post is that if we can make a recurrent neural network learn to drum, then we can teach it to understand any kind of sequential data.

Deconvolutional networks (DN), also called inverse graphics networks (IGNs), are reversed convolutional neural networks. Imagine feeding a network the word “cat” and training it to produce cat-like pictures, by comparing what it generates to real pictures of cats. DNNs can be combined with FFNNs just like regular CNNs, but this is about the point where the line is drawn with coming up with new abbreviations. They may be referenced as deep deconvolutional neural networks, but you could argue that when you stick FFNNs to the back and the front of DNNs that you have yet another architecture which deserves a new name. Note that in most applications one wouldn’t actually feed text-like input to the network, more likely a binary classification input vector. Think <0, 1> being cat, <1, 0> being dog and <1, 1> being cat and dog. The pooling layers commonly found in CNNs are often replaced with similar inverse operations, mainly interpolation and extrapolation with biased assumptions (if a pooling layer uses max pooling, you can invent exclusively lower new data when reversing it).

Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
Original Paper PDF

澳门新萄京官方网站 51

Deep convolutional inverse graphics networks (DCIGN) have a somewhat misleading name, as they are actually VAEs but with CNNs and DNNs for the respective encoders and decoders. These networks attempt to model “features” in the encoding as probabilities, so that it can learn to produce a picture with a cat and a dog together, having only ever seen one of the two in separate pictures. Similarly, you could feed it a picture of a cat with your neighbours’ annoying dog on it, and ask it to remove the dog, without ever having done such an operation. Demo’s have shown that these networks can also learn to model complex transformations on images, such as changing the source of light or the rotation of a 3D object. These networks tend to be trained with back-propagation.

Kulkarni, Tejas D., et al. “Deep convolutional inverse graphics network.” Advances in Neural Information Processing Systems. 2015.
Original Paper PDF

澳门新萄京官方网站 52

Generative adversarial networks (GAN) are from a different breed of networks, they are twins: two networks working together. GANs consist of any two networks (although often a combination of FFs and CNNs), with one tasked to generate content and the other has to judge content. The discriminating network receives either training data or generated content from the generative network. How well the discriminating network was able to correctly predict the data source is then used as part of the error for the generating network. This creates a form of competition where the discriminator is getting better at distinguishing real data from generated data and the generator is learning to become less predictable to the discriminator. This works well in part because even quite complex noise-like patterns are eventually predictable but generated content similar in features to the input data is harder to learn to distinguish. GANs can be quite difficult to train, as you don’t just have to train two networks (either of which can pose it’s own problems) but their dynamics need to be balanced as well. If prediction or generation becomes to good compared to the other, a GAN won’t converge as there is intrinsic divergence.

Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in Neural Information Processing Systems. 2014.
Original Paper PDF

澳门新萄京官方网站 53

Recurrent neural networks (RNN) are FFNNs with a time twist: they are not stateless; they have connections between passes, connections through time. Neurons are fed information not just from the previous layer but also from themselves from the previous pass. This means that the order in which you feed the input and train the network matters: feeding it “milk” and then “cookies” may yield different results compared to feeding it “cookies” and then “milk”. One big problem with RNNs is the vanishing (or exploding) gradient problem where, depending on the activation functions used, information rapidly gets lost over time, just like very deep FFNNs lose information in depth. Intuitively this wouldn’t be much of a problem because these are just weights and not neuron states, but the weights through time is actually where the information from the past is stored; if the weight reaches a value of 0 or 1 000 000, the previous state won’t be very informative. RNNs can in principle be used in many fields as most forms of data that don’t actually have a timeline (i.e. unlike sound or video) can be represented as a sequence. A picture or a string of text can be fed one pixel or character at a time, so the time dependent weights are used for what came before in the sequence, not actually from what happened x seconds before. In general, recurrent networks are a good choice for advancing or completing information, such as autocompletion.

Elman, Jeffrey L. “Finding structure in time.” Cognitive science 14.2 (1990): 179-211.
Original Paper PDF

澳门新萄京官方网站 54

Long / short term memory (LSTM) networks try to combat the vanishing / exploding gradient problem by introducing gates and an explicitly defined memory cell. These are inspired mostly by circuitry, not so much biology. Each neuron has a memory cell and three gates: input, output and forget. The function of these gates is to safeguard the information by stopping or allowing the flow of it. The input gate determines how much of the information from the previous layer gets stored in the cell. The output layer takes the job on the other end and determines how much of the next layer gets to know about the state of this cell. The forget gate seems like an odd inclusion at first but sometimes it’s good to forget: if it’s learning a book and a new chapter begins, it may be necessary for the network to forget some characters from the previous chapter. LSTMs have been shown to be able to learn complex sequences, such as writing like Shakespeare or composing primitive music. Note that each of these gates has a weight to a cell in the previous neuron, so they typically require more resources to run.

Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
Original Paper PDF

澳门新萄京官方网站 55

Gated recurrent units (GRU) are a slight variation on LSTMs. They have one less gate and are wired slightly differently: instead of an input, output and a forget gate, they have an update gate. This update gate determines both how much information to keep from the last state and how much information to let in from the previous layer. The reset gate functions much like the forget gate of an LSTM but it’s located slightly differently. They always send out their full state, they don’t have an output gate. In most cases, they function very similarly to LSTMs, with the biggest difference being that GRUs are slightly faster and easier to run (but also slightly less expressive). In practice these tend to cancel each other out, as you need a bigger network to regain some expressiveness which then in turn cancels out the performance benefits. In some cases where the extra expressiveness is not needed, GRUs can outperform LSTMs.

Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:1412.3555 (2014).
Original Paper PDF


澳门新萄京官方网站 56

Neural Turing machines (NTM) can be understood as an abstraction of LSTMs and an attempt to un-black-box neural networks (and give us some insight in what is going on in there). Instead of coding a memory cell directly into a neuron, the memory is separated. It’s an attempt to combine the efficiency and permanency of regular digital storage and the efficiency and expressive power of neural networks. The idea is to have a content-addressable memory bank and a neural network that can read and write from it. The “Turing” in Neural Turing Machines comes from them being Turing complete: the ability to read and write and change state based on what it reads means it can represent anything a Universal Turing Machine can represent.

Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural turing machines.” arXiv preprint arXiv:1410.5401 (2014).
Original Paper PDF

Bidirectional recurrent neural networks, bidirectional long / short term memory networks and bidirectional gated recurrent units (BiRNN, BiLSTM and BiGRU respectively) are not shown on the chart because they look exactly the same as their unidirectional counterparts. The difference is that these networks are not just connected to the past, but also to the future. As an example, unidirectional LSTMs might be trained to predict the word “fish” by being fed the letters one by one, where the recurrent connections through time remember the last value. A BiLSTM would also be fed the next letter in the sequence on the backward pass, giving it access to future information. This trains the network to fill in gaps instead of advancing information, so instead of expanding an image on the edge, it could fill a hole in the middle of an image.

Schuster, Mike, and Kuldip K. Paliwal. “Bidirectional recurrent neural networks.” IEEE Transactions on Signal Processing 45.11 (1997): 2673-2681.
Original Paper PDF

澳门新萄京官方网站 57

Deep residual networks (DRN) are very deep FFNNs with extra connections passing input from one layer to a later layer (often 2 to 5 layers) as well as the next layer. Instead of trying to find a solution for mapping some input to some output across say 5 layers, the network is enforced to learn to map some input to some output some input. Basically, it adds an identity to the solution, carrying the older input over and serving it freshly to a later layer. It has been shown that these networks are very effective at learning patterns up to 150 layers deep, much more than the regular 2 to 5 layers one could expect to train. However, it has been proven that these networks are in essence just RNNs without the explicit time based construction and they’re often compared to LSTMs without gates.

He, Kaiming, et al. “Deep residual learning for image recognition.” arXiv preprint arXiv:1512.03385 (2015).
Original Paper PDF

澳门新萄京官方网站 58

Echo state networks (ESN) are yet another different type of (recurrent) network. This one sets itself apart from others by having random connections between the neurons (i.e. not organised into neat sets of layers), and they are trained differently. Instead of feeding input and back-propagating the error, we feed the input, forward it and update the neurons for a while, and observe the output over time. The input and the output layers have a slightly unconventional role as the input layer is used to prime the network and the output layer acts as an observer of the activation patterns that unfold over time. During training, only the connections between the observer and the (soup of) hidden units are changed.

Jaeger, Herbert, and Harald Haas. “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication.” science 304.5667 (2004): 78-80.
Original Paper PDF

澳门新萄京官方网站 59

Extreme learning machines (ELM) are basically FFNNs but with random connections. They look very similar to LSMs and ESNs, but they are not recurrent nor spiking. They also do not use backpropagation. Instead, they start with random weights and train the weights in a single step according to the least-squares fit (lowest error across all functions). This results in a much less expressive network but it’s also much faster than backpropagation.

Cambria, Erik, et al. “Extreme learning machines [trends & controversies].” IEEE Intelligent Systems 28.6 (2013): 30-59.
Original Paper PDF

澳门新萄京官方网站 60

Liquid state machines (LSM) are similar soups, looking a lot like ESNs. The real difference is that LSMs are a type of spiking neural networks: sigmoid activations are replaced with threshold functions and each neuron is also an accumulating memory cell. So when updating a neuron, the value is not set to the sum of the neighbours, but rather added to itself. Once the threshold is reached, it releases its’ energy to other neurons. This creates a spiking like pattern, where nothing happens for a while until a threshold is suddenly reached.

Maass, Wolfgang, Thomas Natschläger, and Henry Markram. “Real-time computing without stable states: A new framework for neural computation based on perturbations.” Neural computation 14.11 (2002): 2531-2560.
Original Paper PDF

澳门新萄京官方网站 61

Support vector machines (SVM) find optimal solutions for classification problems. Classically they were only capable of categorising linearly separable data; say finding which images are of Garfield and which of Snoopy, with any other outcome not being possible. During training, SVMs can be thought of as plotting all the data (Garfields and Snoopys) on a graph (2D) and figuring out how to draw a line between the data points. This line would separate the data, so that all Snoopys are on one side and the Garfields on the other. This line moves to an optimal line in such a way that the margins between the data points and the line are maximised on both sides. Classifying new data would be done by plotting a point on this graph and simply looking on which side of the line it is (Snoopy side or Garfield side). Using the kernel trick, they can be taught to classify n-dimensional data. This entails plotting points in a 3D plot, allowing it to distinguish between Snoopy, Garfield AND Simon’s cat, or even higher dimensions distinguishing even more cartoon characters. SVMs are not always considered neural networks.

Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.” Machine learning 20.3 (1995): 273-297.
Original Paper PDF

澳门新萄京官方网站 62

And finally, Kohonen networks (KN, also self organising (feature) map, SOM, SOFM) “complete” our zoo. KNs utilise competitive learning to classify data without supervision. Input is presented to the network, after which the network assesses which of its neurons most closely match that input. These neurons are then adjusted to match the input even better, dragging along their neighbours in the process. How much the neighbours are moved depends on the distance of the neighbours to the best matching units. KNs are sometimes not considered neural networks either.

Kohonen, Teuvo. “Self-organized formation of topologically correct feature maps.” Biological cybernetics 43.1 (1982): 59-69.
Original Paper PDF

Any feedback and criticism is welcome. At the Asimov Institute we do deep learning research and development, so be sure to follow us on twitter for future updates and posts! Thank you for reading!

[Update 15 september 2016] I would like to thank everybody for their insights and corrections, all feedback is hugely appreciated. I will add links and a couple more suggested networks in a future update, stay tuned.

[Update 29 september 2016] Added links and citations to all the original papers. A follow up post is planned, since I found at least 9 more architectures. I will not include them in this post for better consistency in terms of content.

[Update 30 november 2017] Looking for a poster of the neural network zoo? Click here