目录

四个主要问题：

转自：

Recurrent Neural Networks with Swift and
Accelerate

6 APRIL 2017

- 理解 LSTM
网络
- 递归神经网络
- 长期依赖性问题
- LSTM 网络
- LSTM 的核心想法
- 逐步解析 LSTM 的流程
- 长短期记忆的变种
- 结论
- 鸣谢

- 是什么？
- 为什么？
- 做什么？
- 怎么做？

With new neural network architectures popping up every now and then, it’s hard to keep track of them all. Knowing all the abbreviations being thrown around (DCIGN, BiLSTM, DCGAN, anyone?) can be a bit overwhelming at first.

In this blog post we’ll use a recurrent neural network (RNN) to teach
the iPhone to**play the drums**. It will sound something like this:

figure 1 LSTM.JPG

*本文翻译自 Christopher Olah 的博文 [*Understanding LSTM
Networks*](
LSTM 网络。*

本文主要根据Understanding LSTM Networks-colah's blog 编写，包括翻译并增加了自己浅薄的理解。

So I decided to compose a cheat sheet containing many of those architectures. Most of these are neural networks, some are completely different beasts. Though all of these architectures are presented as novel and unique, when I drew the node structures… their underlying relations started to make more sense.

The timing still needs a little work but it definitely sounds like
someone playing the drums!

We’ll teach the computer to play drums without explaining what makes a
good rhythm, or what even a kick drum or a hi-hat is. The RNN will learn
how to drum purely from examples of existing drum patterns.

The reason we’re using a *recurrent* network for this task is that this
type of neural network is very good at understanding sequences of
things, in this case sequences of MIDI notes.

Apple’s BNNS and Metal
CNN
libraries don’t support recurrent neural networks at the moment, but no
worries: we can get pretty far already with just a few matrix
multiplications.

As usual we train the neural network on the Mac (using TensorFlow and
Python), and then copy what it has learned into the iOS app. In the iOS
app we’ll use the Accelerate framework to handle the math.

In this post I’m only going to show the relevant bits of the code. The
full source is on
GitHub,
so look there to follow along.

What is an RNN?

A regular neural
network,
also known as a *feed-forward* network, is a simple pipeline: the input
data goes into one end and comes out the other end as a prediction of
some kind, often in the form of a probability distribution.

The interesting thing about a *recurrent* neural network is that it has
an additional input and output, and these two are connected. The new
input gets its data from the RNN’s output, so the network feeds back
into itself, which is where the name “recurrent” comes from.

*【翻译】理解 LSTM
及其图示
或许可以进一步帮助理解。*

## LSTM是什么？

以下定义摘自百度百科

LSTM(Long Short-Term Memory) 长短期记忆网络，是一种时间递归神经网络，适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。

figure 2 LSTM.JPG

# 理解 LSTM 网络

Understanding LSTM Networks

## LSTM为什么产生？

One problem with drawing them as node maps: it doesn’t really show how they’re used. For example, variational autoencoders (VAE) may look just like autoencoders (AE), but the training process is actually quite different. The use-cases for trained networks differ even more, because VAEs are generators, where you insert noise to get a new sample. AEs, simply map whatever they get as input to the closest training sample they “remember”. I should add that this overview is in no way clarifying how each of the different node types work internally (but that’s a topic for another day).

## 1. tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)

## 递归神经网络

Recurrent Neural Networks

人类并不是时刻都从头开始思考。如果你阅读这篇文章，你是在之前词汇的基础上理解每一个词汇，你不需要丢掉一切从头开始思考。你的思想具有延续性。

传统的神经网络无法做到这样，并且这成为了一个主要的缺陷。例如，想像一下你需要对一部电影中正在发生的事件做出判断。目前还不清楚传统的神经网络如何根据先前发生的事件来推测之后发生的事件。

递归神经网络正好用来解决这个问题。递归神经网络的内部存在着循环，用来保持信息的延续性。

Humans don't start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don't throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can't do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It's unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

上图中有局部神经网络——(A)，输入值 (x_t)，和输出值 (h_t) 。一个循环保证信息一步一步在网络中传递。

这些循环让递归神经网络难以理解。但是，如果仔细想想就会发现，它们和普通的神经网络没什么区别。一个递归神经网络可以看作是一组相同的网络，每一个网络都将信息传递给下一个。如果展开循环就会看到：

In the above diagram, a chunk of neural network, class="math inline">(A), looks at some input class="math inline">(x_t) and outputs a value class="math inline">(h_t). A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren't all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

这个链式结构自然地揭示出递归神经网络和序列与列表紧密相关。这是用于处理序列数据的神经网络的自然架构。

当然，也是可用的。最近几年，RNN 在语音识别、语言建模、翻译、图像描述等等领域取得了难以置信的成功。我把对 RNN 所取得成果的讨论留在 Andrej Karpathy 的博客里。RNN 真的很神奇！

这些成功的关键是 “LSTM” ——一种特殊的递归神经网络，在许多问题上比标准版本的 RNN 好得多。几乎所有递归神经网络取得的出色成果均源于 LSTM 的使用。这篇文章要介绍的正是 LSTM。

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They're the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I'll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy's excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It's these LSTMs that this essay will explore.

### RNN

一般神经网络没有考虑数据的持续影响。通常，前面输入神经元的数据对后输入的数据有影响。考虑到这点或者说为了解决传统神经网络不能捕捉/利用`previous event affect the later ones`

，提出了RNN，网络中加入循环。下图是RNN网络示图。

RNN

RNN网络实质上是多个普通神经网络的连接，每个神经元向下一个传递信息，如下图所示:

RNN链式结构

"LSTMs",a very special kind of recurrent neural network which works,for many tasks,much much better tahn the standard version.

It should be noted that while most of the abbreviations used are generally accepted, not all of them are. RNNs sometimes refer to recursive neural networks, but most of the time they refer to recurrent neural networks. That’s not the end of it though, in many places you’ll find RNN used as placeholder for any recurrent architecture, including LSTMs, GRUs and even the bidirectional variants. AEs suffer from a similar problem from time to time, where VAEs and DAEs and the like are called simply AEs. Many abbreviations also vary in the amount of “N”s to add at the end, because you could call it a convolutional neural network but also simply a convolutional network (resulting in CNN or CN).

Feed-forward network vs. recurrent network

### Type: type

## 长期依赖性问题

The Problem of Long-Term Dependencies

RNN 的吸引力之一是它们能够将先前的信息与当前的问题连接，例如使用先前的视频画面可以启发对当前画面的理解。如果 RNN 可以做到这一点，它们会非常有用。但它可以吗？嗯，这是有条件的。

有时候，我们只需要查看最近的信息来应对当前的问题。例如，一个语言模型试图根据先前的词汇预测下一个词汇。如果我们试图预测
“the clouds are in the *sky*”
中的最后一个词，我们不需要任何进一步的上下文背景，很明显，下一个词将是
*sky*。在这种情况下，相关信息与它所在位置之间的距离很小，RNN
可以学习使用过去的信息。

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they'd be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the

sky,” we don't need any further context –– it's pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it's needed is small, RNNs can learn to use the past information.

但也有些情况下我们需要更多的上下文。考虑尝试预测 “I grew up in France… I
speak fluent *French*.”
中的最后一个词。最近的信息表明，下一个单词可能是一种语言的名称，但如果我们想要具体到哪种语言，我们需要从更远的地方获得上下文——France。因此，相关信息与它所在位置之间的距离非常大是完全可能的。

遗憾的是，随着距离的增大，RNN 开始无法将信息连接起来。

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent

French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It's entirely possible for the gap between the relevant information and the point where it is needed to become very large.Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

理论上，RNN 绝对有能力处理这种“长期依赖性”。人类可通过仔细挑选参数来解决这种形式的“玩具问题”。遗憾的是在实践中，RNN 似乎无法学习它们。这个问题是由 Hochreiter 和 Bengio 等人深入探讨。他发现了问题变困难的根本原因。

谢天谢地，LSTM 没这种问题！

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don't seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don't have this problem!

### The Problem of Long-Term Dependencies[1]

RNNs模型可以`connect previous information to the present task,such as using previous video frames might inform the understanding of the present frame.`

RNNs如何实现上述目标呢？这需要按情况而定。

有时，我们只需要查看最近的信息来执行当前的任务。
例如，考虑一个语言模型试图根据以前的单词预测下一个词。
如果我们试图预测“the clouds are in the sky
”的最后一个词，我们不需要任何进一步的背景(上下文) -
很明显，下一个词将是sky。
在这种情况下，当前任务训练时RNNs模型需要过去n个信息且n很小。`the gap between the relevant information and the place that it’s needed is small`

但是也有需要很多上下文信息的情况。如果我们试图预测长句的最后一个单词：`Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.”`

，最近的信息`I speak fluent French`

表示/提示下一个单词可能是某种语言的名称，但是如果我们缩小范围到具体某种语言时，我们需要关于France的背景信息。那么使用RNNs训练时需要过去n个信息，且n要足够大。`the gap between the relevant information and the point where it is needed to become very large`

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

理论上，RNNs可以处理“long-term dependencies.”,但是实际操作中，RNNs不能学习/训练这样的问题，即需要的过去信息n数量过大的情况下，RNNs将不再适用。The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

**LSTM模型可以处理“long-term dependencies”的问题**

Composing a complete list is practically impossible, as new architectures are invented all the time. Even if published it can still be quite challenging to find them even if you’re looking for them, or sometimes you just overlook some. So while this list may provide you with some insights into the world of AI, please, by no means take this list for being comprehensive; especially if you read this post long after it was written.

I said that RNNs are good at understanding sequences. For this purpose,
the RNN keeps track of some *internal state*. This state is what the RNN
has remembered of the sequence it has seen so far. The extra
input/output is for sending this internal state from the previous
timestep into the next timestep.

To make the iPhone play drums, we train the RNN on a sequence of MIDI
notesthat
represent different drum patterns. We look at just one element from this
sequence at a time — this is called a *timestep*. At each time step, we
teach the RNN to predict the next note from the sequence.

Essentially, we’re training the RNN to remember all the drum patterns
that are in the sequence. It remembers this data in its internal state,
but also in the weights that connect the input **x** and the predicted
output **y** to this state.

### Docstring:

Basic LSTM recurrent network cell.

The implementation is based on: http://arxiv.org/abs/1409.2329.

We add forget_bias (default: 1) to the biases of the forget gate in order to reduce the scale of forgetting in the beginning of the training.

It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.

For advanced models, please use the full @{tf.nn.rnn_cell.LSTMCell} that follows.

## LSTM 网络

LSTM Networks

长短期记忆网络——通常被称为 LSTM，是一种特殊的 RNN，能够学习长期依赖性。由 Hochreiter 和 Schmidhuber（1997）提出的，并且在接下来的工作中被许多人改进和推广。LSTM 在各种各样的问题上表现非常出色，现在被广泛使用。

LSTM 被明确设计用来避免长期依赖性问题。长时间记住信息实际上是 LSTM 的默认行为，而不是需要努力学习的东西！

所有递归神经网络都具有神经网络的链式重复模块。在标准的 RNN 中，这个重复模块具有非常简单的结构，例如只有单个 tanh 层。

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.

^{1}They work tremendously well on a large variety of problems, and are now widely used.LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

LSTM 也具有这种类似的链式结构，但重复模块具有不同的结构。不是一个单独的神经网络层，而是四个，并且以非常特殊的方式进行交互。

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

不要担心细节。稍后我们将逐步浏览 LSTM 的图解。现在，让我们试着去熟悉我们将使用的符号。

Don't worry about the details of what's going on. We'll walk through the LSTM diagram step by step later. For now, let's just try to get comfortable with the notation we'll be using.

在上面的图中，每行包含一个完整的向量，从一个节点的输出到其他节点的输入。粉色圆圈表示逐点运算，如向量加法；而黄色框表示学习的神经网络层。行合并表示串联，而分支表示其内容正在被复制，并且副本将转到不同的位置。

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

## LSTM做什么？

For each of the architectures depicted in the picture, I wrote a *very,
very* brief description. You may find some of these to be useful if
you’re quite familiar with some architectures, but you aren’t familiar
with a particular one.

### Init docstring:

Initialize the basic LSTM cell.

## LSTM 的核心想法

The Core Idea Behind LSTMs

LSTM 的关键是细胞状态，即图中上方的水平线。

细胞状态有点像传送带。它贯穿整个链条，只有一些次要的线性交互作用。信息很容易以不变的方式流过。

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

LSTM 可以通过所谓“门”的精细结构向细胞状态添加或移除信息。

门可以选择性地以让信息通过。它们由 S 形神经网络层和逐点乘法运算组成。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

S 形网络的输出值介于 0 和 1 之间，表示有多大比例的信息通过。0 值表示“没有信息通过”，1 值表示“所有信息通过”。

一个 LSTM 有三种这样的门用来保持和控制细胞状态。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### 应用

基于 LSTM 的系统可以学习翻译语言、控制机器人、图像分析、文档摘要、语音识别、图像识别、手写识别、控制聊天机器人、预测疾病、点击率和股票、合成音乐等等任务

The weights in an RNN

### Args:

**num_units**: int, The number of units in the LSTM cell.

**forget_bias**: float, The bias added to forget gates (see above).

**state_is_tuple**: If True, accepted and returned states are 2-tuples
of the `c_state`

and `m_state`

. If False, they are concatenated along
the column axis. The latter behavior will soon be deprecated.

**activation**: Activation function of the inner states. Default:
`tanh`

.

**reuse**: (optional) Python boolean describing whether to reuse
variables in an existing scope. If not `True`

, and the existing scope
already has the given variables, an error is raised.

## 逐步解析 LSTM 的流程

Step-by-Step LSTM Walk Through

LSTM 的第一步要决定从细胞状态中舍弃哪些信息。这一决定由所谓“遗忘门层”的 S 形网络层做出。它接收 (h_{t-1}) 和 (x_t)，并且对细胞状态 (C_{t-1}) 中的每一个数来说输出值都介于 0 和 1 之间。1 表示“完全接受这个”，0 表示“完全忽略这个”。

让我们回到语言模型的例子，试图用先前的词汇预测下一个。在这个问题中，细胞状态可能包括当前主语的词性，因此可以使用正确的代词。当我们看到一个新的主语时，我们需要忘记先前主语的词性。

The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at class="math inline">(h_{t-1}) and class="math inline">(x_t), and outputs a number between (0) and class="math inline">(1) for each number in the cell state (C_{t-1}). A class="math inline">(1) represents “completely keep this” while a (0) represents “completely get rid of this.”

Let's go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

下一步就是要确定需要在细胞状态中保存哪些新信息。这里分成两部分。第一部分，一个所谓“输入门层”的 S 形网络层确定哪些信息需要更新。第二部分，一个 tanh 形网络层创建一个新的备选值向量—— (tilde{C}_t)，可以用来添加到细胞状态。在下一步中我们将上面的两部分结合起来，产生对状态的更新。

在我们的语言模型中，我们要把新主语的词性加入状态，取代需要遗忘的旧主语。

The next step is to decide what new information we're going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we'll update. Next, a tanh layer creates a vector of new candidate values, class="math inline">(tilde{C}_t), that could be added to the state. In the next step, we'll combine these two to create an update to the state.

In the example of our language model, we'd want to add the gender of the new subject to the cell state, to replace the old one we're forgetting.

现在更新旧的细胞状态 (C_{t-1}) 更新到 (C_{t})。先前的步骤已经决定要做什么，我们只需要照做就好。

我们对旧的状态乘以 (f_t)，用来忘记我们决定忘记的事。然后我们加上 (i_t*tilde{C}_t)，这是新的候选值，根据我们对每个状态决定的更新值按比例进行缩放。

语言模型的例子中，就是在这里我们根据先前的步骤舍弃旧主语的词性，添加新主语的词性。

It's now time to update the old cell state, class="math inline">(C_{t-1}), into the new cell state (C_t). The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by class="math inline">(f_t), forgetting the things we decided to forget earlier. Then we add class="math inline">(i_t*tilde{C}_t). This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we'd actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.

最后，我们需要确定输出值。输出依赖于我们的细胞状态，但会是一个“过滤的”版本。首先我们运行 S 形网络层，用来确定细胞状态中的哪些部分可以输出。然后，我们把细胞状态输入 (tanh)（把数值调整到 (-1) 和 (1) 之间）再和 S 形网络层的输出值相乘，这样我们就可以输出想要输出的部分。

以语言模型为例子，一旦出现一个主语，主语的信息会影响到随后出现的动词。例如，知道主语是单数还是复数，就可以知道随后动词的形式。

Finally, we need to decide what we're going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we're going to output. Then, we put the cell state through class="math inline">(tanh) (to push the values to be between (-1) and class="math inline">(1)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that's what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that's what follows next.

### 现状

在 2015 年，谷歌通过基于CTC 训练的 LSTM 程序大幅提升了安卓手机和其他设备中语音识别的能力，其中就使用了Jürgen Schmidhuber的实验室在 2006 年发表的方法。百度也使用了 CTC；苹果的 iPhone 在 QucikType 和 Siri 中使用了 LSTM；微软不仅将 LSTM 用于语音识别，还将这一技术用于虚拟对话形象生成和编写程序代码等等。亚马逊 Alexa 通过双向 LSTM 在家中与你交流，而谷歌使用 LSTM 的范围更加广泛，它可以生成图像字幕，自动回复电子邮件，它包含在新的智能助手 Allo 中，也显著地提高了谷歌翻译的质量（从 2016 年开始）。目前，谷歌数据中心的很大一部分计算资源现在都在执行 LSTM 任务。

Of course, we don’t want the RNN to just *remember* existing drum
patterns — we want it to come up with new drums on its own.

To do that, we will mess a little with the RNN’s memory: we reset the
internal state by filling it up with random numbers — but we don’t
change the weights. From then on the model will no longer correctly
predict the next note in the sequence because we “erased” its memory of
where it was.

Now when we ask the RNN to “predict” the next notes in the sequence, it
will come up with new, *original* drum patterns. These are still based
on its knowledge of what “good drums” are (because we did not erase the
learned weights), but they are no longer verbatim replications of the
training patterns.

The data

I mentioned we’re training on drum patterns. The dataset I used consists
of a large number of MIDI files. When you open such a MIDI file in
GarageBand or Logic Pro it looks like this:

## 2. tf.nn.dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None)

## 长短期记忆的变种

Variants on Long Short Term Memory

目前我所描述的还只是一个相当一般化的 LSTM 网络。但并非所有 LSTM 网络都和之前描述的一样。事实上，几乎所有文章都会改进 LSTM 网络得到一个特定版本。差别是次要的，但有必要认识一下这些变种。

一个流行的 LSTM 变种由 Gers 和 Schmidhuber 提出，在 LSTM 的基础上添加了一个“窥视孔连接”，这意味着我们可以让门网络层输入细胞状态。

What I've described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it's worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

上图中我们为所有门添加窥视孔，但许多论文只为部分门添加。

另一个变种把遗忘和输入门结合起来。同时确定要遗忘的信息和要添加的新信息，而不再是分开确定。当输入的时候才会遗忘，当遗忘旧信息的时候才会输入新数据。

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we're going to input something in its place. We only input new values to the state when we forget something older.

一个更有意思的 LSTM 变种称为 Gated Recurrent Unit（GRU），由 Cho 等人提出。GRU 把遗忘门和输入门合并成为一个“更新门”，把细胞状态和隐含状态合并，还有其他变化。这样做使得 GRU 比标准的 LSTM 模型更简单，因此正在变得流行起来。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

这些只是若干知名 LSTM 变种中的一小部分。还有其他变种，例如 Yao 等人提出的 Depth Gated RNN。也有一些完全不同的方法处理长期依赖性，例如 Koutnik 等人提出的 Clockwork RNN。

这些变种哪一个是最好的？它们之间的区别重要吗？Greff 等人做了研究，细致的比较流行的变种，结果发现它们几乎都一样。Jozefowicz 等人测试了一万余种 RNN 架构，发现在特定问题上有些架构的表现好于 LSTM。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There's also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they're all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

## LSTM怎么做？

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997).

LSTMs are explicitly designed to avoid the long-term dependency problem.Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

因此标准的RNN模型具有神经网络模块链式结构，模块结构可以非常简单，比如只包含一个tanh layer，如下图所示：

LSTM

模块结构也可以非常复杂，如下图所示：

[图片上传失败...(image-72c315-1521165904331)]

接下来将遍历LSTM图示中的每个环节，在遍历之前，首先要了解图示中每个图形、符号的意思。

图示符号

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

**Feed forward neural networks (FF or FFNN) and perceptrons (P)** are
very straight forward, they feed information from the front to the back
(input and output, respectively). Neural networks are often described as
having layers, where each layer consists of either input, hidden or
output cells in parallel. A layer alone never has connections and in
general two adjacent layers are fully connected (every neuron form one
layer to every neuron to another layer). The simplest somewhat practical
network has two input cells and one output cell, which can be used to
model logic gates. One usually trains FFNNs through back-propagation,
giving the network paired datasets of “what goes in” and “what we want
to have coming out”. This is called supervised learning, as opposed to
unsupervised learning where we only give it input and let the network
fill in the blanks. The error being back-propagated is often some
variation of the difference between the input and the output (like MSE
or just the linear difference). Given that the network has enough hidden
neurons, it can theoretically always model the relationship between the
input and output. Practically their use is a lot more limited but they
are popularly combined with other networks to form new networks.

### Type: function

## 结论

Conclusion

早先，我注意到有些人使用 RNN 取得了显著的成果，这些几乎都是通过 LSTM 网络做到的。对于绝大部分问题，LSTM 真的更好用！

罗列一大堆公式之后，LSTM 看起来令人生畏。还好，文章中逐步的解析让它们更容易接受。

LSTM 是 RNN 取得的一大进步。很自然地要问：还有其他的进步空间吗？研究人员的普遍答案是：Yes！还有进步的空间，那就是注意力（attention）！注意力的想法是让 RNN 中的每一步都从信息更加富集的地方提取信息。例如，你想使用 RNN 对一幅图片生成描述，它也需要提取图片中的一部分来生成输出的文字。事实上，Xu 等人就是这么做的，如果你想探索注意力，这会是一个相当不错的起始点。还有许多出色的成果使用了注意力，注意力未来还将发挥更大的威力...

注意力并非 RNN 研究中唯一一个激动人心的思路。Kalchbrenner 等人提出的 Grid LSTM 看起来极具潜力。Gregor等人、Chung 等人，或者 Bayer 与 Osendorfer 在生成模型中使用 RNN 的想法也非常有意思。最近几年是递归神经网络的明星时间，新出的成果只会更具前景。

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It's natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it's attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu,

et al.(2015) do exactly this – it might be a fun starting point if you want to explore attention! There's been a number of really exciting results using attention, and it seems like a lot more are around the corner…Attention isn't the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner,

et al.(2015) seem extremely promising. Work using RNNs in generative models – such as Gregor,et al.(2015), Chung,et al.(2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

#### The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt(输送带). It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTM可以去除或增加cell state的信息,并被称为门(gates)的结构仔细调控。

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation(逐点乘法运算).

forget gate layer

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

*Rosenblatt, Frank. “The perceptron: a probabilistic model for
information storage and organization in the brain.” Psychological review
65.6 (1958): 386.*

Original Paper
PDF

A MIDI pattern in GarageBand

### Docstring:

Creates a recurrent neural network specified by RNNCell `cell`

.

Performs fully dynamic unrolling of `inputs`

.

`Inputs`

may be a single `Tensor`

where the maximum time is either the
first or second dimension (see the parameter `time_major`

).
Alternatively, it may be a (possibly nested) tuple of Tensors, each of
them having matching batch and time dimensions. The corresponding output
is either a single `Tensor`

having the same number of time steps and
batch size, or a (possibly nested) tuple of such tensors, matching the
nested structure of `cell.output_size`

.

The parameter `sequence_length`

澳门新萄京官方网站， is optional and is used to copy-through
state and zero-out outputs when past a batch element's sequence length.
So it's more for correctness than performance.

Args:

**cell**: An instance of RNNCell.

**inputs**: The RNN inputs.

```
If `time_major == False` (default), this must be a `Tensor` of shape:
`[batch_size, max_time, ...]`, or a nested tuple of such
elements.
If `time_major == True`, this must be a `Tensor` of shape:
`[max_time, batch_size, ...]`, or a nested tuple of such
elements.
This may also be a (possibly nested) tuple of Tensors satisfying this property. The first two dimensions must match across all the inputs, but otherwise the ranks and other shape components may differ. In this case, input to `cell` at each time-step will replicate the structure of these tuples, except for the time dimension (from which the time is taken).
```

The input to `cell`

at each time step will be a `Tensor`

or (possibly
nested) tuple of Tensors each with dimensions `[batch_size, ...]`

.

**sequence_length**: (optional) An int32/int64 vector sized
`[batch_size]`

.

**initial_state**: (optional) An initial state for the RNN. If
`cell.state_size`

is an integer, this must be a `Tensor`

of appropriate
type and shape `[batch_size, cell.state_size]`

. If `cell.state_size`

is
a tuple, this should be a tuple of tensors having shapes
`[batch_size, s] for s in cell.state_size`

.

**dtype**: (optional) The data type for the initial state and expected
output. Required if initial_state is not provided or RNN state has a
heterogeneous dtype.

**parallel_iterations**: (Default: 32). The number of iterations to run
in parallel. Those operations which do not have any temporal dependency
and can be run in parallel, will be. This parameter trades off time for
space. Values >> 1 use more memory but take less time, while
smaller values use less memory but computations take longer.

**swap_memory**: Transparently swap the tensors produced in forward
inference but needed for back prop from GPU to CPU. This allows training
RNNs which would typically not fit on a single GPU, with very minimal
(or no) performance penalty.

**time_major**: The shape format of the `inputs`

and `outputs`

Tensors.

If true, these `Tensors`

must be shaped
`[max_time, batch_size, depth]`

.

If false, these `Tensors`

must be shaped
`[batch_size, max_time, depth]`

.

Using `time_major = True`

is a bit more efficient because it avoids

transposes at the beginning and end of the RNN calculation. However,

most TensorFlow data is batch-major, so by default this function

accepts input and emits output in batch-major form.

**scope**: VariableScope for the created subgraph; defaults to "rnn".

## 鸣谢

Acknowledgments

我很感谢有许多人帮助我更好地理解 LSTM 网络，无论是可视化上边的评注，还是文章后面的反馈。

我非常感谢我在 Google 的同事提供了有益的反馈，特别是 Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis 和 Ilya Sutskever。我也非常感谢其他花时间帮助我的同事，包括 Dario Amodei 和 Jacob Steinhardt。我要特别感谢 Kyunghyun Cho 针对文章图解的极具关切的来信。

在这篇博客之前，我已经在两个系列研讨班上阐述过 LSTM 网络，当时我正在做神经网络方面的教学。感谢所有参加过研讨班的人以及他们提出的反馈。

I'm grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I'm very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I'm also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I'm especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

- In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.↩

#### Step-by-Step LSTM Walk Through

第一步是选择cell state中要被丢弃的信息，这一步由被称为“forget gate
layer”的sigmoid layer完成。sigmoid
layer根据输入h_{t-1}和x_{t}，并为cell state
C_{t-1}中每个值输出一个介于0-1之间的值。当输出为 1
表示完全保留这个cell state信息，当输出为 0
表示完全抛弃。比如说如果我们尝试利用语言模型，根据之前所有的背景信息来预测下一个词。在这样的问题中，cell
state可能包括当前主体的性别，因此可以使用正确的代词。
当我们看到一个新的主体时，我们想忘记旧主体的性别。

下图即为“forget gate layer”示图：

[图片上传失败...(image-4aad78-1521165904331)]

接下来选择/决定要存入到cell
state的新信息。这步有两个部分。首先，被称为“input gate layer”的sigmoid
layer决定我们将更新哪些值。接下来，tanh层创建一个新的候选值向量C^{〜}_{t}，可以添加到状态state中。在下一步中，我们将结合这两者来实现细胞状态cell
state的更新。

在我们的语言模型的例子中，我们希望将新主体的性别添加到cell
state中，以替换我们抛弃的旧主体性别信息。

下图为“input gate layer” tanh layer示图：

input gate layer tanh layer

现在是时候将之前的cell state C_{t-1}更新为cell status
C_{t}。 之前的步骤已经决定要做什么，我们只需要真正做到这一点。

我们将旧状态C_{t-1}乘以f_{t}，忘记/抛弃我们早先决定抛弃的信息。
然后加上i_{t}*C^{〜}_{t}。
这是新的候选值，根据我们决定更新每个状态值的比例进行缩放。

就语言模型而言，这实现了我们实际放弃旧主体性别信息并添加新主体信息的操作。过程如下图所示：

更新cell state

最后，我们需要决定我们要输出的内容。 这个输出将基于我们的cell state，但将是一个过滤版本。 首先，我们运行一个sigmoid layer，它决定我们要输出的cell state的哪些部分。 然后，将cell state 通过tanh（将值推到-1和1之间）并将其乘以sigmoid gate的输出，以便我们只输出决定输出的部分。

对于语言模型示例，由于它刚刚看到了一个主体，因此它可能需要输出与动词相关的信息，以防接下来会发生什么。
例如，它可能会输出主体是单数还是复数，以便我们知道如果接下来是什么，应该将动词的形式结合到一起。这个部分是通过sigmoid
layer实现cell state的过滤，根据过滤版本的cell
state修改输出h_{t}.

上述过程如下图所示：

模型输出

The green bars represent the notes that are being played. The note C1 is
a kick drum, D1 is a snare drum, G#1 is a hi-hat, and so on. The drum
patterns in the dataset are all 1 measure (or 4 beats) long.

In a MIDI file the notes are stored as a series of *events*:

NOTE ON time: 0 channel: 0 note: 36 velocity: 80NOTE ON time: 0 channel:
0 note: 46 velocity: 80NOTE OFF time: 120 channel: 0 note: 36 velocity:
64NOTE OFF time: 0 channel: 0 note: 46 velocity: 64NOTE ON time: 120
channel: 0 note: 44 velocity: 80NOTE OFF time: 120 channel: 0 note: 44
velocity: 64NOTE ON time: 120 channel: 0 note: 38 velocity: 80NOTE OFF
time: 120 channel: 0 note: 38 velocity: 64NOTE ON time: 0 channel: 0
note: 44 velocity: 80. . . and so on . . .

### Returns:

A pair (outputs, state) where:

```
outputs: The RNN output `Tensor`.
If time_major == False (default), this will be a `Tensor` shaped:
`[batch_size, max_time, cell.output_size]`.
If time_major == True, this will be a `Tensor` shaped:
`[max_time, batch_size, cell.output_size]`.
Note, if `cell.output_size` is a (possibly nested) tuple of integers
or `TensorShape` objects, then `outputs` will be a tuple having the
same structure as `cell.output_size`, containing Tensors having shapes
corresponding to the shape data in `cell.output_size`.
state: The final state. If `cell.state_size` is an int, this
will be shaped `[batch_size, cell.state_size]`. If it is a
`TensorShape`, this will be shaped `[batch_size] cell.state_size`.
If it is a (possibly nested) tuple of ints or `TensorShape`, this will
be a tuple having the corresponding shapes.
```

#### Variants on Long Short Term Memory

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

"peephole connections"

Another variation is to use coupled(耦合) forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

coupled

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

GRU

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

To begin playing a note there is a NOTE ON event, to stop playing there
is a NOTE OFF event. The duration of the note is determined by the
amount of time between NOTE ON and NOTE OFF. For us, the duration of the
notes isn’t really important because drum sounds are short — they aren’t
sustained like a flute or violin. All we care about is the NOTE ON
events, which tell us when new drum sounds begin.

Each NOTE ON event includes a few different bits of data, but for our
purposes we only need to know the *timestamp* and the *note number*.

The note number is an integer that represents the drum sound. For
example, 36 is the number for note C in octave 1, which is the kick
drum. (The General MIDI
standard
defines which note number is mapped to which percussion instrument.)

The timestamp for an event is a “delta” time, which means it is the
number of *ticks*we should wait before processing this event. For the
MIDI files in our dataset, there are 480 ticks per beat. So if we play
the drums at 120 beats-per-minute, then one second has 960 ticks in it.
This is not really important to remember; just know that for each note
in the drum pattern there’s also a delay measured in ticks.

Our input sequence to the RNN then has the following form:

(note, ticks) (note, ticks) (note, ticks) . . .

### Raises:

TypeError: If `cell`

is not an instance of RNNCell.

ValueError: If inputs is None or an empty list.

#### 发展展望

LSTM以后的发展方向：

- Attention:Xu, et al. (2015)
- Grid LSTMs:Kalchbrenner, et al. (2015)
- RNN in generative models:Gregor, et al. (2015),Chung, et al. (2015),Bayer & Osendorfer (2015)

**Radial basis function (RBF)** networks are FFNNs with radial basis
functions as activation functions. There’s nothing more to it. Doesn’t
mean they don’t have their uses, but most FFNNs with other activation
functions don’t get their own name. This mostly has to do with inventing
them at the right time.

At every timestep we insert a (note, ticks)

pair into the RNN and it will try to predict the next (note, ticks)

pair from the same sequence. For the example above, the sequence is:

(36, 0) (46, 0) (44, 240) (38, 240) (44, 0) . . .

## 参考

- Understanding LSTM Networks-colah's blog

*Broomhead, David S., and David Lowe. Radial basis functions,
multi-variable functional interpolation and adaptive networks. No.
RSRE-MEMO-4148. ROYAL SIGNALS AND RADAR ESTABLISHMENT MALVERN (UNITED
KINGDOM), 1988.*

Original Paper PDF

That’s a kick drum (36) and an open-hihat (46) on the first beat,
followed by a pedal hi-hat (44) after 240 ticks, followed by a snare
drum (38) and a pedal hi-hat (44) after another 240 ticks, and so on.

The dataset I used for training has 2700 of these MIDI files. I glued
them together into one big sequence of 52260 (note, ticks)

pairs. Just think of this sequence as a ginormous drum
solo.
This is the sequence we’ll try to make the RNN remember.

**Note:** This dataset of drum patterns comes from a commercial drum kit
plug-in for use in audio production tools such as Logic Pro. I was
looking for a fun dataset for training an RNN when I realized I had a
large library of drum patterns in MIDI format sitting in a folder on my
computer… and so the RNN drummer was born. Unfortunately, it also means
this dataset is copyrighted and I can’t distribute it with the GitHub
project. If you want to train the RNN yourself, you’ll need to find your
own collection of drum patterns in MIDI format — I can’t give you mine.

One-hot encoding

You’ve seen that the MIDI note numbers are regular integers. We’ll be
using the note numbers between 35 and 60, which is the range reserved in
the General MIDI standard for percussion instruments.

The ticks are also integers, between 0 and 1920. (That’s how many ticks
go into one measure and each MIDI file in the dataset is only one
measure long.)

However, we can’t just feed integers into our neural network. In machine
learning when you encode something using an integer (or a floating-point
value), you imply there is an order to it: the number 55 is bigger than
the number 36.

But this is not true for our MIDI notes: the drum sound represented by
MIDI note number 55 is not “bigger” than the drum sound with number 36.
These numbers represent completely different things — one is a kick
drum, the other a cymbal.

Instead of truly being numbers on some continuous scale, our MIDI notes
are examples of what’s called *categorical variables*. It’s better to
encode that kind of data using **one-hot encoding** rather than integers
(or floats).

For the sake of giving an example, let’s say that our entire dataset
only uses five unique note numbers:

36 kick drum38 snare drum42 closed hi-hat48 tom55 cymbal

We can then encode any given note number using a 5-element vector. Each
index in this vector corresponds to one of those five drum sounds. A
kick drum (note 36) would be encoded as:

[ 1, 0, 0, 0, 0 ]

A **Hopfield network (HN)** is a network where every neuron is connected
to every other neuron; it is a completely entangled plate of spaghetti
as even all the nodes function as everything. Each node is input before
training, then hidden during training and output afterwards. The
networks are trained by setting the value of the neurons to the desired
pattern after which the weights can be computed. The weights do not
change after this. Once trained for one or more patterns, the network
will always converge to one of the learned patterns because the network
is only stable in those states. Note that it does not always conform to
the desired state (it’s not a magic black box sadly). It stabilises in
part due to the total “energy” or “temperature” of the network being
reduced incrementally during training. Each neuron has an activation
threshold which scales to this temperature, which if surpassed by
summing the input causes the neuron to take the form of one of two
states (usually -1 or 1, sometimes 0 or 1). Updating the network can be
done synchronously or more commonly one by one. If updated one by one, a
fair random sequence is created to organise which cells update in what
order (fair random being all options (n) occurring exactly once every n
items). This is so you can tell when the network is stable (done
converging), once every cell has been updated and none of them changed,
the network is stable (annealed). These networks are often called
associative memory because the converge to the most similar state as the
input; if humans see half a table we can image the other half, this
network will converge to a table if presented with half noise and half a
table.

while a snare drum would be encoded as:

[ 0, 1, 0, 0, 0 ]

*Hopfield, John J. “Neural networks and physical systems with emergent
collective computational abilities.” Proceedings of the national academy
of sciences 79.8 (1982): 2554-2558.*

Original Paper
PDF

and so on… It’s called “one-hot” because the vector is all zeros except
for a one at the index that represents the thing you’re encoding. Now
all these vectors have the same “length” and there is no longer an
ordering relationship between them.

We do the same thing for the ticks, and then combine these two one-hot
encoded vectors into one big vector called **x**:

**Markov chains (MC or discrete time Markov Chain, DTMC)** are kind of
the predecessors to BMs and HNs. They can be understood as follows: from
this node where I am now, what are the odds of me going to any of my
neighbouring nodes? They are memoryless (i.e. Markov Property) which
means that every state you end up in depends completely on the previous
state. While not really a neural network, they do resemble neural
networks and form the theoretical basis for BMs and HNs. MC aren’t
always considered neural networks, as goes for BMs, RBMs and HNs. Markov
chains aren’t always fully connected either.

One-hot encoded input vector

*Hayes, Brian. “First links in the Markov chain.” American Scientist
101.2 (2013): 252.*

Original Paper
PDF

In the full dataset there are 17 unique note numbers and 209 unique tick
values, so this vector consists of 226 elements. (Of those elements, 224
are 0 and two are 1.)

The sequence that we present to the RNN does not really exist of (note,
ticks)

pairs but is a list of these one-hot encoded vectors:

[ 0, 0, 1, 0, 0, 0, ..., 0 ] [ 1, 0, 0, 0, 0, 0, ..., 0 ] [ 0, 0,
0, 1, 0, 0, ..., 1 ]. . . and so on . . .

Because there are 52260 notes in the dataset, the entire training
sequence is made up of 52260 of those 226-element vectors.

The script
convert_midi.py
reads the MIDI files from the dataset and outputs a new file **X.npy**
that contains this 52260×226 matrix with the full training sequence.
(The script also saves two lookup tables that tell us which note numbers
and tick values correspond to the positions in the one-hot vectors.)

**Note:** You may be wondering why we’re one-hot encoding the ticks too
as these are numerical variables and not categorical. A timespan of 200
ticks definitely means that it’s twice as long as 100 ticks. Fair
question. I figured I would keep things simple and encode the note
numbers and ticks in the same way. This is not necessarily the most
efficient way to encode the durations of the notes but it’s good enough
for this blog post.

Long Short-Term Memory (huh?!)

The kind of recurrent neural network we’re using is something called an
LSTM or Long Short-Term Memory. It looks like this on the inside:

**Boltzmann machines (BM)** are a lot like HNs, but: some neurons are
marked as input neurons and others remain “hidden”. The input neurons
become output neurons at the end of a full network update. It starts
with random weights and learns through back-propagation, or more
recently through contrastive divergence (a Markov chain is used to
determine the gradients between two informational gains). Compared to a
HN, the neurons mostly have binary activation patterns. As hinted by
being trained by MCs, BMs are stochastic networks. The training and
running process of a BM is fairly similar to a HN: one sets the input
neurons to certain clamped values after which the network is set free
(it doesn’t get a sock). While free the cells can get any value and we
repetitively go back and forth between the input and hidden neurons. The
activation is controlled by a global temperature value, which if lowered
lowers the energy of the cells. This lower energy causes their
activation patterns to stabilise. The network reaches an equilibrium
given the right temperature.

*Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming
in Boltzmann machines.” Parallel distributed processing: Explorations in
the microstructure of cognition 1 (1986): 282-317.*

Original Paper
PDF

The gates inside an LSTM cell

The vector **x** is a single input that we feed into the network. It’s
one of those 226-element vectors from the training sequence that
combines the note number and the delay in ticks for a single drum
sound.

The output **y** is the prediction that is computed by the LSTM. This is
also a 226-element vector but this time it contains a probability
distribution over the possible note numbers and tick values. The goal of
training the LSTM is to get an output **y** that is (mostly) equal to
the next element from the training sequence.

Recall that a recurrent network has “internal state” that acts as its
memory. The internal state of the LSTM is given by two vectors: **c**
and **h**. The **c** vector helps the LSTM to remember the sequence of
MIDI notes it has seen so far, and **h** is used to predict the next
notes in the sequence.

At every time step we compute new values for **c** and **h**, and then
feed these back into the network so they are used as inputs for the next
timestep.

The most interesting feature of the LSTM is that it has *gates* that can
be either 0 (closed) or 1 (open). The gates determine how data flows
through the LSTM layer.

The gates perform different jobs:

The “input” gate **i** determines whether the input **x** is added to
the memory vector **c**. If this gate is closed, the input is basically
ignored.

The **g** gate determines how much of input **x** gets added to **c** if
the input gate is open.

The “output” gate **o** determines what gets put into the new value of
**h**.

The “forget” gate **f** is used to reset parts of the memory **c**.

**Restricted Boltzmann machines (RBM)** are remarkably similar to BMs
(surprise) and therefore also similar to HNs. The biggest difference
between BMs and RBMs is that RBMs are a better usable because they are
more restricted. They don’t trigger-happily connect every neuron to
every other neuron but only connect every different group of neurons to
every other group, so no input neurons are directly connected to other
input neurons and no hidden to hidden connections are made either. RBMs
can be trained like FFNNs with a twist: instead of passing data forward
and then back-propagating, you forward pass the data and then backward
pass the data (back to the first layer). After that you train with
forward-and-back-propagation.

The inputs **x** and **h** are connected to these gates using weights —
**Wxf**, **Whf**, etc. When we train the LSTM, what it learns are the
values of those weights. (It does not learn the values of **h** or
**c**.)

Thanks to this mechanism with the gates, the LSTM can remember things
over the long term, and it can even choose to forget things it no longer
considers important.

Confused how this works? It doesn’t matter. Exactly how or why these
gates work the way they do isn’t very important for this blog post. (If
you really want to know,read the
paper.)
Just know this particular scheme has proven to work very well for
remembering long sequences.

Our job is to make the network learn the optimal values for the weights
between**x** and **h** and these gates, and for the weights between
**h** and **y**.

The math

To implement an LSTM any sane person would use a tool such as
Keras which lets you
simply write layer = LSTM()

. However, we are going to do it the hard way, using primitive
TensorFlow operations.

The reason for doing it the hard way, is that we’re going to have to
implement this math ourselves in the iOS app, so it’s useful to
understand the formulas that are being used.

The formulas needed to implement the inner logic of the LSTM layer look
like this:

f = tf.sigmoid(tf.matmul(x[t], Wxf) tf.matmul(h[t - 1], Whf)
bf)i = tf.sigmoid(tf.matmul(x[t], Wxi) tf.matmul(h[t - 1], Whi)
bi)o = tf.sigmoid(tf.matmul(x[t], Wxo) tf.matmul(h[t - 1], Who)
bo)g = tf.tanh(tf.matmul(x[t], Wxg) tf.matmul(h[t - 1], Whg) bg)

*Smolensky, Paul. Information processing in dynamical systems:
Foundations of harmony theory. No. CU-CS-321-86. COLORADO UNIV AT
BOULDER DEPT OF COMPUTER SCIENCE, 1986.*

Original Paper
PDF

What goes on here is less intimidating than it first appears. Let’s look
at the line for the **f** gate in detail:

f = tf.sigmoid( tf.matmul(x[t], Wxf) # 1 tf.matmul(h[t - 1], Whf)
# 2 bf # 3 )

This computes whether the **f** gate is open (1) or closed (0).
Step-by-step this is what it does:

First multiply the input **x** for the current timestep with the matrix
**Wxf**. This matrix contains the weights of the connections between
**x** and **f**.

Also multiply the input **h** with the weights matrix **Whf**. In these
formulas, t

is the index of the timestep. Because **h** feeds back into the network
we use the value of **h** from the previous timestep, given by h[t -
1]

.

**Autoencoders (AE)** are somewhat similar to FFNNs as AEs are more like
a different use of FFNNs than a fundamentally different architecture.
The basic idea behind autoencoders is to encode information (as in
compress, not encrypt) automatically, hence the name. The entire network
always resembles an hourglass like shape, with smaller hidden layers
than the input and output layers. AEs are also always symmetrical around
the middle layer(s) (one or two depending on an even or odd amount of
layers). The smallest layer(s) is|are almost always in the middle, the
place where the information is most compressed (the chokepoint of the
network). Everything up to the middle is called the encoding part,
everything after the middle the decoding and the middle (surprise) the
code. One can train them using backpropagation by feeding input and
setting the error to be the difference between the input and what came
out. AEs can be built symmetrically when it comes to weights as well, so
the encoding weights are the same as the decoding weights.

Add a bias value **bf**.

*Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer
perceptrons and singular value decomposition.” Biological cybernetics
59.4-5 (1988): 291-294.*

Original Paper
PDF

Finally, take the *logistic sigmoid* of the whole thing. The sigmoid
function returns 0, 1, or a value in between.

The same thing happens for the other gates, except that for **g** we use
a hyperbolic tangent function to get a number between -1 and 1 (instead
of 0 and 1). Each gate has its own set of weight matrices and bias
values.

Once we know which gates are open and which are closed, we can compute
the new values of the internal state **c** and **h**:

c[t] = f * c[t - 1] i * gh[t] = o * tf.tanh(c[t])

We put the new values of **c** and **h** into c[t]

and h[t]

, so that these will be used as the inputs for the next timestep.

Now that we know the new value for the state vector **h**, we can use
this to predict the output **y** for this timestep:

y = tf.matmul(h[t], Why) by

**Sparse autoencoders (SAE)** are in a way the opposite of AEs. Instead
of teaching a network to represent a bunch of information in less
“space” or nodes, we try to encode information in more space. So instead
of the network converging in the middle and then expanding back to the
input size, we blow up the middle. These types of networks can be used
to extract many small features from a dataset. If one were to train a
SAE the same way as an AE, you would in almost all cases end up with a
pretty useless identity network (as in what comes in is what comes out,
without any transformation or decomposition). To prevent this, instead
of feeding back the input, we feed back the input plus a sparsity
driver. This sparsity driver can take the form of a threshold filter,
where only a certain error is passed back and trained, the other error
will be “irrelevant” for that pass and set to zero. In a way this
resembles spiking neural networks, where not all neurons fire all the
time (and points are scored for biological plausibility).

This prediction performs yet another matrix multiplication, this time
using the weights **Why** between **h** and **y**. (This is a simple
affine function like the one that happens in a fully-connected layer.)

Recall that our input **x** is a vector with 226 elements that contains
two separate data items: the MIDI note number and the delay in ticks.
This means we also need to predict the note and tick values separately,
and so we use two softmax functions, each on a separate portion of the
**y** vector:

y_note[t] = tf.nn.softmax(y[:num_midi_澳门新萄京官方网站：长短时记忆网络，神经网络大全。notes])y_tick[t] =
tf.nn.softmax(y[num_midi_notes:])

*Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann
LeCun. “Efficient learning of sparse representations with an
energy-based model.” Proceedings of NIPS. 2007.*

Original Paper
PDF

And that’s in a nutshell how the math in the LSTM layer works. To read
more about these formulas, see the Wikipedia
page.

**Note:** Even though the above LSTM formulas are taken from the Python
training script and use TensorFlow to do the computations, we need to
implement *exactly* the same formulas in the iOS app. But instead of
TensorFlow, we’ll use the Accelerate framework for that.

Too many matrices!

As you know, when a neural network is trained it will learn values for
the weights and biases. The same is true here: the LSTM will learn the
values of **Wxf**, **Whf**, **bf**,**Why**, **by**, and so on. Notice
that this is 9 different matrices and 5 different bias values.

We can be clever and actually combine these matrices into one big
matrix:

**Variational autoencoders (VAE)** have the same architecture as AEs but
are “taught” something else: an approximated probability distribution of
the input samples. It’s a bit back to the roots as they are bit more
closely related to BMs and RBMs. They do however rely on Bayesian
mathematics regarding probabilistic inference and independence, as well
as a re-parametrisation trick to achieve this different representation.
The inference and independence parts make sense intuitively, but they
rely on somewhat complex mathematics. The basics come down to this: take
influence into account. If one thing happens in one place and something
else happens somewhere else, they are not necessarily related. If they
are not related, then the error propagation should consider that. This
is a useful approach because neural networks are large graphs (in a
way), so it helps if you can rule out influence from some nodes to other
nodes as you dive into deeper layers.

*Kingma, Diederik P., and Max Welling. “Auto-encoding variational
bayes.” arXiv preprint arXiv:1312.6114 (2013).*

Original Paper PDF

Combing the weight matrices into a big matrix

We first put the value of **x** for this timestep and the value of **h**
of the previous timestep into a new vector (plus the constant 1, which
gets multiplied with the bias). Likewise, we put all the weights and
biases into one matrix. And then we multiply these two together.

This does the exact same thing as the eight matrix multiplies from
before. The big advantage is that we now have to manage only a single
weight matrix for **x** and **h**(and no bias value, since that is part
of this big matrix too).

We can simplify the computation for the gates to just this:

combined = tf.concat([x[t], h[t - 1], tf.ones(1)], axis=0)gates =
tf.matmul(combined, Wx)

And then compute the new values of **c** and **h** as follows:

c[t] = tf.sigmoid(gates[0])*c[t - 1]
tf.sigmoid(gates[1])*tf.tanh(gates[3])h[t] =
tf.sigmoid(gates[2])*tf.tanh(c[t])

**Denoising autoencoders (DAE)** are AEs where we don’t feed just the
input data, but we feed the input data with noise (like making an image
more grainy). We compute the error the same way though, so the output of
the network is compared to the original input without noise. This
encourages the network not to learn details but broader features, as
learning smaller features often turns out to be “wrong” due to it
constantly changing with noise.

These two formulas for **c** and **h** didn’t really change — I just
moved the sigmoid and tanh functions here.

Now when we train the LSTM we only have to deal with two weight
matrices: **Wx**, which is the big matrix I showed you here, and **Wy**,
the matrix that for the weights between **h** and **y**. Those two
matrices are the learned parameters that get loaded into the iOS app.

Training

OK, let’s recap where we are now:

We’ve got a dataset of 52260 one-hot encoded vectors that describe MIDI
notes and their timing. Together, these 52260 vectors make up a very
long sequence of drum patterns.

*Vincent, Pascal, et al. “Extracting and composing robust features with
denoising autoencoders.” Proceedings of the 25th international
conference on Machine learning. ACM, 2008.*

Original Paper
PDF

We want to train the LSTM to memorize this sequence. In other words, for every note of the sequence the LSTM should be able to correctly predict the note that follows.

We have the formulas for computing what happens in an LSTM layer. It
takes an input **x**, which is one of these vectors describing a single
drum sound, and two state vectors **h** and **c**. The LSTM then
computes new values for **h** and **c**, as well as a prediction **y**
for what the next note in the sequence will be.

Now we need to put this all together to train the recurrent network.
This will give us two matrices **Wx** and **Wy** that describe the
weights of the connections between the different parts of the LSTM.

And then we can use those weights in the iOS app to play new drum
patterns.

**Note:** The GitHub
repo
only contains a few drum patterns since I am not allowed to distribute
the full dataset. So unless you have your own library of drum patterns,
there isn’t much use in doing the training yourself. However, you *can*
still run the iOS app, as the trained weights are included in the Xcode
project.

That said, if you really want to, you can run the **lstm.py** script to
train the neural network on the included drum patterns (see the
README
file for instructions). Don’t get your hopes up though — because there
isn’t nearly enough data to train on, the model won’t be very good.

**Deep belief networks (DBN)** is the name given to stacked
architectures of mostly RBMs or VAEs. These networks have been shown to
be effectively trainable stack by stack, where each AE or RBM only has
to learn to encode the previous network. This technique is also known as
greedy training, where greedy means making locally optimal solutions to
get to a decent but possibly not optimal answer. DBNs can be trained
through contrastive divergence or back-propagation and learn to
represent the data as a probabilistic model, just like regular RBMs or
VAEs. Once trained or converged to a (more) stable state through
unsupervised learning, the model can be used to generate new data. If
trained with contrastive divergence, it can even classify existing data
because the neurons have been taught to look for different features.

A few notes about training

Training an LSTM isn’t very different from training any other neural
network. We use backpropagation with an SGD (Stochastic Gradient
Descent) optimizer and we train until the loss is low enough.

However, the nature of this network being recurrent — where the outputs
**h** and**c** are always connected to the inputs **h** and **c** —
makes backpropagation a little tricky. We don’t want to get stuck in an
infinite loop!

The way to deal with this is a technique called *backpropagation through
time*where we backpropagate through all the steps of the entire training
sequence.

In the interest of keeping this blog post short, I’m not going to
explain the entire training procedure here. You can find the complete
implementation
in **lstm.py** in the function train()

.

However, I do want to mention a few things:

The learning capacity of the LSTM is determined by the size of the **h**
and **c**vectors. A size of 200 units (or neurons if you will) seems to
work well. More units might work even better but at some point you’ll
get diminishing returns, and you’re better off stacking multiple LSTMs
on top of each other (making the network deeper rather than wider).

*Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.”
Advances in neural information processing systems 19 (2007): 153.*

Original Paper
PDF

It’s not practical to backpropagate through all 52260 steps of the training sequence, even though that would give the best results. Instead, we only go back 200 timesteps. After a bit of experimentation this seemed like a reasonable number. To achieve this, the training script actually sticks 200 LSTM units together and processes the training sequence in chunks of 200 notes at a time.

Every so often the training script computes the percentage of predictions it has correct. It does this on the training set (there is no validation set) so take it with a grain of salt, but it’s a good indicator of whether the training is still making progress or not.

The final model took a few hours to train on my iMac but that’s because it doesn’t have a GPU that TensorFlow can use (sad face). I let the training script run until the learning seemed to have stalled (the accuracy and loss did not improve), then I pressed Ctrl C, lowered the learning rate in the script, and resumed training from the last checkpoint.

**Convolutional neural networks (CNN or deep convolutional neural
networks, DCNN)** are quite different from most other networks. They are
primarily used for image processing but can also be used for other types
of input such as as audio. A typical use case for CNNs is where you feed
the network images and the network classifies the data, e.g. it
outputs “cat” if you give it a cat picture and “dog” when you give it a
dog picture. CNNs tend to start with an input “scanner” which is not
intended to parse all the training data at once. For example, to input
an image of 200 x 200 pixels, you wouldn’t want a layer with 40 000
nodes. Rather, you create a scanning input layer of say 20 x 20 which
you feed the first 20 x 20 pixels of the image (usually starting in the
upper left corner). Once you passed that input (and possibly use it for
training) you feed it the next 20 x 20 pixels: you move the scanner one
pixel to the right. Note that one wouldn’t move the input 20 pixels (or
whatever scanner width) over, you’re not dissecting the image into
blocks of 20 x 20, but rather you’re crawling over it. This input data
is then fed through convolutional layers instead of normal layers, where
not all nodes are connected to all nodes. Each node only concerns itself
with close neighbouring cells (how close depends on the implementation,
but usually not more than a few). These convolutional layers also tend
to shrink as they become deeper, mostly by easily divisible factors of
the input (so 20 would probably go to a layer of 10 followed by a layer
of 5). Powers of two are very commonly used here, as they can be divided
cleanly and completely by definition: 32, 16, 8, 4, 2, 1. Besides these
convolutional layers, they also often feature pooling layers. Pooling is
a way to filter out details: a commonly found pooling technique is max
pooling, where we take say 2 x 2 pixels and pass on the pixel with the
most amount of red. To apply CNNs for audio, you basically feed the
input audio waves and inch over the length of the clip, segment by
segment. Real world implementations of CNNs often glue an FFNN to the
end to further process the data, which allows for highly non-linear
abstractions. These networks are called DCNNs but the names and
abbreviations between these two are often used interchangeably.

The model that is included in the GitHub
repo澳门新萄京官方网站：长短时记忆网络，神经网络大全。
has an accuracy score of about 92%, which means 8 in every 100 notes
from the training sequence are remembered wrong. Once the model reached
92% accuracy, it didn’t seem to want to go much further than that, so
we’ve probably reached the capacity of our model.

An accuracy of “only” 92% is good enough for our purposes: we don’t want
the LSTM to literally remember every example from the training data,
just enough to get a sense of what it means to play the drums.

So how good is it?

Don’t fire the drummer from your band just yet. :–)

*LeCun, Yann, et al. “Gradient-based learning applied to document
recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.*

Original Paper PDF

The question is: has the recurrent neural network *really* learned
anything from the training data, or does it just output random notes?

Here’s an
MP3
of randomly chosen notes and durations from the training data. It
doesn’t sound like real drums at all.

Compare it with this
recording
that was produced by the LSTM. It’s definitely much more realistic! (In
fact, it sounds a lot like the kid down the street practicing.)

Of course, the model we’re using is very basic. It’s a single LSTM layer
with “only” 200 neurons. No doubt there are much better ways to train a
computer to play the drums. One way is to make the network deeper by
stacking multiple LSTMs. This should improve the performance by a lot!

The weights that are learned by the model take up 1.5 MB of storage. The
dataset, on the other hand, is only 1.3 MB! That doesn’t seem very
efficient. But just having the dataset does not mean you know how to
drum — the weights are more than just a way to remember the training
data, they also “understand” in some way what it means to play the
drums.

The cool thing is that our neural network doesn’t really know anything
about music: we just gave it examples and it has learned drumming from
that (to some extent anyway). The point I’m trying to make with this
blog post is that if we can make a recurrent neural network learn to
drum, then we can teach it to understand any kind of sequential data.

**Deconvolutional networks (DN)**, also called inverse graphics networks
(IGNs), are reversed convolutional neural networks. Imagine feeding a
network the word “cat” and training it to produce cat-like pictures, by
comparing what it generates to real pictures of cats. DNNs can be
combined with FFNNs just like regular CNNs, but this is about the point
where the line is drawn with coming up with new abbreviations. They may
be referenced as deep deconvolutional neural networks, but you could
argue that when you stick FFNNs to the back and the front of DNNs that
you have yet another architecture which deserves a new name. Note that
in most applications one wouldn’t actually feed text-like input to the
network, more likely a binary classification input vector. Think <0,
1> being cat, <1, 0> being dog and <1, 1> being cat and
dog. The pooling layers commonly found in CNNs are often replaced with
similar inverse operations, mainly interpolation and extrapolation with
biased assumptions (if a pooling layer uses max pooling, you can invent
exclusively lower new data when reversing it).

*Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.*

Original Paper
PDF

**Deep convolutional inverse graphics networks (DCIGN)** have a somewhat
misleading name, as they are actually VAEs but with CNNs and DNNs for
the respective encoders and decoders. These networks attempt to model
“features” in the encoding as probabilities, so that it can learn to
produce a picture with a cat and a dog together, having only ever seen
one of the two in separate pictures. Similarly, you could feed it a
picture of a cat with your neighbours’ annoying dog on it, and ask it to
remove the dog, without ever having done such an operation. Demo’s have
shown that these networks can also learn to model complex
transformations on images, such as changing the source of light or the
rotation of a 3D object. These networks tend to be trained with
back-propagation.

*Kulkarni, Tejas D., et al. “Deep convolutional inverse graphics
network.” Advances in Neural Information Processing Systems. 2015.*

Original Paper PDF

**Generative adversarial networks (GAN)** are from a different breed of
networks, they are twins: two networks working together. GANs consist of
any two networks (although often a combination of FFs and CNNs), with
one tasked to generate content and the other has to judge content. The
discriminating network receives either training data or generated
content from the generative network. How well the discriminating network
was able to correctly predict the data source is then used as part of
the error for the generating network. This creates a form of competition
where the discriminator is getting better at distinguishing real data
from generated data and the generator is learning to become less
predictable to the discriminator. This works well in part because even
quite complex noise-like patterns are eventually predictable but
generated content similar in features to the input data is harder to
learn to distinguish. GANs can be quite difficult to train, as you don’t
just have to train two networks (either of which can pose it’s own
problems) but their dynamics need to be balanced as well. If prediction
or generation becomes to good compared to the other, a GAN won’t
converge as there is intrinsic divergence.

*Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in
Neural Information Processing Systems. 2014.*

Original Paper PDF

**Recurrent neural networks (RNN)** are FFNNs with a time twist: they
are not stateless; they have connections between passes, connections
through time. Neurons are fed information not just from the previous
layer but also from themselves from the previous pass. This means that
the order in which you feed the input and train the network matters:
feeding it “milk” and then “cookies” may yield different results
compared to feeding it “cookies” and then “milk”. One big problem with
RNNs is the vanishing (or exploding) gradient problem
where, depending on the activation functions used, information
rapidly gets lost over time, just like very deep FFNNs lose information
in depth. Intuitively this wouldn’t be much of a problem because these
are just weights and not neuron states, but the weights through time is
actually where the information from the past is stored; if the weight
reaches a value of 0 or 1 000 000, the previous state won’t be very
informative. RNNs can in principle be used in many fields as most forms
of data that don’t actually have a timeline (i.e. unlike sound or video)
can be represented as a sequence. A picture or a string of text can be
fed one pixel or character at a time, so the time dependent weights are
used for what came before in the sequence, not actually from what
happened x seconds before. In general, recurrent networks are a good
choice for advancing or completing information, such as autocompletion.

*Elman, Jeffrey L. “Finding structure in time.” Cognitive science 14.2
(1990): 179-211.*

Original Paper PDF

**Long / short term memory (LSTM)** networks try to combat the vanishing
/ exploding gradient problem by introducing gates and an explicitly
defined memory cell. These are inspired mostly by circuitry, not so much
biology. Each neuron has a memory cell and three gates: input, output
and forget. The function of these gates is to safeguard the information
by stopping or allowing the flow of it. The input gate determines how
much of the information from the previous layer gets stored in the cell.
The output layer takes the job on the other end and determines how much
of the next layer gets to know about the state of this cell. The forget
gate seems like an odd inclusion at first but sometimes it’s good to
forget: if it’s learning a book and a new chapter begins, it may be
necessary for the network to forget some characters from the previous
chapter. LSTMs have been shown to be able to learn complex sequences,
such as writing like Shakespeare or composing primitive music. Note that
each of these gates has a weight to a cell in the previous neuron, so
they typically require more resources to run.

*Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.”
Neural computation 9.8 (1997): 1735-1780.*

Original Paper
PDF

**Gated recurrent units (GRU)** are a slight variation on LSTMs. They
have one less gate and are wired slightly differently: instead of an
input, output and a forget gate, they have an update gate. This update
gate determines both how much information to keep from the last state
and how much information to let in from the previous layer. The reset
gate functions much like the forget gate of an LSTM but it’s located
slightly differently. They always send out their full state, they don’t
have an output gate. In most cases, they function very similarly to
LSTMs, with the biggest difference being that GRUs are slightly faster
and easier to run (but also slightly less expressive). In practice these
tend to cancel each other out, as you need a bigger network to regain
some expressiveness which then in turn cancels out the performance
benefits. In some cases where the extra expressiveness is not needed,
GRUs can outperform LSTMs.

*Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural
networks on sequence modeling.” arXiv preprint arXiv:1412.3555
(2014).*

Original Paper PDF

**Neural Turing machines (NTM)** can be understood as an abstraction of
LSTMs and an attempt to un-black-box neural networks (and give us some
insight in what is going on in there). Instead of coding a memory cell
directly into a neuron, the memory is separated. It’s an attempt to
combine the efficiency and permanency of regular digital storage and the
efficiency and expressive power of neural networks. The idea is to have
a content-addressable memory bank and a neural network that can read and
write from it. The “Turing” in Neural Turing Machines comes from them
being Turing complete: the ability to read and write and change state
based on what it reads means it can represent anything a Universal
Turing Machine can represent.

*Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural turing machines.”
arXiv preprint arXiv:1410.5401 (2014).*

Original Paper PDF

**Bidirectional recurrent neural networks, bidirectional long / short
term memory networks and bidirectional gated recurrent units (BiRNN,
BiLSTM and BiGRU respectively)** are not shown on the chart because they
look exactly the same as their unidirectional counterparts. The
difference is that these networks are not just connected to the past,
but also to the future. As an example, unidirectional LSTMs might be
trained to predict the word “fish” by being fed the letters one by one,
where the recurrent connections through time remember the last value. A
BiLSTM would also be fed the next letter in the sequence on the backward
pass, giving it access to future information. This trains the network to
fill in gaps instead of advancing information, so instead of expanding
an image on the edge, it could fill a hole in the middle of an image.

*Schuster, Mike, and Kuldip K. Paliwal. “Bidirectional recurrent neural
networks.” IEEE Transactions on Signal Processing 45.11 (1997):
2673-2681.*

Original Paper
PDF

**Deep residual networks (DRN)** are very deep FFNNs with extra
connections passing input from one layer to a later layer (often 2 to 5
layers) as well as the next layer. Instead of trying to find a solution
for mapping some input to some output across say 5 layers, the network
is enforced to learn to map some input to some output some input.
Basically, it adds an identity to the solution, carrying the older input
over and serving it freshly to a later layer. It has been shown that
these networks are very effective at learning patterns up to 150 layers
deep, much more than the regular 2 to 5 layers one could expect to
train. However, it has been proven that these networks are in essence
just RNNs without the explicit time based construction and they’re often
compared to LSTMs without gates.

*He, Kaiming, et al. “Deep residual learning for image recognition.”
arXiv preprint arXiv:1512.03385 (2015).*

Original Paper PDF

**Echo state networks (ESN)** are yet another different type of
(recurrent) network. This one sets itself apart from others by having
random connections between the neurons (i.e. not organised into neat
sets of layers), and they are trained differently. Instead of feeding
input and back-propagating the error, we feed the input, forward it
and update the neurons for a while, and observe the output over time.
The input and the output layers have a slightly unconventional
role as the input layer is used to prime the network and the output
layer acts as an observer of the activation patterns that unfold over
time. During training, only the connections between the observer and the
(soup of) hidden units are changed.

*Jaeger, Herbert, and Harald Haas. “Harnessing nonlinearity: Predicting
chaotic systems and saving energy in wireless communication.” science
304.5667 (2004): 78-80.*

Original Paper
PDF

**Extreme learning machines (ELM)** are basically FFNNs but with random
connections. They look very similar to LSMs and ESNs, but they are not
recurrent nor spiking. They also do not use backpropagation. Instead,
they start with random weights and train the weights in a single step
according to the least-squares fit (lowest error across all functions).
This results in a much less expressive network but it’s also much faster
than backpropagation.

*Cambria, Erik, et al. “Extreme learning machines [trends &
controversies].” IEEE Intelligent Systems 28.6 (2013): 30-59.*

Original Paper
PDF

**Liquid state machines (LSM)** are similar soups, looking a lot like
ESNs. The real difference is that LSMs are a type of spiking neural
networks: sigmoid activations are replaced with threshold functions and
each neuron is also an accumulating memory cell. So when updating a
neuron, the value is not set to the sum of the neighbours, but rather
added to itself. Once the threshold is reached, it releases its’ energy
to other neurons. This creates a spiking like pattern, where nothing
happens for a while until a threshold is suddenly reached.

*Maass, Wolfgang, Thomas Natschläger, and Henry Markram. “Real-time
computing without stable states: A new framework for neural computation
based on perturbations.” Neural computation 14.11 (2002): 2531-2560.*

Original Paper
PDF

**Support vector machines (SVM)** find optimal solutions for
classification problems. Classically they were only capable of
categorising linearly separable data; say finding which images are of
Garfield and which of Snoopy, with any other outcome not being possible.
During training, SVMs can be thought of as plotting all the data
(Garfields and Snoopys) on a graph (2D) and figuring out how to draw a
line between the data points. This line would separate the data, so that
all Snoopys are on one side and the Garfields on the other. This line
moves to an optimal line in such a way that the margins between the data
points and the line are maximised on both sides. Classifying new data
would be done by plotting a point on this graph and simply looking on
which side of the line it is (Snoopy side or Garfield side). Using the
kernel trick, they can be taught to classify n-dimensional data. This
entails plotting points in a 3D plot, allowing it to distinguish between
Snoopy, Garfield AND Simon’s cat, or even higher dimensions
distinguishing even more cartoon characters. SVMs are not always
considered neural networks.

*Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.”
Machine learning 20.3 (1995): 273-297.*

Original Paper
PDF

And finally, **Kohonen networks (KN, also self organising (feature) map,
SOM, SOFM)** “complete” our zoo. KNs utilise competitive learning to
classify data without supervision. Input is presented to the network,
after which the network assesses which of its neurons most closely match
that input. These neurons are then adjusted to match the input even
better, dragging along their neighbours in the process. How much the
neighbours are moved depends on the distance of the neighbours to the
best matching units. KNs are sometimes not considered neural networks
either.

*Kohonen, Teuvo. “Self-organized formation of topologically correct
feature maps.” Biological cybernetics 43.1 (1982): 59-69.*

Original Paper
PDF

Any feedback and criticism is welcome. At the Asimov Institute we do deep learning research and development, so be sure to follow us on twitter for future updates and posts! Thank you for reading!

[Update 15 september 2016] I would like to thank everybody for their insights and corrections, all feedback is hugely appreciated. I will add links and a couple more suggested networks in a future update, stay tuned.

[Update 29 september 2016] Added links and citations to all the original papers. A follow up post is planned, since I found at least 9 more architectures. I will not include them in this post for better consistency in terms of content.

[Update 30 november 2017] Looking for a poster of the neural network zoo? Click here

本文由澳门新萄京官方网站发布于www.8455.com,转载请注明出处：澳门新萄京官方网站：长短时记忆网络，神经网

关键词：