【RNN】深扒能提高预测准确率的beam search

假装爱读书的Flora

2019年07月31日 14:13

收录于文集

共4篇

本文要点：

seq-to-seq 论文作者没提的那些事
Beam Search里面的小心机

大家都知道seq-to-seq，输入是长度为T的sequence，可以输出为长度T‘的sequence，两者不需要相等，使得机器翻译更上一层楼。它由两个LSTM型的RNN模型组成，一个是encoder，一个是decoder。

大家都烂熟于心了，所以今天不讲seq-to-seq论文中已经讲明白的事情，而是讲讲它没细提但是默默在github里面实现的东西，以及不加入训练却能提高inference准确率的神奇的beam search：）

seq-to-seq 论文作者没提的那些事

第一、固定长度的向量具体指LSTM的什么

论文中说用多个LSTM层当encoder，将输入句子浓缩成一个固定长度的向量，再将这个向量作为输入，放进decoder让它生成输出句子

原话：The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1)

论文中从头到尾没说这个固定长度的向量到底是啥，看代码知道：

encoder类返回的是(outputs, states)的tuple，这边outputs指的是每一个timestep的h_t, states的话假设使用LSTM cell指的是最后一个timestep做完之后的(h_t, c_t) tuple

第二、桥梁这个类

论文中并没有来得及提到而实际github中能发现：

encoder输出的向量V_e，和decoder输入的向量V_d，假设V_e维度是[batch size, m], V_d维度是[batch size, n]，m和n是可以不一样的，使用一个叫桥梁的类，可以实现不同的逻辑，将V_e转化为V_d

https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L41-L77

这个类的两个输入：

encoder_outputs: A namedtuple 上面已经讲了这具体是啥

decoder_state_size: decoder的维度也就是上面说的n

桥梁有三种：

ZeroBridge（~~不管encoder妈妈给我吃什么我都只考0分，~~返回[batch, n]的零矩阵）

PassThroughBridge（不管encoder给我吃什么都原样递给decoder，前提是m=n）

InitialStateBridge（一个FC层也就是个[m, n]的矩阵，把[Batch, m]乘成[Batch, n]）https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L112-L166

Beam Search里面的小心机

论文中说为了提高infer时的准确率，只在训练结束需要预测的时候使用的beam search

什么是beam search呢？

原文就用了三句话讲它，其中一句：We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation.

想象一下：假设不用beam search，我们预测时，在每个timestep，只保留概率最大的那个词作为这个位置上我们预测会出现的词，每一个timestep都这么做，直到遇到EOS(也就是预测到这是句尾)或者到达我们提前设置好的最大句子长度。

这么做有点像DFS，从树顶上开始，每一层找到最大的P往下走，走到最底下就返回结果

那么beam search就有点像加了限制的BFS，当Beam width B=2时，每一层保留前两个最有可能的prefix带着往下走，直到到达最大句子长度（词语个数或者说time step）或者中间遇到EOS(也就是预测到这是句尾)

聚焦到这个函数基本上就知道bs在干嘛：

https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L143

def choose_top_k(scores_flat, config):

"""Chooses the top-k beams as successors.

"""

next_beam_scores, word_indices = tf.nn.top_k(scores_flat, k=config.beam_width)

return next_beam_scores, word_indices

感觉是通过考虑更多的可能，来提升预测准确率的一个算法

再看看beam search相关的2个类，方便我们理解接下来的函数：

class BeamSearchState(

namedtuple("BeamSearchState", ["log_probs", "finished", "lengths"])):

"""State for a single step of beam search.

这个类代表一个timestep里面的状态，含三个属性：

log_probs 直到本timestep，每一个timestep中所有可能出现的词语的P

finished 本timestep哪些树枝不会继续向下长，已经遇到EOS或者max

lengths 本timestep每个树枝的prefix的长度（也就是走到现在的词语个数）

class BeamSearchStepOutput(

namedtuple("BeamSearchStepOutput",

["scores", "predicted_ids", "beam_parent_ids"])):

这个类代表一个timestep完成后的输出，含三个属性：

scores 本timestep每个备选树枝的打分分数

predicted_ids 本timestep选中的前B个词语

beam_parent_ids 被选中的词语们的父亲们，用来最后追溯，完整回复出走的这个路径

最后我们来看看seq-to-seq悄咪咪干啥了：

论文中没有提到，这个beam search每一轮的prefix句子其实不是直接按照概率选出来的前N。

github代码中根据另一篇论文Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation（https://arxiv.org/pdf/1609.08144.pdf），对裸概率做了处理，根据不同长度的prefix进行了一个标准化的操作，对过短的输出句子给出了惩罚。

以下是该论文中的原话：

We introduce two important refinements to the pure max-probability based beam search algorithm: a coverage penalty and length normalization. With length normalization, we aim to account for the fact that we have to compare hypotheses of different length. Without some form of length-normalizationregular beam search will favor shorter results over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences

意思就是不标准化输出prefix的长度的话，越长的句子logP越低，越不容易被选上

他们使用了一个函数来做这个惩罚：

α ∈ [0.6 − 0.7]

看看代码：

def length_penalty(sequence_lengths, penalty_factor):

Args:

sequence_lengths: The sequence length of all hypotheses, a tensor

of shape [beam_size, vocab_size].

penalty_factor: A scalar that weights the length penalty.

Returns:

The length penalty factor, a tensor fo shape [beam_size].

"""

return tf.div((5. + tf.to_float(sequence_lengths))**penalty_factor, (5. + 1.)**penalty_factor)

是不是闻到了玄学调参的味道?

def hyp_score(log_probs, sequence_lengths, config):

"""Calculates scores for beam search hypotheses.

"""

length_penality_ = length_penalty( sequence_lengths=sequence_lengths,

penalty_factor=config.length_penalty_weight)

score = log_probs / length_penality_

return score

将logP用length_penalty标准化一下

最最后，我们通过代码实现来看看beam search的一个timestep具体怎么走?

https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L196-L279

1. 取出上一步的每个假设的句子长度和每个句子是否已经结束的信息

prediction_lengths = beam_state.lengths

previously_finished = beam_state.finished

2. 把输出词典中的每个词加到现有的每个假设句子prefix上，计算这一步的每个新假设句子的logP （这个logP的维度应该是[beam_width, vocab_size]）

probs = tf.nn.log_softmax(logits)

probs = mask_probs(probs, config.eos_token, previously_finished)

total_probs = tf.expand_dims(beam_state.log_probs, 1) + probs

3. 由于本轮可能有些句子遇到EOS终止，这边更新一下每个新假设句子的长度

lengths_to_add = tf.one_hot( [config.eos_token] * config.beam_width,

config.vocab_size, 0, 1)

add_mask = (1 - tf.to_int32( previously_finished ) )

lengths_to_add = tf.expand_dims( add_mask, 1 ) * lengths_to_add

new_prediction_lengths = tf.expand_dims( prediction_lengths,1 ) + lengths_to_add

4. 用上面讲的带玄学惩罚的算分函数算分

scores = hyp_score( log_probs=total_probs, sequence_lengths=new_prediction_lengths,

config=config )

# score的维度[beam_size, vocab_size]

scores_flat = tf.reshape(scores, [-1])

#在树顶的时候，只考虑第一个假设

scores_flat = tf.cond(tf.convert_to_tensor(time_) > 0, lambda: scores_flat,

lambda: scores[0])

5. 根据算出来的分以及一个选词函数successors选出这个位置预测的词们

next_beam_scores, word_indices = config.choose_successors_fn(scores_flat, config)

next_beam_scores.set_shape([config.beam_width])

word_indices.set_shape([config.beam_width])

6. 得到本轮预测词语们（也就是选好保留哪些branchs）之后，找到他们对应的logP啊，本轮对应句子是否结束之类的

total_probs_flat = tf.reshape(total_probs, [-1], name="total_probs_flat")

next_beam_probs = tf.gather(total_probs_flat, word_indices)

next_beam_probs.set_shape([config.beam_width])

next_word_ids = tf.mod(word_indices, config.vocab_size)

next_beam_ids = tf.div(word_indices, config.vocab_size)

# Append new ids to current predictions

next_finished = tf.logical_or(

tf.gather(beam_state.finished, next_beam_ids),

tf.equal(next_word_ids, config.eos_token))

7. 根据选中的branchs，更新句子长度：被选中的而还没有结束的branch 长度加1

lengths_to_add = tf.to_int32(tf.not_equal(next_word_ids, config.eos_token))

lengths_to_add = (1 - tf.to_int32(next_finished)) * lengths_to_add

next_prediction_len = tf.gather(beam_state.lengths, next_beam_ids)

next_prediction_len += lengths_to_add

8. 打包返回进入下一步计算

next_state = BeamSearchState(

log_probs=next_beam_probs,

lengths=next_prediction_len,

finished=next_finished)

output = BeamSearchStepOutput(

scores=next_beam_scores,

predicted_ids=next_word_ids,

beam_parent_ids=next_beam_ids)

P.S. 原文里面beam size B用的2声称就够好了，然而发现pointer generator模型中用到了4

感兴趣的话找机会讲pointer generator这个模型，可以生成重写型的文本摘要（不是摘取式）

Reference:

Seq-to-seq github https://github.com/google/seq2seq

Seq-to-seq论文 https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation论文 https://arxiv.org/pdf/1609.08144.pdf

Tensorflow github https://github.com/tensorflow/tensorflow

若想了解更多深度学习内容，不妨关注~假装爱读书的Flora~

有希望看到的内容也可以告诉我~

机器学习机器翻译深度学习自然语言处理 TensorFlow 文本摘要 seq-to-seq 束搜索算法 beam search

本文为我原创，未经授权禁止转载

cv3206132

分享至

投诉或建议