【RNN】深扒能提高预测准确率的beam search
假装爱读书的Flora
2019年07月31日 14:13
收录于文集
共4篇

本文要点:

  • seq-to-seq 论文作者没提的那些事

  • Beam Search里面的小心机


大家都知道seq-to-seq,输入是长度为T的sequence,可以输出为长度T‘的sequence,两者不需要相等,使得机器翻译更上一层楼。它由两个LSTM型的RNN模型组成,一个是encoder,一个是decoder。

大家都烂熟于心了,所以今天不讲seq-to-seq论文中已经讲明白的事情,而是讲讲它没细提但是默默在github里面实现的东西,以及不加入训练却能提高inference准确率的神奇的beam search:)


seq-to-seq 论文作者没提的那些事

第一、固定长度的向量具体指LSTM的什么

论文中说用多个LSTM层当encoder,将输入句子浓缩成一个固定长度的向量,再将这个向量作为输入,放进decoder让它生成输出句子

原话:The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1)

论文中从头到尾没说这个固定长度的向量到底是啥,看代码知道:

encoder类返回的是(outputs, states)的tuple,这边outputs指的是每一个timestep的h_t, states的话假设使用LSTM cell指的是最后一个timestep做完之后的(h_t, c_t) tuple

第二、桥梁这个类

论文中并没有来得及提到而实际github中能发现:

encoder输出的向量V_e,和decoder输入的向量V_d,假设V_e维度是[batch size, m], V_d维度是[batch size, n],m和n是可以不一样的,使用一个叫桥梁的类,可以实现不同的逻辑,将V_e转化为V_d

https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L41-L77

这个类的两个输入:

encoder_outputs: A namedtuple 上面已经讲了这具体是啥

decoder_state_size: decoder的维度 也就是上面说的n

桥梁有三种:

ZeroBridge不管encoder妈妈给我吃什么我都只考0分,返回[batch, n]的零矩阵)

PassThroughBridge(不管encoder给我吃什么都 原样递给decoder,前提是m=n)

InitialStateBridge(一个FC层也就是个[m, n]的矩阵, 把[Batch, m]乘成[Batch, n])https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L112-L166


Beam Search里面的小心机

论文中说为了提高infer时的准确率,只在训练结束需要预测的时候使用的beam search

什么是beam search呢?

原文就用了三句话讲它,其中一句:We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation.

想象一下:假设不用beam search,我们预测时,在每个timestep,只保留概率最大的那个词作为这个位置上我们预测会出现的词,每一个timestep都这么做,直到遇到EOS(也就是预测到这是句尾)或者到达我们提前设置好的最大句子长度。

这么做有点像DFS,从树顶上开始,每一层找到最大的P往下走,走到最底下就返回结果

那么beam search就有点像加了限制的BFS,当Beam width B=2时,每一层保留前两个最有可能的prefix带着往下走,直到到达最大句子长度(词语个数或者说time step)或者中间遇到EOS(也就是预测到这是句尾)

聚焦到这个函数基本上就知道bs在干嘛:

https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L143

def choose_top_k(scores_flat, config):

    "&#​34;"Chooses the top-k beams as successors.

    "&#​34;"

    next_beam_scores, word_indices = tf.nn.top_k(scores_flat, k=config.beam_width)

    return next_beam_scores, word_indices

感觉是通过考虑更多的可能,来提升预测准确率的一个算法

再看看beam search相关的2个类,方便我们理解接下来的函数:

class BeamSearchState(

    namedtuple("BeamSearchState&#​34;, ["log_probs&#​34;, "finished&#​34;, "lengths&#​34;])):

  "&#​34;"State for a single step of beam search.

这个类代表一个timestep里面的状态,含三个属性:

log_probs 直到本timestep,每一个timestep中所有可能出现的词语的P

finished 本timestep哪些树枝不会继续向下长,已经遇到EOS或者max

lengths 本timestep每个树枝的prefix的长度(也就是走到现在的词语个数)

class BeamSearchStepOutput(

    namedtuple("BeamSearchStepOutput&#​34;,

               ["scores&#​34;, "predicted_ids&#​34;, "beam_parent_ids&#​34;])):

这个类代表一个timestep完成后的输出,含三个属性:

scores 本timestep每个备选树枝的打分分数

predicted_ids 本timestep选中的前B个词语

beam_parent_ids 被选中的词语们的父亲们,用来最后追溯,完整回复出走的这个路径

最后我们来看看seq-to-seq悄咪咪干啥了:

论文中没有提到,这个beam search每一轮的prefix句子其实不是直接按照概率选出来的前N。

github代码中根据另一篇论文Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation(https://arxiv.org/pdf/1609.08144.pdf),对裸概率做了处理,根据不同长度的prefix进行了一个标准化的操作,对过短的输出句子给出了惩罚。

以下是该论文中的原话:

We introduce two important refinements to the pure max-probability based beam search algorithm: a coverage penalty and length normalization. With length normalization, we aim to account for the fact that we have to compare hypotheses of different length. Without some form of length-normalizationregular beam search will favor shorter results over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences

意思就是不标准化输出prefix的长度的话,越长的句子logP越低,越不容易被选上

他们使用了一个函数来做这个惩罚:

α ∈ [0.6 − 0.7]

看看代码:

def length_penalty(sequence_lengths, penalty_factor):

   Args:

    sequence_lengths: The sequence length of all hypotheses, a tensor

      of shape [beam_size, vocab_size].

    penalty_factor: A scalar that weights the length penalty.

  Returns:

    The length penalty factor, a tensor fo shape [beam_size].

   "&#​34;"

  return tf.div((5. + tf.to_float(sequence_lengths))**penalty_factor, (5. + 1.)**penalty_factor)

是不是闻到了玄学调参的味道?

def hyp_score(log_probs, sequence_lengths, config):

    "&#​34;"Calculates scores for beam search hypotheses.

    "&#​34;"

    length_penality_ = length_penalty( sequence_lengths=sequence_lengths,

                                                            penalty_factor=config.length_penalty_weight)

    score = log_probs / length_penality_

    return score

将logP用length_penalty标准化一下

最最后,我们通过代码实现来看看beam search的一个timestep具体怎么走?

https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L196-L279

1. 取出上一步的每个假设的句子长度和每个句子是否已经结束的信息

    prediction_lengths = beam_state.lengths

    previously_finished = beam_state.finished

2. 把输出词典中的每个词加到现有的每个假设句子prefix上,计算这一步的每个新假设句子的logP (这个logP的维度应该是[beam_width, vocab_size])

    probs = tf.nn.log_softmax(logits)

    probs = mask_probs(probs, config.eos_token, previously_finished)

    total_probs = tf.expand_dims(beam_state.log_probs, 1) + probs

3. 由于本轮可能有些句子遇到EOS终止,这边更新一下每个新假设句子的长度

    lengths_to_add = tf.one_hot( [config.eos_token] * config.beam_width, 

                                                    config.vocab_size, 0, 1)

    add_mask = (1 - tf.to_int32( previously_finished ) )

    lengths_to_add = tf.expand_dims( add_mask, 1 ) * lengths_to_add

    new_prediction_lengths = tf.expand_dims( prediction_lengths,1 ) + lengths_to_add

4. 用上面讲的带玄学惩罚的算分函数算分

    scores = hyp_score( log_probs=total_probs, sequence_lengths=new_prediction_lengths,

                                      config=config )

    # score的维度[beam_size, vocab_size]

    scores_flat = tf.reshape(scores, [-1])

    #在树顶的时候,只考虑第一个假设

    scores_flat = tf.cond(tf.convert_to_tensor(time_) > 0, lambda: scores_flat, 

                                                                                         lambda: scores[0])

5. 根据算出来的分以及一个选词函数successors选出这个位置预测的词们

    next_beam_scores, word_indices = config.choose_successors_fn(scores_flat, config)

    next_beam_scores.set_shape([config.beam_width])

    word_indices.set_shape([config.beam_width])

6. 得到本轮预测词语们(也就是选好保留哪些branchs)之后,找到他们对应的logP啊,本轮对应句子是否结束之类的

    total_probs_flat = tf.reshape(total_probs, [-1], name="total_probs_flat&#​34;)

    next_beam_probs = tf.gather(total_probs_flat, word_indices)

    next_beam_probs.set_shape([config.beam_width])

    next_word_ids = tf.mod(word_indices, config.vocab_size)

    next_beam_ids = tf.div(word_indices, config.vocab_size)

    # Append new ids to current predictions

    next_finished = tf.logical_or(

        tf.gather(beam_state.finished, next_beam_ids),

        tf.equal(next_word_ids, config.eos_token))

7. 根据选中的branchs,更新句子长度:被选中的而还没有结束的branch 长度加1

    lengths_to_add = tf.to_int32(tf.not_equal(next_word_ids, config.eos_token))

    lengths_to_add = (1 - tf.to_int32(next_finished)) * lengths_to_add

    next_prediction_len = tf.gather(beam_state.lengths, next_beam_ids)

    next_prediction_len += lengths_to_add

8.  打包返回进入下一步计算

    next_state = BeamSearchState(

        log_probs=next_beam_probs,

        lengths=next_prediction_len,

        finished=next_finished)

    output = BeamSearchStepOutput(

        scores=next_beam_scores,

        predicted_ids=next_word_ids,

        beam_parent_ids=next_beam_ids)

P.S. 原文里面beam size B用的2声称就够好了,然而发现pointer generator模型中用到了4

感兴趣的话找机会讲pointer generator这个模型,可以生成重写型的文本摘要(不是摘取式)


Reference: 

Seq-to-seq github https://github.com/google/seq2seq 

Seq-to-seq论文 https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation论文 https://arxiv.org/pdf/1609.08144.pdf

Tensorflow github https://github.com/tensorflow/tensorflow


若想了解更多深度学习内容,不妨关注~假装爱读书的Flora​~  

有希望看到的内容也可以告诉我~