
本文要点:
seq-to-seq 论文作者没提的那些事
Beam Search里面的小心机
大家都知道seq-to-seq,输入是长度为T的sequence,可以输出为长度T‘的sequence,两者不需要相等,使得机器翻译更上一层楼。它由两个LSTM型的RNN模型组成,一个是encoder,一个是decoder。
大家都烂熟于心了,所以今天不讲seq-to-seq论文中已经讲明白的事情,而是讲讲它没细提但是默默在github里面实现的东西,以及不加入训练却能提高inference准确率的神奇的beam search:)
seq-to-seq 论文作者没提的那些事
第一、固定长度的向量具体指LSTM的什么
论文中说用多个LSTM层当encoder,将输入句子浓缩成一个固定长度的向量,再将这个向量作为输入,放进decoder让它生成输出句子
原话:The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1)
论文中从头到尾没说这个固定长度的向量到底是啥,看代码知道:
encoder类返回的是(outputs, states)的tuple,这边outputs指的是每一个timestep的h_t, states的话假设使用LSTM cell指的是最后一个timestep做完之后的(h_t, c_t) tuple
第二、桥梁这个类
论文中并没有来得及提到而实际github中能发现:
encoder输出的向量V_e,和decoder输入的向量V_d,假设V_e维度是[batch size, m], V_d维度是[batch size, n],m和n是可以不一样的,使用一个叫桥梁的类,可以实现不同的逻辑,将V_e转化为V_d
https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L41-L77
这个类的两个输入:
encoder_outputs: A namedtuple 上面已经讲了这具体是啥
decoder_state_size: decoder的维度 也就是上面说的n
桥梁有三种:
ZeroBridge(不管encoder妈妈给我吃什么我都只考0分,返回[batch, n]的零矩阵)
PassThroughBridge(不管encoder给我吃什么都 原样递给decoder,前提是m=n)
InitialStateBridge(一个FC层也就是个[m, n]的矩阵, 把[Batch, m]乘成[Batch, n])https://github.com/google/seq2seq/blob/master/seq2seq/models/bridges.py#L112-L166
Beam Search里面的小心机
论文中说为了提高infer时的准确率,只在训练结束需要预测的时候使用的beam search
什么是beam search呢?
原文就用了三句话讲它,其中一句:We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation.
想象一下:假设不用beam search,我们预测时,在每个timestep,只保留概率最大的那个词作为这个位置上我们预测会出现的词,每一个timestep都这么做,直到遇到EOS(也就是预测到这是句尾)或者到达我们提前设置好的最大句子长度。
这么做有点像DFS,从树顶上开始,每一层找到最大的P往下走,走到最底下就返回结果
那么beam search就有点像加了限制的BFS,当Beam width B=2时,每一层保留前两个最有可能的prefix带着往下走,直到到达最大句子长度(词语个数或者说time step)或者中间遇到EOS(也就是预测到这是句尾)
聚焦到这个函数基本上就知道bs在干嘛:
https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L143
def choose_top_k(scores_flat, config):
"""Chooses the top-k beams as successors.
"""
next_beam_scores, word_indices = tf.nn.top_k(scores_flat, k=config.beam_width)
return next_beam_scores, word_indices
感觉是通过考虑更多的可能,来提升预测准确率的一个算法
再看看beam search相关的2个类,方便我们理解接下来的函数:
class BeamSearchState(
namedtuple("BeamSearchState", ["log_probs", "finished", "lengths"])):
"""State for a single step of beam search.
这个类代表一个timestep里面的状态,含三个属性:
log_probs 直到本timestep,每一个timestep中所有可能出现的词语的P
finished 本timestep哪些树枝不会继续向下长,已经遇到EOS或者max
lengths 本timestep每个树枝的prefix的长度(也就是走到现在的词语个数)
class BeamSearchStepOutput(
namedtuple("BeamSearchStepOutput",
["scores", "predicted_ids", "beam_parent_ids"])):
这个类代表一个timestep完成后的输出,含三个属性:
scores 本timestep每个备选树枝的打分分数
predicted_ids 本timestep选中的前B个词语
beam_parent_ids 被选中的词语们的父亲们,用来最后追溯,完整回复出走的这个路径
最后我们来看看seq-to-seq悄咪咪干啥了:
论文中没有提到,这个beam search每一轮的prefix句子其实不是直接按照概率选出来的前N。
github代码中根据另一篇论文Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation(https://arxiv.org/pdf/1609.08144.pdf),对裸概率做了处理,根据不同长度的prefix进行了一个标准化的操作,对过短的输出句子给出了惩罚。
以下是该论文中的原话:
We introduce two important refinements to the pure max-probability based beam search algorithm: a coverage penalty and length normalization. With length normalization, we aim to account for the fact that we have to compare hypotheses of different length. Without some form of length-normalizationregular beam search will favor shorter results over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences
意思就是不标准化输出prefix的长度的话,越长的句子logP越低,越不容易被选上
他们使用了一个函数来做这个惩罚:

α ∈ [0.6 − 0.7]
看看代码:
def length_penalty(sequence_lengths, penalty_factor):
Args:
sequence_lengths: The sequence length of all hypotheses, a tensor
of shape [beam_size, vocab_size].
penalty_factor: A scalar that weights the length penalty.
Returns:
The length penalty factor, a tensor fo shape [beam_size].
"""
return tf.div((5. + tf.to_float(sequence_lengths))**penalty_factor, (5. + 1.)**penalty_factor)
是不是闻到了玄学调参的味道?
def hyp_score(log_probs, sequence_lengths, config):
"""Calculates scores for beam search hypotheses.
"""
length_penality_ = length_penalty( sequence_lengths=sequence_lengths,
penalty_factor=config.length_penalty_weight)
score = log_probs / length_penality_
return score
将logP用length_penalty标准化一下
最最后,我们通过代码实现来看看beam search的一个timestep具体怎么走?
https://github.com/google/seq2seq/blob/master/seq2seq/inference/beam_search.py#L196-L279
1. 取出上一步的每个假设的句子长度和每个句子是否已经结束的信息
prediction_lengths = beam_state.lengths
previously_finished = beam_state.finished
2. 把输出词典中的每个词加到现有的每个假设句子prefix上,计算这一步的每个新假设句子的logP (这个logP的维度应该是[beam_width, vocab_size])
probs = tf.nn.log_softmax(logits)
probs = mask_probs(probs, config.eos_token, previously_finished)
total_probs = tf.expand_dims(beam_state.log_probs, 1) + probs
3. 由于本轮可能有些句子遇到EOS终止,这边更新一下每个新假设句子的长度
lengths_to_add = tf.one_hot( [config.eos_token] * config.beam_width,
config.vocab_size, 0, 1)
add_mask = (1 - tf.to_int32( previously_finished ) )
lengths_to_add = tf.expand_dims( add_mask, 1 ) * lengths_to_add
new_prediction_lengths = tf.expand_dims( prediction_lengths,1 ) + lengths_to_add
4. 用上面讲的带玄学惩罚的算分函数算分
scores = hyp_score( log_probs=total_probs, sequence_lengths=new_prediction_lengths,
config=config )
# score的维度[beam_size, vocab_size]
scores_flat = tf.reshape(scores, [-1])
#在树顶的时候,只考虑第一个假设
scores_flat = tf.cond(tf.convert_to_tensor(time_) > 0, lambda: scores_flat,
lambda: scores[0])
5. 根据算出来的分以及一个选词函数successors选出这个位置预测的词们
next_beam_scores, word_indices = config.choose_successors_fn(scores_flat, config)
next_beam_scores.set_shape([config.beam_width])
word_indices.set_shape([config.beam_width])
6. 得到本轮预测词语们(也就是选好保留哪些branchs)之后,找到他们对应的logP啊,本轮对应句子是否结束之类的
total_probs_flat = tf.reshape(total_probs, [-1], name="total_probs_flat")
next_beam_probs = tf.gather(total_probs_flat, word_indices)
next_beam_probs.set_shape([config.beam_width])
next_word_ids = tf.mod(word_indices, config.vocab_size)
next_beam_ids = tf.div(word_indices, config.vocab_size)
# Append new ids to current predictions
next_finished = tf.logical_or(
tf.gather(beam_state.finished, next_beam_ids),
tf.equal(next_word_ids, config.eos_token))
7. 根据选中的branchs,更新句子长度:被选中的而还没有结束的branch 长度加1
lengths_to_add = tf.to_int32(tf.not_equal(next_word_ids, config.eos_token))
lengths_to_add = (1 - tf.to_int32(next_finished)) * lengths_to_add
next_prediction_len = tf.gather(beam_state.lengths, next_beam_ids)
next_prediction_len += lengths_to_add
8. 打包返回进入下一步计算
next_state = BeamSearchState(
log_probs=next_beam_probs,
lengths=next_prediction_len,
finished=next_finished)
output = BeamSearchStepOutput(
scores=next_beam_scores,
predicted_ids=next_word_ids,
beam_parent_ids=next_beam_ids)
P.S. 原文里面beam size B用的2声称就够好了,然而发现pointer generator模型中用到了4
感兴趣的话找机会讲pointer generator这个模型,可以生成重写型的文本摘要(不是摘取式)
Reference:
Seq-to-seq github https://github.com/google/seq2seq
Seq-to-seq论文 https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation论文 https://arxiv.org/pdf/1609.08144.pdf
Tensorflow github https://github.com/tensorflow/tensorflow
若想了解更多深度学习内容,不妨关注~假装爱读书的Flora~
有希望看到的内容也可以告诉我~