机器翻译：谷歌翻译是如何对几乎所有语言进行翻译的？( 四 ) 全文共13204字

#tokenize theoutput sentences(Output language)output_tokenizer =Tokenizer(num_words=MAX_NUM_WORDS, filters='')output_tokenizer.fit_on_texts(output_sentences+output_sentences_inputs)output_integer_seq =output_tokenizer.texts_to_sequences(output_sentences)output_input_integer_seq =output_tokenizer.texts_to_sequences(output_sentences_inputs)print(output_input_integer_seq)word2idx_outputs=output_tokenizer.word_indexprint('Total uniquewords in the output: %s'%len(word2idx_outputs))num_words_output=len(word2idx_outputs)+1max_out_len =max(len(sen) for sen inoutput_integer_seq)print("Length oflongest sentence in the output: %g"% max_out_len)Output:Total unique words in the output: 9511Length of longest sentence in the output: 12现在，可以通过上面的直方图来验证两种语言中最长句子的长度。还可以得出这样的结论：英语句子通常较短，平均单词量比法语译文句子的单词量要少。
接下来需要填充输入。填充输入和输出的原因是文本的句子长度不固定，但长短期记忆网络希望输入的例句长度都相等。因此需要将句子转换为长度固定的向量。为此，一种可行的方法就是填充。
#Padding theencoder inputencoder_input_sequences =pad_sequences(input_integer_seq,maxlen=max_input_len)print("encoder_input_sequences.shape:",encoder_input_sequences.shape)#Padding thedecoder inputsdecoder_input_sequences =pad_sequences(output_input_integer_seq,maxlen=max_out_len, padding='post')print("decoder_input_sequences.shape:",decoder_input_sequences.shape)#Padding thedecoder outputsdecoder_output_sequences =pad_sequences(output_integer_seq,maxlen=max_out_len, padding='post')print("decoder_output_sequences.shape:",decoder_output_sequences.shape)encoder_input_sequences.shape: (20000, 6)decoder_input_sequences.shape: (20000, 12)decoder_output_sequences.shape: (20000, 12)输入中有20000个句子（英语），每个输入句子的长度都为6 ，所以现在输入的形式为（20000 ， 6）。同理，输出中有20000个句子（法语），每个输出句子的长度都为12 ，所以现在输出的形式为（20000 ， 12），被翻译的语言也是如此。
大家可能还记得，索引180处的原句为join us 。标记生成器将该句拆分为join和us两个单词，将它们转换为整数，然后通过对输入列表中索引180处的句子所对应的整数序列的开头添加四个零来实现前填充（pre-padding）。
print("encoder_input_sequences[180]:",encoder_input_sequences[180])Output:encoder_input_sequences[180]: [0000 46459]要验证join和us的整数值是否分别为464和59 ，可将单词传递给word2index_inputs词典，如下图所示：
prnt(word2idx_inputs["join"])print(word2idx_inputs["us"])Output:46459更值得一提的是，解码器则会采取后填充（post-padding）的方法，即在句子末尾添加零。而在编码器中，零被填充在开头位置。该方法背后的原因是编码器输出基于出现在句末的单词，因此原始单词被保留在句末，零则被填充在开头位置。而解码器是从开头处理句子，因此对解码器的输入和输出执行后填充。
词嵌入向量（Word Embeddings）
文章插图
图源：unsplash
我们要先将单词转换为对应的数字向量表示，再将向量输入给深度学习模型。我们也已经将单词转化成了数字。那么整数/数字表示和词嵌入向量之间有什么区别呢？