速度超快！字节跳动开源序列推理引擎LightSeq( 四 ) 机器之心发布机器之心编辑部

2. 缓存刷新分别占比 10% 和 6%，比重也较高，但很难继续优化。今后可以尝试减少缓存量（如降低 decoder 层数，降低缓存精度等）来继续降低延迟。
3. 其他运算总计占比 8% 和 6%，包括了 Layer Normalization、beam search 和中间结果的显存读写等。
可视化结果说明了 LightSeq 已经做到了极致优化，大大提升了推理速度。
传送门：
GitHub项目地址：
【速度超快！字节跳动开源序列推理引擎LightSeq】
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
[4] WMT2020,
[5] Li, Jiwei, Will Monroe, and Dan Jurafsky. "A simple, fast diverse decoding algorithm for neural generation." arXiv preprint arXiv:1611.08562 (2016).
[6] TurboTransformers,
[7] FasterTransformer,
[8] NVIDIA Triton Inference Server,
[9] LightSeq proto, /tree/master/proto
[10] LightSeq性能评测报告, /blob/master/docs/performance.md
[11] LightSeq Layer Normalization, /blob/master/kernels/transformerKernels.cu.cc#L269
[12] cuBLAS,
[13] GPT2,"Language Models are Unsupervised Multitask Learners"