速度超快!字节跳动开源序列推理引擎LightSeq( 四 )


2. 缓存刷新分别占比 10% 和 6%, 比重也较高 , 但很难继续优化 。 今后可以尝试减少缓存量(如降低 decoder 层数 , 降低缓存精度等)来继续降低延迟 。
3. 其他运算总计占比 8% 和 6%, 包括了 Layer Normalization、beam search 和中间结果的显存读写等 。
可视化结果说明了 LightSeq 已经做到了极致优化 , 大大提升了推理速度 。
传送门:
GitHub项目地址:
【速度超快!字节跳动开源序列推理引擎LightSeq】
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
[4] WMT2020,
[5] Li, Jiwei, Will Monroe, and Dan Jurafsky. "A simple, fast diverse decoding algorithm for neural generation." arXiv preprint arXiv:1611.08562 (2016).
[6] TurboTransformers,
[7] FasterTransformer,
[8] NVIDIA Triton Inference Server,
[9] LightSeq proto, /tree/master/proto
[10] LightSeq性能评测报告, /blob/master/docs/performance.md
[11] LightSeq Layer Normalization, /blob/master/kernels/transformerKernels.cu.cc#L269
[12] cuBLAS,
[13] GPT2,"Language Models are Unsupervised Multitask Learners"