Towards faster inference of transformers: Strategies for accelerating decoding processes

Citation DataPage: 1-72

Publication Year2024

0
Citations
138
Usage
0
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Usage
138
- Downloads
  82
- Abstract Views
  56

Thesis / Dissertation Description

This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications.

Bibliographic Details

REPOSITORY URLhttps://ink.library.smu.edu.sg/etd_coll/613

URL IDhttps://ink.library.smu.edu.sg/etd_coll/613; https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll

AUTHOR(S)

Cunxiao DU

PUBLISHER(S)

Singapore Management University

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know