Towards faster inference of transformers: Strategies for accelerating decoding processes
Page: 1-72
2024
- 138Usage
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Metrics Details
- Usage138
- Downloads82
- Abstract Views56
Thesis / Dissertation Description
This thesis delves into the acceleration and optimization of Transformer inference, a subject of increasing importance with the emergence of Large Language Models (LLMs). The study primarily addresses the challenges posed by two inherent properties of Transformers during inference: the quadratic complexity of the attention mechanism and the sequential nature of autoregressive inference. The research is structured into three main parts. The first part enhances the learning capabilities of non-autoregressive Transformers, achieving a remarkable 15.0x acceleration on machine translation tasks. The following section focuses on lossless acceleration through speculative decoding, where the proposed algorithm, Glide with CAPE, is shown to accelerate 33-billion parameter LLMs by approximately 2.5 times. In the last segment, the complexity of the attention mechanism is reduced to a constant level through the implementation of a Markov autoregressive Transformer, without significantly compromising model performance. This comprehensive study not only tackles the computational challenges of Transformer models but also paves the way for more efficient deployment of LLMs in real-world applications.
Bibliographic Details
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know