Optimizing LLM Deployments through Inference Backends

Authors

  • Ashish Bansal USA Author

DOI:

https://doi.org/10.47363/JAICC/2024(3)E128

Keywords:

Generative AI, LLM Inference, LLM Inference Backends

Abstract

As model size increases, especially in the area of natural language processing, models expand past the number of parameters that can fit on a single GPU memory—requiring multiple GPUs to run inference on these models without skirting performance. But, as teams scale across more GPUs, the cost to serve inference scales, too—often outpacing what a company can afford and make these implementation profitable for organisation. Another consideration is that many LLM- based services run in real time, so low latency is a must to deliver great user experiences.

Teams today need an efficient way to serve inference from LLMs that allows infrastructure to scale across multiple GPUs—without breaking the bank. In this paper we will discuss about differ- ent llm inference backends that can be utilised to serve these llm for high performance, low cost and much robust implementation the major LLMs known today include billions of parameters. Either its ChatGPT-3 from OpenAI, Claude from Anthropic or LLama from Meta, the generative AI models are continuously increasing their parameters and this brings its own challenges to imple- ment these technologies for Real World applications and easy to run these on devices.

As model size increases, especially in the area of natural language processing, models expand past the number of parameters that can fit on a single GPU memory—requiring multiple GPUs to run inference on these models without skirting performance. But, as teams scale across more GPUs, the cost to serve inference scales, too—often outpacing what a company can afford and make these implementation profitable for organisation. Another consideration is that many LLM- based services run in real time, so low latency is a must to deliver great user experiences

Teams today need an efficient way to serve inference from LLMs that allows infrastructure to scale across multiple GPUs—without breaking the bank. In this paper we will discuss about differ-ent llm inference backends that can be utilised to serve these llm for high performance, low cost and much robust implementation

Author Biography

  • Ashish Bansal, USA

    Ashish Bansal, USA.

Downloads

Published

2024-07-29

How to Cite

Optimizing LLM Deployments through Inference Backends. (2024). Journal of Artificial Intelligence & Cloud Computing, 3(4), 1-4. https://doi.org/10.47363/JAICC/2024(3)E128

Similar Articles

11-20 of 234

You may also start an advanced similarity search for this article.