Optimizing LLM Deployments through Inference Backends

Ashish  Bansal

doi:10.47363/JAICC/2024(3)E128

Authors

Ashish Bansal USA Author

DOI:

https://doi.org/10.47363/JAICC/2024(3)E128

Keywords:

Generative AI, LLM Inference, LLM Inference Backends

Abstract

As model size increases, especially in the area of natural language processing, models expand past the number of parameters that can fit on a single GPU memory—requiring multiple GPUs to run inference on these models without skirting performance. But, as teams scale across more GPUs, the cost to serve inference scales, too—often outpacing what a company can afford and make these implementation profitable for organisation. Another consideration is that many LLM- based services run in real time, so low latency is a must to deliver great user experiences.

Teams today need an efficient way to serve inference from LLMs that allows infrastructure to scale across multiple GPUs—without breaking the bank. In this paper we will discuss about differ- ent llm inference backends that can be utilised to serve these llm for high performance, low cost and much robust implementation the major LLMs known today include billions of parameters. Either its ChatGPT-3 from OpenAI, Claude from Anthropic or LLama from Meta, the generative AI models are continuously increasing their parameters and this brings its own challenges to imple- ment these technologies for Real World applications and easy to run these on devices.

As model size increases, especially in the area of natural language processing, models expand past the number of parameters that can fit on a single GPU memory—requiring multiple GPUs to run inference on these models without skirting performance. But, as teams scale across more GPUs, the cost to serve inference scales, too—often outpacing what a company can afford and make these implementation profitable for organisation. Another consideration is that many LLM- based services run in real time, so low latency is a must to deliver great user experiences

Teams today need an efficient way to serve inference from LLMs that allows infrastructure to scale across multiple GPUs—without breaking the bank. In this paper we will discuss about differ-ent llm inference backends that can be utilised to serve these llm for high performance, low cost and much robust implementation

Author Biography

Ashish Bansal, USA

Ashish Bansal, USA.

Optimizing LLM Deployments through Inference Backends

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Similar Articles

issn

Make a Submission

Information

Latest publications