Optimizing LLM Inference: Metrics that Matter for Real Time Applications

Authors

  • Sriram Sagi Duke University, USA.  Author

DOI:

https://doi.org/10.47363/JAICC/2025(4)446

Keywords:

Large Language Models (LLMs), Time to First Token (TTFT), End-to-End Latency, Inter Token Latency (ITL), Tokens Per Second (TPS), Requests Per Second (RPS), Inference Benchmarking, Latency Optimization

Abstract

The deployment of Large Language Models (LLMs) including GPT, LLaMA, Claude, and Gemini occurs in real-time applications which require both lowlatency and high-throughput inference. The transition of these models from research to production systems requires essential evaluation of their inference performance. This paper provides a detailed analysis of six performance metrics which are commonly used to evaluate LLM inference: Time to First Token (TTFT), Generation Time, End-to-End Latency (e2e_latency), Inter Token Latency (ITL), Tokens Per Second (TPS), and Requests Per Second (RPS). This paper examines the definitions and practical implications and interrelationships between these metrics through descriptive analysis and empirical observations. Our research demonstrates how responsiveness and throughput create trade-offs while showing that applications need specific metrics for optimization. The research provides practical guidance to researchers and engineers and system architects who want to evaluate or enhance LLM systems
in latency-critical and shared infrastructure environments.

Author Biography

  • Sriram Sagi, Duke University, USA. 

    Sriram Sagi, Duke University, USA. 

Downloads

Published

2025-01-16

How to Cite

Optimizing LLM Inference: Metrics that Matter for Real Time Applications. (2025). Journal of Artificial Intelligence & Cloud Computing, 4(1), 1-4. https://doi.org/10.47363/JAICC/2025(4)446

Similar Articles

21-30 of 360

You may also start an advanced similarity search for this article.