Optimizing LLM Inference: Metrics that Matter for Real Time Applications

Sriram  Sagi

doi:10.47363/JAICC/2025(4)446

Authors

Sriram Sagi Duke University, USA. Author

DOI:

https://doi.org/10.47363/JAICC/2025(4)446

Keywords:

Large Language Models (LLMs), Time to First Token (TTFT), End-to-End Latency, Inter Token Latency (ITL), Tokens Per Second (TPS), Requests Per Second (RPS), Inference Benchmarking, Latency Optimization

Abstract

The deployment of Large Language Models (LLMs) including GPT, LLaMA, Claude, and Gemini occurs in real-time applications which require both lowlatency and high-throughput inference. The transition of these models from research to production systems requires essential evaluation of their inference performance. This paper provides a detailed analysis of six performance metrics which are commonly used to evaluate LLM inference: Time to First Token (TTFT), Generation Time, End-to-End Latency (e2e_latency), Inter Token Latency (ITL), Tokens Per Second (TPS), and Requests Per Second (RPS). This paper examines the definitions and practical implications and interrelationships between these metrics through descriptive analysis and empirical observations. Our research demonstrates how responsiveness and throughput create trade-offs while showing that applications need specific metrics for optimization. The research provides practical guidance to researchers and engineers and system architects who want to evaluate or enhance LLM systems
in latency-critical and shared infrastructure environments.

Author Biography

Sriram Sagi, Duke University, USA.

Sriram Sagi, Duke University, USA.

Journal of Artificial Intelligence & Cloud Computing

Optimizing LLM Inference: Metrics that Matter for Real Time Applications

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Similar Articles

Secure Code Completion Models Tuned for Compliance-Heavy Domains

Trustworthy Multimodal AI Systems for Automated Language Evaluation

AI-Driven Cybersecurity and Anomaly Detection in Blockchain

Federated Reinforcement Learning for Edge AI Decision-Making in 6G-Enabled V2x Systems

AI-Based Anomaly Detection in Multi-Cloud Traffic Using Deep Learning Models

Personalization Models for Email Subject Lines and Send Times

Progressive Web Apps with Angular: Enhancing Cloud Access on Mobile Devices

Harnessing Data and AI for Financial Advantage

Privacy-Preserving Mental Health Analysis on Social Media Using Federated Deep Learning and Named Entity Recognition

Performance Tuning AWS Lambda Functions with MongoDB Cloudfor High Throughput