Skip to main content Skip to secondary navigation

Fairness in Serving Large Language Models

Main content start

Speaker: Ying Sheng, PhD student, Computer Science Dept., Stanford University
Date: June 5, 2024

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. This talk introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. 

Bio: Ying Sheng is a Ph.D. candidate in Computer Science at Stanford University, advised by Clark Barrett. Her research focuses on building and deploying large language model applications, emphasizing accessibility, efficiency, programmability, and verifiability. As a core member of the LMSYS Org, she has developed influential open models, datasets, systems, and evaluation tools, such as Vicuna, Chatbot Arena, and SGLang. More information about her can be found at https://sites.google.com/view/yingsheng. 

Fairness in Serving Large Language Models
(Ying Sheng, PhD student, Computer Science Dept., Stanford)