Number of users that make requests at the same time.
Metric 1: Token-level Inference Speed
token/second
This metric is defined as the number of output tokens generated per unit of end-to-end latency.
For applications where matching the average human reading speed is required, users should focus on scenarios where the speed is 5 tokens/sec or more, which is the average human reading speed.
In other scenarios requiring faster near real-time token generation, such as 15 tokens/second inference speed, for example, dialog/chatbot where the number of concurrent users that could be served is lower, and the overall throughput is lower.
Metric 2: Token-level Throughput
token/second
This metric quantifies the average total number of tokens generated by the server across all simultaneous user requests. It provides an aggregate measure of server capacity and efficiency to serve requests across users.
When inference speed is less critical, such as in offline batch processing tasks, the focus should be where throughput peaks and therefore server cost efficiency is highest. This indicates the LLM's capacity to handle a high number of concurrent requests, ideal for batch processing or background tasks where immediate response is not essential.
Note: The token-level throughput benchmark was done using the LLMPerf tool. The throughput computation has an issue where it includes the time it requires to encode the generated text for token computation.
Metric 3: Request-level Latency
second
Average time elapsed between the request submission and the time it took to complete the request, such as after the last token of the request was generated.
Metric 4: Request-level throughput (RPM)
request/second
The number of requests served per unit time, in this case per minute.
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time.
In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:
The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens
The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
This scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, we set the following token lengths:
Scenario 5 is only applicable to the embedding models. This scenario mimics embedding generation as part of the data ingestion pipeline of a vector database.
In this scenario, all requests are the same size, which is 96 documents, each one with 512 tokens. An example would be a collection of large PDF files, each file with 30,000+ words that a user wants to ingest into a vector DB.
The lighter embeddings scenario is similar to scenario 5, except that we reduce the size of each request to 16 documents, each with 512 tokens. Smaller files with fewer words could be supported by scenario 6.