Dedicated AI Cluster Performance Benchmarks in Generative AI
Review the hosting dedicated AI cluster benchmarks in OCI Generative AI.
- Review the terms used in the dedicated AI cluster performance benchmarks.
- To get the hosting dedicated AI cluster benchmarks, click each scenario in the chat and text generation scenarios and the text embedding scenarios.
Performance Benchmark Terms
Term | Unit | Definition |
---|---|---|
Concurrency |
(number) |
Number of users that make requests at the same time. |
Metric 1: Token-level Inference Speed |
token/second |
This metric is defined as the number of output tokens generated per unit of end-to-end latency. For applications where matching the average human reading speed is required, users should focus on scenarios where the speed is 5 tokens/sec or more, which is the average human reading speed. In other scenarios requiring faster near real-time token generation, such as 15 tokens/second inference speed, for example, dialog/chatbot where the number of concurrent users that could be served is lower, and the overall throughput is lower. |
Metric 2: Token-level Throughput |
token/second |
This metric quantifies the average total number of tokens generated by the server across all simultaneous user requests. It provides an aggregate measure of server capacity and efficiency to serve requests across users. When inference speed is less critical, such as in offline batch processing tasks, the focus should be where throughput peaks and therefore server cost efficiency is highest. This indicates the LLM's capacity to handle a high number of concurrent requests, ideal for batch processing or background tasks where immediate response is not essential. Note: The token-level throughput benchmark was done using the LLMPerf tool. The throughput computation has an issue where it includes the time it requires to encode the generated text for token computation. |
Metric 3: Request-level Latency |
second |
Average time elapsed between the request submission and the time it took to complete the request, such as after the last token of the request was generated. |
Metric 4: Request-level throughput (RPM) |
request/second |
The number of requests served per unit time, in this case per minute. |
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.
Chat and Text Generation Scenarios
Scenario | Description |
---|---|
This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:
|
|
The RAG scenario has a very long prompt and a short response. This scenario also mimics summarization use cases.
|
|
Scenario 3: Generation Heavy |
This scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, we set the following token lengths:
|
This scenario covers chatbot / dialog use cases where the prompt and responses are shorter.
|
Text Embedding Scenarios
Scenario | Description |
---|---|
Scenario 5 is only applicable to the embedding models. This scenario mimics embedding generation as part of the data ingestion pipeline of a vector database. In this scenario, all requests are the same size, which is 96 documents, each one with 512 tokens. An example would be a collection of large PDF files, each file with 30,000+ words that a user wants to ingest into a vector DB. |
|
Scenario 6: Lighter Embeddings Workload |
The lighter embeddings scenario is similar to scenario 5, except that we reduce the size of each request to 16 documents, each with 512 tokens. Smaller files with fewer words could be supported by scenario 6. |