Scenario 5: Text Embeddings Benchmarks in Generative AI
The text embedding scenario mimics embedding generation as part of the data ingestion pipeline of a vector database.
The text embedding scenario is only applicable to the embedding models. In this scenario, all requests are the same size, which is 96 documents, each one with 512 tokens. An example would be a collection of large PDF files, each file with 30,000+ words that a user wants to ingest into a vector DB.
Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Text Embedding Scenarios. The text embedding scenario is performed in the following region.
Brazil East (Sao Paulo)
- Model:
cohere.embed-english-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.53 24 8 4.35 108 32 14.93 120 128 47.66 150 - Model:
cohere.embed-multilingual-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.25 24 8 4.33 120 32 14.94 144 128 49.21 198
Germany Central (Frankfurt)
- Model:
cohere.embed-english-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.53 24 8 4.35 108 32 14.93 120 128 47.66 150 - Model:
cohere.embed-multilingual-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.25 24 8 4.33 120 32 14.94 144 128 49.21 198
UK South (London)
- Model:
cohere.embed-english-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.53 24 8 4.35 108 32 14.93 120 128 47.66 150 - Model:
cohere.embed-multilingual-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.25 24 8 4.33 120 32 14.94 144 128 49.21 198
US Midwest (Chicago)
- Model:
cohere.embed-english-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.53 24 8 4.35 108 32 14.93 120 128 47.66 150 - Model:
cohere.embed-english-light-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 1.75 30 8 3.93 108 32 14.44 113 128 48.00 120 - Model:
cohere.embed-multilingual-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 2.25 24 8 4.33 120 32 14.94 144 128 49.21 198 - Model:
cohere.embed-multilingual-light-v3.0
hosted on one Embed Cohere unit of a dedicated AI cluster -
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 1.69 42 8 3.80 118 32 14.26 126 128 37.17 138