Scenario 3: Generation Heavy Benchmarks in Generative AI
The generation heavy scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items.
The generation heavy scenario is performed with the following token lengths:
- The prompt length is fixed to 100 tokens
- The response length is fixed to 1,000 tokens
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.
Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The generation heavy scenario is performed in the following region.
Brazil East (Sao Paulo)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 147.84 148.54 8.18 7.25 2 146.96 292.45 10.59 11.16 4 139.14 520.57 8.46 26.20 8 128.71 923.73 9.73 43.55 16 122.33 1,631.48 10.76 73.30 32 114.14 2,586.64 12.99 102.60 64 95.98 4,124.24 13.42 186.47 128 69.06 6,366.06 19.24 285.92 256 40.02 6,973.92 35.71 305.09 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 132.10 131.90 16.12 3.70 2 130.10 256.33 15.61 7.62 4 125.23 495.22 17.36 13.61 8 111.15 832.88 18.74 23.87 16 104.75 1,375.51 21.45 36.61 32 100.82 2,974.72 21.65 81.76 64 79.67 4,635.15 26.36 131.98 128 60.49 6,290.61 37.0 171.76 256 31.69 7,010.75 62.48 196.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.15 48.33 20.37 2.90 2 48.73 96.67 20.57 2.90 4 48.17 186.67 20.85 11.20 8 47.53 373.33 21.20 22.40 16 46.69 720.00 21.75 43.20 32 41.65 1,279.99 24.54 76.80 64 41.92 1,279.98 47.75 76.80 128 41.93 1,279.96 91.49 76.80 256 41.88 1,279.93 166.93 76.80 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 106.36 105.00 9.41 6.30 2 104.89 206.67 9.55 12.40 4 101.93 400.00 9.84 24.00 8 98.89 773.33 10.17 46.40 16 91.20 1,439.99 11.07 86.40 32 72.13 2,239.98 14.03 134.40 64 72.29 2,293.30 27.49 137.60 128 72.30 2,239.89 53.75 134.39 256 72.27 2,239.84 102.37 134.39 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
Germany Central (Frankfurt)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 147.84 148.54 8.18 7.25 2 146.96 292.45 10.59 11.16 4 139.14 520.57 8.46 26.20 8 128.71 923.73 9.73 43.55 16 122.33 1,631.48 10.76 73.30 32 114.14 2,586.64 12.99 102.60 64 95.98 4,124.24 13.42 186.47 128 69.06 6,366.06 19.24 285.92 256 40.02 6,973.92 35.71 305.09 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 132.10 131.90 16.12 3.70 2 130.10 256.33 15.61 7.62 4 125.23 495.22 17.36 13.61 8 111.15 832.88 18.74 23.87 16 104.75 1,375.51 21.45 36.61 32 100.82 2,974.72 21.65 81.76 64 79.67 4,635.15 26.36 131.98 128 60.49 6,290.61 37.0 171.76 256 31.69 7,010.75 62.48 196.58 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
UK South (London)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 147.84 148.54 8.18 7.25 2 146.96 292.45 10.59 11.16 4 139.14 520.57 8.46 26.20 8 128.71 923.73 9.73 43.55 16 122.33 1,631.48 10.76 73.30 32 114.14 2,586.64 12.99 102.60 64 95.98 4,124.24 13.42 186.47 128 69.06 6,366.06 19.24 285.92 256 40.02 6,973.92 35.71 305.09 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 132.10 131.90 16.12 3.70 2 130.10 256.33 15.61 7.62 4 125.23 495.22 17.36 13.61 8 111.15 832.88 18.74 23.87 16 104.75 1,375.51 21.45 36.61 32 100.82 2,974.72 21.65 81.76 64 79.67 4,635.15 26.36 131.98 128 60.49 6,290.61 37.0 171.76 256 31.69 7,010.75 62.48 196.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.15 48.33 20.37 2.90 2 48.73 96.67 20.57 2.90 4 48.17 186.67 20.85 11.20 8 47.53 373.33 21.20 22.40 16 46.69 720.00 21.75 43.20 32 41.65 1,279.99 24.54 76.80 64 41.92 1,279.98 47.75 76.80 128 41.93 1,279.96 91.49 76.80 256 41.88 1,279.93 166.93 76.80 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 106.36 105.00 9.41 6.30 2 104.89 206.67 9.55 12.40 4 101.93 400.00 9.84 24.00 8 98.89 773.33 10.17 46.40 16 91.20 1,439.99 11.07 86.40 32 72.13 2,239.98 14.03 134.40 64 72.29 2,293.30 27.49 137.60 128 72.30 2,239.89 53.75 134.39 256 72.27 2,239.84 102.37 134.39 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.18 50.14 20.43 2.94 2 49.28 97.61 20.78 5.72 4 48.22 186.82 21.32 10.94 8 47.20 365.89 21.75 21.43 16 44.69 650.22 22.89 38.03 32 37.29 989.98 27.31 58.04 64 29.53 1621.76 32.68 95.08 128 19.17 1784.76 53.14 104.56 256 10.79 2271.18 94.78 133.05 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68
US Midwest (Chicago)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 147.84 148.54 8.18 7.25 2 146.96 292.45 10.59 11.16 4 139.14 520.57 8.46 26.20 8 128.71 923.73 9.73 43.55 16 122.33 1,631.48 10.76 73.30 32 114.14 2,586.64 12.99 102.60 64 95.98 4,124.24 13.42 186.47 128 69.06 6,366.06 19.24 285.92 256 40.02 6,973.92 35.71 305.09 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 132.10 131.90 16.12 3.70 2 130.10 256.33 15.61 7.62 4 125.23 495.22 17.36 13.61 8 111.15 832.88 18.74 23.87 16 104.75 1,375.51 21.45 36.61 32 100.82 2,974.72 21.65 81.76 64 79.67 4,635.15 26.36 131.98 128 60.49 6,290.61 37.0 171.76 256 31.69 7,010.75 62.48 196.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 49.15 48.33 20.37 2.90 2 48.73 96.67 20.57 2.90 4 48.17 186.67 20.85 11.20 8 47.53 373.33 21.20 22.40 16 46.69 720.00 21.75 43.20 32 41.65 1,279.99 24.54 76.80 64 41.92 1,279.98 47.75 76.80 128 41.93 1,279.96 91.49 76.80 256 41.88 1,279.93 166.93 76.80 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 106.36 105.00 9.41 6.30 2 104.89 206.67 9.55 12.40 4 101.93 400.00 9.84 24.00 8 98.89 773.33 10.17 46.40 16 91.20 1,439.99 11.07 86.40 32 72.13 2,239.98 14.03 134.40 64 72.29 2,293.30 27.49 137.60 128 72.30 2,239.89 53.75 134.39 256 72.27 2,239.84 102.37 134.39 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.28 26.55 18.50 3.24 2 30.79 50.88 16.14 7.12 4 29.46 93.36 18.15 12.09 8 28.20 170.20 19.40 21.40 16 26.37 271.80 17.73 40.56 32 25.24 419.13 21.06 55.06 64 22.19 755.43 24.38 98.29 128 17.43 1,248.19 29.45 168.00 256 11.27 1,794.88 44.85 236.65 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 95.37 52.01 19.56 3.07 2 92.77 101.29 20.04 5.98 4 91.60 191.83 20.34 11.32 8 86.83 338.87 21.51 19.97 16 78.12 547.34 23.92 32.23 32 64.77 1,111.24 28.91 65.46 64 50.52 1,722.11 37.23 101.48 128 31.29 2,123.49 60.17 125.12 256 14.93 2,002.12 126.87 117.98 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 30.53 30.51 33.58 1.79 2 29.78 59.01 34.42 3.45 4 28.88 112.35 35.48 6.58 8 27.67 215.18 36.99 12.61 16 24.85 364.06 40.99 21.34 32 20.51 552.34 49.60 32.35 64 16.12 900.39 59.36 52.72 128 10.17 980.45 100.27 57.43 256 6.30 1334.59 162.08 78.19 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 47.20 50.32 3.53 16.65 2 45.06 98.42 3.61 32.48 4 43.85 165.60 3.26 63.91 8 40.56 292.22 3.04 133.20 16 38.35 416.13 3.61 171.22 32 28.68 557.5 4.64 219.01 64 15.19 613.72 9.65 171.83 128 10.74 664.11 11.67 233.87 256 5.83 721.50 22.78 253.54 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 126.40 110.90 13.07 4.57 2 122.93 213.92 13.33 8.87 4 117.03 403.27 15.32 15.26 8 106.11 707.45 16.86 26.78 16 98.06 1,258.94 18.22 47.94 32 86.74 2,147.82 21.04 79.38 64 72.43 3,011.59 25.50 107.48 128 55.80 5,058.49 32.38 191.22 256 36.56 5,025.93 52.34 189.68 - Model:
cohere.command
(Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 35.78 33.43 10.98 5.33 8 31.41 99.67 13.87 16.61 32 28.49 237.1 19.48 40.24 128 23.01 326.93 53.13 54.89 - Model:
cohere.command-light
(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 80.38 83.61 9.19 6.34 8 45.96 278.91 13.89 22.46 32 23.90 493.78 27.34 41.13 128 5.12 565.06 82.15 44.89 - Model:
meta.llama-2-70b-chat
(Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 18.12 17.58 21.44 2.72 8 15.96 64.28 26.83 8.91 32 13.72 195.48 29.43 27.99 128 8.61 541.75 48.50 71.52