The chatbot scenario covers chatbot/dialog use cases where the prompt and responses are shorter.
The prompt length is fixed to 100 tokens.
The response length is fixed to 100 tokens.
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
134.80
126.97
1.56
36.46
2
128.71
235.26
1.57
70.05
4
122.01
436.12
1.63
131.04
8
113.84
762.01
1.81
222.59
16
101.20
1,177.66
1.99
347.43
32
83.96
2,021.49
2.31
610.16
64
64.47
3,191.72
3.07
950.61
128
43.12
3,772.60
4.92
1,120.64
256
21.76
4,094.46
8.56
1,212.42
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
94.04
87.41
1.95
29.44
2
88.13
163.85
1.93
58.04
4
86.49
315.44
2.03
108.02
8
80.10
550.10
2.39
171.44
16
70.13
861.65
2.56
288.47
32
62.39
1,517.61
3.06
476.62
64
42.36
2,139.38
3.76
753.58
128
29.22
3,137.09
5.74
1,023.88
256
17.13
3,229.42
9.78
1,117.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.60
1.89
31.56
2
52.63
102.67
1.93
61.60
4
53.06
205.27
1.93
123.16
8
52.47
394.66
1.97
236.79
16
49.27
715.55
2.11
429.33
32
42.71
1,198.53
2.46
719.12
64
37.25
2,017.51
2.90
1,210.76
128
28.28
2,414.71
4.15
1,448.83
256
18.26
2,576.59
7.21
1,545.96
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.20
48.67
2.05
29.20
2
49.53
96.67
2.06
58.00
4
49.08
188.00
2.12
112.80
8
48.40
356.00
2.23
213.60
16
47.26
645.33
2.44
387.20
32
42.22
1,077.33
2.90
646.40
64
44.95
1,162.65
5.41
697.59
128
44.92
1,162.64
10.84
697.58
256
45.02
1,162.21
21.58
697.32
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
111.04
109.67
0.91
65.80
2
108.57
212.33
0.91
127.40
4
105.67
408.00
0.91
244.80
8
102.65
408.00
1.02
461.60
16
96.48
1,370.66
1.13
822.40
32
78.96
2,110.49
1.42
822.40
64
89.80
2,522.64
2.41
1,513.58
128
89.69
2,516.96
4.94
1,510.17
256
90.27
2,517.19
9.96
1,510.31
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.38
26.65
3.74
15.99
2
26.43
51.30
3.88
30.78
4
25.92
100.61
3.96
60.36
8
25.52
196.72
4.06
118.03
16
21.24
328.32
4.84
196.99
32
19.32
588.59
5.36
353.15
64
16.73
1,003.22
6.29
601.93
128
12.56
1,433.27
8.59
859.96
256
8.60
1,586.86
8.59
952.11
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
28.93
21.65
4.60
13.01
2
31.72
50.89
3.90
30.54
4
30.86
91.23
4.17
54.74
8
29.61
163.06
4.33
97.84
16
27.66
277.48
4.49
166.49
32
26.01
615.83
4.77
369.50
64
22.49
1,027.87
5.67
616.77
128
17.22
1,527.06
7.37
616.77
256
10.67
1,882.65
11.44
1,131.71
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
97.11
51.67
1.98
30.14
2
95.38
99.17
2.04
57.87
4
93.91
183.96
2.10
107.50
8
89.79
318.53
2.23
186.09
16
81.05
506.12
2.47
294.03
32
64.15
909.40
3.18
530.15
64
50.35
1,405.67
4.08
818.96
128
33.59
1,786.60
6.26
1,040.74
256
18.77
1,866.83
11.43
1,086.94
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
52.05
52.57
1.95
30.80
2
50.70
100.90
2.00
59.19
4
49.96
192.32
2.06
112.89
8
47.75
369.74
2.15
216.13
16
44.36
643.94
2.30
377.65
32
36.74
982.39
2.74
576.42
64
31.27
1605.80
3.23
942.49
128
20.59
1,841.44
4.96
1,082.95
256
11.49
2,333.32
8.88
1,368.63
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
42.36
38.82
2.23
26.07
2
42.49
77.95
2.18
52.86
4
42.15
155.04
2.15
106.28
8
39.72
274.21
2.33
192.82
16
37.28
527.72
2.36
366.20
32
32.87
828.91
2.88
538.91
64
24.48
1,175.93
3.40
816.00
128
19.21
1,522.53
5.38
1,023.93
256
10.11
1,668.07
8.49
1,127.35
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
Germany Central (Frankfurt) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
134.80
126.97
1.56
36.46
2
128.71
235.26
1.57
70.05
4
122.01
436.12
1.63
131.04
8
113.84
762.01
1.81
222.59
16
101.20
1,177.66
1.99
347.43
32
83.96
2,021.49
2.31
610.16
64
64.47
3,191.72
3.07
950.61
128
43.12
3,772.60
4.92
1,120.64
256
21.76
4,094.46
8.56
1,212.42
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
94.04
87.41
1.95
29.44
2
88.13
163.85
1.93
58.04
4
86.49
315.44
2.03
108.02
8
80.10
550.10
2.39
171.44
16
70.13
861.65
2.56
288.47
32
62.39
1,517.61
3.06
476.62
64
42.36
2,139.38
3.76
753.58
128
29.22
3,137.09
5.74
1,023.88
256
17.13
3,229.42
9.78
1,117.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.60
1.89
31.56
2
52.63
102.67
1.93
61.60
4
53.06
205.27
1.93
123.16
8
52.47
394.66
1.97
236.79
16
49.27
715.55
2.11
429.33
32
42.71
1,198.53
2.46
719.12
64
37.25
2,017.51
2.90
1,210.76
128
28.28
2,414.71
4.15
1,448.83
256
18.26
2,576.59
7.21
1,545.96
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.38
26.65
3.74
15.99
2
26.43
51.30
3.88
30.78
4
25.92
100.61
3.96
60.36
8
25.52
196.72
4.06
118.03
16
21.24
328.32
4.84
196.99
32
19.32
588.59
5.36
353.15
64
16.73
1,003.22
6.29
601.93
128
12.56
1,433.27
8.59
859.96
256
8.60
1,586.86
8.59
952.11
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
28.93
21.65
4.60
13.01
2
31.72
50.89
3.90
30.54
4
30.86
91.23
4.17
54.74
8
29.61
163.06
4.33
97.84
16
27.66
277.48
4.49
166.49
32
26.01
615.83
4.77
369.50
64
22.49
1,027.87
5.67
616.77
128
17.22
1,527.06
7.37
616.77
256
10.67
1,882.65
11.44
1,131.71
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
97.11
51.67
1.98
30.14
2
95.38
99.17
2.04
57.87
4
93.91
183.96
2.10
107.50
8
89.79
318.53
2.23
186.09
16
81.05
506.12
2.47
294.03
32
64.15
909.40
3.18
530.15
64
50.35
1,405.67
4.08
818.96
128
33.59
1,786.60
6.26
1,040.74
256
18.77
1,866.83
11.43
1,086.94
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large
Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
52.05
52.57
1.95
30.80
2
50.70
100.90
2.00
59.19
4
49.96
192.32
2.06
112.89
8
47.75
369.74
2.15
216.13
16
44.36
643.94
2.30
377.65
32
36.74
982.39
2.74
576.42
64
31.27
1605.80
3.23
942.49
128
20.59
1,841.44
4.96
1,082.95
256
11.49
2,333.32
8.88
1,368.63
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
42.36
38.82
2.23
26.07
2
42.49
77.95
2.18
52.86
4
42.15
155.04
2.15
106.28
8
39.72
274.21
2.33
192.82
16
37.28
527.72
2.36
366.20
32
32.87
828.91
2.88
538.91
64
24.48
1,175.93
3.40
816.00
128
19.21
1,522.53
5.38
1,023.93
256
10.11
1,668.07
8.49
1,127.35
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
Japan Central (Osaka) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
134.80
126.97
1.56
36.46
2
128.71
235.26
1.57
70.05
4
122.01
436.12
1.63
131.04
8
113.84
762.01
1.81
222.59
16
101.20
1,177.66
1.99
347.43
32
83.96
2,021.49
2.31
610.16
64
64.47
3,191.72
3.07
950.61
128
43.12
3,772.60
4.92
1,120.64
256
21.76
4,094.46
8.56
1,212.42
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
94.04
87.41
1.95
29.44
2
88.13
163.85
1.93
58.04
4
86.49
315.44
2.03
108.02
8
80.10
550.10
2.39
171.44
16
70.13
861.65
2.56
288.47
32
62.39
1,517.61
3.06
476.62
64
42.36
2,139.38
3.76
753.58
128
29.22
3,137.09
5.74
1,023.88
256
17.13
3,229.42
9.78
1,117.58
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.20
48.67
2.05
29.20
2
49.53
96.67
2.06
58.00
4
49.08
188.00
2.12
112.80
8
48.40
356.00
2.23
213.60
16
47.26
645.33
2.44
387.20
32
42.22
1,077.33
2.90
646.40
64
44.95
1,162.65
5.41
697.59
128
44.92
1,162.64
10.84
697.58
256
45.02
1,162.21
21.58
697.32
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
111.04
109.67
0.91
65.80
2
108.57
212.33
0.91
127.40
4
105.67
408.00
0.91
244.80
8
102.65
408.00
1.02
461.60
16
96.48
1,370.66
1.13
822.40
32
78.96
2,110.49
1.42
822.40
64
89.80
2,522.64
2.41
1,513.58
128
89.69
2,516.96
4.94
1,510.17
256
90.27
2,517.19
9.96
1,510.31
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.60
1.89
31.56
2
52.63
102.67
1.93
61.60
4
53.06
205.27
1.93
123.16
8
52.47
394.66
1.97
236.79
16
49.27
715.55
2.11
429.33
32
42.71
1,198.53
2.46
719.12
64
37.25
2,017.51
2.90
1,210.76
128
28.28
2,414.71
4.15
1,448.83
256
18.26
2,576.59
7.21
1,545.96
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.38
26.65
3.74
15.99
2
26.43
51.30
3.88
30.78
4
25.92
100.61
3.96
60.36
8
25.52
196.72
4.06
118.03
16
21.24
328.32
4.84
196.99
32
19.32
588.59
5.36
353.15
64
16.73
1,003.22
6.29
601.93
128
12.56
1,433.27
8.59
859.96
256
8.60
1,586.86
8.59
952.11
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
28.93
21.65
4.60
13.01
2
31.72
50.89
3.90
30.54
4
30.86
91.23
4.17
54.74
8
29.61
163.06
4.33
97.84
16
27.66
277.48
4.49
166.49
32
26.01
615.83
4.77
369.50
64
22.49
1,027.87
5.67
616.77
128
17.22
1,527.06
7.37
616.77
256
10.67
1,882.65
11.44
1,131.71
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
97.11
51.67
1.98
30.14
2
95.38
99.17
2.04
57.87
4
93.91
183.96
2.10
107.50
8
89.79
318.53
2.23
186.09
16
81.05
506.12
2.47
294.03
32
64.15
909.40
3.18
530.15
64
50.35
1,405.67
4.08
818.96
128
33.59
1,786.60
6.26
1,040.74
256
18.77
1,866.83
11.43
1,086.94
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
42.36
38.82
2.23
26.07
2
42.49
77.95
2.18
52.86
4
42.15
155.04
2.15
106.28
8
39.72
274.21
2.33
192.82
16
37.28
527.72
2.36
366.20
32
32.87
828.91
2.88
538.91
64
24.48
1,175.93
3.40
816.00
128
19.21
1,522.53
5.38
1,023.93
256
10.11
1,668.07
8.49
1,127.35
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
UK South (London) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
134.80
126.97
1.56
36.46
2
128.71
235.26
1.57
70.05
4
122.01
436.12
1.63
131.04
8
113.84
762.01
1.81
222.59
16
101.20
1,177.66
1.99
347.43
32
83.96
2,021.49
2.31
610.16
64
64.47
3,191.72
3.07
950.61
128
43.12
3,772.60
4.92
1,120.64
256
21.76
4,094.46
8.56
1,212.42
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
94.04
87.41
1.95
29.44
2
88.13
163.85
1.93
58.04
4
86.49
315.44
2.03
108.02
8
80.10
550.10
2.39
171.44
16
70.13
861.65
2.56
288.47
32
62.39
1,517.61
3.06
476.62
64
42.36
2,139.38
3.76
753.58
128
29.22
3,137.09
5.74
1,023.88
256
17.13
3,229.42
9.78
1,117.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.60
1.89
31.56
2
52.63
102.67
1.93
61.60
4
53.06
205.27
1.93
123.16
8
52.47
394.66
1.97
236.79
16
49.27
715.55
2.11
429.33
32
42.71
1,198.53
2.46
719.12
64
37.25
2,017.51
2.90
1,210.76
128
28.28
2,414.71
4.15
1,448.83
256
18.26
2,576.59
7.21
1,545.96
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.20
48.67
2.05
29.20
2
49.53
96.67
2.06
58.00
4
49.08
188.00
2.12
112.80
8
48.40
356.00
2.23
213.60
16
47.26
645.33
2.44
387.20
32
42.22
1,077.33
2.90
646.40
64
44.95
1,162.65
5.41
697.59
128
44.92
1,162.64
10.84
697.58
256
45.02
1,162.21
21.58
697.32
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
111.04
109.67
0.91
65.80
2
108.57
212.33
0.91
127.40
4
105.67
408.00
0.91
244.80
8
102.65
408.00
1.02
461.60
16
96.48
1,370.66
1.13
822.40
32
78.96
2,110.49
1.42
822.40
64
89.80
2,522.64
2.41
1,513.58
128
89.69
2,516.96
4.94
1,510.17
256
90.27
2,517.19
9.96
1,510.31
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.38
26.65
3.74
15.99
2
26.43
51.30
3.88
30.78
4
25.92
100.61
3.96
60.36
8
25.52
196.72
4.06
118.03
16
21.24
328.32
4.84
196.99
32
19.32
588.59
5.36
353.15
64
16.73
1,003.22
6.29
601.93
128
12.56
1,433.27
8.59
859.96
256
8.60
1,586.86
8.59
952.11
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
28.93
21.65
4.60
13.01
2
31.72
50.89
3.90
30.54
4
30.86
91.23
4.17
54.74
8
29.61
163.06
4.33
97.84
16
27.66
277.48
4.49
166.49
32
26.01
615.83
4.77
369.50
64
22.49
1,027.87
5.67
616.77
128
17.22
1,527.06
7.37
616.77
256
10.67
1,882.65
11.44
1,131.71
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
97.11
51.67
1.98
30.14
2
95.38
99.17
2.04
57.87
4
93.91
183.96
2.10
107.50
8
89.79
318.53
2.23
186.09
16
81.05
506.12
2.47
294.03
32
64.15
909.40
3.18
530.15
64
50.35
1,405.67
4.08
818.96
128
33.59
1,786.60
6.26
1,040.74
256
18.77
1,866.83
11.43
1,086.94
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
52.05
52.57
1.95
30.80
2
50.70
100.90
2.00
59.19
4
49.96
192.32
2.06
112.89
8
47.75
369.74
2.15
216.13
16
44.36
643.94
2.30
377.65
32
36.74
982.39
2.74
576.42
64
31.27
1605.80
3.23
942.49
128
20.59
1,841.44
4.96
1,082.95
256
11.49
2,333.32
8.88
1,368.63
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
42.36
38.82
2.23
26.07
2
42.49
77.95
2.18
52.86
4
42.15
155.04
2.15
106.28
8
39.72
274.21
2.33
192.82
16
37.28
527.72
2.36
366.20
32
32.87
828.91
2.88
538.91
64
24.48
1,175.93
3.40
816.00
128
19.21
1,522.53
5.38
1,023.93
256
10.11
1,668.07
8.49
1,127.35
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
US Midwest (Chicago) 🔗
Model: cohere.command-r-08-2024(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
134.80
126.97
1.56
36.46
2
128.71
235.26
1.57
70.05
4
122.01
436.12
1.63
131.04
8
113.84
762.01
1.81
222.59
16
101.20
1,177.66
1.99
347.43
32
83.96
2,021.49
2.31
610.16
64
64.47
3,191.72
3.07
950.61
128
43.12
3,772.60
4.92
1,120.64
256
21.76
4,094.46
8.56
1,212.42
Model: cohere.command-r-plus-08-2024(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
94.04
87.41
1.95
29.44
2
88.13
163.85
1.93
58.04
4
86.49
315.44
2.03
108.02
8
80.10
550.10
2.39
171.44
16
70.13
861.65
2.56
288.47
32
62.39
1,517.61
3.06
476.62
64
42.36
2,139.38
3.76
753.58
128
29.22
3,137.09
5.74
1,023.88
256
17.13
3,229.42
9.78
1,117.58
Model: meta.llama-3.3-70b-instruct(Meta Llama 3.3 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
53.62
52.60
1.89
31.56
2
52.63
102.67
1.93
61.60
4
53.06
205.27
1.93
123.16
8
52.47
394.66
1.97
236.79
16
49.27
715.55
2.11
429.33
32
42.71
1,198.53
2.46
719.12
64
37.25
2,017.51
2.90
1,210.76
128
28.28
2,414.71
4.15
1,448.83
256
18.26
2,576.59
7.21
1,545.96
Model: meta.llama-3.2-90b-vision-instruct(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
50.20
48.67
2.05
29.20
2
49.53
96.67
2.06
58.00
4
49.08
188.00
2.12
112.80
8
48.40
356.00
2.23
213.60
16
47.26
645.33
2.44
387.20
32
42.22
1,077.33
2.90
646.40
64
44.95
1,162.65
5.41
697.59
128
44.92
1,162.64
10.84
697.58
256
45.02
1,162.21
21.58
697.32
Model: meta.llama-3.2-11b-vision-instruct(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
111.04
109.67
0.91
65.80
2
108.57
212.33
0.91
127.40
4
105.67
408.00
0.91
244.80
8
102.65
408.00
1.02
461.60
16
96.48
1,370.66
1.13
822.40
32
78.96
2,110.49
1.42
822.40
64
89.80
2,522.64
2.41
1,513.58
128
89.69
2,516.96
4.94
1,510.17
256
90.27
2,517.19
9.96
1,510.31
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster
Important
You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.
The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
27.38
26.65
3.74
15.99
2
26.43
51.30
3.88
30.78
4
25.92
100.61
3.96
60.36
8
25.52
196.72
4.06
118.03
16
21.24
328.32
4.84
196.99
32
19.32
588.59
5.36
353.15
64
16.73
1,003.22
6.29
601.93
128
12.56
1,433.27
8.59
859.96
256
8.60
1,586.86
8.59
952.11
Model: meta.llama-3.1-405b-instruct(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
28.93
21.65
4.60
13.01
2
31.72
50.89
3.90
30.54
4
30.86
91.23
4.17
54.74
8
29.61
163.06
4.33
97.84
16
27.66
277.48
4.49
166.49
32
26.01
615.83
4.77
369.50
64
22.49
1,027.87
5.67
616.77
128
17.22
1,527.06
7.37
616.77
256
10.67
1,882.65
11.44
1,131.71
Model: meta.llama-3.1-70b-instruct(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
97.11
51.67
1.98
30.14
2
95.38
99.17
2.04
57.87
4
93.91
183.96
2.10
107.50
8
89.79
318.53
2.23
186.09
16
81.05
506.12
2.47
294.03
32
64.15
909.40
3.18
530.15
64
50.35
1,405.67
4.08
818.96
128
33.59
1,786.60
6.26
1,040.74
256
18.77
1,866.83
11.43
1,086.94
Model: meta.llama-3-70b-instruct(Meta Llama 3) model hosted on one Large
Generic unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
31.07
31.12
3.28
18.29
2
30.33
59.43
3.40
34.88
4
29.39
113.76
3.51
66.48
8
27.14
210.00
3.77
123.22
16
24.04
351.38
4.24
205.78
32
19.40
523.68
5.23
306.44
64
16.12
837.45
6.28
491.00
128
9.48
920.97
10.63
541.91
256
5.73
1,211.95
17.79
713.19
Model: cohere.command-r-16k(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
42.36
38.82
2.23
26.07
2
42.49
77.95
2.18
52.86
4
42.15
155.04
2.15
106.28
8
39.72
274.21
2.33
192.82
16
37.28
527.72
2.36
366.20
32
32.87
828.91
2.88
538.91
64
24.48
1,175.93
3.40
816.00
128
19.21
1,522.53
5.38
1,023.93
256
10.11
1,668.07
8.49
1,127.35
Model: cohere.command-r-plus(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
112.29
95.11
1.82
31.65
2
109.27
186.61
1.91
60.55
4
104.19
350.17
1.98
115.70
8
93.66
625.10
2.24
200.55
16
84.60
1,087.14
2.46
354.44
32
68.80
1,718.20
2.96
557.70
64
53.25
2,455.21
3.53
827.78
128
38.02
3,366.97
5.48
1,113.31
256
25.19
3,983.61
8.35
1,322.15
Model: cohere.command(Cohere Command 52 B) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
34.98
28.85
3.21
17.30
8
29.51
119.83
5.34
71.62
32
27.44
293.58
5.91
177.09
128
25.56
482.88
6.67
291.95
Model: cohere.command-light(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)
1
71.85
54.49
1.74
30.21
8
41.91
191.52
2.87
105.63
32
31.37
395.49
3.55
216.87
128
28.27
557.57
3.9
302.44
Model: meta.llama-2-70b-chat(Llama2 (70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster
Concurrency
Token-level Inference Speed (token/second)
Token-level Throughput (token/second)
Request-level Latency (second)
Request-level Throughput (Request per minute) (RPM)