Supported Models

Generative AI supports its large language models based on each model's type and serving mode. The large language models serve user requests in either an on-demand serving mode or a dedicated serving mode. Review the following sections to learn about deprecation times and to decide what serving mode works best for you.

On-demand Serving Mode
  • User pays for each inference call based on the length of the prompt plus the length of the response for each request.
  • Available for pretrained foundational models only.
  • The models run on shared GPU resources.
  • The price differs for small and large models.
  • When the new version releases, user gets an overlapping period where both versions are supported, to gracefully migrate to the new version.
  • In this overlapping time, the old version goes to a deprecating state. After the deprecating state ends, Generative AI no longer supports the old version.
Serving Mode Model Tag Duration in Deprecating State
On Demand cohere.command Default 1 month
On Demand cohere.embed Default 6 months
On Demand meta.llama-2 Default 1 months
Dedicated Serving Mode
  • User pays for dedicated clusters that run 24/7 during the user's committed time.
  • Available for pretrained out-of-the-box models and custom models.
  • The GPU resources are dedicated to the user and not shared.
  • The price depends on the model and the number of instances that the user selects when they create or update their cluster.
  • User can expect predictable latency and throughput for model performances.
  • When the new version releases, user gets an overlapping period where both versions are supported to gracefully migrate to the new version.
  • Because each hosting dedicated AI cluster can only host the same version of a model, users can decide the model version that their dedicated AI cluster hosts and not migrate within the overlapping time period. Instead, users can request a long-term support on select model versions.
  • The following table displays the maximum time that users can keep a model version alive.
Serving Mode Model Tag Model Keep Alive Time Window
Dedicated cohere.command Default 3 months
Dedicated cohere.command Long-term Support Up to 18 months
Dedicated cohere.embed Default 6 months
Dedicated cohere.embed Long-term Support Up to 18 months
Dedicated meta.llama-2 Default 3 months
Dedicated meta.llama-2 Long-term Support Up to 18 months
Note

Deprecation times might change in the future.