Together AI charges for inference based on model size, typically around $0.88 per million tokens for Llama-3.1-70B and up to $3.50 for the massive 405B model. For a production RAG application processing 1 million Llama-70B tokens daily, you’re looking at approximately $26/month on Together, compared to roughly $11/month on DeepInfra or $19 on Groq. It’s not the absolute cheapest option, but the premium buys you something critical: a sweet spot between raw speed and model variety.
The platform sits comfortably in the "performance tier" of inference providers. By leveraging their own FlashAttention research, they deliver speeds (often ~100+ tokens/second for 70B models) that feel instantaneous for chat interfaces, significantly outperforming standard AWS Bedrock or Azure setups. The API is strictly OpenAI-compatible, meaning migration is usually a one-line config change. Unlike Groq, which is limited by its specialized hardware to a handful of models, Together hosts a massive registry of over 100 open-source options, including the latest from Qwen, Mixtral, and DeepSeek.
However, it's not without friction. The serverless experience has a "cold start" problem if you stray from the most popular models. Requesting a niche coding model can sometimes result in a 10-second hang while the GPU spins up, which kills the real-time vibe. Additionally, they’ve tightened their onboarding; there is no true free tier anymore—you must pre-load $5 just to generate your first token.
The real competition is nuanced. If you need absolute lowest latency for a consumer chatbot, Groq is faster. If you need the rock-bottom price for background batch jobs, DeepInfra is cheaper. But Together AI is the best default for developers who need a reliable, high-speed API that supports the newest open models the day they drop, without managing a GPU cluster yourself.
Skip Together if you are building a "free-to-play" hobby app and can't stomach the initial credit purchase or strict rate limits on low tiers. Use it if you’re a startup that needs near-Groq speeds but requires the flexibility of a wider model catalog.
Pricing
The biggest surprise for new users is the lack of a true free trial. Documentation confirms you must purchase a minimum of $5 in credits to generate your first API key, acting as a gate against spam. Once inside, pricing is competitive but not bargain-bin. Llama-3.1-8B is $0.18/1M tokens, while the 70B variant hovers around $0.88/1M. This is roughly 2x the cost of DeepInfra ($0.36) but comparable to Fireworks. The cost cliff hits hard with the 405B model, jumping to $3.50/1M tokens, so ensure your router falls back to smaller models whenever possible.
Technical Verdict
Together's engineering DNA shows in their stack. The Python SDK is a thin, clean wrapper around their REST API, fully typed and stable. Latency is excellent for cached/hot models, often hitting 100+ t/s on 70B params. However, error handling on rate limits (HTTP 429) can be abrupt, and serverless cold starts on obscure models are noticeable (5s+). Documentation is functional but sparse on advanced fine-tuning examples.
Quick Start
# pip install together
import os
from together import Together
client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain quantum computing in one sentence."}],
)
print(response.choices[0].message.content)Watch Out
- No free trial; requires a $5 minimum credit purchase to generate an API key.
- Serverless cold starts can take 5-10 seconds for less popular models.
- Rate limits are strict on the default tier and often require manual support tickets to raise.
- The 'blended' pricing on some pages can hide the fact that output tokens are significantly more expensive than input tokens.
