Alibaba GTE text-embedding-v4 costs $0.07 per 1 million tokens, undercutting OpenAI’s large embedding model by nearly 50% while delivering top-tier performance on the MTEB leaderboard. It exists in two distinct forms: a highly efficient managed API and a massive set of open-weight models (like gte-Qwen2-7B-instruct) that you can host yourself.
For most teams, the API is the logical choice. It has recently added an OpenAI-compatible endpoint, making migration as simple as changing a base_url. In a production workload processing 10 million documents averaging 1,000 tokens each (10 billion total tokens), Alibaba’s API would cost roughly $700. The same workload on OpenAI’s text-embedding-3-large ($0.13/1M) would cost $1,300. While OpenAI’s small model is cheaper ($200 for this workload), Alibaba’s v4 compares to the large variant in performance, making it a value leader for high-accuracy requirements. It also supports Matryoshka-style dimension reduction, letting you store smaller vectors (e.g., 256 dims) while retaining most of the semantic fidelity of the full 1,024+ dimensions.
The open-weight version, gte-Qwen2-7B-instruct, is a different beast entirely. It offers a massive 32k context window, far exceeding the API’s 8k limit. This makes it uniquely suited for heavy RAG applications where you need to embed entire legal contracts or research papers as single vectors. However, running a 7B parameter embedding model is like using a semi-truck to pick up groceries—it requires significant GPU VRAM (approx. 14GB+) and has higher latency than standard BERT-based models.
The downsides are mostly operational. The API’s 8k context limit is restrictive compared to the open model’s 32k capability. Additionally, while the API is available globally (Singapore/US servers), some enterprises may hesitate due to data sovereignty concerns regarding Alibaba Cloud, depending on their internal compliance policies. Documentation can also be fragmented between Hugging Face (for open weights) and Alibaba Cloud (for the API).
If you need state-of-the-art multilingual embeddings and want to slash your OpenAI bill without sacrificing quality, switch to the text-embedding-v4 API. If you have specialized needs—specifically privacy or handling documents longer than 8k tokens—self-host the Qwen2-7B model, provided you have the GPU infrastructure to support it.
Pricing
The text-embedding-v4 API charges a flat $0.07 per 1 million tokens. There is no free tier for the API beyond an initial trial quota (often 1M tokens), but the open-weights models are free to use under Apache 2.0 if you bring your own compute.
Compared to competitors:
- OpenAI text-embedding-3-large: $0.13/1M (Alibaba is ~46% cheaper).
- OpenAI text-embedding-3-small: $0.02/1M (Alibaba is 3.5x more expensive, but much higher quality).
- Cohere Embed v3: $0.10/1M.
The cost cliff is virtually non-existent due to the low per-token rate, but self-hosting the 7B model introduces a 'compute cliff'—you immediately need A10/A100 class GPUs.
Technical Verdict
The API is production-ready with a new OpenAI-compatible endpoint (/compatible-mode/v1), removing the need for the custom DashScope SDK in many cases. Latency is competitive for the API, though the self-hosted 7B model is noticeably slower than standard 300M parameter encoders. The support for Matryoshka representation learning (truncating dimensions from 2048/1024 down to 128) allows for significant vector DB storage savings without retraining. Multilingual performance is exceptional, particularly for Asian languages.
Quick Start
# pip install dashscope
import dashscope
from dashscope import TextEmbedding
resp = TextEmbedding.call(
model=TextEmbedding.Models.text_embedding_v4,
input='The quick brown fox jumps over the lazy dog'
)
print(resp.output['embeddings'][0]['embedding'][:5]) # [0.01, -0.02, ...]Watch Out
- API context is capped at 8,192 tokens, while the open-weight model supports 32,768.
- Self-hosting the 7B model requires substantial VRAM (approx. 14GB+ for FP16), making it expensive to run on standard cloud instances.
- The API default dimension might be 1024, but the model supports up to 2048 or down to 128; you must specify this explicitly to optimize storage.
- Documentation is split between 'DashScope' (API) and Hugging Face (weights), leading to confusion about model capabilities.
