Back to products
Gemini Embedding 2

Gemini Embedding 2

Google's first natively multimodal embedding model

Overview

What it is

Gemini Embedding 2 is Google's first natively multimodal embedding model that maps text, images, video, audio and documents into a single embedding space, enabling multimodal retrieval and classification across different types of media and it’s available now in public preview.

Intent

I need it when

Implement semantic classification and clustering across mixed-media datasets

The model natively understands interleaved multimodal input and captures semantic intent across 100+ languages, making it ideal for sentiment analysis, data clustering, and classification tasks that span text, images, video, and audio without requiring intermediate transcriptions or format conversions.

Access state-of-the-art multimodal embeddings with strong speech and video capabilities

Gemini Embedding 2 establishes new performance standards for multimodal embeddings, introducing strong native speech capabilities and outperforming leading models in text, image, and video tasks. It is available in public preview via Gemini API and Vertex AI with integration support for LangChain, LlamaIndex, Haystack, Weaviate, QDrant, and ChromaDB.

Optimize embedding storage and performance costs while maintaining quality

Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL), allowing flexible output dimensions that scale down from the default 3072 to lower dimensions. This enables developers to balance embedding quality against storage and computational costs based on their specific performance requirements.

Build multimodal search and retrieval systems that understand diverse content types

Gemini Embedding 2 maps text, images, videos, audio, and documents into a single unified embedding space, enabling semantic search and Retrieval-Augmented Generation (RAG) across all media types simultaneously. This eliminates complex multi-pipeline architectures and allows developers to capture nuanced relationships between different modalities in a single request.

Process complex real-world multimodal data without intermediate conversions

The model natively ingests audio without text transcription, processes PDFs up to 6 pages, handles up to 120 seconds of video, and supports up to 6 images per request. This native multimodal processing eliminates preprocessing steps and captures complex relationships that would be lost in format conversions.

Drop

Not a fit when

  • When you need single-modality embeddings only and do not require multimodal understanding across text, images, video, audio, and documents
  • When your application requires real-time embeddings with sub-millisecond latency, as the model processes up to 120 seconds of video and 6-page PDFs which may introduce processing delays
  • When you have strict data privacy requirements that prohibit sending content to cloud-based APIs, since Gemini Embedding 2 is accessed via Gemini API or Vertex AI
  • When you need embeddings for languages outside the 100+ languages supported by the model
  • When your use case requires on-device or edge deployment without cloud infrastructure dependencies
Commercials

Pricing

Pricing not specified