Converts multimodal input (text, images, and videos) into vector representations. Supports a single fused vector across modalities, independent per-element vectors, configurable output dimensions, video frame sampling control, and a custom task instruction.
Documentation Index
Fetch the complete documentation index at: https://docs.octen.ai/llms.txt
Use this file to discover all available pages before exploring further.
API key used for request authentication. Obtain an API key before using the API. Note: A payment method is required to use the API.
The multimodal embedding model used for this request.
octen-vl-embedding, octen-vl-embedding-large The multimodal content to be vectorized. Supports text, images, videos, and combinations. Maximum total elements per request: 20. Maximum images per request: 5. Maximum videos per request: 1.
Whether to generate a fused embedding. When true, all elements in contents are fused into a single vector; when false, each element produces an independent vector.
The dimensionality of the output embedding vectors. Defaults to the model's max dimension (octen-vl-embedding: 2048, octen-vl-embedding-large: 4096). Any positive integer ≤ the model's max dimension is allowed.
Controls the frame sampling density for video inputs. Smaller values reduce the number of extracted frames and lower video token consumption.
0 <= x <= 1Custom task description used to guide the model in understanding the query intent. Its length counts toward input_tokens and shares the 32,000-token total context limit with contents.
Successful VL embedding response