Model Capabilities

Image Understanding

When sending images, it is advised to not store request/response history on the server. Otherwise the request may fail. See Disable storing previous request/response on server.

Some models allow images in the input. The model will consider the image context when generating the response.

Constructing the message body - difference from text-only prompt

The request message to image understanding is similar to text-only prompt. The main difference is that instead of text input:

JSON

[
  {
    "role": "user",
    "content": "What is in this image?"
  }
]

We send in content as a list of objects:

JSON

[
  {
    "role": "user",
    "content": [
      {
        "type": "input_image",
        "image_url": "data:image/jpeg;base64,<base64_image_string>",
        "detail": "high"
      },
      {
        "type": "input_text",
        "text": "What is in this image?"
      }
    ]
  }
]

The image_url.url can also be the image's url on the Internet.

Image understanding example

import os
from xai_sdk import Client
from xai_sdk.chat import user, image

client = Client(
    api_key=os.getenv("XAI_API_KEY"),
    management_api_key=os.getenv("XAI_MANAGEMENT_API_KEY"),
    timeout=3600,
)

image_url = "https://science.nasa.gov/wp-content/uploads/2023/09/web-first-images-release.png"


chat = client.chat.create(model="grok-4-1-fast-reasoning")
chat.append(
    user(
        "What's in this image?",
        image(image_url=image_url, detail="high"),
    )
)

response = chat.sample()
print(response)

# The response ID that can be used to continue the conversation later

print(response.id)

Image input general limits

  • Maximum image size: 20MiB
  • Maximum number of images: No limit
  • Supported image file types: jpg/jpeg or png.
  • Any image/text input order is accepted (e.g. text prompt can precede image prompt)