A multimodal message is a message that contains content of multiple types, such as text and images, structured as an array of content blocks in the content field, each with a type and the corresponding data (e.g., image_url for images) [citation:6].
A multimodal message leverages the content blocks structure to combine different modalities, such as text and images, into a single message. This is essential for vision-capable LLMs. The most common use case is passing images, which can be done by providing a base64-encoded image string or a publicly accessible image URL within an image_url content block. The core concept is the ContentBlock — a dictionary with a type key that dictates the structure of the rest of the block [citation:6].