asdasd

30th of 32 Questions.

A user's message contains both text and an image — how do you construct the correct multimodal HumanMessage content block for a vision model?

Construct a multimodal HumanMessage where the content field is a list of content blocks: a text block for the user's text and an image_url block for the image (either as a URL or base64-encoded data).

Modern vision models (GPT-4V, Claude 3, Gemini) expect messages with a structured content array rather than a plain string. The HumanMessage class accepts a list of dictionaries, each with a type key. For text, use {"type": "text", "text": "your text"}. For images, use {"type": "image_url", "image_url": {"url": "..."}}. The URL can be a public HTTP/HTTPS URL or a data URL with base64 encoding. This format is supported by LangChain's built-in message serialization and is automatically converted to the provider's expected format.

Building a Multimodal HumanMessage