feat(core): Parse Multimodal MCP Tool responses (#5529)

Co-authored-by: Luccas Paroni <luccasparoni@google.com>
This commit is contained in:
Luccas Paroni
2025-08-05 16:19:47 -03:00
committed by GitHub
parent b465145229
commit 2778c7d851
4 changed files with 580 additions and 56 deletions

View File

@@ -15,9 +15,11 @@ The Gemini CLI core (`packages/core`) features a robust system for defining, reg
- `execute()`: The core method that performs the tool's action and returns a `ToolResult`.
- **`ToolResult` (`tools.ts`):** An interface defining the structure of a tool's execution outcome:
- `llmContent`: The factual string content to be included in the history sent back to the LLM for context.
- `llmContent`: The factual content to be included in the history sent back to the LLM for context. This can be a simple string or a `PartListUnion` (an array of `Part` objects and strings) for rich content.
- `returnDisplay`: A user-friendly string (often Markdown) or a special object (like `FileDiff`) for display in the CLI.
- **Returning Rich Content:** Tools are not limited to returning simple text. The `llmContent` can be a `PartListUnion`, which is an array that can contain a mix of `Part` objects (for images, audio, etc.) and `string`s. This allows a single tool execution to return multiple pieces of rich content.
- **Tool Registry (`tool-registry.ts`):** A class (`ToolRegistry`) responsible for:
- **Registering Tools:** Holding a collection of all available built-in tools (e.g., `ReadFileTool`, `ShellTool`).
- **Discovering Tools:** It can also discover tools dynamically:

View File

@@ -571,6 +571,56 @@ The MCP integration tracks several states:
This comprehensive integration makes MCP servers a powerful way to extend the Gemini CLI's capabilities while maintaining security, reliability, and ease of use.
## Returning Rich Content from Tools
MCP tools are not limited to returning simple text. You can return rich, multi-part content, including text, images, audio, and other binary data in a single tool response. This allows you to build powerful tools that can provide diverse information to the model in a single turn.
All data returned from the tool is processed and sent to the model as context for its next generation, enabling it to reason about or summarize the provided information.
### How It Works
To return rich content, your tool's response must adhere to the MCP specification for a [`CallToolResult`](https://modelcontextprotocol.io/specification/2025-06-18/server/tools#tool-result). The `content` field of the result should be an array of `ContentBlock` objects. The Gemini CLI will correctly process this array, separating text from binary data and packaging it for the model.
You can mix and match different content block types in the `content` array. The supported block types include:
- `text`
- `image`
- `audio`
- `resource` (embedded content)
- `resource_link`
### Example: Returning Text and an Image
Here is an example of a valid JSON response from an MCP tool that returns both a text description and an image:
```json
{
"content": [
{
"type": "text",
"text": "Here is the logo you requested."
},
{
"type": "image",
"data": "BASE64_ENCODED_IMAGE_DATA_HERE",
"mimeType": "image/png"
},
{
"type": "text",
"text": "The logo was created in 2025."
}
]
}
```
When the Gemini CLI receives this response, it will:
1. Extract all the text and combine it into a single `functionResponse` part for the model.
2. Present the image data as a separate `inlineData` part.
3. Provide a clean, user-friendly summary in the CLI, indicating that both text and an image were received.
This enables you to build sophisticated tools that can provide rich, multi-modal context to the Gemini model.
## MCP Prompts as Slash Commands
In addition to tools, MCP servers can expose predefined prompts that can be executed as slash commands within the Gemini CLI. This allows you to create shortcuts for common or complex queries that can be easily invoked by name.