Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Introduction
Voice Live supports connecting to remote Model Context Protocol (MCP) servers during a voice session. MCP integration enables the model to discover and invoke tools hosted on external services, such as documentation search, wiki lookup, or custom APIs, and incorporate tool results into spoken responses.
MCP server integration differs from function calling in these ways:
| Aspect | Function calling | MCP server |
|---|---|---|
| Tool execution | Client-side | Server-side (managed by Voice Live) |
| Tool discovery | Client defines tools explicitly | Voice Live auto-discovers tools from MCP endpoint |
| Approval model | Not applicable | Configurable: "always" (default), "never", or per-tool dictionary |
| API version required | 2025-10-01 |
2026-04-10 or later |
Key concepts
MCPServerdefinition: Declare one or more MCP endpoints in the session configuration withserver_label,server_url, and optionalallowed_tools,headers,authorization, andrequire_approval.- Tool discovery: On session start, Voice Live calls each MCP server's tool listing endpoint and emits
mcp_list_toolsevents. - Tool invocation: When the model decides to call an MCP tool, the service handles execution and streams
response.mcp_callevents. - Approval flow: When
require_approvalis set to"always"(the default), the client receives anmcp_approval_requestconversation item and must respond with anmcp_approval_responsebefore the call executes. Setrequire_approvalto"never"for automatic execution, or use a per-tool dictionary to mix modes on the same server.
Approval modes
The require_approval property on each MCPServer controls whether tool calls need client-side approval before execution. It accepts a string or a per-tool dictionary.
| Mode | Value | Behavior |
|---|---|---|
| Always (default) | "always" |
Every tool call sends an mcp_approval_request to the client. The call doesn't execute until the client responds with mcp_approval_response and approve=true. |
| Never | "never" |
Tool calls execute automatically. No approval event is sent. |
| Per-tool | {"always": ["tool_a"], "never": ["tool_b", "tool_c"]} |
Each tool is assigned an approval mode individually. Tools not listed in either key default to "always". |
When to use each mode:
"always"ā Use for tools that perform write operations, access sensitive data, or incur costs. The voice samples auto-approve subsequent calls to the same server within the same turn to reduce repeated prompts."never"ā Use for read-only lookups, search APIs, or trusted internal tools where user confirmation adds latency without security benefit.- Per-tool dictionary ā Use when a single MCP server exposes a mix of read-only and write tools. For example, a documentation server might allow
search_docswithout approval but require approval forsubmit_feedback.
Note
In voice scenarios, each approval triggers a conversational prompt. Configure require_approval carefully to balance security with conversation flow. See Voice-native approval for implementation patterns.
For the full MCP event and type reference, see Voice Live API reference.
Learn how to connect remote MCP servers to a Voice Live session using the VoiceLive SDK for Python. This article builds on the Quickstart: Create a Voice Live real-time voice agent with MCP server integration.
Reference documentation | Package (PyPi) | Additional samples on GitHub
Follow the how-to below or get the full sample code:
Prerequisites
- An Azure subscription. Create one for free.
- Python 3.10 or later version. If you don't have a suitable version of Python installed, you can follow the instructions in the VS Code Python Tutorial for the easiest way of installing Python on your operating system.
- A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see the Voice Live overview documentation.
azure-ai-voicelivepackage version 1.2.0 or later (MCP support requiresapi_version="2026-04-10").- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Tip
To use Voice Live with MCP, you don't need to deploy an audio model with your Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about model availability, see the Voice Live overview documentation.
Prepare the environment
Complete the Voice Live quickstart to set up your environment, configure authentication, and test your first Voice Live conversation.
MCP integration concepts
MCP server definition
Use the MCPServer class to declare each remote MCP endpoint. At minimum, provide server_label (a display name) and server_url (the MCP endpoint URL). Optionally restrict available tools with allowed_tools and configure the approval mode.
Approval modes
Control whether MCP tool calls require user approval before execution:
require_approval="never": The tool executes automatically when the model invokes it.require_approval="always"(default): The client receives anmcp_approval_requestand must respond before the tool runs.- Per-tool dictionary: Set
require_approval={"never": ["tool_a"], "always": ["tool_b"]}for granular control.
API version requirement
MCP support requires api_version="2026-04-10" or later. Pass this value in the connect() call.
Define MCP servers
Define the MCP servers that Voice Live can use during the session. Each server is an MCPServer instance added to the tools list in the session configuration.
The following code defines two MCP servers: one with automatic tool execution and one that requires user approval before running.
# Define MCP servers that Voice Live can use during the session.
# Each server is an MCPServer instance added to the tools list.
mcp_tools: list[Tool] = [
MCPServer(
server_label="deepwiki",
server_url="https://mcp.deepwiki.com/mcp",
allowed_tools=["read_wiki_structure", "ask_question"],
require_approval="never",
),
MCPServer(
server_label="azure_doc",
server_url="https://learn.microsoft.com/api/mcp",
require_approval="always",
),
]
In this sample:
- The
deepwikiserver allows onlyread_wiki_structureandask_questiontools, withrequire_approval="never"for automatic execution. - The
azure_docserver allows all tools on the endpoint, withrequire_approval="always"so users can review each call before execution.
Configure the session with MCP tools
Pass the MCP server definitions to the RequestSession tools list alongside your voice, modality, and turn-detection settings.
async def _setup_session(self, mcp_tools: list[Tool]):
"""Configure the VoiceLive session with MCP tools."""
logger.info("Setting up voice conversation session with MCP tools...")
# Create voice configuration
voice_config: Union[AzureStandardVoice, str]
if "-" in self.voice or ":" in self.voice:
voice_config = AzureStandardVoice(name=self.voice)
else:
voice_config = self.voice
# Create turn detection configuration
turn_detection_config = ServerVad(
threshold=0.5,
prefix_padding_ms=300,
silence_duration_ms=500)
# Create session configuration with MCP tools in the tools list
session_config = RequestSession(
modalities=[Modality.TEXT, Modality.AUDIO],
instructions=self.instructions,
voice=voice_config,
input_audio_format=InputAudioFormat.PCM16,
output_audio_format=OutputAudioFormat.PCM16,
turn_detection=turn_detection_config,
input_audio_echo_cancellation=AudioEchoCancellation(),
input_audio_noise_reduction=AudioNoiseReduction(type="azure_deep_noise_suppression"),
tools=mcp_tools,
tool_choice=ToolChoiceLiteral.AUTO,
input_audio_transcription=AudioInputTranscriptionOptions(
model="azure-speech" if "realtime" not in self.model.lower() else "whisper-1"
),
)
# Interim response bridges latency during MCP tool calls, but is only
# supported on non-realtime model pipelines (e.g. gpt-4o-mini).
if "realtime" not in self.model.lower():
session_config.interim_response = LlmInterimResponseConfig(
triggers=[InterimResponseTrigger.TOOL, InterimResponseTrigger.LATENCY],
latency_threshold_ms=100,
instructions="Create friendly interim responses indicating wait time due to "
"ongoing processing, if any. Do not include in all responses! "
"Do not say you don't have real-time access to information when "
"calling tools!",
)
logger.info("Interim response enabled for model %s", self.model)
else:
logger.info("Interim response skipped ā not supported on realtime pipeline (%s)", self.model)
conn = self.connection
assert conn is not None
await conn.session.update(session=session_config)
logger.info("Session configuration with MCP tools sent")
In this sample:
RequestSessionbundles MCP tools with audio format, voice, and turn detection settings.connection.session.update(session=session_config)sends the full configuration to Voice Live.- Voice Live automatically discovers available tools from each MCP server after the session starts.
Handle MCP events
Process MCP-specific events in the event loop. The key events are:
CONVERSATION_ITEM_CREATEDwithItemType.MCP_CALL: An MCP tool call was triggered by the model.RESPONSE_MCP_CALL_COMPLETED: The MCP call completed successfully.RESPONSE_MCP_CALL_FAILED: The MCP call failed.CONVERSATION_ITEM_CREATEDwithItemType.MCP_APPROVAL_REQUEST: The server is requesting approval for a tool call.CONVERSATION_ITEM_CREATEDwithItemType.MCP_LIST_TOOLS: Tool discovery completed for a server.
async def _handle_event(self, event):
"""Handle different types of events from VoiceLive, including MCP events."""
ap = self.audio_processor
conn = self.connection
assert ap is not None
assert conn is not None
if event.type == ServerEventType.SESSION_UPDATED:
logger.info("Session ready: %s", event.session.id)
await write_conversation_log(f"SessionID: {event.session.id}")
await write_conversation_log(f"Model: {event.session.model}")
await write_conversation_log(f"Voice: {event.session.voice}")
await write_conversation_log("")
self.session_ready = True
ap.start_capture()
elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED:
logger.info("User started speaking - stopping playback")
print("š¤ Listening...")
ap.skip_pending_audio()
# Approval call counter is NOT reset on speech ā it tracks the
# lifecycle of a task (reset on denial or after results are spoken)
# But approved-servers-this-turn resets when user starts a new topic
if self._pending_approval is None and self._mcp_call_in_progress <= 0:
self._approved_servers_this_turn.clear()
# Clear deferred response flags if no MCP calls are in progress.
# Prevents stale _needs_response_create from re-triggering result
# playback after the user interrupts.
if self._mcp_call_in_progress <= 0:
self._needs_response_create = False
self._mcp_results_pending = False
if self._active_response and not self._response_api_done:
try:
await conn.response.cancel()
except Exception as e:
if "no active response" not in str(e).lower():
logger.warning("Cancel failed: %s", e)
# If an MCP call is running, mark current calls as stale (user is moving on)
# and let the user know it's still in progress
if self._mcp_call_in_progress > 0 and self._pending_approval is None:
self._stale_mcp_items.update(self._active_mcp_items)
logger.info("User spoke during MCP call ā marking %d calls as stale", len(self._active_mcp_items))
try:
await conn.conversation.item.create(
item=MessageItem(
role="system",
content=[InputTextContentPart(
text="A tool call is still running in the background. The user just spoke. "
"Respond to what the user said. If a tool result arrives later, "
"briefly introduce it as a late result from an earlier request."
)],
)
)
except Exception as e:
logger.warning("Failed to inject MCP status update: %s", e)
elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED:
logger.info("User stopped speaking")
print("š¤ Processing...")
elif event.type == ServerEventType.RESPONSE_CREATED:
logger.info("Assistant response created")
self._active_response = True
self._response_api_done = False
elif event.type == ServerEventType.RESPONSE_AUDIO_DELTA:
ap.queue_audio(event.delta)
elif event.type == ServerEventType.RESPONSE_AUDIO_DONE:
logger.info("Assistant finished speaking")
print("š¤ Ready for next input...")
elif event.type == ServerEventType.RESPONSE_TEXT_DONE:
text = event.text if hasattr(event, 'text') else event.get("text", "")
print(f"š¤ Assistant text:\t{text}")
await write_conversation_log(f"Assistant Text Response:\t{text}")
elif event.type == ServerEventType.RESPONSE_AUDIO_TRANSCRIPT_DONE:
transcript = event.transcript if hasattr(event, 'transcript') else event.get("transcript", "")
print(f"š¤ Assistant audio transcript:\t{transcript}")
await write_conversation_log(f"Assistant Audio Response:\t{transcript}")
elif event.type == ServerEventType.RESPONSE_DONE:
logger.info("Response complete")
await write_conversation_log("--- Response complete ---")
self._active_response = False
self._response_api_done = True
# If an approval prompt needs to be injected, do it now that no response is active
if self._approval_prompt_needed and self._pending_approval is not None:
self._approval_prompt_needed = False
await self._send_approval_voice_prompt(self._pending_approval, conn)
# If MCP results are pending and all calls are now done, create response
elif self._mcp_results_pending and self._mcp_call_in_progress <= 0 and self._pending_approval is None:
self._mcp_results_pending = False
try:
await conn.response.create()
except Exception:
pass
# If a response.create was deferred due to collision, retry now
elif self._needs_response_create:
self._needs_response_create = False
try:
await conn.response.create()
except Exception:
pass # Best-effort retry
elif event.type == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED:
transcript = event.transcript if hasattr(event, 'transcript') else event.get("transcript", "")
logger.info("User said: %s", transcript)
print(f"š¤ You said:\t{transcript}")
await write_conversation_log(f"User Input:\t{transcript}")
# Interpret as an approval answer if we have a pending approval ā
# whether or not the prompt has finished speaking. This allows the
# user to barge in with "yes" without waiting for the full prompt.
if self._pending_approval is not None:
await self._resolve_voice_approval(transcript, conn)
elif event.type == ServerEventType.ERROR:
msg = event.error.message
# Reset response state ā errors can terminate a response without RESPONSE_DONE
self._active_response = False
self._response_api_done = True
if "Cancellation failed: no active response" not in msg:
if "interim response" in msg.lower():
logger.warning("Interim response not supported with this model pipeline (non-fatal)")
elif "active response" in msg.lower():
logger.debug("Response collision (expected during MCP flow): %s", msg)
else:
logger.error("VoiceLive error: %s", msg)
print(f"Error: {msg}")
await write_conversation_log(f"ERROR: {msg}")
# MCP-specific events
elif event.type == ServerEventType.MCP_LIST_TOOLS_IN_PROGRESS:
logger.info("MCP list tools in progress for %s", event.item_id)
elif event.type == ServerEventType.MCP_LIST_TOOLS_COMPLETED:
logger.info("MCP list tools completed for %s", event.item_id)
print("š§ MCP tools discovered successfully")
await write_conversation_log("MCP tools discovered successfully")
elif event.type == ServerEventType.MCP_LIST_TOOLS_FAILED:
logger.error("MCP list tools failed for %s", event.item_id)
print("ā MCP tool discovery failed")
await write_conversation_log("ERROR: MCP tool discovery failed")
elif event.type == ServerEventType.RESPONSE_MCP_CALL_IN_PROGRESS:
logger.info("MCP call in progress for %s", event.item_id)
print("ā³ MCP tool call in progress...")
await write_conversation_log(f"MCP call in progress: {event.item_id}")
self._mcp_call_in_progress += 1
self._active_mcp_items.add(event.item_id)
self._start_mcp_stall_timer(conn)
elif event.type == ServerEventType.RESPONSE_MCP_CALL_COMPLETED:
item_id = event.item_id
self._mcp_call_in_progress = max(0, self._mcp_call_in_progress - 1)
self._active_mcp_items.discard(item_id)
self._cancel_mcp_stall_timer()
if item_id in self._handled_mcp_completions:
logger.debug("Ignoring duplicate MCP completion for %s", item_id)
else:
self._handled_mcp_completions.add(item_id)
is_stale = item_id in self._stale_mcp_items
self._stale_mcp_items.discard(item_id)
logger.info("MCP call completed for %s (stale=%s)", item_id, is_stale)
await write_conversation_log(f"MCP call completed: {item_id} (stale={is_stale})")
await self._handle_mcp_call_completed(event, conn, is_stale=is_stale)
elif event.type == ServerEventType.RESPONSE_MCP_CALL_FAILED:
item_id = event.item_id
logger.error("MCP call failed for %s", item_id)
print("ā MCP tool call failed")
await write_conversation_log(f"ERROR: MCP call failed: {item_id}")
self._mcp_call_in_progress = max(0, self._mcp_call_in_progress - 1)
self._active_mcp_items.discard(item_id)
self._stale_mcp_items.discard(item_id)
self._cancel_mcp_stall_timer()
# Kick the model to inform the user the tool call failed
try:
await conn.response.create()
except Exception as e:
if "active response" not in str(e).lower():
logger.warning("Failed to create response after MCP failure: %s", e)
elif event.type == ServerEventType.CONVERSATION_ITEM_CREATED:
logger.info("Conversation item created: id=%s, type=%s", event.item.id, event.item.type)
if event.item.type == ItemType.MCP_LIST_TOOLS:
logger.info("MCP list tools item: server_label=%s", event.item.server_label)
elif event.item.type == ItemType.MCP_CALL:
await self._handle_mcp_call_arguments(event, conn)
elif event.item.type == ItemType.MCP_APPROVAL_REQUEST:
await self._handle_mcp_approval_request(event, conn)
else:
logger.debug("Unhandled event type: %s", event.type)
In this sample:
_handle_mcp_call_argumentswaits for the full arguments to stream in viaRESPONSE_MCP_CALL_ARGUMENTS_DONE, then waits for the response to complete._handle_mcp_call_completedreceives the tool output and triggers a new response so the model can incorporate the result into its next spoken reply.
Handle approval requests
When a server is configured with require_approval="always", client code must handle the approval flow. Instead of blocking on console input, inject a system message so the model asks the user verbally and parse the spoken response.
async def _handle_mcp_approval_request(self, conversation_created_event, connection):
"""Handle MCP approval request by asking the user via voice."""
if not isinstance(conversation_created_event, ServerEventConversationItemCreated):
logger.error("Expected ServerEventConversationItemCreated")
return
if not isinstance(conversation_created_event.item, ResponseMCPApprovalRequestItem):
logger.error("Expected ResponseMCPApprovalRequestItem")
return
mcp_approval_item = conversation_created_event.item
approval_id = mcp_approval_item.id
server_label = mcp_approval_item.server_label
function_name = mcp_approval_item.name
if not approval_id:
logger.error("MCP approval item missing ID")
return
# Auto-deny after too many calls to the same server in one task.
# This prevents infinite tool-call loops in voice UX.
MAX_APPROVAL_CALLS_PER_TASK = 3
current_count = self._approval_call_count.get(server_label, 0)
if current_count >= MAX_APPROVAL_CALLS_PER_TASK:
logger.info("Auto-denying %s ā reached %d calls this task", function_name, current_count)
print(f" Auto-denied: {server_label}/{function_name} (max {MAX_APPROVAL_CALLS_PER_TASK} calls reached)")
try:
await connection.conversation.item.create(
item=MCPApprovalResponseRequestItem(approval_request_id=approval_id, approve=False)
)
except Exception as e:
logger.warning("Failed to send auto-deny: %s", e)
return
# Auto-approve if user already approved this server earlier in the same turn.
# This avoids repeated approval prompts for consecutive calls to the same service.
if server_label in self._approved_servers_this_turn:
logger.info("Auto-approving %s ā server already approved this turn", function_name)
print(f" Auto-approved: {server_label}/{function_name} (already approved this turn)")
try:
await connection.conversation.item.create(
item=MCPApprovalResponseRequestItem(approval_request_id=approval_id, approve=True)
)
except Exception as e:
logger.warning("Failed to send auto-approve: %s", e)
return
# If another approval is already pending, queue this one
if self._pending_approval is not None:
logger.info("Queuing approval for %s ā another is already pending", function_name)
self._approval_queue.append({
"approval_id": approval_id,
"server_label": server_label,
"function_name": function_name,
})
return
logger.info("MCP approval request: server=%s tool=%s", server_label, function_name)
print(f"\nš MCP Approval Request (voice-based):")
print(f" Server: {server_label} Tool: {function_name}")
# Store the pending approval. If no response is currently active,
# send the voice prompt immediately. Otherwise, defer it to
# RESPONSE_DONE to avoid colliding with an active response.
self._pending_approval = {
"approval_id": approval_id,
"server_label": server_label,
"function_name": function_name,
}
if not self._active_response:
await self._send_approval_voice_prompt(self._pending_approval, connection)
else:
self._approval_prompt_needed = True
async def _send_approval_voice_prompt(self, pending: dict, connection):
"""Inject a system message asking the model to verbally request permission."""
server = pending["server_label"]
call_count = self._approval_call_count.get(server, 0)
self._approval_call_count[server] = call_count + 1
if call_count == 0:
prompt = (
"You MUST ask the user for explicit permission before proceeding. "
f'Say exactly: "I\'d like to search the {server} service for information. '
f'Do you approve? Please say yes or no."'
)
else:
prompt = (
"You MUST ask the user for permission again. "
'Say exactly: "I need to do one more search to get complete information. '
'Should I continue? Please say yes or no."'
)
try:
await connection.conversation.item.create(
item=MessageItem(
role="system",
content=[InputTextContentPart(text=prompt)],
)
)
await connection.response.create()
except Exception as e:
logger.warning("Failed to send approval voice prompt: %s", e)
async def _resolve_voice_approval(self, transcript: str, connection):
"""Interpret the user's spoken response as approval or denial."""
pending = self._pending_approval
if pending is None:
return
text = transcript.strip().lower()
# Match "yes" or "no" as whole words (word boundaries prevent false
# positives from words like "yesterday" or "nobody").
# Also accept "stop" and "cancel" as denial.
approved = bool(re.search(r'\byes\b', text))
denied = bool(re.search(r'\b(no|stop|cancel)\b', text))
if not approved and not denied:
# Ambiguous ā ask again via the deferred prompt mechanism
logger.info("Ambiguous approval response: %s", transcript)
self._approval_prompt_needed = True
return
if approved and denied:
# Conflicting signals ā treat as denial for safety
approved = False
# Clear the pending state before sending the response
self._pending_approval = None
if approved:
self._approved_servers_this_turn.add(pending["server_label"])
else:
self._approval_call_count.clear() # Topic is over
self._approved_servers_this_turn.discard(pending["server_label"])
approval_response_item = MCPApprovalResponseRequestItem(
approval_request_id=pending["approval_id"], approve=approved
)
try:
await connection.conversation.item.create(item=approval_response_item)
except Exception as e:
logger.error("Failed to send approval response: %s", e)
return
logger.info("Voice approval resolved: %s for %s", approved, pending["function_name"])
print(f" Voice approval: {'Approved ā
' if approved else 'Denied ā'}")
await write_conversation_log(f"Voice approval: {'Approved' if approved else 'Denied'} for {pending['server_label']}")
# Process next queued approval, if any
await self._process_next_approval(connection)
async def _process_next_approval(self, connection):
"""Pop the next queued approval and ask via voice."""
if not self._approval_queue:
return
next_approval = self._approval_queue.pop(0)
self._pending_approval = next_approval
# Send immediately if no response is active, otherwise defer
if not self._active_response:
await self._send_approval_voice_prompt(next_approval, connection)
else:
self._approval_prompt_needed = True
In this sample:
- The
mcp_approval_requestevent containsserver_label,name(tool name), andarguments. - A system message instructs the model to verbally ask for permission.
MCPApprovalResponseRequestItemsends the decision back to Voice Live withapprove=Trueorapprove=False.
Resolve voice-based approval
Parse the user's spoken transcript to determine approval. Use word-boundary regex to avoid false positives from words like "yesterday" or "nobody".
elif event.type == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED:
transcript = event.transcript if hasattr(event, 'transcript') else event.get("transcript", "")
logger.info("User said: %s", transcript)
print(f"š¤ You said:\t{transcript}")
await write_conversation_log(f"User Input:\t{transcript}")
# Interpret as an approval answer if we have a pending approval ā
# whether or not the prompt has finished speaking. This allows the
# user to barge in with "yes" without waiting for the full prompt.
if self._pending_approval is not None:
await self._resolve_voice_approval(transcript, conn)
In this sample:
- The transcript from
CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETEDis matched against\byes\band\b(no|stop|cancel)\bpatterns. - Subsequent calls to the same server within the same turn are auto-approved to avoid repeated prompts.
- After a configurable maximum (for example, 3 approvals), further calls are auto-denied and the model responds with what it has.
Detect stalls during MCP tool calls
MCP tool calls can take several seconds. Use a repeating timer to proactively inform the user that the assistant is still waiting for results.
MCP_STALL_MAX_NOTIFICATIONS = 3
def _start_mcp_stall_timer(self, connection):
"""Start a repeating timer that verbally updates the user if an MCP call takes too long."""
self._cancel_mcp_stall_timer()
async def _stall_loop():
stall_count = 0
while self._mcp_call_in_progress > 0 and stall_count < self.MCP_STALL_MAX_NOTIFICATIONS:
await asyncio.sleep(10)
if self._mcp_call_in_progress <= 0:
break
stall_count += 1
# Note: MCP calls cannot be cancelled via the API ā only honest
# status updates are possible until the server responds or times out.
msg = ("The tool call is still running. "
"Briefly reassure the user that you're still waiting for results. "
"One short sentence only.")
logger.info("MCP stall notification #%d", stall_count)
try:
await connection.conversation.item.create(
item=MessageItem(
role="system",
content=[InputTextContentPart(text=msg)],
)
)
await connection.response.create()
except Exception as e:
if "active response" in str(e).lower():
self._needs_response_create = True
else:
logger.debug("Stall notification failed: %s", e)
self._mcp_stall_task = asyncio.create_task(_stall_loop())
def _cancel_mcp_stall_timer(self):
"""Cancel the MCP stall timer if running."""
if self._mcp_stall_task and not self._mcp_stall_task.done():
self._mcp_stall_task.cancel()
self._mcp_stall_task = None
In this sample:
- A 10-second interval timer injects system messages like "Tell the user you're still waiting" up to 3 times.
- The timer is cancelled when the MCP call completes or the user interrupts with barge-in.
Run the sample
Create the
mcp-quickstart.pyfile with the following code:# ------------------------------------------------------------------------- # Copyright (c) Microsoft Corporation. All rights reserved. # Licensed under the MIT License. # ------------------------------------------------------------------------- """ FILE: mcp-quickstart.py DESCRIPTION: This sample demonstrates how to use the Azure AI Voice Live SDK with MCP (Model Context Protocol) server integration. It shows how to define MCP servers, handle MCP tool call events, and implement an approval flow for tool calls that require user consent. USAGE: python mcp-quickstart.py --use-token-credential Set the environment variables with your own values before running the sample: 1) AZURE_VOICELIVE_ENDPOINT - The Azure VoiceLive endpoint 2) AZURE_VOICELIVE_API_KEY - The Azure VoiceLive API key (if not using token credential) REQUIREMENTS: - azure-ai-voicelive - python-dotenv - pyaudio (for audio capture and playback) - azure-identity (for token credential authentication) """ from __future__ import annotations import os import sys import argparse import asyncio import base64 from datetime import datetime import logging import queue import re import signal from typing import Union, Optional, TYPE_CHECKING, cast from azure.core.credentials import AzureKeyCredential from azure.core.credentials_async import AsyncTokenCredential from azure.identity.aio import AzureCliCredential from azure.ai.voicelive.aio import connect from azure.ai.voicelive.models import ( AudioEchoCancellation, AudioInputTranscriptionOptions, AudioNoiseReduction, AzureStandardVoice, InputAudioFormat, InputTextContentPart, InterimResponseTrigger, ItemType, LlmInterimResponseConfig, MCPApprovalResponseRequestItem, MCPServer, MessageItem, Modality, OutputAudioFormat, RequestSession, ResponseMCPApprovalRequestItem, ResponseMCPCallItem, ServerEventConversationItemCreated, ServerEventResponseMcpCallCompleted, ServerEventType, ServerVad, Tool, ToolChoiceLiteral, ) from dotenv import load_dotenv import pyaudio if TYPE_CHECKING: from azure.ai.voicelive.aio import VoiceLiveConnection # Change to the directory where this script is located os.chdir(os.path.dirname(os.path.abspath(__file__))) # Environment variable loading load_dotenv('../.env', override=True) # Set up logging if not os.path.exists('logs'): os.makedirs('logs') timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") # Conversation log filename (separate from debug log) _script_dir = os.path.dirname(os.path.abspath(__file__)) conversation_logfilename = f"conversation_{timestamp}.log" logging.basicConfig( filename=f'logs/{timestamp}_voicelive.log', filemode="w", format='%(asctime)s:%(name)s:%(levelname)s:%(message)s', level=logging.INFO ) logger = logging.getLogger(__name__) class AudioProcessor: """ Handles real-time audio capture and playback for the voice assistant. Threading Architecture: - Main thread: Event loop and UI - Capture thread: PyAudio input stream reading - Send thread: Async audio data transmission to VoiceLive - Playback thread: PyAudio output stream writing """ loop: asyncio.AbstractEventLoop class AudioPlaybackPacket: """Represents a packet that can be sent to the audio playback queue.""" def __init__(self, seq_num: int, data: Optional[bytes]): self.seq_num = seq_num self.data = data def __init__(self, connection): self.connection = connection self.audio = pyaudio.PyAudio() # Audio configuration - PCM16, 24kHz, mono self.format = pyaudio.paInt16 self.channels = 1 self.rate = 24000 self.chunk_size = 1200 # 50ms # Capture and playback state self.input_stream = None self.playback_queue: queue.Queue[AudioProcessor.AudioPlaybackPacket] = queue.Queue() self.playback_base = 0 self.next_seq_num = 0 self.output_stream: Optional[pyaudio.Stream] = None logger.info("AudioProcessor initialized with 24kHz PCM16 mono audio") def start_capture(self): """Start capturing audio from microphone.""" def _capture_callback(in_data, _frame_count, _time_info, _status_flags): audio_base64 = base64.b64encode(in_data).decode("utf-8") asyncio.run_coroutine_threadsafe( self.connection.input_audio_buffer.append(audio=audio_base64), self.loop ) return (None, pyaudio.paContinue) if self.input_stream: return self.loop = asyncio.get_event_loop() try: self.input_stream = self.audio.open( format=self.format, channels=self.channels, rate=self.rate, input=True, frames_per_buffer=self.chunk_size, stream_callback=_capture_callback, ) logger.info("Started audio capture") except Exception: logger.exception("Failed to start audio capture") raise def start_playback(self): """Initialize audio playback system.""" if self.output_stream: return remaining = bytes() def _playback_callback(_in_data, frame_count, _time_info, _status_flags): nonlocal remaining frame_count *= pyaudio.get_sample_size(pyaudio.paInt16) out = remaining[:frame_count] remaining_local = remaining[frame_count:] while len(out) < frame_count: try: packet = self.playback_queue.get_nowait() except queue.Empty: out = out + bytes(frame_count - len(out)) continue if not packet or not packet.data: break if packet.seq_num < self.playback_base: continue num_to_take = frame_count - len(out) out = out + packet.data[:num_to_take] remaining_local = packet.data[num_to_take:] remaining = remaining_local if len(out) >= frame_count: return (out, pyaudio.paContinue) else: return (out, pyaudio.paComplete) try: self.output_stream = self.audio.open( format=self.format, channels=self.channels, rate=self.rate, output=True, frames_per_buffer=self.chunk_size, stream_callback=_playback_callback ) logger.info("Audio playback system ready") except Exception: logger.exception("Failed to initialize audio playback") raise def _get_and_increase_seq_num(self): seq = self.next_seq_num self.next_seq_num += 1 return seq def queue_audio(self, audio_data: Optional[bytes]) -> None: """Queue audio data for playback.""" self.playback_queue.put( AudioProcessor.AudioPlaybackPacket( seq_num=self._get_and_increase_seq_num(), data=audio_data)) def skip_pending_audio(self): """Skip current audio in playback queue.""" self.playback_base = self._get_and_increase_seq_num() def shutdown(self): """Clean up audio resources.""" if self.input_stream: self.input_stream.stop_stream() self.input_stream.close() self.input_stream = None logger.info("Stopped audio capture") if self.output_stream: self.skip_pending_audio() self.queue_audio(None) self.output_stream.stop_stream() self.output_stream.close() self.output_stream = None logger.info("Stopped audio playback") if self.audio: self.audio.terminate() logger.info("Audio processor cleaned up") class MCPVoiceAssistant: """Voice assistant with MCP server integration.""" def __init__( self, endpoint: str, credential: Union[AzureKeyCredential, AsyncTokenCredential], model: str, voice: str, instructions: str, ): self.endpoint = endpoint self.credential = credential self.model = model self.voice = voice self.instructions = instructions self.connection: Optional["VoiceLiveConnection"] = None self.audio_processor: Optional[AudioProcessor] = None self.session_ready = False self._active_response = False self._response_api_done = False self._pending_approval: Optional[dict] = None # Currently active approval request self._approval_queue: list[dict] = [] # Queued approvals waiting to be asked self._approval_prompt_needed = False # True when we need to inject the prompt at next RESPONSE_DONE self._mcp_call_in_progress = 0 # Count of active MCP tool calls self._handled_mcp_completions: set = set() # Deduplicate MCP completion events self._needs_response_create = False # Retry response.create at next RESPONSE_DONE self._approval_call_count: dict[str, int] = {} # Per-server call count this turn self._mcp_item_to_server: dict = {} # Map MCP item IDs to server_label/function_name self._approval_servers: set = set() # Server labels that require approval self._mcp_stall_task: Optional[asyncio.Task] = None # Timer for MCP stall detection self._active_mcp_items: set = set() # Item IDs of currently in-progress MCP calls self._stale_mcp_items: set = set() # MCP calls the user has moved on from self._approved_servers_this_turn: set = set() # Servers user already approved this turn self._mcp_results_pending = False # True when MCP calls completed but response.create deferred async def start(self): """Start the voice assistant session with MCP support.""" try: logger.info("Connecting to VoiceLive API with model %s", self.model) # <define_mcp_servers> # Define MCP servers that Voice Live can use during the session. # Each server is an MCPServer instance added to the tools list. mcp_tools: list[Tool] = [ MCPServer( server_label="deepwiki", server_url="https://mcp.deepwiki.com/mcp", allowed_tools=["read_wiki_structure", "ask_question"], require_approval="never", ), MCPServer( server_label="azure_doc", server_url="https://learn.microsoft.com/api/mcp", require_approval="always", ), ] # </define_mcp_servers> # Track which servers require approval for per-turn loop prevention. # Servers with require_approval="always" are guarded to avoid # repeated approval prompts in voice UX ā a design decision to keep # the voice conversation flow smooth. Servers with "never" are allowed # to make multiple calls (e.g. DeepWiki's read_wiki_structure ā # ask_question pattern) since they don't interrupt the user. self._approval_servers = { s.server_label for s in mcp_tools if isinstance(s, MCPServer) and s.require_approval == "always" } # Connect with api_version="2026-01-01-preview" for MCP support async with connect( endpoint=self.endpoint, credential=self.credential, model=self.model, api_version="2026-01-01-preview", ) as connection: self.connection = connection # Initialize audio processor ap = AudioProcessor(connection) self.audio_processor = ap # Configure session with MCP tools await self._setup_session(mcp_tools) # Start audio systems ap.start_playback() logger.info("Voice assistant with MCP ready! Start speaking...") print("\n" + "=" * 70) print("š¤ VOICE ASSISTANT WITH MCP READY") print("Try saying:") print(" ⢠'What is the GitHub repo fastapi about?'") print(" ⢠'Search the Azure documentation for Voice Live API.'") print("You may need to approve some MCP tool calls in the console.") print("Press Ctrl+C to exit") print("=" * 70 + "\n") # Process events await self._process_events() finally: if self.audio_processor: self.audio_processor.shutdown() # <configure_session> async def _setup_session(self, mcp_tools: list[Tool]): """Configure the VoiceLive session with MCP tools.""" logger.info("Setting up voice conversation session with MCP tools...") # Create voice configuration voice_config: Union[AzureStandardVoice, str] if "-" in self.voice or ":" in self.voice: voice_config = AzureStandardVoice(name=self.voice) else: voice_config = self.voice # Create turn detection configuration turn_detection_config = ServerVad( threshold=0.5, prefix_padding_ms=300, silence_duration_ms=500) # Create session configuration with MCP tools in the tools list session_config = RequestSession( modalities=[Modality.TEXT, Modality.AUDIO], instructions=self.instructions, voice=voice_config, input_audio_format=InputAudioFormat.PCM16, output_audio_format=OutputAudioFormat.PCM16, turn_detection=turn_detection_config, input_audio_echo_cancellation=AudioEchoCancellation(), input_audio_noise_reduction=AudioNoiseReduction(type="azure_deep_noise_suppression"), tools=mcp_tools, tool_choice=ToolChoiceLiteral.AUTO, input_audio_transcription=AudioInputTranscriptionOptions( model="azure-speech" if "realtime" not in self.model.lower() else "whisper-1" ), ) # Interim response bridges latency during MCP tool calls, but is only # supported on non-realtime model pipelines (e.g. gpt-4o-mini). if "realtime" not in self.model.lower(): session_config.interim_response = LlmInterimResponseConfig( triggers=[InterimResponseTrigger.TOOL, InterimResponseTrigger.LATENCY], latency_threshold_ms=100, instructions="Create friendly interim responses indicating wait time due to " "ongoing processing, if any. Do not include in all responses! " "Do not say you don't have real-time access to information when " "calling tools!", ) logger.info("Interim response enabled for model %s", self.model) else: logger.info("Interim response skipped ā not supported on realtime pipeline (%s)", self.model) conn = self.connection assert conn is not None await conn.session.update(session=session_config) logger.info("Session configuration with MCP tools sent") # </configure_session> async def _process_events(self): """Process events from the VoiceLive connection.""" conn = self.connection assert conn is not None async for event in conn: try: await self._handle_event(event) except Exception: logger.exception("Error handling event %s (non-fatal)", getattr(event, 'type', '?')) # <handle_mcp_events> async def _handle_event(self, event): """Handle different types of events from VoiceLive, including MCP events.""" ap = self.audio_processor conn = self.connection assert ap is not None assert conn is not None if event.type == ServerEventType.SESSION_UPDATED: logger.info("Session ready: %s", event.session.id) await write_conversation_log(f"SessionID: {event.session.id}") await write_conversation_log(f"Model: {event.session.model}") await write_conversation_log(f"Voice: {event.session.voice}") await write_conversation_log("") self.session_ready = True ap.start_capture() elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED: logger.info("User started speaking - stopping playback") print("š¤ Listening...") ap.skip_pending_audio() # Approval call counter is NOT reset on speech ā it tracks the # lifecycle of a task (reset on denial or after results are spoken) # But approved-servers-this-turn resets when user starts a new topic if self._pending_approval is None and self._mcp_call_in_progress <= 0: self._approved_servers_this_turn.clear() # Clear deferred response flags if no MCP calls are in progress. # Prevents stale _needs_response_create from re-triggering result # playback after the user interrupts. if self._mcp_call_in_progress <= 0: self._needs_response_create = False self._mcp_results_pending = False if self._active_response and not self._response_api_done: try: await conn.response.cancel() except Exception as e: if "no active response" not in str(e).lower(): logger.warning("Cancel failed: %s", e) # If an MCP call is running, mark current calls as stale (user is moving on) # and let the user know it's still in progress if self._mcp_call_in_progress > 0 and self._pending_approval is None: self._stale_mcp_items.update(self._active_mcp_items) logger.info("User spoke during MCP call ā marking %d calls as stale", len(self._active_mcp_items)) try: await conn.conversation.item.create( item=MessageItem( role="system", content=[InputTextContentPart( text="A tool call is still running in the background. The user just spoke. " "Respond to what the user said. If a tool result arrives later, " "briefly introduce it as a late result from an earlier request." )], ) ) except Exception as e: logger.warning("Failed to inject MCP status update: %s", e) elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED: logger.info("User stopped speaking") print("š¤ Processing...") elif event.type == ServerEventType.RESPONSE_CREATED: logger.info("Assistant response created") self._active_response = True self._response_api_done = False elif event.type == ServerEventType.RESPONSE_AUDIO_DELTA: ap.queue_audio(event.delta) elif event.type == ServerEventType.RESPONSE_AUDIO_DONE: logger.info("Assistant finished speaking") print("š¤ Ready for next input...") elif event.type == ServerEventType.RESPONSE_TEXT_DONE: text = event.text if hasattr(event, 'text') else event.get("text", "") print(f"š¤ Assistant text:\t{text}") await write_conversation_log(f"Assistant Text Response:\t{text}") elif event.type == ServerEventType.RESPONSE_AUDIO_TRANSCRIPT_DONE: transcript = event.transcript if hasattr(event, 'transcript') else event.get("transcript", "") print(f"š¤ Assistant audio transcript:\t{transcript}") await write_conversation_log(f"Assistant Audio Response:\t{transcript}") elif event.type == ServerEventType.RESPONSE_DONE: logger.info("Response complete") await write_conversation_log("--- Response complete ---") self._active_response = False self._response_api_done = True # If an approval prompt needs to be injected, do it now that no response is active if self._approval_prompt_needed and self._pending_approval is not None: self._approval_prompt_needed = False await self._send_approval_voice_prompt(self._pending_approval, conn) # If MCP results are pending and all calls are now done, create response elif self._mcp_results_pending and self._mcp_call_in_progress <= 0 and self._pending_approval is None: self._mcp_results_pending = False try: await conn.response.create() except Exception: pass # If a response.create was deferred due to collision, retry now elif self._needs_response_create: self._needs_response_create = False try: await conn.response.create() except Exception: pass # Best-effort retry # <voice_approval_transcription> elif event.type == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED: transcript = event.transcript if hasattr(event, 'transcript') else event.get("transcript", "") logger.info("User said: %s", transcript) print(f"š¤ You said:\t{transcript}") await write_conversation_log(f"User Input:\t{transcript}") # Interpret as an approval answer if we have a pending approval ā # whether or not the prompt has finished speaking. This allows the # user to barge in with "yes" without waiting for the full prompt. if self._pending_approval is not None: await self._resolve_voice_approval(transcript, conn) # </voice_approval_transcription> elif event.type == ServerEventType.ERROR: msg = event.error.message # Reset response state ā errors can terminate a response without RESPONSE_DONE self._active_response = False self._response_api_done = True if "Cancellation failed: no active response" not in msg: if "interim response" in msg.lower(): logger.warning("Interim response not supported with this model pipeline (non-fatal)") elif "active response" in msg.lower(): logger.debug("Response collision (expected during MCP flow): %s", msg) else: logger.error("VoiceLive error: %s", msg) print(f"Error: {msg}") await write_conversation_log(f"ERROR: {msg}") # MCP-specific events elif event.type == ServerEventType.MCP_LIST_TOOLS_IN_PROGRESS: logger.info("MCP list tools in progress for %s", event.item_id) elif event.type == ServerEventType.MCP_LIST_TOOLS_COMPLETED: logger.info("MCP list tools completed for %s", event.item_id) print("š§ MCP tools discovered successfully") await write_conversation_log("MCP tools discovered successfully") elif event.type == ServerEventType.MCP_LIST_TOOLS_FAILED: logger.error("MCP list tools failed for %s", event.item_id) print("ā MCP tool discovery failed") await write_conversation_log("ERROR: MCP tool discovery failed") elif event.type == ServerEventType.RESPONSE_MCP_CALL_IN_PROGRESS: logger.info("MCP call in progress for %s", event.item_id) print("ā³ MCP tool call in progress...") await write_conversation_log(f"MCP call in progress: {event.item_id}") self._mcp_call_in_progress += 1 self._active_mcp_items.add(event.item_id) self._start_mcp_stall_timer(conn) elif event.type == ServerEventType.RESPONSE_MCP_CALL_COMPLETED: item_id = event.item_id self._mcp_call_in_progress = max(0, self._mcp_call_in_progress - 1) self._active_mcp_items.discard(item_id) self._cancel_mcp_stall_timer() if item_id in self._handled_mcp_completions: logger.debug("Ignoring duplicate MCP completion for %s", item_id) else: self._handled_mcp_completions.add(item_id) is_stale = item_id in self._stale_mcp_items self._stale_mcp_items.discard(item_id) logger.info("MCP call completed for %s (stale=%s)", item_id, is_stale) await write_conversation_log(f"MCP call completed: {item_id} (stale={is_stale})") await self._handle_mcp_call_completed(event, conn, is_stale=is_stale) elif event.type == ServerEventType.RESPONSE_MCP_CALL_FAILED: item_id = event.item_id logger.error("MCP call failed for %s", item_id) print("ā MCP tool call failed") await write_conversation_log(f"ERROR: MCP call failed: {item_id}") self._mcp_call_in_progress = max(0, self._mcp_call_in_progress - 1) self._active_mcp_items.discard(item_id) self._stale_mcp_items.discard(item_id) self._cancel_mcp_stall_timer() # Kick the model to inform the user the tool call failed try: await conn.response.create() except Exception as e: if "active response" not in str(e).lower(): logger.warning("Failed to create response after MCP failure: %s", e) elif event.type == ServerEventType.CONVERSATION_ITEM_CREATED: logger.info("Conversation item created: id=%s, type=%s", event.item.id, event.item.type) if event.item.type == ItemType.MCP_LIST_TOOLS: logger.info("MCP list tools item: server_label=%s", event.item.server_label) elif event.item.type == ItemType.MCP_CALL: await self._handle_mcp_call_arguments(event, conn) elif event.item.type == ItemType.MCP_APPROVAL_REQUEST: await self._handle_mcp_approval_request(event, conn) else: logger.debug("Unhandled event type: %s", event.type) # </handle_mcp_events> # <handle_approval> async def _handle_mcp_approval_request(self, conversation_created_event, connection): """Handle MCP approval request by asking the user via voice.""" if not isinstance(conversation_created_event, ServerEventConversationItemCreated): logger.error("Expected ServerEventConversationItemCreated") return if not isinstance(conversation_created_event.item, ResponseMCPApprovalRequestItem): logger.error("Expected ResponseMCPApprovalRequestItem") return mcp_approval_item = conversation_created_event.item approval_id = mcp_approval_item.id server_label = mcp_approval_item.server_label function_name = mcp_approval_item.name if not approval_id: logger.error("MCP approval item missing ID") return # Auto-deny after too many calls to the same server in one task. # This prevents infinite tool-call loops in voice UX. MAX_APPROVAL_CALLS_PER_TASK = 3 current_count = self._approval_call_count.get(server_label, 0) if current_count >= MAX_APPROVAL_CALLS_PER_TASK: logger.info("Auto-denying %s ā reached %d calls this task", function_name, current_count) print(f" Auto-denied: {server_label}/{function_name} (max {MAX_APPROVAL_CALLS_PER_TASK} calls reached)") try: await connection.conversation.item.create( item=MCPApprovalResponseRequestItem(approval_request_id=approval_id, approve=False) ) except Exception as e: logger.warning("Failed to send auto-deny: %s", e) return # Auto-approve if user already approved this server earlier in the same turn. # This avoids repeated approval prompts for consecutive calls to the same service. if server_label in self._approved_servers_this_turn: logger.info("Auto-approving %s ā server already approved this turn", function_name) print(f" Auto-approved: {server_label}/{function_name} (already approved this turn)") try: await connection.conversation.item.create( item=MCPApprovalResponseRequestItem(approval_request_id=approval_id, approve=True) ) except Exception as e: logger.warning("Failed to send auto-approve: %s", e) return # If another approval is already pending, queue this one if self._pending_approval is not None: logger.info("Queuing approval for %s ā another is already pending", function_name) self._approval_queue.append({ "approval_id": approval_id, "server_label": server_label, "function_name": function_name, }) return logger.info("MCP approval request: server=%s tool=%s", server_label, function_name) print(f"\nš MCP Approval Request (voice-based):") print(f" Server: {server_label} Tool: {function_name}") # Store the pending approval. If no response is currently active, # send the voice prompt immediately. Otherwise, defer it to # RESPONSE_DONE to avoid colliding with an active response. self._pending_approval = { "approval_id": approval_id, "server_label": server_label, "function_name": function_name, } if not self._active_response: await self._send_approval_voice_prompt(self._pending_approval, connection) else: self._approval_prompt_needed = True async def _send_approval_voice_prompt(self, pending: dict, connection): """Inject a system message asking the model to verbally request permission.""" server = pending["server_label"] call_count = self._approval_call_count.get(server, 0) self._approval_call_count[server] = call_count + 1 if call_count == 0: prompt = ( "You MUST ask the user for explicit permission before proceeding. " f'Say exactly: "I\'d like to search the {server} service for information. ' f'Do you approve? Please say yes or no."' ) else: prompt = ( "You MUST ask the user for permission again. " 'Say exactly: "I need to do one more search to get complete information. ' 'Should I continue? Please say yes or no."' ) try: await connection.conversation.item.create( item=MessageItem( role="system", content=[InputTextContentPart(text=prompt)], ) ) await connection.response.create() except Exception as e: logger.warning("Failed to send approval voice prompt: %s", e) async def _resolve_voice_approval(self, transcript: str, connection): """Interpret the user's spoken response as approval or denial.""" pending = self._pending_approval if pending is None: return text = transcript.strip().lower() # Match "yes" or "no" as whole words (word boundaries prevent false # positives from words like "yesterday" or "nobody"). # Also accept "stop" and "cancel" as denial. approved = bool(re.search(r'\byes\b', text)) denied = bool(re.search(r'\b(no|stop|cancel)\b', text)) if not approved and not denied: # Ambiguous ā ask again via the deferred prompt mechanism logger.info("Ambiguous approval response: %s", transcript) self._approval_prompt_needed = True return if approved and denied: # Conflicting signals ā treat as denial for safety approved = False # Clear the pending state before sending the response self._pending_approval = None if approved: self._approved_servers_this_turn.add(pending["server_label"]) else: self._approval_call_count.clear() # Topic is over self._approved_servers_this_turn.discard(pending["server_label"]) approval_response_item = MCPApprovalResponseRequestItem( approval_request_id=pending["approval_id"], approve=approved ) try: await connection.conversation.item.create(item=approval_response_item) except Exception as e: logger.error("Failed to send approval response: %s", e) return logger.info("Voice approval resolved: %s for %s", approved, pending["function_name"]) print(f" Voice approval: {'Approved ā ' if approved else 'Denied ā'}") await write_conversation_log(f"Voice approval: {'Approved' if approved else 'Denied'} for {pending['server_label']}") # Process next queued approval, if any await self._process_next_approval(connection) async def _process_next_approval(self, connection): """Pop the next queued approval and ask via voice.""" if not self._approval_queue: return next_approval = self._approval_queue.pop(0) self._pending_approval = next_approval # Send immediately if no response is active, otherwise defer if not self._active_response: await self._send_approval_voice_prompt(next_approval, connection) else: self._approval_prompt_needed = True # </handle_approval> # <mcp_stall_detection> MCP_STALL_MAX_NOTIFICATIONS = 3 def _start_mcp_stall_timer(self, connection): """Start a repeating timer that verbally updates the user if an MCP call takes too long.""" self._cancel_mcp_stall_timer() async def _stall_loop(): stall_count = 0 while self._mcp_call_in_progress > 0 and stall_count < self.MCP_STALL_MAX_NOTIFICATIONS: await asyncio.sleep(10) if self._mcp_call_in_progress <= 0: break stall_count += 1 # Note: MCP calls cannot be cancelled via the API ā only honest # status updates are possible until the server responds or times out. msg = ("The tool call is still running. " "Briefly reassure the user that you're still waiting for results. " "One short sentence only.") logger.info("MCP stall notification #%d", stall_count) try: await connection.conversation.item.create( item=MessageItem( role="system", content=[InputTextContentPart(text=msg)], ) ) await connection.response.create() except Exception as e: if "active response" in str(e).lower(): self._needs_response_create = True else: logger.debug("Stall notification failed: %s", e) self._mcp_stall_task = asyncio.create_task(_stall_loop()) def _cancel_mcp_stall_timer(self): """Cancel the MCP stall timer if running.""" if self._mcp_stall_task and not self._mcp_stall_task.done(): self._mcp_stall_task.cancel() self._mcp_stall_task = None # </mcp_stall_detection> async def _handle_mcp_call_completed(self, mcp_call_completed_event, connection, *, is_stale=False): """Handle MCP call completed events.""" if not isinstance(mcp_call_completed_event, ServerEventResponseMcpCallCompleted): logger.error("Expected ServerEventResponseMcpCallCompleted") return logger.info("MCP call completed for %s (stale=%s)", mcp_call_completed_event.item_id, is_stale) print("ā MCP tool call completed successfully") # Clean up item mapping self._mcp_item_to_server.pop(mcp_call_completed_event.item_id, None) # Reset approval counter if no more approvals are pending (task complete) if self._pending_approval is None and not self._approval_queue: self._approval_call_count.clear() # If the user moved on during this call, tell the model it's a late result if is_stale: try: await connection.conversation.item.create( item=MessageItem( role="system", content=[InputTextContentPart( text="This tool result is from an earlier request. The user has " "since moved on. Briefly introduce it as a late result, e.g. " "'By the way, those results from earlier just came in...' " "then share the key findings concisely." )], ) ) except Exception as e: logger.warning("Failed to inject late-result context: %s", e) # Batch response: only call response.create when ALL MCP calls for this # turn have completed. This prevents partial results and repeated tool calls. if self._mcp_call_in_progress <= 0 and self._pending_approval is None and not self._approval_queue: logger.info("All MCP calls complete ā creating response") try: await connection.response.create() except Exception as e: if "active response" in str(e).lower(): self._needs_response_create = True else: logger.warning("Failed to create response after MCP calls: %s", e) else: self._mcp_results_pending = True logger.info("MCP calls still in progress (%d) ā deferring response", self._mcp_call_in_progress) async def _handle_mcp_call_arguments(self, conversation_created_event, connection): """Log MCP call details and announce the tool call to the user via voice.""" if not isinstance(conversation_created_event, ServerEventConversationItemCreated): logger.error("Expected ServerEventConversationItemCreated") return if not isinstance(conversation_created_event.item, ResponseMCPCallItem): logger.error("Expected ResponseMCPCallItem") return mcp_call_item = conversation_created_event.item server_label = mcp_call_item.server_label function_name = mcp_call_item.name logger.info("MCP Call triggered: server_label=%s, function_name=%s", server_label, function_name) print(f"š§ MCP tool call: {server_label}/{function_name}") self._mcp_item_to_server[mcp_call_item.id] = f"{server_label}/{function_name}" # Announce the tool call to the user so they know something is # happening while the MCP call runs. Skip for approval-required # servers (the approval prompt handles communication) and skip # if an approval is already pending. if self._pending_approval is None and server_label not in self._approval_servers: try: await connection.conversation.item.create( item=MessageItem( role="system", content=[InputTextContentPart( text="Briefly tell the user you're looking something up. One short sentence only." )], ) ) await connection.response.create() except Exception as e: if "active response" not in str(e).lower(): logger.warning("Failed to create tool announcement: %s", e) def parse_arguments(): """Parse command line arguments.""" parser = argparse.ArgumentParser( description="Voice Assistant with MCP using Azure VoiceLive SDK", ) parser.add_argument( "--api-key", help="Azure VoiceLive API key (or set AZURE_VOICELIVE_API_KEY env var)", type=str, default=os.environ.get("AZURE_VOICELIVE_API_KEY"), ) parser.add_argument( "--endpoint", help="Azure VoiceLive endpoint (default: from AZURE_VOICELIVE_ENDPOINT env var)", type=str, default=os.environ.get("AZURE_VOICELIVE_ENDPOINT", "https://your-resource-name.services.ai.azure.com/"), ) parser.add_argument( "--model", help="VoiceLive model to use (default: gpt-realtime)", type=str, default=os.environ.get("AZURE_VOICELIVE_MODEL", "gpt-realtime"), ) parser.add_argument( "--voice", help="Voice to use for the assistant (default: en-US-Ava:DragonHDLatestNeural)", type=str, default=os.environ.get("AZURE_VOICELIVE_VOICE", "en-US-Ava:DragonHDLatestNeural"), ) parser.add_argument( "--instructions", help="System instructions for the AI assistant", type=str, default=os.environ.get( "AZURE_VOICELIVE_INSTRUCTIONS", "You are a helpful AI assistant with access to MCP tools. " "Always respond in English. " "When a user asks a question, use the appropriate tool once to find information, " "then summarize the results conversationally. IMPORTANT: Never call the same tool " "more than once per user question. After receiving a tool result, always respond " "to the user with what you found ā do not search again. " "Some tools require user approval before they can be used. When you receive a " "system message asking you to request permission, you MUST clearly ask the user " "for their explicit approval before proceeding. Always wait for the user to say " "yes or no. Never skip the approval question or assume permission is granted. " "If a tool result arrives after the conversation has moved to a different topic, " "briefly introduce it as a late result before sharing the findings.", ), ) parser.add_argument( "--use-token-credential", help="Use Azure token credential instead of API key", action="store_true", default=False ) parser.add_argument("--verbose", help="Enable verbose logging", action="store_true") return parser.parse_args() async def write_conversation_log(message: str) -> None: """Write a message to the conversation log.""" log_path = os.path.join(_script_dir, 'logs', conversation_logfilename) def _write(): with open(log_path, 'a', encoding='utf-8') as f: f.write(message + "\n") await asyncio.to_thread(_write) def main(): """Main function.""" args = parse_arguments() if args.verbose: logging.getLogger().setLevel(logging.DEBUG) if not args.api_key and not args.use_token_credential: print("ā Error: No authentication provided") print("Please provide an API key using --api-key or set AZURE_VOICELIVE_API_KEY environment variable,") print("or use --use-token-credential for Azure authentication.") sys.exit(1) credential: Union[AzureKeyCredential, AsyncTokenCredential] if args.use_token_credential: credential = AzureCliCredential() logger.info("Using Azure token credential") else: credential = AzureKeyCredential(args.api_key) logger.info("Using API key credential") assistant = MCPVoiceAssistant( endpoint=args.endpoint, credential=credential, model=args.model, voice=args.voice, instructions=args.instructions, ) def signal_handler(_sig, _frame): logger.info("Received shutdown signal") raise KeyboardInterrupt() signal.signal(signal.SIGINT, signal_handler) signal.signal(signal.SIGTERM, signal_handler) try: asyncio.run(assistant.start()) except KeyboardInterrupt: print("\nš Voice assistant with MCP shut down. Goodbye!") except Exception as e: print("Fatal Error: ", e) if __name__ == "__main__": # Check audio system try: p = pyaudio.PyAudio() input_devices = [ i for i in range(p.get_device_count()) if cast(Union[int, float], p.get_device_info_by_index(i).get("maxInputChannels", 0) or 0) > 0 ] output_devices = [ i for i in range(p.get_device_count()) if cast(Union[int, float], p.get_device_info_by_index(i).get("maxOutputChannels", 0) or 0) > 0 ] p.terminate() if not input_devices: print("ā No audio input devices found. Please check your microphone.") sys.exit(1) if not output_devices: print("ā No audio output devices found. Please check your speakers.") sys.exit(1) except Exception as e: print(f"ā Audio system check failed: {e}") sys.exit(1) print("šļø Voice Assistant with MCP - Azure VoiceLive SDK") print("=" * 65) main()Sign in to Azure with the following command:
az loginRun the Python script:
python mcp-quickstart.pySpeak into your microphone. Try asking questions like "What tools do you have?" or "Search the Azure documentation for Voice Live API."
- For the
deepwikiserver (require_approval="never"), tool calls execute automatically. - For the
azure_docserver (require_approval="always"), you're prompted to approve each tool call in the console.
- For the
Press Ctrl+C to stop the session.
MCP server configuration reference
| Parameter | Required | Description |
|---|---|---|
server_label |
Yes | Display name for the MCP server. |
server_url |
Yes | URL of the remote MCP endpoint. |
allowed_tools |
No | List of tool names the model can call. If omitted, all tools are allowed. |
require_approval |
No | "never", "always" (default), or a per-tool dictionary. |
headers |
No | Extra HTTP headers to include in MCP requests. |
authorization |
No | Authorization token for MCP requests. |
For the complete REST API type definition, see MCPTool in the Voice Live API reference.
Learn how to connect remote MCP servers to a Voice Live session using the VoiceLive SDK for C#. This article builds on the Quickstart: Create a Voice Live real-time voice agent with MCP server integration.
Reference documentation | Package (NuGet) | Additional samples on GitHub
Follow the how-to below or get the full sample code:
Prerequisites
- An Azure subscription. Create one for free.
- .NET 8.0 SDK or later.
- A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see the Voice Live overview documentation.
Azure.AI.VoiceLivepackage version 1.1.0 or later (MCP support requires API version2026-04-10).- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Tip
To use Voice Live with MCP, you don't need to deploy an audio model with your Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about model availability, see the Voice Live overview documentation.
Prepare the environment
Complete the Voice Live quickstart to set up your environment, configure authentication, and test your first Voice Live conversation.
MCP integration concepts
MCP server definition
Use the VoiceLiveMcpServerDefinition class to declare each remote MCP endpoint. At minimum, provide ServerLabel (a display name) and ServerUrl (the MCP endpoint URL). Optionally restrict available tools with AllowedTools and configure the approval mode.
Approval modes
Control whether MCP tool calls require user approval before execution:
RequireApproval = "never": The tool executes automatically when the model invokes it.RequireApproval = "always"(default): The client receives an approval request and must respond before the tool runs.
API version requirement
MCP support requires API version 2026-04-10 or later.
Define MCP servers
Define the MCP servers that Voice Live can use during the session. Each server is a VoiceLiveMcpServerDefinition instance added to the tools list in the session configuration.
The following code defines two MCP servers: one with automatic tool execution and one that requires user approval before running.
/// <summary>
/// Define MCP servers that Voice Live can use during the session.
/// Each server is a VoiceLiveMcpServerDefinition instance added to the session options tools list.
/// </summary>
private List<VoiceLiveToolDefinition> DefineMCPServers()
{
var mcpTools = new List<VoiceLiveToolDefinition>
{
new VoiceLiveMcpServerDefinition("deepwiki", "https://mcp.deepwiki.com/mcp")
{
AllowedTools = { "read_wiki_structure", "ask_question" },
RequireApproval = BinaryData.FromString("\"never\""),
},
new VoiceLiveMcpServerDefinition("azure_doc", "https://learn.microsoft.com/api/mcp")
{
RequireApproval = BinaryData.FromString("\"always\""),
},
};
return mcpTools;
}
In this sample:
- The
deepwikiserver allows onlyread_wiki_structureandask_questiontools, withRequireApprovalset to"never"for automatic execution. - The
azure_docserver allows all tools on the endpoint, withRequireApprovalset to"always"so users can review each call before execution.
Configure the session with MCP tools
Pass the MCP server definitions to the session options tools list alongside your voice, modality, and turn-detection settings.
private async Task SetupSessionAsync(CancellationToken cancellationToken)
{
_logger.LogInformation("Setting up session with MCP tools...");
var azureVoice = new AzureStandardVoice(_voice);
var turnDetection = new ServerVadTurnDetection
{
Threshold = 0.5f,
PrefixPadding = TimeSpan.FromMilliseconds(300),
SilenceDuration = TimeSpan.FromMilliseconds(500)
};
// Create session options and add MCP servers to the tools list
var sessionOptions = new VoiceLiveSessionOptions
{
InputAudioEchoCancellation = new AudioEchoCancellation(),
Model = _model,
Instructions = _instructions,
Voice = azureVoice,
InputAudioFormat = InputAudioFormat.Pcm16,
OutputAudioFormat = OutputAudioFormat.Pcm16,
TurnDetection = turnDetection
};
// Enable input audio transcription so we receive
// SessionUpdateConversationItemInputAudioTranscriptionCompleted events
// (required for the voice-based approval flow).
sessionOptions.InputAudioTranscription = new AudioInputTranscriptionOptions(
_model.Contains("realtime", StringComparison.OrdinalIgnoreCase) ? "whisper-1" : "azure-speech");
sessionOptions.Modalities.Clear();
sessionOptions.Modalities.Add(InteractionModality.Text);
sessionOptions.Modalities.Add(InteractionModality.Audio);
// Add MCP servers to the tools list
var mcpServers = DefineMCPServers();
foreach (var tool in mcpServers)
{
sessionOptions.Tools.Add(tool);
}
// Track which servers require approval for per-turn loop prevention
_approvalServers = new HashSet<string> { "azure_doc" };
await _session!.ConfigureSessionAsync(sessionOptions, cancellationToken).ConfigureAwait(false);
_logger.LogInformation("Session with MCP tools configured");
}
In this sample:
VoiceLiveSessionOptionsbundles MCP tools with audio format, voice, and turn detection settings.ConfigureSessionAsync(options)sends the full configuration to Voice Live.- Voice Live automatically discovers available tools from each MCP server after the session starts.
Handle MCP events
Process MCP-specific events in the event loop. The key events include MCP tool call creation, completion, failure, and approval requests.
private async Task HandleSessionUpdateAsync(SessionUpdate serverEvent, CancellationToken cancellationToken)
{
switch (serverEvent)
{
case SessionUpdateSessionUpdated sessionUpdated:
_logger.LogInformation("Session updated");
WriteLog($"SessionID: {sessionUpdated.Session?.Id}");
WriteLog($"Model: {_model}");
WriteLog($"Voice: {_voice}");
WriteLog("");
if (_audioProcessor != null)
await _audioProcessor.StartCaptureAsync().ConfigureAwait(false);
break;
case SessionUpdateInputAudioBufferSpeechStarted:
Console.WriteLine("š¤ Listening...");
if (_audioProcessor != null)
await _audioProcessor.StopPlaybackAsync().ConfigureAwait(false);
if (_responseActive && _canCancelResponse)
{
try { await _session!.CancelResponseAsync(cancellationToken).ConfigureAwait(false); }
catch { }
try { await _session!.ClearStreamingAudioAsync(cancellationToken).ConfigureAwait(false); }
catch { }
}
// Do NOT reset _approvalCallCount here ā the counter should only
// reset on task completion (in MCP-call-completed when no pending/queued
// approvals remain) or on explicit denial (in ResolveVoiceApprovalAsync).
// Resetting on every speech-start would let the model retry denied calls.
// Clear deferred response flags if no MCP calls are in progress.
// Prevents stale needsResponseCreate from re-triggering result playback
// after the user interrupts.
if (_mcpCallInProgress <= 0)
{
_needsResponseCreate = false;
_mcpResultsPending = false;
}
// Reset approved-servers-this-turn when user starts a new topic
if (_pendingApproval == null && _mcpCallInProgress <= 0)
_approvedServersThisTurn.Clear();
// If an MCP call is running, ask the user if they want to wait or skip
if (_mcpCallInProgress > 0 && _pendingApproval == null)
{
foreach (var id in _activeMcpItems) _staleMcpItems.Add(id);
_logger.LogInformation("User spoke during MCP call ā marking {Count} calls as stale", _activeMcpItems.Count);
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "message",
role = "system",
content = new[] { new { type = "input_text", text = "A tool call is still running in the background. The user just spoke. Respond to what the user said. If a tool result arrives later, briefly introduce it as a late result from an earlier request." } }
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex) { _logger.LogWarning("Failed to inject MCP status update: {Error}", ex.Message); }
}
break;
case SessionUpdateInputAudioBufferSpeechStopped:
Console.WriteLine("š¤ Processing...");
if (_audioProcessor != null)
await _audioProcessor.StartPlaybackAsync().ConfigureAwait(false);
break;
case SessionUpdateResponseCreated:
_responseActive = true;
_canCancelResponse = true;
break;
case SessionUpdateResponseAudioDelta audioDelta:
if (audioDelta.Delta != null && _audioProcessor != null)
await _audioProcessor.QueueAudioAsync(audioDelta.Delta.ToArray()).ConfigureAwait(false);
break;
case SessionUpdateResponseAudioDone:
Console.WriteLine("š¤ Ready for next input...");
break;
case SessionUpdateResponseDone:
_responseActive = false;
_canCancelResponse = false;
WriteLog("--- Response complete ---");
// If an approval prompt needs to be injected, do it now
if (_approvalPromptNeeded && _pendingApproval != null)
{
_approvalPromptNeeded = false;
await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false);
}
// If MCP results are pending and all calls are now done, create response
else if (_mcpResultsPending && _mcpCallInProgress <= 0 && _pendingApproval == null)
{
_mcpResultsPending = false;
try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); }
catch { }
}
else if (_needsResponseCreate)
{
_needsResponseCreate = false;
try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); }
catch { }
}
break;
case SessionUpdateError errorEvent:
var msg = errorEvent.Error?.Message ?? "";
if (!msg.Contains("no active response", StringComparison.OrdinalIgnoreCase))
{
// Suppress non-fatal interim/collision errors
if (msg.Contains("interim response", StringComparison.OrdinalIgnoreCase))
{
_logger.LogWarning("Interim response not supported with this model pipeline (non-fatal)");
}
else if (msg.Contains("active response", StringComparison.OrdinalIgnoreCase))
{
_logger.LogDebug("Response collision (expected during MCP flow): {Message}", msg);
}
else
{
Console.WriteLine($"ā Error: {msg}");
WriteLog($"ERROR: {msg}");
}
}
_responseActive = false;
_canCancelResponse = false;
break;
// Transcription event ā used for voice-based approval resolution
case SessionUpdateConversationItemInputAudioTranscriptionCompleted transcription:
var transcript = transcription.Transcript ?? "";
_logger.LogInformation("User said: {Transcript}", transcript);
Console.WriteLine($"š¤ You said:\t{transcript}");
WriteLog($"User Input:\t{transcript}");
if (_pendingApproval != null)
{
await ResolveVoiceApprovalAsync(transcript, cancellationToken).ConfigureAwait(false);
}
break;
// MCP-specific events
case SessionUpdateMcpListToolsCompleted mcpListDone:
Console.WriteLine("š§ MCP tools discovered successfully");
WriteLog("MCP tools discovered successfully");
_logger.LogInformation("MCP tools discovered for server");
break;
case SessionUpdateMcpListToolsFailed:
Console.WriteLine("ā MCP tool discovery failed");
WriteLog("ERROR: MCP tool discovery failed");
break;
case SessionUpdateResponseMcpCallInProgress mcpInProgress:
Console.WriteLine("ā³ MCP tool call in progress...");
WriteLog($"MCP call in progress: {mcpInProgress.ItemId}");
_mcpCallInProgress++;
_activeMcpItems.Add(mcpInProgress.ItemId ?? "");
StartMcpStallTimer(cancellationToken);
break;
case SessionUpdateResponseMcpCallCompleted mcpCompleted:
{
var itemId = mcpCompleted.ItemId ?? "";
_mcpCallInProgress = Math.Max(0, _mcpCallInProgress - 1);
_activeMcpItems.Remove(itemId);
CancelMcpStallTimer();
if (_handledMcpCompletions.Contains(itemId))
{
_logger.LogDebug("Ignoring duplicate MCP completion for {ItemId}", itemId);
}
else
{
_handledMcpCompletions.Add(itemId);
bool isStale = _staleMcpItems.Remove(itemId);
_logger.LogInformation("MCP call completed for {ItemId} (stale={IsStale})", itemId, isStale);
Console.WriteLine("ā
MCP tool call completed successfully");
WriteLog($"MCP call completed: {itemId} (stale={isStale})");
// Clean up item mapping
_mcpItemToServer.Remove(itemId);
// Reset approval counter if no more approvals are pending
if (_pendingApproval == null && _approvalQueue.Count == 0)
_approvalCallCount.Clear();
// If the user moved on during this call, tell the model it's a late result
if (isStale)
{
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "message",
role = "system",
content = new[] { new { type = "input_text", text = "This tool result is from an earlier request. The user has since moved on. Briefly introduce it as a late result, e.g. 'By the way, those results from earlier just came in...' then share the key findings concisely." } }
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex) { _logger.LogWarning("Failed to inject late-result context: {Error}", ex.Message); }
}
// Batch response: only call response.create when ALL MCP calls for this
// turn have completed. This prevents partial results and repeated tool calls.
if (_pendingApproval == null && _approvalQueue.Count == 0 && _mcpCallInProgress <= 0)
{
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
if (ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase))
_needsResponseCreate = true;
else
_logger.LogWarning("Failed to create response after MCP call: {Error}", ex.Message);
}
}
else
{
_mcpResultsPending = true;
_logger.LogInformation("MCP calls still in progress ({Count}) ā deferring response", _mcpCallInProgress);
}
}
break;
}
case SessionUpdateResponseMcpCallFailed mcpFailed:
{
var failedItemId = mcpFailed.ItemId ?? "";
Console.WriteLine("ā MCP tool call failed");
WriteLog($"ERROR: MCP call failed: {failedItemId}");
_mcpCallInProgress = Math.Max(0, _mcpCallInProgress - 1);
_activeMcpItems.Remove(failedItemId);
_staleMcpItems.Remove(failedItemId);
CancelMcpStallTimer();
try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); }
catch { }
break;
}
case SessionUpdateConversationItemCreated itemCreated
when itemCreated.Item is SessionResponseMcpApprovalRequestItem mcpApproval:
await HandleMCPApprovalAsync(mcpApproval, cancellationToken).ConfigureAwait(false);
break;
case SessionUpdateConversationItemCreated itemCreated:
_logger.LogDebug("Conversation item created: {ItemType}", itemCreated.Item?.GetType().Name);
// Track mcp_call items for server mapping and announce non-approval tool calls
if (itemCreated.Item is SessionResponseMcpCallItem mcpCallItem)
{
var serverLabel = mcpCallItem.ServerLabel ?? "";
var functionName = mcpCallItem.Name ?? "";
var mcpItemId = mcpCallItem.Id ?? "";
_logger.LogInformation("MCP Call triggered: server_label={Server}, function_name={Function}", serverLabel, functionName);
Console.WriteLine($"š§ MCP tool call: {serverLabel}/{functionName}");
if (!string.IsNullOrEmpty(mcpItemId))
_mcpItemToServer[mcpItemId] = $"{serverLabel}/{functionName}";
// Announce the tool call so the user knows something is happening
if (_pendingApproval == null && !_approvalServers.Contains(serverLabel))
{
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "message",
role = "system",
content = new[] { new { type = "input_text", text = "Briefly tell the user you're looking something up. One short sentence only." } }
}
}), cancellationToken).ConfigureAwait(false);
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
if (!ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase))
_logger.LogWarning("Failed to create tool announcement: {Error}", ex.Message);
}
}
}
break;
default:
_logger.LogDebug("Unhandled event: {EventType}", serverEvent.GetType().Name);
break;
}
}
Handle approval requests
When a server is configured with RequireApproval = "always", client code must handle the approval flow. Instead of blocking on Console.ReadLine(), inject a system message so the model asks the user verbally and parse the spoken transcript for intent.
/// <summary>
/// Handle MCP approval request by asking the user via voice.
/// </summary>
private async Task HandleMCPApprovalAsync(SessionResponseMcpApprovalRequestItem approvalItem, CancellationToken cancellationToken)
{
var approvalId = approvalItem.Id;
var serverLabel = approvalItem.ServerLabel ?? "";
var toolName = approvalItem.Name ?? "";
if (string.IsNullOrEmpty(approvalId))
{
_logger.LogError("MCP approval item missing ID");
return;
}
// If another approval is already pending, queue this one
if (_pendingApproval != null)
{
_logger.LogInformation("Queuing approval for {Tool} ā another is already pending", toolName);
_approvalQueue.Enqueue(new ApprovalInfo(approvalId, serverLabel, toolName));
return;
}
const int MaxApprovalCallsPerTask = 3;
_approvalCallCount.TryGetValue(serverLabel, out var currentCount);
if (currentCount >= MaxApprovalCallsPerTask)
{
_logger.LogInformation("Auto-denying {Tool} ā reached {Count} calls this task", toolName, currentCount);
Console.WriteLine($" Auto-denied: {serverLabel}/{toolName} (max {MaxApprovalCallsPerTask} calls reached)");
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "mcp_approval_response",
approval_request_id = approvalId,
approve = false
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
_logger.LogWarning("Failed to send auto-deny: {Error}", ex.Message);
}
return;
}
// Auto-approve if user already approved this server earlier in the same turn
if (_approvedServersThisTurn.Contains(serverLabel))
{
_logger.LogInformation("Auto-approving {Tool} ā server already approved this turn", toolName);
Console.WriteLine($" Auto-approved: {serverLabel}/{toolName} (already approved this turn)");
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "mcp_approval_response",
approval_request_id = approvalId,
approve = true
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
_logger.LogWarning("Failed to send auto-approve: {Error}", ex.Message);
}
return;
}
_logger.LogInformation("MCP approval request: server={Server} tool={Tool}", serverLabel, toolName);
Console.WriteLine();
Console.WriteLine($"š MCP Approval Request (voice-based):");
Console.WriteLine($" Server: {serverLabel} Tool: {toolName}");
WriteLog($"Approval request: server={serverLabel} tool={toolName}");
_pendingApproval = new ApprovalInfo(approvalId, serverLabel, toolName);
if (!_responseActive)
{
await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false);
}
else
{
_approvalPromptNeeded = true;
}
}
/// <summary>
/// Inject a system message asking the model to verbally request permission.
/// </summary>
private async Task SendApprovalVoicePromptAsync(CancellationToken cancellationToken)
{
var pending = _pendingApproval;
if (pending == null) return;
var server = pending.ServerLabel;
_approvalCallCount.TryGetValue(server, out var callCount);
_approvalCallCount[server] = callCount + 1;
string prompt;
if (callCount == 0)
{
prompt = "You MUST ask the user for explicit permission before proceeding. "
+ $"Say exactly: \"I'd like to search the {server} service for information. "
+ "Do you approve? Please say yes or no.\"";
}
else
{
prompt = "You MUST ask the user for permission again. "
+ "Say exactly: \"I need to do one more search to get complete information. "
+ "Should I continue? Please say yes or no.\"";
}
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "message",
role = "system",
content = new[] { new { type = "input_text", text = prompt } }
}
}), cancellationToken).ConfigureAwait(false);
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
_logger.LogWarning("Failed to send approval voice prompt: {Error}", ex.Message);
}
}
/// <summary>
/// Interpret the user's spoken response as approval or denial.
/// </summary>
private async Task ResolveVoiceApprovalAsync(string transcript, CancellationToken cancellationToken)
{
var pending = _pendingApproval;
if (pending == null) return;
var text = transcript.Trim().ToLowerInvariant();
bool approved = Regex.IsMatch(text, @"\byes\b");
bool denied = Regex.IsMatch(text, @"\b(no|stop|cancel)\b");
if (!approved && !denied)
{
// Ambiguous ā ask again via the deferred prompt mechanism
_logger.LogInformation("Ambiguous approval response: {Transcript}", transcript);
_approvalPromptNeeded = true;
return;
}
if (approved && denied)
{
// Conflicting signals ā treat as denial for safety
approved = false;
}
// Clear the pending state before sending the response
_pendingApproval = null;
if (approved)
_approvedServersThisTurn.Add(pending.ServerLabel);
else
{
_approvalCallCount.Clear();
_approvedServersThisTurn.Remove(pending.ServerLabel);
}
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "mcp_approval_response",
approval_request_id = pending.ApprovalId,
approve = approved,
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
_logger.LogError("Failed to send approval response: {Error}", ex.Message);
return;
}
_logger.LogInformation("Voice approval resolved: {Approved} for {Tool}", approved, pending.FunctionName);
Console.WriteLine($" Voice approval: {(approved ? "Approved ā
" : "Denied ā")}");
WriteLog($"Approval resolved: {(approved ? "APPROVED" : "DENIED")} for {pending.ServerLabel}/{pending.FunctionName}");
// Process next queued approval, if any
await ProcessNextApprovalAsync(cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Pop the next queued approval and ask via voice.
/// </summary>
private async Task ProcessNextApprovalAsync(CancellationToken cancellationToken)
{
if (_approvalQueue.Count == 0) return;
var next = _approvalQueue.Dequeue();
_pendingApproval = next;
if (!_responseActive)
{
await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false);
}
else
{
_approvalPromptNeeded = true;
}
}
In this sample:
- A system message instructs the model to verbally ask for permission.
McpApprovalResponseItemsends the decision back to Voice Live withApprove = trueorApprove = false.
Resolve voice-based approval
Parse the user's spoken transcript to determine approval. Use word-boundary regex to avoid false positives from words like "yesterday" or "nobody".
/// <summary>
/// Interpret the user's spoken response as approval or denial.
/// </summary>
private async Task ResolveVoiceApprovalAsync(string transcript, CancellationToken cancellationToken)
{
var pending = _pendingApproval;
if (pending == null) return;
var text = transcript.Trim().ToLowerInvariant();
bool approved = Regex.IsMatch(text, @"\byes\b");
bool denied = Regex.IsMatch(text, @"\b(no|stop|cancel)\b");
if (!approved && !denied)
{
// Ambiguous ā ask again via the deferred prompt mechanism
_logger.LogInformation("Ambiguous approval response: {Transcript}", transcript);
_approvalPromptNeeded = true;
return;
}
if (approved && denied)
{
// Conflicting signals ā treat as denial for safety
approved = false;
}
// Clear the pending state before sending the response
_pendingApproval = null;
if (approved)
_approvedServersThisTurn.Add(pending.ServerLabel);
else
{
_approvalCallCount.Clear();
_approvedServersThisTurn.Remove(pending.ServerLabel);
}
try
{
await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "mcp_approval_response",
approval_request_id = pending.ApprovalId,
approve = approved,
}
}), cancellationToken).ConfigureAwait(false);
}
catch (Exception ex)
{
_logger.LogError("Failed to send approval response: {Error}", ex.Message);
return;
}
_logger.LogInformation("Voice approval resolved: {Approved} for {Tool}", approved, pending.FunctionName);
Console.WriteLine($" Voice approval: {(approved ? "Approved ā
" : "Denied ā")}");
WriteLog($"Approval resolved: {(approved ? "APPROVED" : "DENIED")} for {pending.ServerLabel}/{pending.FunctionName}");
// Process next queued approval, if any
await ProcessNextApprovalAsync(cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Pop the next queued approval and ask via voice.
/// </summary>
private async Task ProcessNextApprovalAsync(CancellationToken cancellationToken)
{
if (_approvalQueue.Count == 0) return;
var next = _approvalQueue.Dequeue();
_pendingApproval = next;
if (!_responseActive)
{
await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false);
}
else
{
_approvalPromptNeeded = true;
}
}
In this sample:
- The transcript from
ConversationItemInputAudioTranscriptionCompletedis matched against\byes\band\b(no|stop|cancel)\bpatterns. - Subsequent calls to the same server within the same turn are auto-approved to avoid repeated prompts.
- After a configurable maximum (for example, 3 approvals), further calls are auto-denied and the model responds with what it has.
Detect stalls during MCP tool calls
MCP tool calls can take several seconds. Use a repeating timer to proactively inform the user that the assistant is still waiting for results.
private void StartMcpStallTimer(CancellationToken ct)
{
CancelMcpStallTimer();
_mcpStallCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
var token = _mcpStallCts.Token;
_ = Task.Run(async () =>
{
int stallCount = 0;
while (_mcpCallInProgress > 0 && stallCount < 3)
{
await Task.Delay(10000, token).ConfigureAwait(false);
if (_mcpCallInProgress <= 0 || _session == null)
break;
stallCount++;
// MCP calls cannot be cancelled ā only honest status updates are possible.
string msg = "The tool call is still running. Briefly reassure the user that you're still waiting for results. One short sentence only.";
try
{
await _session.SendCommandAsync(BinaryData.FromObjectAsJson(new
{
type = "conversation.item.create",
item = new
{
type = "message",
role = "system",
content = new[] { new { type = "input_text", text = msg } }
}
}), token).ConfigureAwait(false);
await _session.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), token).ConfigureAwait(false);
}
catch (Exception ex)
{
if (ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase))
_needsResponseCreate = true;
}
}
}, token);
}
private void CancelMcpStallTimer()
{
if (_mcpStallCts != null)
{
_mcpStallCts.Cancel();
_mcpStallCts.Dispose();
_mcpStallCts = null;
}
}
In this sample:
- A 10-second interval timer injects system messages like "Tell the user you're still waiting" up to 3 times.
- The timer is cancelled when the MCP call completes or the user interrupts with barge-in.
Run the sample
Create the MCPQuickstart.cs file with the following code:
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text.RegularExpressions; using System.Threading; using System.Threading.Channels; using System.Threading.Tasks; using Azure.AI.VoiceLive; using Azure.Identity; using Microsoft.Extensions.Configuration; using Microsoft.Extensions.Logging; using NAudio.Wave; namespace Azure.AI.VoiceLive.Samples { /// <summary> /// MCP Quickstart - demonstrates MCP server integration with VoiceLive SDK. /// Shows how to define MCP servers, handle MCP tool calls, and implement /// an approval flow for tool calls that require user consent. /// </summary> public class Program { public static async Task<int> Main(string[] args) { // Setup configuration var configuration = new ConfigurationBuilder() .AddJsonFile("appsettings.json", optional: true) .AddEnvironmentVariables() .Build(); var apiKey = configuration["VoiceLive:ApiKey"] ?? Environment.GetEnvironmentVariable("AZURE_VOICELIVE_API_KEY"); var endpoint = configuration["VoiceLive:Endpoint"] ?? Environment.GetEnvironmentVariable("AZURE_VOICELIVE_ENDPOINT") ?? "https://your-resource-name.services.ai.azure.com/"; var model = configuration["VoiceLive:Model"] ?? Environment.GetEnvironmentVariable("AZURE_VOICELIVE_MODEL") ?? "gpt-realtime"; var voice = configuration["VoiceLive:Voice"] ?? Environment.GetEnvironmentVariable("AZURE_VOICELIVE_VOICE") ?? "en-US-Ava:DragonHDLatestNeural"; var instructions = configuration["VoiceLive:Instructions"] ?? "You are a helpful AI assistant with access to MCP tools. Use the tools to help answer user questions. Respond naturally and conversationally. Some tools require user approval before they can be used. When you receive a system message asking you to request permission, you MUST clearly ask the user for their explicit approval before proceeding. Always wait for the user to say yes or no. Never skip the approval question or assume permission is granted. If a tool result arrives after the conversation has moved to a different topic, briefly introduce it as a late result before sharing the findings."; var useTokenCredential = args.Length > 0 && args[0] == "--use-token-credential"; // Setup logging using var loggerFactory = LoggerFactory.Create(builder => { builder.AddConsole(); builder.SetMinimumLevel(LogLevel.Information); }); var logger = loggerFactory.CreateLogger<Program>(); // Validate credentials if (string.IsNullOrEmpty(apiKey) && !useTokenCredential) { Console.WriteLine("ā Error: No authentication provided"); Console.WriteLine("Set AZURE_VOICELIVE_API_KEY or use --use-token-credential."); return 1; } // Check audio system if (!CheckAudioSystem(logger)) return 1; try { VoiceLiveClient client; if (useTokenCredential) { client = new VoiceLiveClient(new Uri(endpoint), new DefaultAzureCredential(), new VoiceLiveClientOptions()); logger.LogInformation("Using Azure token credential"); } else { client = new VoiceLiveClient(new Uri(endpoint), new AzureKeyCredential(apiKey!), new VoiceLiveClientOptions()); logger.LogInformation("Using API key credential"); } using var assistant = new MCPVoiceAssistant(client, model, voice, instructions, loggerFactory); using var cts = new CancellationTokenSource(); Console.CancelKeyPress += (sender, e) => { e.Cancel = true; cts.Cancel(); }; await assistant.StartAsync(cts.Token).ConfigureAwait(false); } catch (OperationCanceledException) { Console.WriteLine("\nš Voice assistant with MCP shut down. Goodbye!"); } catch (Exception ex) { logger.LogError(ex, "Fatal error"); Console.WriteLine($"ā Error: {ex.Message}"); return 1; } return 0; } private static bool CheckAudioSystem(ILogger logger) { try { using var waveIn = new WaveInEvent { WaveFormat = new WaveFormat(24000, 16, 1), BufferMilliseconds = 50 }; waveIn.DataAvailable += (_, __) => { }; waveIn.StartRecording(); waveIn.StopRecording(); var buffer = new BufferedWaveProvider(new WaveFormat(24000, 16, 1)) { BufferDuration = TimeSpan.FromMilliseconds(200) }; using var waveOut = new WaveOutEvent { DesiredLatency = 100 }; waveOut.Init(buffer); waveOut.Play(); waveOut.Stop(); logger.LogInformation("Audio system check passed"); return true; } catch (Exception ex) { Console.WriteLine($"ā Audio system check failed: {ex.Message}"); return false; } } } /// <summary> /// Voice assistant with MCP server integration. /// </summary> public class MCPVoiceAssistant : IDisposable { private readonly VoiceLiveClient _client; private readonly string _model; private readonly string _voice; private readonly string _instructions; private readonly ILogger<MCPVoiceAssistant> _logger; private readonly ILoggerFactory _loggerFactory; private VoiceLiveSession? _session; private AudioProcessor? _audioProcessor; private bool _disposed; private bool _responseActive; private bool _canCancelResponse; // Voice-based MCP approval state private record ApprovalInfo(string ApprovalId, string ServerLabel, string FunctionName); private ApprovalInfo? _pendingApproval; private readonly Queue<ApprovalInfo> _approvalQueue = new(); private bool _approvalPromptNeeded; private int _mcpCallInProgress; private readonly HashSet<string> _handledMcpCompletions = new(); private bool _needsResponseCreate; private readonly Dictionary<string, int> _approvalCallCount = new(); private readonly Dictionary<string, string> _mcpItemToServer = new(); private HashSet<string> _approvalServers = new(); private CancellationTokenSource? _mcpStallCts; private readonly HashSet<string> _activeMcpItems = new(); private readonly HashSet<string> _staleMcpItems = new(); private bool _mcpResultsPending; private readonly HashSet<string> _approvedServersThisTurn = new(); private static readonly string LogFilename = $"conversation_{DateTime.Now:yyyyMMdd_HHmmss}.log"; public MCPVoiceAssistant( VoiceLiveClient client, string model, string voice, string instructions, ILoggerFactory loggerFactory) { _client = client; _model = model; _voice = voice; _instructions = instructions; _loggerFactory = loggerFactory; _logger = loggerFactory.CreateLogger<MCPVoiceAssistant>(); } public async Task StartAsync(CancellationToken cancellationToken = default) { try { _logger.LogInformation("Connecting to VoiceLive API with model {Model}", _model); _session = await _client.StartSessionAsync(_model, cancellationToken).ConfigureAwait(false); _audioProcessor = new AudioProcessor(_session, _loggerFactory.CreateLogger<AudioProcessor>()); await SetupSessionAsync(cancellationToken).ConfigureAwait(false); await _audioProcessor.StartPlaybackAsync().ConfigureAwait(false); await _audioProcessor.StartCaptureAsync().ConfigureAwait(false); _logger.LogInformation("Voice assistant with MCP ready!"); Console.WriteLine(); Console.WriteLine(new string('=', 70)); Console.WriteLine("š¤ VOICE ASSISTANT WITH MCP READY"); Console.WriteLine("Try saying:"); Console.WriteLine(" ⢠'What is the GitHub repo fastapi about?'"); Console.WriteLine(" ⢠'Search the Azure documentation for Voice Live API.'"); Console.WriteLine("You may need to approve some MCP tool calls in the console."); Console.WriteLine("Press Ctrl+C to exit"); Console.WriteLine(new string('=', 70)); Console.WriteLine(); await ProcessEventsAsync(cancellationToken).ConfigureAwait(false); } catch (OperationCanceledException) { _logger.LogInformation("Shutting down..."); } finally { if (_audioProcessor != null) await _audioProcessor.CleanupAsync().ConfigureAwait(false); } } // <define_mcp_servers> /// <summary> /// Define MCP servers that Voice Live can use during the session. /// Each server is a VoiceLiveMcpServerDefinition instance added to the session options tools list. /// </summary> private List<VoiceLiveToolDefinition> DefineMCPServers() { var mcpTools = new List<VoiceLiveToolDefinition> { new VoiceLiveMcpServerDefinition("deepwiki", "https://mcp.deepwiki.com/mcp") { AllowedTools = { "read_wiki_structure", "ask_question" }, RequireApproval = BinaryData.FromString("\"never\""), }, new VoiceLiveMcpServerDefinition("azure_doc", "https://learn.microsoft.com/api/mcp") { RequireApproval = BinaryData.FromString("\"always\""), }, }; return mcpTools; } // </define_mcp_servers> // <configure_session> private async Task SetupSessionAsync(CancellationToken cancellationToken) { _logger.LogInformation("Setting up session with MCP tools..."); var azureVoice = new AzureStandardVoice(_voice); var turnDetection = new ServerVadTurnDetection { Threshold = 0.5f, PrefixPadding = TimeSpan.FromMilliseconds(300), SilenceDuration = TimeSpan.FromMilliseconds(500) }; // Create session options and add MCP servers to the tools list var sessionOptions = new VoiceLiveSessionOptions { InputAudioEchoCancellation = new AudioEchoCancellation(), Model = _model, Instructions = _instructions, Voice = azureVoice, InputAudioFormat = InputAudioFormat.Pcm16, OutputAudioFormat = OutputAudioFormat.Pcm16, TurnDetection = turnDetection }; // Enable input audio transcription so we receive // SessionUpdateConversationItemInputAudioTranscriptionCompleted events // (required for the voice-based approval flow). sessionOptions.InputAudioTranscription = new AudioInputTranscriptionOptions( _model.Contains("realtime", StringComparison.OrdinalIgnoreCase) ? "whisper-1" : "azure-speech"); sessionOptions.Modalities.Clear(); sessionOptions.Modalities.Add(InteractionModality.Text); sessionOptions.Modalities.Add(InteractionModality.Audio); // Add MCP servers to the tools list var mcpServers = DefineMCPServers(); foreach (var tool in mcpServers) { sessionOptions.Tools.Add(tool); } // Track which servers require approval for per-turn loop prevention _approvalServers = new HashSet<string> { "azure_doc" }; await _session!.ConfigureSessionAsync(sessionOptions, cancellationToken).ConfigureAwait(false); _logger.LogInformation("Session with MCP tools configured"); } // </configure_session> private async Task ProcessEventsAsync(CancellationToken cancellationToken) { try { await foreach (SessionUpdate serverEvent in _session!.GetUpdatesAsync(cancellationToken).ConfigureAwait(false)) { await HandleSessionUpdateAsync(serverEvent, cancellationToken).ConfigureAwait(false); } } catch (OperationCanceledException) { } } // <handle_mcp_events> private async Task HandleSessionUpdateAsync(SessionUpdate serverEvent, CancellationToken cancellationToken) { switch (serverEvent) { case SessionUpdateSessionUpdated sessionUpdated: _logger.LogInformation("Session updated"); WriteLog($"SessionID: {sessionUpdated.Session?.Id}"); WriteLog($"Model: {_model}"); WriteLog($"Voice: {_voice}"); WriteLog(""); if (_audioProcessor != null) await _audioProcessor.StartCaptureAsync().ConfigureAwait(false); break; case SessionUpdateInputAudioBufferSpeechStarted: Console.WriteLine("š¤ Listening..."); if (_audioProcessor != null) await _audioProcessor.StopPlaybackAsync().ConfigureAwait(false); if (_responseActive && _canCancelResponse) { try { await _session!.CancelResponseAsync(cancellationToken).ConfigureAwait(false); } catch { } try { await _session!.ClearStreamingAudioAsync(cancellationToken).ConfigureAwait(false); } catch { } } // Do NOT reset _approvalCallCount here ā the counter should only // reset on task completion (in MCP-call-completed when no pending/queued // approvals remain) or on explicit denial (in ResolveVoiceApprovalAsync). // Resetting on every speech-start would let the model retry denied calls. // Clear deferred response flags if no MCP calls are in progress. // Prevents stale needsResponseCreate from re-triggering result playback // after the user interrupts. if (_mcpCallInProgress <= 0) { _needsResponseCreate = false; _mcpResultsPending = false; } // Reset approved-servers-this-turn when user starts a new topic if (_pendingApproval == null && _mcpCallInProgress <= 0) _approvedServersThisTurn.Clear(); // If an MCP call is running, ask the user if they want to wait or skip if (_mcpCallInProgress > 0 && _pendingApproval == null) { foreach (var id in _activeMcpItems) _staleMcpItems.Add(id); _logger.LogInformation("User spoke during MCP call ā marking {Count} calls as stale", _activeMcpItems.Count); try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "message", role = "system", content = new[] { new { type = "input_text", text = "A tool call is still running in the background. The user just spoke. Respond to what the user said. If a tool result arrives later, briefly introduce it as a late result from an earlier request." } } } }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogWarning("Failed to inject MCP status update: {Error}", ex.Message); } } break; case SessionUpdateInputAudioBufferSpeechStopped: Console.WriteLine("š¤ Processing..."); if (_audioProcessor != null) await _audioProcessor.StartPlaybackAsync().ConfigureAwait(false); break; case SessionUpdateResponseCreated: _responseActive = true; _canCancelResponse = true; break; case SessionUpdateResponseAudioDelta audioDelta: if (audioDelta.Delta != null && _audioProcessor != null) await _audioProcessor.QueueAudioAsync(audioDelta.Delta.ToArray()).ConfigureAwait(false); break; case SessionUpdateResponseAudioDone: Console.WriteLine("š¤ Ready for next input..."); break; case SessionUpdateResponseDone: _responseActive = false; _canCancelResponse = false; WriteLog("--- Response complete ---"); // If an approval prompt needs to be injected, do it now if (_approvalPromptNeeded && _pendingApproval != null) { _approvalPromptNeeded = false; await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false); } // If MCP results are pending and all calls are now done, create response else if (_mcpResultsPending && _mcpCallInProgress <= 0 && _pendingApproval == null) { _mcpResultsPending = false; try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch { } } else if (_needsResponseCreate) { _needsResponseCreate = false; try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch { } } break; case SessionUpdateError errorEvent: var msg = errorEvent.Error?.Message ?? ""; if (!msg.Contains("no active response", StringComparison.OrdinalIgnoreCase)) { // Suppress non-fatal interim/collision errors if (msg.Contains("interim response", StringComparison.OrdinalIgnoreCase)) { _logger.LogWarning("Interim response not supported with this model pipeline (non-fatal)"); } else if (msg.Contains("active response", StringComparison.OrdinalIgnoreCase)) { _logger.LogDebug("Response collision (expected during MCP flow): {Message}", msg); } else { Console.WriteLine($"ā Error: {msg}"); WriteLog($"ERROR: {msg}"); } } _responseActive = false; _canCancelResponse = false; break; // Transcription event ā used for voice-based approval resolution case SessionUpdateConversationItemInputAudioTranscriptionCompleted transcription: var transcript = transcription.Transcript ?? ""; _logger.LogInformation("User said: {Transcript}", transcript); Console.WriteLine($"š¤ You said:\t{transcript}"); WriteLog($"User Input:\t{transcript}"); if (_pendingApproval != null) { await ResolveVoiceApprovalAsync(transcript, cancellationToken).ConfigureAwait(false); } break; // MCP-specific events case SessionUpdateMcpListToolsCompleted mcpListDone: Console.WriteLine("š§ MCP tools discovered successfully"); WriteLog("MCP tools discovered successfully"); _logger.LogInformation("MCP tools discovered for server"); break; case SessionUpdateMcpListToolsFailed: Console.WriteLine("ā MCP tool discovery failed"); WriteLog("ERROR: MCP tool discovery failed"); break; case SessionUpdateResponseMcpCallInProgress mcpInProgress: Console.WriteLine("ā³ MCP tool call in progress..."); WriteLog($"MCP call in progress: {mcpInProgress.ItemId}"); _mcpCallInProgress++; _activeMcpItems.Add(mcpInProgress.ItemId ?? ""); StartMcpStallTimer(cancellationToken); break; case SessionUpdateResponseMcpCallCompleted mcpCompleted: { var itemId = mcpCompleted.ItemId ?? ""; _mcpCallInProgress = Math.Max(0, _mcpCallInProgress - 1); _activeMcpItems.Remove(itemId); CancelMcpStallTimer(); if (_handledMcpCompletions.Contains(itemId)) { _logger.LogDebug("Ignoring duplicate MCP completion for {ItemId}", itemId); } else { _handledMcpCompletions.Add(itemId); bool isStale = _staleMcpItems.Remove(itemId); _logger.LogInformation("MCP call completed for {ItemId} (stale={IsStale})", itemId, isStale); Console.WriteLine("ā MCP tool call completed successfully"); WriteLog($"MCP call completed: {itemId} (stale={isStale})"); // Clean up item mapping _mcpItemToServer.Remove(itemId); // Reset approval counter if no more approvals are pending if (_pendingApproval == null && _approvalQueue.Count == 0) _approvalCallCount.Clear(); // If the user moved on during this call, tell the model it's a late result if (isStale) { try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "message", role = "system", content = new[] { new { type = "input_text", text = "This tool result is from an earlier request. The user has since moved on. Briefly introduce it as a late result, e.g. 'By the way, those results from earlier just came in...' then share the key findings concisely." } } } }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogWarning("Failed to inject late-result context: {Error}", ex.Message); } } // Batch response: only call response.create when ALL MCP calls for this // turn have completed. This prevents partial results and repeated tool calls. if (_pendingApproval == null && _approvalQueue.Count == 0 && _mcpCallInProgress <= 0) { try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { if (ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase)) _needsResponseCreate = true; else _logger.LogWarning("Failed to create response after MCP call: {Error}", ex.Message); } } else { _mcpResultsPending = true; _logger.LogInformation("MCP calls still in progress ({Count}) ā deferring response", _mcpCallInProgress); } } break; } case SessionUpdateResponseMcpCallFailed mcpFailed: { var failedItemId = mcpFailed.ItemId ?? ""; Console.WriteLine("ā MCP tool call failed"); WriteLog($"ERROR: MCP call failed: {failedItemId}"); _mcpCallInProgress = Math.Max(0, _mcpCallInProgress - 1); _activeMcpItems.Remove(failedItemId); _staleMcpItems.Remove(failedItemId); CancelMcpStallTimer(); try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch { } break; } case SessionUpdateConversationItemCreated itemCreated when itemCreated.Item is SessionResponseMcpApprovalRequestItem mcpApproval: await HandleMCPApprovalAsync(mcpApproval, cancellationToken).ConfigureAwait(false); break; case SessionUpdateConversationItemCreated itemCreated: _logger.LogDebug("Conversation item created: {ItemType}", itemCreated.Item?.GetType().Name); // Track mcp_call items for server mapping and announce non-approval tool calls if (itemCreated.Item is SessionResponseMcpCallItem mcpCallItem) { var serverLabel = mcpCallItem.ServerLabel ?? ""; var functionName = mcpCallItem.Name ?? ""; var mcpItemId = mcpCallItem.Id ?? ""; _logger.LogInformation("MCP Call triggered: server_label={Server}, function_name={Function}", serverLabel, functionName); Console.WriteLine($"š§ MCP tool call: {serverLabel}/{functionName}"); if (!string.IsNullOrEmpty(mcpItemId)) _mcpItemToServer[mcpItemId] = $"{serverLabel}/{functionName}"; // Announce the tool call so the user knows something is happening if (_pendingApproval == null && !_approvalServers.Contains(serverLabel)) { try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "message", role = "system", content = new[] { new { type = "input_text", text = "Briefly tell the user you're looking something up. One short sentence only." } } } }), cancellationToken).ConfigureAwait(false); await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { if (!ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase)) _logger.LogWarning("Failed to create tool announcement: {Error}", ex.Message); } } } break; default: _logger.LogDebug("Unhandled event: {EventType}", serverEvent.GetType().Name); break; } } // </handle_mcp_events> // <handle_approval> /// <summary> /// Handle MCP approval request by asking the user via voice. /// </summary> private async Task HandleMCPApprovalAsync(SessionResponseMcpApprovalRequestItem approvalItem, CancellationToken cancellationToken) { var approvalId = approvalItem.Id; var serverLabel = approvalItem.ServerLabel ?? ""; var toolName = approvalItem.Name ?? ""; if (string.IsNullOrEmpty(approvalId)) { _logger.LogError("MCP approval item missing ID"); return; } // If another approval is already pending, queue this one if (_pendingApproval != null) { _logger.LogInformation("Queuing approval for {Tool} ā another is already pending", toolName); _approvalQueue.Enqueue(new ApprovalInfo(approvalId, serverLabel, toolName)); return; } const int MaxApprovalCallsPerTask = 3; _approvalCallCount.TryGetValue(serverLabel, out var currentCount); if (currentCount >= MaxApprovalCallsPerTask) { _logger.LogInformation("Auto-denying {Tool} ā reached {Count} calls this task", toolName, currentCount); Console.WriteLine($" Auto-denied: {serverLabel}/{toolName} (max {MaxApprovalCallsPerTask} calls reached)"); try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "mcp_approval_response", approval_request_id = approvalId, approve = false } }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogWarning("Failed to send auto-deny: {Error}", ex.Message); } return; } // Auto-approve if user already approved this server earlier in the same turn if (_approvedServersThisTurn.Contains(serverLabel)) { _logger.LogInformation("Auto-approving {Tool} ā server already approved this turn", toolName); Console.WriteLine($" Auto-approved: {serverLabel}/{toolName} (already approved this turn)"); try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "mcp_approval_response", approval_request_id = approvalId, approve = true } }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogWarning("Failed to send auto-approve: {Error}", ex.Message); } return; } _logger.LogInformation("MCP approval request: server={Server} tool={Tool}", serverLabel, toolName); Console.WriteLine(); Console.WriteLine($"š MCP Approval Request (voice-based):"); Console.WriteLine($" Server: {serverLabel} Tool: {toolName}"); WriteLog($"Approval request: server={serverLabel} tool={toolName}"); _pendingApproval = new ApprovalInfo(approvalId, serverLabel, toolName); if (!_responseActive) { await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false); } else { _approvalPromptNeeded = true; } } /// <summary> /// Inject a system message asking the model to verbally request permission. /// </summary> private async Task SendApprovalVoicePromptAsync(CancellationToken cancellationToken) { var pending = _pendingApproval; if (pending == null) return; var server = pending.ServerLabel; _approvalCallCount.TryGetValue(server, out var callCount); _approvalCallCount[server] = callCount + 1; string prompt; if (callCount == 0) { prompt = "You MUST ask the user for explicit permission before proceeding. " + $"Say exactly: \"I'd like to search the {server} service for information. " + "Do you approve? Please say yes or no.\""; } else { prompt = "You MUST ask the user for permission again. " + "Say exactly: \"I need to do one more search to get complete information. " + "Should I continue? Please say yes or no.\""; } try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "message", role = "system", content = new[] { new { type = "input_text", text = prompt } } } }), cancellationToken).ConfigureAwait(false); await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogWarning("Failed to send approval voice prompt: {Error}", ex.Message); } } // <voice_approval_transcription> /// <summary> /// Interpret the user's spoken response as approval or denial. /// </summary> private async Task ResolveVoiceApprovalAsync(string transcript, CancellationToken cancellationToken) { var pending = _pendingApproval; if (pending == null) return; var text = transcript.Trim().ToLowerInvariant(); bool approved = Regex.IsMatch(text, @"\byes\b"); bool denied = Regex.IsMatch(text, @"\b(no|stop|cancel)\b"); if (!approved && !denied) { // Ambiguous ā ask again via the deferred prompt mechanism _logger.LogInformation("Ambiguous approval response: {Transcript}", transcript); _approvalPromptNeeded = true; return; } if (approved && denied) { // Conflicting signals ā treat as denial for safety approved = false; } // Clear the pending state before sending the response _pendingApproval = null; if (approved) _approvedServersThisTurn.Add(pending.ServerLabel); else { _approvalCallCount.Clear(); _approvedServersThisTurn.Remove(pending.ServerLabel); } try { await _session!.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "mcp_approval_response", approval_request_id = pending.ApprovalId, approve = approved, } }), cancellationToken).ConfigureAwait(false); } catch (Exception ex) { _logger.LogError("Failed to send approval response: {Error}", ex.Message); return; } _logger.LogInformation("Voice approval resolved: {Approved} for {Tool}", approved, pending.FunctionName); Console.WriteLine($" Voice approval: {(approved ? "Approved ā " : "Denied ā")}"); WriteLog($"Approval resolved: {(approved ? "APPROVED" : "DENIED")} for {pending.ServerLabel}/{pending.FunctionName}"); // Process next queued approval, if any await ProcessNextApprovalAsync(cancellationToken).ConfigureAwait(false); } /// <summary> /// Pop the next queued approval and ask via voice. /// </summary> private async Task ProcessNextApprovalAsync(CancellationToken cancellationToken) { if (_approvalQueue.Count == 0) return; var next = _approvalQueue.Dequeue(); _pendingApproval = next; if (!_responseActive) { await SendApprovalVoicePromptAsync(cancellationToken).ConfigureAwait(false); } else { _approvalPromptNeeded = true; } } // </voice_approval_transcription> // </handle_approval> // <mcp_stall_detection> private void StartMcpStallTimer(CancellationToken ct) { CancelMcpStallTimer(); _mcpStallCts = CancellationTokenSource.CreateLinkedTokenSource(ct); var token = _mcpStallCts.Token; _ = Task.Run(async () => { int stallCount = 0; while (_mcpCallInProgress > 0 && stallCount < 3) { await Task.Delay(10000, token).ConfigureAwait(false); if (_mcpCallInProgress <= 0 || _session == null) break; stallCount++; // MCP calls cannot be cancelled ā only honest status updates are possible. string msg = "The tool call is still running. Briefly reassure the user that you're still waiting for results. One short sentence only."; try { await _session.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "conversation.item.create", item = new { type = "message", role = "system", content = new[] { new { type = "input_text", text = msg } } } }), token).ConfigureAwait(false); await _session.SendCommandAsync(BinaryData.FromObjectAsJson(new { type = "response.create" }), token).ConfigureAwait(false); } catch (Exception ex) { if (ex.Message.Contains("active response", StringComparison.OrdinalIgnoreCase)) _needsResponseCreate = true; } } }, token); } private void CancelMcpStallTimer() { if (_mcpStallCts != null) { _mcpStallCts.Cancel(); _mcpStallCts.Dispose(); _mcpStallCts = null; } } // </mcp_stall_detection> private static void WriteLog(string message) { try { var logDir = Path.Combine(Directory.GetCurrentDirectory(), "logs"); Directory.CreateDirectory(logDir); var logPath = Path.Combine(logDir, LogFilename); File.AppendAllText(logPath, $"[{DateTime.Now:HH:mm:ss}] {message}{Environment.NewLine}"); } catch (IOException) { } } public void Dispose() { if (_disposed) return; CancelMcpStallTimer(); _audioProcessor?.Dispose(); _session?.Dispose(); _disposed = true; } } /// <summary> /// Audio processor for real-time capture and playback. /// Same pattern as ModelQuickstart - handles PCM16 24kHz mono audio. /// </summary> public class AudioProcessor : IDisposable { private readonly VoiceLiveSession _session; private readonly ILogger<AudioProcessor> _logger; private const int SampleRate = 24000; private const int Channels = 1; private const int BitsPerSample = 16; private WaveInEvent? _waveIn; private WaveOutEvent? _waveOut; private BufferedWaveProvider? _playbackBuffer; private bool _isCapturing; private bool _isPlaying; private readonly Channel<byte[]> _audioSendChannel; private readonly ChannelWriter<byte[]> _audioSendWriter; private readonly ChannelReader<byte[]> _audioSendReader; private readonly Channel<byte[]> _audioPlaybackChannel; private readonly ChannelWriter<byte[]> _audioPlaybackWriter; private readonly ChannelReader<byte[]> _audioPlaybackReader; private Task? _audioSendTask; private Task? _audioPlaybackTask; private readonly CancellationTokenSource _cancellationTokenSource; private CancellationTokenSource _playbackCancellationTokenSource; public AudioProcessor(VoiceLiveSession session, ILogger<AudioProcessor> logger) { _session = session; _logger = logger; _audioSendChannel = Channel.CreateUnbounded<byte[]>(); _audioSendWriter = _audioSendChannel.Writer; _audioSendReader = _audioSendChannel.Reader; _audioPlaybackChannel = Channel.CreateUnbounded<byte[]>(); _audioPlaybackWriter = _audioPlaybackChannel.Writer; _audioPlaybackReader = _audioPlaybackChannel.Reader; _cancellationTokenSource = new CancellationTokenSource(); _playbackCancellationTokenSource = new CancellationTokenSource(); } public Task StartCaptureAsync() { if (_isCapturing) return Task.CompletedTask; _isCapturing = true; _waveIn = new WaveInEvent { WaveFormat = new WaveFormat(SampleRate, BitsPerSample, Channels), BufferMilliseconds = 50 }; _waveIn.DataAvailable += (sender, e) => { if (_isCapturing && e.BytesRecorded > 0) { var audioData = new byte[e.BytesRecorded]; Array.Copy(e.Buffer, 0, audioData, 0, e.BytesRecorded); _audioSendWriter.TryWrite(audioData); } }; _waveIn.StartRecording(); _audioSendTask = ProcessAudioSendAsync(_cancellationTokenSource.Token); _logger.LogInformation("Started audio capture"); return Task.CompletedTask; } public Task StartPlaybackAsync() { if (_isPlaying) return Task.CompletedTask; _isPlaying = true; _waveOut = new WaveOutEvent { DesiredLatency = 100 }; _playbackBuffer = new BufferedWaveProvider(new WaveFormat(SampleRate, BitsPerSample, Channels)) { BufferDuration = TimeSpan.FromSeconds(10), DiscardOnBufferOverflow = true }; _waveOut.Init(_playbackBuffer); _waveOut.Play(); _playbackCancellationTokenSource = new CancellationTokenSource(); _audioPlaybackTask = ProcessAudioPlaybackAsync(); _logger.LogInformation("Audio playback ready"); return Task.CompletedTask; } public async Task StopPlaybackAsync() { if (!_isPlaying) return; _isPlaying = false; while (_audioPlaybackReader.TryRead(out _)) { } _playbackBuffer?.ClearBuffer(); if (_waveOut != null) { _waveOut.Stop(); _waveOut.Dispose(); _waveOut = null; } _playbackBuffer = null; _playbackCancellationTokenSource.Cancel(); if (_audioPlaybackTask != null) { await _audioPlaybackTask.ConfigureAwait(false); _audioPlaybackTask = null; } } public async Task QueueAudioAsync(byte[] audioData) { if (_isPlaying && audioData.Length > 0) await _audioPlaybackWriter.WriteAsync(audioData).ConfigureAwait(false); } public async Task CleanupAsync() { _isCapturing = false; if (_waveIn != null) { _waveIn.StopRecording(); _waveIn.Dispose(); _waveIn = null; } _audioSendWriter.TryComplete(); if (_audioSendTask != null) await _audioSendTask.ConfigureAwait(false); await StopPlaybackAsync().ConfigureAwait(false); _cancellationTokenSource.Cancel(); _logger.LogInformation("Audio processor cleaned up"); } private async Task ProcessAudioSendAsync(CancellationToken ct) { try { await foreach (var audioData in _audioSendReader.ReadAllAsync(ct).ConfigureAwait(false)) { try { await _session.SendInputAudioAsync(audioData, ct).ConfigureAwait(false); } catch { } } } catch (OperationCanceledException) { } } private async Task ProcessAudioPlaybackAsync() { try { var ct = CancellationTokenSource.CreateLinkedTokenSource( _playbackCancellationTokenSource.Token, _cancellationTokenSource.Token).Token; await foreach (var audioData in _audioPlaybackReader.ReadAllAsync(ct).ConfigureAwait(false)) { if (_playbackBuffer != null && _isPlaying) _playbackBuffer.AddSamples(audioData, 0, audioData.Length); } } catch (OperationCanceledException) { } } public void Dispose() { CleanupAsync().Wait(); _cancellationTokenSource.Dispose(); } } }Sign in to Azure with the following command:
az loginBuild and run the application:
dotnet runSpeak into your microphone. Try asking questions like "What tools do you have?" or "Search the Azure documentation for Voice Live API."
- For the
deepwikiserver (RequireApproval = "never"), tool calls execute automatically. - For the
azure_docserver (RequireApproval = "always"), you're prompted to approve each tool call in the console.
- For the
Press Ctrl+C to stop the session.
MCP server configuration reference
| Parameter | Required | Description |
|---|---|---|
ServerLabel |
Yes | Display name for the MCP server. |
ServerUrl |
Yes | URL of the remote MCP endpoint. |
AllowedTools |
No | List of tool names the model can call. If omitted, all tools are allowed. |
RequireApproval |
No | "never", "always" (default), or a per-tool dictionary. |
Headers |
No | Extra HTTP headers to include in MCP requests. |
Authorization |
No | Authorization token for MCP requests. |
For the complete REST API type definition, see MCPTool in the Voice Live API reference.
Learn how to connect remote MCP servers to a Voice Live session using the VoiceLive SDK for Java. This article builds on the Quickstart: Create a Voice Live real-time voice agent with MCP server integration.
Reference documentation | Package (Maven) | Additional samples on GitHub
Follow the how-to below or get the full sample code:
Prerequisites
- An Azure subscription. Create one for free.
- Java Development Kit (JDK) version 11 or later.
- Apache Maven installed.
- A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see the Voice Live overview documentation.
azure-ai-voicelivepackage version 1.0.0 or later (MCP support requires API version2026-04-10).- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Tip
To use Voice Live with MCP, you don't need to deploy an audio model with your Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about model availability, see the Voice Live overview documentation.
Prepare the environment
Complete the Voice Live quickstart to set up your environment, configure authentication, and test your first Voice Live conversation.
MCP integration concepts
MCP server definition
Use the MCPServer type to declare each remote MCP endpoint. At minimum, provide serverLabel (a display name) and serverUrl (the MCP endpoint URL). Optionally restrict available tools with allowedTools and configure the approval mode.
Approval modes
Control whether MCP tool calls require user approval before execution:
requireApproval("never"): The tool executes automatically when the model invokes it.requireApproval("always")(default): The client receives an approval request and must respond before the tool runs.
API version requirement
MCP support requires API version 2026-04-10 or later.
Define MCP servers
Define the MCP servers that Voice Live can use during the session. Each server is an MCPServer instance added to the tools list in the session configuration.
The following code defines two MCP servers: one with automatic tool execution and one that requires user approval before running.
/**
* Define MCP servers that Voice Live can use during the session.
* Each server is an MCPServer instance added to the session options tools list.
*/
private static List<VoiceLiveToolDefinition> defineMCPServers() {
List<VoiceLiveToolDefinition> mcpTools = new ArrayList<>();
mcpTools.add(new MCPServer("deepwiki", "https://mcp.deepwiki.com/mcp")
.setAllowedTools(Arrays.asList("read_wiki_structure", "ask_question"))
.setRequireApproval(BinaryData.fromString("never")));
mcpTools.add(new MCPServer("azure_doc", "https://learn.microsoft.com/api/mcp")
.setRequireApproval(BinaryData.fromString("always")));
return mcpTools;
}
In this sample:
- The
deepwikiserver allows onlyread_wiki_structureandask_questiontools, withrequireApprovalset to"never"for automatic execution. - The
azure_docserver allows all tools on the endpoint, withrequireApprovalset to"always"so users can review each call before execution.
Configure the session with MCP tools
Pass the MCP server definitions to the session options tools list alongside your voice, modality, and turn-detection settings.
/**
* Create session configuration with MCP servers in the tools list.
*/
private static VoiceLiveSessionOptions createSessionOptions(Config config) {
ServerVadTurnDetection turnDetection = new ServerVadTurnDetection()
.setThreshold(0.5)
.setPrefixPaddingMs(300)
.setSilenceDurationMs(500)
.setInterruptResponse(true)
.setAutoTruncate(true)
.setCreateResponse(true);
// Enable input audio transcription so we receive user speech as text
AudioInputTranscriptionOptionsModel transcriptionModel = config.model.toLowerCase().contains("realtime")
? AudioInputTranscriptionOptionsModel.WHISPER_1
: AudioInputTranscriptionOptionsModel.fromString("azure-speech");
AudioInputTranscriptionOptions transcriptionOptions =
new AudioInputTranscriptionOptions(transcriptionModel);
VoiceLiveSessionOptions options = new VoiceLiveSessionOptions()
.setInstructions(config.instructions)
.setVoice(BinaryData.fromObject(new AzureStandardVoice(config.voice)))
.setModalities(Arrays.asList(InteractionModality.TEXT, InteractionModality.AUDIO))
.setInputAudioFormat(InputAudioFormat.PCM16)
.setOutputAudioFormat(OutputAudioFormat.PCM16)
.setInputAudioSamplingRate(SAMPLE_RATE)
.setInputAudioNoiseReduction(new AudioNoiseReduction(AudioNoiseReductionType.NEAR_FIELD))
.setInputAudioEchoCancellation(new AudioEchoCancellation())
.setInputAudioTranscription(transcriptionOptions)
.setTurnDetection(turnDetection);
// Add MCP servers to the tools list
List<VoiceLiveToolDefinition> mcpServers = defineMCPServers();
options.setTools(mcpServers);
return options;
}
In this sample:
VoiceLiveSessionOptionsbundles MCP tools with audio format, voice, and turn detection settings.- The session configuration is sent to Voice Live after connecting.
- Voice Live automatically discovers available tools from each MCP server after the session starts.
Handle MCP events
Process MCP-specific events in the event loop. The key events include MCP tool call creation, completion, failure, and approval requests.
/**
* Handle incoming server events, including MCP-specific events
* and voice-based approval flow.
*/
private static void handleServerEvent(SessionUpdate event, AudioProcessor audioProcessor,
SessionState state, VoiceLiveSessionAsyncClient session) {
ServerEventType eventType = event.getType();
try {
if (eventType == ServerEventType.SESSION_UPDATED) {
System.out.println("ā Session updated - starting microphone");
writeLog("Session updated");
audioProcessor.startCapture();
} else if (eventType == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED) {
System.out.println("š¤ Listening...");
audioProcessor.skipPendingAudio();
// Cancel any active response ā prevents duplicate result playback
// when the user interrupts during MCP result speech (matches C#/Python/JS)
if (state.responseActive) {
session.send(BinaryData.fromString("{\"type\":\"response.cancel\"}"))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
}
// Clear deferred response flags if no MCP calls are in progress.
// Without this, a stale needsResponseCreate from a collision during
// the approval flow causes the model to re-speak results after the
// user interrupts.
if (state.mcpCallInProgress.get() <= 0) {
state.needsResponseCreate = false;
state.mcpResultsPending = false;
}
// Reset approved-servers-this-turn when user starts a new topic
if (state.pendingApproval == null && state.mcpCallInProgress.get() <= 0) {
state.approvedServersThisTurn.clear();
}
// If an MCP call is running and no approval is pending, mark as stale
if (state.mcpCallInProgress.get() > 0 && state.pendingApproval == null) {
state.staleMcpItems.addAll(state.activeMcpItems);
System.out.println("[barge-in] Marking " + state.activeMcpItems.size() + " MCP calls as stale");
sendSystemMessage(session,
"A tool call is still running in the background. The user just spoke. "
+ "Respond to what the user said. If a tool result arrives later, "
+ "briefly introduce it as a late result from an earlier request.")
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
}
} else if (eventType == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED) {
System.out.println("š¤ Processing...");
} else if (eventType == ServerEventType.RESPONSE_CREATED) {
state.responseActive = true;
} else if (eventType == ServerEventType.RESPONSE_AUDIO_DELTA) {
if (event instanceof SessionUpdateResponseAudioDelta) {
SessionUpdateResponseAudioDelta audioEvent = (SessionUpdateResponseAudioDelta) event;
byte[] audioData = audioEvent.getDelta();
if (audioData != null && audioData.length > 0) {
audioProcessor.queueAudio(audioData);
}
}
} else if (eventType == ServerEventType.RESPONSE_AUDIO_DONE) {
System.out.println("š¤ Ready for next input...");
} else if (eventType == ServerEventType.RESPONSE_DONE) {
state.responseActive = false;
System.out.println("ā
Response complete");
writeLog("--- Response complete ---");
// If an approval prompt needs to be injected, do it now
if (state.approvalPromptNeeded && state.pendingApproval != null) {
state.approvalPromptNeeded = false;
sendApprovalVoicePrompt(state, session);
// If MCP results are pending and all calls are now done, create response
} else if (state.mcpResultsPending && state.mcpCallInProgress.get() <= 0 && state.pendingApproval == null) {
state.mcpResultsPending = false;
try {
session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
} catch (Exception e) {
// best-effort
}
} else if (state.needsResponseCreate) {
// Deferred response.create ā retry now that no response is active
state.needsResponseCreate = false;
try {
session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
} catch (Exception e) {
// best-effort retry
}
}
} else if (eventType == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED) {
String eventJson = BinaryData.fromObject(event).toString();
String transcript = extractJsonField(eventJson, "transcript");
System.out.println("š¤ You said:\t" + transcript);
writeLog("User Input:\t" + transcript);
// Interpret as an approval answer if we have a pending approval
if (state.pendingApproval != null) {
resolveVoiceApproval(transcript, state, session);
}
} else if (eventType == ServerEventType.ERROR) {
// Reset response state ā errors can terminate a response without RESPONSE_DONE
state.responseActive = false;
if (event instanceof SessionUpdateError) {
String msg = ((SessionUpdateError) event).getError().getMessage();
if (msg.contains("no active response")) {
// suppress
} else if (msg.toLowerCase().contains("interim response")) {
// non-fatal
} else if (msg.toLowerCase().contains("active response")) {
// expected during MCP flow
} else {
System.out.println("ā Error: " + msg);
writeLog("ERROR: " + msg);
}
}
// MCP-specific events
} else if (eventType == ServerEventType.MCP_LIST_TOOLS_COMPLETED) {
System.out.println("š§ MCP tools discovered successfully");
writeLog("MCP tools discovered successfully");
} else if (eventType == ServerEventType.MCP_LIST_TOOLS_FAILED) {
System.out.println("ā MCP tool discovery failed");
writeLog("ERROR: MCP tool discovery failed");
} else if (eventType == ServerEventType.RESPONSE_MCP_CALL_IN_PROGRESS) {
System.out.println("ā³ MCP tool call in progress...");
writeLog("MCP call in progress");
state.mcpCallInProgress.incrementAndGet();
String inProgressJson = BinaryData.fromObject(event).toString();
String inProgressItemId = extractJsonField(inProgressJson, "item_id");
if (inProgressItemId != null) state.activeMcpItems.add(inProgressItemId);
startMcpStallTimer(state, session);
} else if (eventType == ServerEventType.RESPONSE_MCP_CALL_COMPLETED) {
String eventJson = BinaryData.fromObject(event).toString();
String itemId = extractJsonField(eventJson, "item_id");
state.mcpCallInProgress.updateAndGet(v -> Math.max(0, v - 1));
if (itemId != null) state.activeMcpItems.remove(itemId);
cancelMcpStallTimer(state);
if (state.handledMcpCompletions.contains(itemId)) {
// duplicate ā ignore
} else {
state.handledMcpCompletions.add(itemId);
boolean isStale = itemId != null && state.staleMcpItems.remove(itemId);
System.out.println("ā
MCP tool call completed (stale=" + isStale + ")");
writeLog("MCP call completed: " + itemId + " (stale=" + isStale + ")");
state.mcpItemToServer.remove(itemId);
// Reset approval counter if no more approvals pending
if (state.pendingApproval == null && state.approvalQueue.isEmpty()) {
state.approvalCallCount.clear();
}
// If the user moved on during this call, tell the model it's a late result.
// Chain any late-result context message with the response.create below
// to ensure the system message arrives first.
Mono<Void> preResponseMono = Mono.empty();
if (isStale) {
preResponseMono = sendSystemMessage(session,
"This tool result is from an earlier request. The user has "
+ "since moved on. Briefly introduce it as a late result, e.g. "
+ "'By the way, those results from earlier just came in...' "
+ "then share the key findings concisely.");
}
// Batch response: only call response.create when ALL MCP calls for this
// turn have completed. This prevents partial results and repeated tool calls.
if (state.pendingApproval == null && state.approvalQueue.isEmpty()
&& state.mcpCallInProgress.get() <= 0) {
preResponseMono
.then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}")))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {
if (err.getMessage().toLowerCase().contains("active response")) {
state.needsResponseCreate = true;
}
});
} else {
preResponseMono
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
state.mcpResultsPending = true;
System.out.println("[mcp] MCP calls still in progress (" + state.mcpCallInProgress.get() + ") or approval pending ā deferring response");
}
}
} else if (eventType == ServerEventType.RESPONSE_MCP_CALL_FAILED) {
System.out.println("ā MCP tool call failed");
writeLog("ERROR: MCP tool call failed");
String failedJson = BinaryData.fromObject(event).toString();
String failedItemId = extractJsonField(failedJson, "item_id");
state.mcpCallInProgress.updateAndGet(v -> Math.max(0, v - 1));
if (failedItemId != null) {
state.activeMcpItems.remove(failedItemId);
state.staleMcpItems.remove(failedItemId);
}
cancelMcpStallTimer(state);
try {
session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
} catch (Exception e) {
// best effort
}
} else if (eventType == ServerEventType.CONVERSATION_ITEM_CREATED) {
handleMCPConversationItem(event, state, session);
}
} catch (Exception e) {
System.err.println("ā Error handling event: " + e.getMessage());
}
}
Handle approval requests
When a server is configured with requireApproval("always"), client code must handle the approval flow. Instead of blocking on Scanner.nextLine(), inject a system message so the model asks the user verbally and parse the spoken response.
/**
* Handle MCP conversation items: approval requests, tool call announcements,
* and item-to-server tracking.
*/
private static void handleMCPConversationItem(SessionUpdate event, SessionState state,
VoiceLiveSessionAsyncClient session) {
String eventJson = BinaryData.fromObject(event).toString();
if (eventJson.contains("mcp_approval_request")) {
// Extract approval details
String approvalId = extractJsonField(eventJson, "id");
String serverLabel = extractJsonField(eventJson, "server_label");
String functionName = extractJsonField(eventJson, "name");
if ("unknown".equals(approvalId)) {
return;
}
final int MAX_APPROVAL_CALLS_PER_TASK = 3;
int currentCount = state.approvalCallCount.getOrDefault(serverLabel, 0);
if (currentCount >= MAX_APPROVAL_CALLS_PER_TASK) {
System.out.println(" Auto-denied: " + serverLabel + "/" + functionName
+ " (max " + MAX_APPROVAL_CALLS_PER_TASK + " calls reached)");
try {
String denyJson = String.format(
"{\"type\":\"conversation.item.create\",\"item\":"
+ "{\"type\":\"mcp_approval_response\","
+ "\"approval_request_id\":\"%s\","
+ "\"approve\":false}}",
approvalId);
session.send(BinaryData.fromString(denyJson))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err ->
System.err.println("Failed to send auto-deny: " + err.getMessage()));
} catch (Exception e) {
System.err.println("Failed to send auto-deny: " + e.getMessage());
}
return;
}
// Auto-approve if user already approved this server earlier in the same turn
if (state.approvedServersThisTurn.contains(serverLabel)) {
System.out.println(" Auto-approved: " + serverLabel + "/" + functionName
+ " (already approved this turn)");
try {
String approveJson = String.format(
"{\"type\":\"conversation.item.create\",\"item\":"
+ "{\"type\":\"mcp_approval_response\","
+ "\"approval_request_id\":\"%s\","
+ "\"approve\":true}}",
approvalId);
session.send(BinaryData.fromString(approveJson))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err ->
System.err.println("Failed to send auto-approve: " + err.getMessage()));
} catch (Exception e) {
System.err.println("Failed to send auto-approve: " + e.getMessage());
}
return;
}
// If another approval is already pending, queue this one
if (state.pendingApproval != null) {
state.approvalQueue.add(
new SessionState.ApprovalInfo(approvalId, serverLabel, functionName));
return;
}
System.out.println();
System.out.println("š MCP Approval Request (voice-based):");
System.out.println(" Server: " + serverLabel + " Tool: " + functionName);
writeLog("Approval request: server=" + serverLabel + " tool=" + functionName);
state.pendingApproval =
new SessionState.ApprovalInfo(approvalId, serverLabel, functionName);
if (!state.responseActive) {
sendApprovalVoicePrompt(state, session);
} else {
state.approvalPromptNeeded = true;
}
} else if (eventJson.contains("\"type\":\"mcp_call\"")) {
// Track MCP call items and announce non-approval tool calls
String itemId = extractJsonField(eventJson, "id");
String serverLabel = extractJsonField(eventJson, "server_label");
String functionName = extractJsonField(eventJson, "name");
System.out.println("š§ MCP tool call: " + serverLabel + "/" + functionName);
state.mcpItemToServer.put(itemId, serverLabel + "/" + functionName);
// Announce to the user if this server doesn't require approval
if (state.pendingApproval == null && !state.approvalServers.contains(serverLabel)) {
sendSystemMessage(session,
"Briefly tell the user you're looking something up. One short sentence only.")
.then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}")))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {});
}
}
}
/**
* Inject a system message asking the model to verbally request permission.
*/
private static void sendApprovalVoicePrompt(SessionState state,
VoiceLiveSessionAsyncClient session) {
SessionState.ApprovalInfo pending = state.pendingApproval;
if (pending == null) return;
int callCount = state.approvalCallCount.getOrDefault(pending.serverLabel(), 0);
state.approvalCallCount.put(pending.serverLabel(), callCount + 1);
String prompt;
if (callCount == 0) {
prompt = "You MUST ask the user for explicit permission before proceeding. "
+ "Say exactly: \"I'd like to search the " + pending.serverLabel()
+ " service for information. Do you approve? Please say yes or no.\"";
} else {
prompt = "You MUST ask the user for permission again. "
+ "Say exactly: \"I need to do one more search to get complete information. "
+ "Should I continue? Please say yes or no.\"";
}
sendSystemMessage(session, prompt)
.then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}")))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err ->
System.err.println("ā Failed to send approval voice prompt: " + err.getMessage()));
}
/**
* Interpret the user's spoken response as approval or denial.
*/
private static void resolveVoiceApproval(String transcript, SessionState state,
VoiceLiveSessionAsyncClient session) {
SessionState.ApprovalInfo pending = state.pendingApproval;
if (pending == null) return;
String text = transcript.trim().toLowerCase();
boolean approved = YES_PATTERN.matcher(text).find();
boolean denied = NO_PATTERN.matcher(text).find();
if (!approved && !denied) {
// Ambiguous ā ask again at next RESPONSE_DONE
state.approvalPromptNeeded = true;
return;
}
if (approved && denied) {
approved = false; // conflicting signals ā deny for safety
}
state.pendingApproval = null;
if (approved) {
state.approvedServersThisTurn.add(pending.serverLabel());
} else {
state.approvalCallCount.clear();
state.approvedServersThisTurn.remove(pending.serverLabel());
}
System.out.println(" Voice approval: " + (approved ? "Approved ā
" : "Denied ā"));
writeLog("Approval resolved: " + (approved ? "APPROVED" : "DENIED") + " for " + pending.serverLabel() + "/" + pending.functionName());
// Send approval/denial response via raw JSON.
// Chain processNextApproval after the send completes to avoid racing.
String approvalJson = String.format(
"{\"type\":\"conversation.item.create\",\"item\":"
+ "{\"type\":\"mcp_approval_response\","
+ "\"approval_request_id\":\"%s\","
+ "\"approve\":%s}}",
pending.approvalId(), approved);
session.send(BinaryData.fromString(approvalJson))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(
v -> processNextApproval(state, session),
error -> {
System.err.println("ā Failed to send approval response: " + error.getMessage());
processNextApproval(state, session);
}
);
}
/**
* Pop the next queued approval and ask via voice.
*/
private static void processNextApproval(SessionState state,
VoiceLiveSessionAsyncClient session) {
SessionState.ApprovalInfo next = state.approvalQueue.poll();
if (next == null) return;
// Auto-approve if user already approved this server earlier in the same turn
if (state.approvedServersThisTurn.contains(next.serverLabel())) {
System.out.println(" Auto-approved (queued): " + next.serverLabel() + "/" + next.functionName());
String approveJson = String.format(
"{\"type\":\"conversation.item.create\",\"item\":"
+ "{\"type\":\"mcp_approval_response\","
+ "\"approval_request_id\":\"%s\","
+ "\"approve\":true}}",
next.approvalId());
session.send(BinaryData.fromString(approveJson))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(
v -> processNextApproval(state, session),
err -> {
System.err.println("Failed to send queued auto-approve: " + err.getMessage());
processNextApproval(state, session);
});
return;
}
state.pendingApproval = next;
if (!state.responseActive) {
sendApprovalVoicePrompt(state, session);
} else {
state.approvalPromptNeeded = true;
}
}
In this sample:
- A system message instructs the model to verbally ask for permission.
MCPApprovalResponseRequestItemsends the decision back to Voice Live.
Resolve voice-based approval
Parse the user's spoken transcript to determine approval. Use word-boundary regex to avoid false positives from words like "yesterday" or "nobody".
} else if (eventType == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED) {
String eventJson = BinaryData.fromObject(event).toString();
String transcript = extractJsonField(eventJson, "transcript");
System.out.println("š¤ You said:\t" + transcript);
writeLog("User Input:\t" + transcript);
// Interpret as an approval answer if we have a pending approval
if (state.pendingApproval != null) {
resolveVoiceApproval(transcript, state, session);
}
In this sample:
- The transcript from
CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETEDis matched against\byes\band\b(no|stop|cancel)\bpatterns. - Subsequent calls to the same server within the same turn are auto-approved to avoid repeated prompts.
- After a configurable maximum (for example, 3 approvals), further calls are auto-denied and the model responds with what it has.
Detect stalls during MCP tool calls
MCP tool calls can take several seconds. Use a repeating timer to proactively inform the user that the assistant is still waiting for results.
/**
* Start a timer that verbally updates the user if an MCP call takes too long.
*/
private static void startMcpStallTimer(SessionState state,
VoiceLiveSessionAsyncClient session) {
cancelMcpStallTimer(state);
final AtomicInteger stallCount = new AtomicInteger(0);
state.mcpStallTimer = SCHEDULER.scheduleAtFixedRate(() -> {
if (state.mcpCallInProgress.get() <= 0) {
cancelMcpStallTimer(state);
return;
}
int count = stallCount.incrementAndGet();
if (count > 3) {
cancelMcpStallTimer(state);
return;
}
// MCP calls cannot be cancelled ā only honest status updates are possible.
String msg = "The tool call is still running. "
+ "Briefly reassure the user that you're still waiting for results. "
+ "One short sentence only.";
sendSystemMessage(session, msg)
.then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}")))
.subscribeOn(Schedulers.boundedElastic())
.subscribe(v -> {}, err -> {
if (err.getMessage() != null
&& err.getMessage().toLowerCase().contains("active response")) {
state.needsResponseCreate = true;
}
});
}, 10, 10, TimeUnit.SECONDS);
}
/**
* Cancel the MCP stall timer if running.
*/
private static void cancelMcpStallTimer(SessionState state) {
ScheduledFuture<?> timer = state.mcpStallTimer;
if (timer != null && !timer.isDone()) {
timer.cancel(false);
}
state.mcpStallTimer = null;
}
In this sample:
- A
ScheduledExecutorServicefires at a 10-second interval, injecting system messages up to 3 times. - The timer is cancelled when the MCP call completes or the user interrupts with barge-in.
Run the sample
Create the src/main/java/MCPQuickstart.java file with the following code:
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. import com.azure.ai.voicelive.VoiceLiveAsyncClient; import com.azure.ai.voicelive.VoiceLiveClientBuilder; import com.azure.ai.voicelive.VoiceLiveServiceVersion; import com.azure.ai.voicelive.VoiceLiveSessionAsyncClient; import com.azure.ai.voicelive.models.AudioEchoCancellation; import com.azure.ai.voicelive.models.AudioInputTranscriptionOptions; import com.azure.ai.voicelive.models.AudioInputTranscriptionOptionsModel; import com.azure.ai.voicelive.models.AudioNoiseReduction; import com.azure.ai.voicelive.models.AudioNoiseReductionType; import com.azure.ai.voicelive.models.AzureStandardVoice; import com.azure.ai.voicelive.models.ClientEventSessionUpdate; import com.azure.ai.voicelive.models.InputAudioFormat; import com.azure.ai.voicelive.models.InteractionModality; import com.azure.ai.voicelive.models.MCPServer; import com.azure.ai.voicelive.models.OutputAudioFormat; import com.azure.ai.voicelive.models.ServerEventType; import com.azure.ai.voicelive.models.ServerVadTurnDetection; import com.azure.ai.voicelive.models.SessionUpdate; import com.azure.ai.voicelive.models.SessionUpdateError; import com.azure.ai.voicelive.models.SessionUpdateResponseAudioDelta; import com.azure.ai.voicelive.models.VoiceLiveSessionOptions; import com.azure.ai.voicelive.models.VoiceLiveToolDefinition; import com.azure.core.credential.KeyCredential; import com.azure.core.credential.TokenCredential; import com.azure.core.util.BinaryData; import com.azure.identity.AzureCliCredentialBuilder; import reactor.core.publisher.Mono; import reactor.core.scheduler.Schedulers; import javax.sound.sampled.AudioFormat; import javax.sound.sampled.AudioSystem; import javax.sound.sampled.DataLine; import javax.sound.sampled.LineUnavailableException; import javax.sound.sampled.SourceDataLine; import javax.sound.sampled.TargetDataLine; import java.io.FileInputStream; import java.io.FileWriter; import java.io.IOException; import java.io.InputStream; import java.io.PrintWriter; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.Queue; import java.util.Set; import java.util.concurrent.BlockingQueue; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.Executors; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.ScheduledFuture; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicBoolean; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicReference; import java.util.regex.Pattern; /** * MCP Quickstart - demonstrates MCP server integration with the VoiceLive SDK. * Shows how to define MCP servers, handle MCP tool calls, and implement * an approval flow for tool calls that require user consent. * * <p><strong>Environment Variables Required:</strong></p> * <ul> * <li>AZURE_VOICELIVE_ENDPOINT - The VoiceLive service endpoint URL</li> * <li>AZURE_VOICELIVE_API_KEY - The API key (required if not using --use-token-credential)</li> * </ul> * * <p><strong>How to Run:</strong></p> * <pre>{@code * mvn compile exec:java -Dexec.mainClass="MCPQuickstart" -q * }</pre> */ public final class MCPQuickstart { private static final String DEFAULT_MODEL = "gpt-realtime"; private static final String DEFAULT_VOICE = "en-US-Ava:DragonHDLatestNeural"; private static final String DEFAULT_INSTRUCTIONS = "You are a helpful AI assistant with access to MCP tools. " + "Use the tools to help answer user questions. " + "Respond naturally and conversationally. " + "Some tools require user approval before they can be used. When you receive a " + "system message asking you to request permission, you MUST clearly ask the user " + "for their explicit approval before proceeding. Always wait for the user to say " + "yes or no. Never skip the approval question or assume permission is granted. " + "If a tool result arrives after the conversation has moved to a different topic, " + "briefly introduce it as a late result before sharing the findings."; private static final String ENV_ENDPOINT = "AZURE_VOICELIVE_ENDPOINT"; private static final String ENV_API_KEY = "AZURE_VOICELIVE_API_KEY"; private static final int SAMPLE_RATE = 24000; private static final int CHANNELS = 1; private static final int SAMPLE_SIZE_BITS = 16; private static final int CHUNK_SIZE = 1200; private static final int AUDIO_BUFFER_SIZE_MULTIPLIER = 4; private MCPQuickstart() { throw new UnsupportedOperationException("Utility class"); } private static final ScheduledExecutorService SCHEDULER = Executors.newSingleThreadScheduledExecutor(r -> { Thread t = new Thread(r, "MCP-StallTimer"); t.setDaemon(true); return t; }); private static final Pattern YES_PATTERN = Pattern.compile("\\byes\\b", Pattern.CASE_INSENSITIVE); private static final Pattern NO_PATTERN = Pattern.compile("\\b(no|stop|cancel)\\b", Pattern.CASE_INSENSITIVE); private static final String LOG_FILENAME = "conversation_" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMdd_HHmmss")) + ".log"; /** * Mutable session state shared across event handlers. * All fields are thread-safe (volatile or concurrent collections). */ private static class SessionState { volatile ApprovalInfo pendingApproval; final Queue<ApprovalInfo> approvalQueue = new ConcurrentLinkedQueue<>(); volatile boolean approvalPromptNeeded; final AtomicInteger mcpCallInProgress = new AtomicInteger(0); final Set<String> handledMcpCompletions = ConcurrentHashMap.newKeySet(); volatile boolean needsResponseCreate; final Map<String, Integer> approvalCallCount = new ConcurrentHashMap<>(); final Map<String, String> mcpItemToServer = new ConcurrentHashMap<>(); Set<String> approvalServers = Set.of(); volatile ScheduledFuture<?> mcpStallTimer; volatile boolean responseActive; final Set<String> activeMcpItems = ConcurrentHashMap.newKeySet(); final Set<String> staleMcpItems = ConcurrentHashMap.newKeySet(); volatile boolean mcpResultsPending; final Set<String> approvedServersThisTurn = ConcurrentHashMap.newKeySet(); static class ApprovalInfo { final String approvalId; final String serverLabel; final String functionName; ApprovalInfo(String approvalId, String serverLabel, String functionName) { this.approvalId = approvalId; this.serverLabel = serverLabel; this.functionName = functionName; } String approvalId() { return approvalId; } String serverLabel() { return serverLabel; } String functionName() { return functionName; } } } private static class AudioPlaybackPacket { final int sequenceNumber; final byte[] audioData; AudioPlaybackPacket(int sequenceNumber, byte[] audioData) { this.sequenceNumber = sequenceNumber; this.audioData = audioData; } } /** * Audio processor for real-time capture and playback. */ private static class AudioProcessor { private final VoiceLiveSessionAsyncClient session; private final AudioFormat audioFormat; private TargetDataLine microphone; private SourceDataLine speaker; private final AtomicBoolean isCapturing = new AtomicBoolean(false); private final AtomicBoolean isPlaying = new AtomicBoolean(false); private final BlockingQueue<AudioPlaybackPacket> playbackQueue = new LinkedBlockingQueue<>(); private final AtomicInteger nextSequenceNumber = new AtomicInteger(0); private final AtomicInteger playbackBase = new AtomicInteger(0); AudioProcessor(VoiceLiveSessionAsyncClient session) { this.session = session; this.audioFormat = new AudioFormat( AudioFormat.Encoding.PCM_SIGNED, SAMPLE_RATE, SAMPLE_SIZE_BITS, CHANNELS, CHANNELS * SAMPLE_SIZE_BITS / 8, SAMPLE_RATE, false ); } void startCapture() { if (isCapturing.get()) return; try { DataLine.Info micInfo = new DataLine.Info(TargetDataLine.class, audioFormat); microphone = (TargetDataLine) AudioSystem.getLine(micInfo); microphone.open(audioFormat, CHUNK_SIZE * AUDIO_BUFFER_SIZE_MULTIPLIER); microphone.start(); isCapturing.set(true); Thread captureThread = new Thread(this::captureAudioLoop, "VoiceLive-AudioCapture"); captureThread.setDaemon(true); captureThread.start(); System.out.println("š¤ Microphone capture started"); } catch (LineUnavailableException e) { throw new RuntimeException("Failed to initialize microphone", e); } } void startPlayback() { if (isPlaying.get()) return; try { DataLine.Info speakerInfo = new DataLine.Info(SourceDataLine.class, audioFormat); speaker = (SourceDataLine) AudioSystem.getLine(speakerInfo); speaker.open(audioFormat, CHUNK_SIZE * AUDIO_BUFFER_SIZE_MULTIPLIER); speaker.start(); isPlaying.set(true); Thread playbackThread = new Thread(this::playbackAudioLoop, "VoiceLive-AudioPlayback"); playbackThread.setDaemon(true); playbackThread.start(); System.out.println("š Audio playback started"); } catch (LineUnavailableException e) { throw new RuntimeException("Failed to initialize speaker", e); } } private void captureAudioLoop() { byte[] buffer = new byte[CHUNK_SIZE * 2]; while (isCapturing.get() && microphone != null) { try { int bytesRead = microphone.read(buffer, 0, buffer.length); if (bytesRead > 0) { byte[] audioChunk = Arrays.copyOf(buffer, bytesRead); session.sendInputAudio(BinaryData.fromBytes(audioChunk)) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, error -> { if (!error.getMessage().contains("cancelled")) { System.err.println("ā Error sending audio: " + error.getMessage()); } }); } } catch (Exception e) { if (isCapturing.get()) { System.err.println("ā Error in audio capture: " + e.getMessage()); } break; } } } private void playbackAudioLoop() { while (isPlaying.get()) { try { AudioPlaybackPacket packet = playbackQueue.take(); if (packet.audioData == null) break; if (packet.sequenceNumber < playbackBase.get()) continue; if (speaker != null && speaker.isOpen()) { speaker.write(packet.audioData, 0, packet.audioData.length); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } } } void queueAudio(byte[] audioData) { if (audioData != null && audioData.length > 0) { int seqNum = nextSequenceNumber.getAndIncrement(); playbackQueue.offer(new AudioPlaybackPacket(seqNum, audioData)); } } void skipPendingAudio() { playbackBase.set(nextSequenceNumber.get()); playbackQueue.clear(); if (speaker != null && speaker.isOpen()) speaker.flush(); } void shutdown() { isCapturing.set(false); if (microphone != null) { microphone.stop(); microphone.close(); microphone = null; } isPlaying.set(false); playbackQueue.offer(new AudioPlaybackPacket(-1, null)); if (speaker != null) { speaker.stop(); speaker.close(); speaker = null; } System.out.println("š Audio processor shut down"); } } private static class Config { String endpoint; String apiKey; String model = DEFAULT_MODEL; String voice = DEFAULT_VOICE; String instructions = DEFAULT_INSTRUCTIONS; boolean useTokenCredential = false; static Config load(String[] args) { Config config = new Config(); Properties props = loadProperties(); if (props != null) { config.endpoint = props.getProperty("azure.voicelive.endpoint"); config.apiKey = props.getProperty("azure.voicelive.api-key"); config.model = props.getProperty("azure.voicelive.model", DEFAULT_MODEL); config.voice = props.getProperty("azure.voicelive.voice", DEFAULT_VOICE); } if (System.getenv(ENV_ENDPOINT) != null) config.endpoint = System.getenv(ENV_ENDPOINT); if (System.getenv(ENV_API_KEY) != null) config.apiKey = System.getenv(ENV_API_KEY); for (int i = 0; i < args.length; i++) { switch (args[i]) { case "--endpoint": if (i + 1 < args.length) config.endpoint = args[++i]; break; case "--api-key": if (i + 1 < args.length) config.apiKey = args[++i]; break; case "--model": if (i + 1 < args.length) config.model = args[++i]; break; case "--voice": if (i + 1 < args.length) config.voice = args[++i]; break; case "--use-token-credential": config.useTokenCredential = true; break; } } return config; } } private static Properties loadProperties() { Properties props = new Properties(); try (InputStream input = new FileInputStream("application.properties")) { props.load(input); return props; } catch (IOException e) { return null; } } // <define_mcp_servers> /** * Define MCP servers that Voice Live can use during the session. * Each server is an MCPServer instance added to the session options tools list. */ private static List<VoiceLiveToolDefinition> defineMCPServers() { List<VoiceLiveToolDefinition> mcpTools = new ArrayList<>(); mcpTools.add(new MCPServer("deepwiki", "https://mcp.deepwiki.com/mcp") .setAllowedTools(Arrays.asList("read_wiki_structure", "ask_question")) .setRequireApproval(BinaryData.fromString("never"))); mcpTools.add(new MCPServer("azure_doc", "https://learn.microsoft.com/api/mcp") .setRequireApproval(BinaryData.fromString("always"))); return mcpTools; } // </define_mcp_servers> // <configure_session> /** * Create session configuration with MCP servers in the tools list. */ private static VoiceLiveSessionOptions createSessionOptions(Config config) { ServerVadTurnDetection turnDetection = new ServerVadTurnDetection() .setThreshold(0.5) .setPrefixPaddingMs(300) .setSilenceDurationMs(500) .setInterruptResponse(true) .setAutoTruncate(true) .setCreateResponse(true); // Enable input audio transcription so we receive user speech as text AudioInputTranscriptionOptionsModel transcriptionModel = config.model.toLowerCase().contains("realtime") ? AudioInputTranscriptionOptionsModel.WHISPER_1 : AudioInputTranscriptionOptionsModel.fromString("azure-speech"); AudioInputTranscriptionOptions transcriptionOptions = new AudioInputTranscriptionOptions(transcriptionModel); VoiceLiveSessionOptions options = new VoiceLiveSessionOptions() .setInstructions(config.instructions) .setVoice(BinaryData.fromObject(new AzureStandardVoice(config.voice))) .setModalities(Arrays.asList(InteractionModality.TEXT, InteractionModality.AUDIO)) .setInputAudioFormat(InputAudioFormat.PCM16) .setOutputAudioFormat(OutputAudioFormat.PCM16) .setInputAudioSamplingRate(SAMPLE_RATE) .setInputAudioNoiseReduction(new AudioNoiseReduction(AudioNoiseReductionType.NEAR_FIELD)) .setInputAudioEchoCancellation(new AudioEchoCancellation()) .setInputAudioTranscription(transcriptionOptions) .setTurnDetection(turnDetection); // Add MCP servers to the tools list List<VoiceLiveToolDefinition> mcpServers = defineMCPServers(); options.setTools(mcpServers); return options; } // </configure_session> // <handle_mcp_events> /** * Handle incoming server events, including MCP-specific events * and voice-based approval flow. */ private static void handleServerEvent(SessionUpdate event, AudioProcessor audioProcessor, SessionState state, VoiceLiveSessionAsyncClient session) { ServerEventType eventType = event.getType(); try { if (eventType == ServerEventType.SESSION_UPDATED) { System.out.println("ā Session updated - starting microphone"); writeLog("Session updated"); audioProcessor.startCapture(); } else if (eventType == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED) { System.out.println("š¤ Listening..."); audioProcessor.skipPendingAudio(); // Cancel any active response ā prevents duplicate result playback // when the user interrupts during MCP result speech (matches C#/Python/JS) if (state.responseActive) { session.send(BinaryData.fromString("{\"type\":\"response.cancel\"}")) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } // Clear deferred response flags if no MCP calls are in progress. // Without this, a stale needsResponseCreate from a collision during // the approval flow causes the model to re-speak results after the // user interrupts. if (state.mcpCallInProgress.get() <= 0) { state.needsResponseCreate = false; state.mcpResultsPending = false; } // Reset approved-servers-this-turn when user starts a new topic if (state.pendingApproval == null && state.mcpCallInProgress.get() <= 0) { state.approvedServersThisTurn.clear(); } // If an MCP call is running and no approval is pending, mark as stale if (state.mcpCallInProgress.get() > 0 && state.pendingApproval == null) { state.staleMcpItems.addAll(state.activeMcpItems); System.out.println("[barge-in] Marking " + state.activeMcpItems.size() + " MCP calls as stale"); sendSystemMessage(session, "A tool call is still running in the background. The user just spoke. " + "Respond to what the user said. If a tool result arrives later, " + "briefly introduce it as a late result from an earlier request.") .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } } else if (eventType == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED) { System.out.println("š¤ Processing..."); } else if (eventType == ServerEventType.RESPONSE_CREATED) { state.responseActive = true; } else if (eventType == ServerEventType.RESPONSE_AUDIO_DELTA) { if (event instanceof SessionUpdateResponseAudioDelta) { SessionUpdateResponseAudioDelta audioEvent = (SessionUpdateResponseAudioDelta) event; byte[] audioData = audioEvent.getDelta(); if (audioData != null && audioData.length > 0) { audioProcessor.queueAudio(audioData); } } } else if (eventType == ServerEventType.RESPONSE_AUDIO_DONE) { System.out.println("š¤ Ready for next input..."); } else if (eventType == ServerEventType.RESPONSE_DONE) { state.responseActive = false; System.out.println("ā Response complete"); writeLog("--- Response complete ---"); // If an approval prompt needs to be injected, do it now if (state.approvalPromptNeeded && state.pendingApproval != null) { state.approvalPromptNeeded = false; sendApprovalVoicePrompt(state, session); // If MCP results are pending and all calls are now done, create response } else if (state.mcpResultsPending && state.mcpCallInProgress.get() <= 0 && state.pendingApproval == null) { state.mcpResultsPending = false; try { session.send(BinaryData.fromString("{\"type\":\"response.create\"}")) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } catch (Exception e) { // best-effort } } else if (state.needsResponseCreate) { // Deferred response.create ā retry now that no response is active state.needsResponseCreate = false; try { session.send(BinaryData.fromString("{\"type\":\"response.create\"}")) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } catch (Exception e) { // best-effort retry } } // <voice_approval_transcription> } else if (eventType == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED) { String eventJson = BinaryData.fromObject(event).toString(); String transcript = extractJsonField(eventJson, "transcript"); System.out.println("š¤ You said:\t" + transcript); writeLog("User Input:\t" + transcript); // Interpret as an approval answer if we have a pending approval if (state.pendingApproval != null) { resolveVoiceApproval(transcript, state, session); } // </voice_approval_transcription> } else if (eventType == ServerEventType.ERROR) { // Reset response state ā errors can terminate a response without RESPONSE_DONE state.responseActive = false; if (event instanceof SessionUpdateError) { String msg = ((SessionUpdateError) event).getError().getMessage(); if (msg.contains("no active response")) { // suppress } else if (msg.toLowerCase().contains("interim response")) { // non-fatal } else if (msg.toLowerCase().contains("active response")) { // expected during MCP flow } else { System.out.println("ā Error: " + msg); writeLog("ERROR: " + msg); } } // MCP-specific events } else if (eventType == ServerEventType.MCP_LIST_TOOLS_COMPLETED) { System.out.println("š§ MCP tools discovered successfully"); writeLog("MCP tools discovered successfully"); } else if (eventType == ServerEventType.MCP_LIST_TOOLS_FAILED) { System.out.println("ā MCP tool discovery failed"); writeLog("ERROR: MCP tool discovery failed"); } else if (eventType == ServerEventType.RESPONSE_MCP_CALL_IN_PROGRESS) { System.out.println("ā³ MCP tool call in progress..."); writeLog("MCP call in progress"); state.mcpCallInProgress.incrementAndGet(); String inProgressJson = BinaryData.fromObject(event).toString(); String inProgressItemId = extractJsonField(inProgressJson, "item_id"); if (inProgressItemId != null) state.activeMcpItems.add(inProgressItemId); startMcpStallTimer(state, session); } else if (eventType == ServerEventType.RESPONSE_MCP_CALL_COMPLETED) { String eventJson = BinaryData.fromObject(event).toString(); String itemId = extractJsonField(eventJson, "item_id"); state.mcpCallInProgress.updateAndGet(v -> Math.max(0, v - 1)); if (itemId != null) state.activeMcpItems.remove(itemId); cancelMcpStallTimer(state); if (state.handledMcpCompletions.contains(itemId)) { // duplicate ā ignore } else { state.handledMcpCompletions.add(itemId); boolean isStale = itemId != null && state.staleMcpItems.remove(itemId); System.out.println("ā MCP tool call completed (stale=" + isStale + ")"); writeLog("MCP call completed: " + itemId + " (stale=" + isStale + ")"); state.mcpItemToServer.remove(itemId); // Reset approval counter if no more approvals pending if (state.pendingApproval == null && state.approvalQueue.isEmpty()) { state.approvalCallCount.clear(); } // If the user moved on during this call, tell the model it's a late result. // Chain any late-result context message with the response.create below // to ensure the system message arrives first. Mono<Void> preResponseMono = Mono.empty(); if (isStale) { preResponseMono = sendSystemMessage(session, "This tool result is from an earlier request. The user has " + "since moved on. Briefly introduce it as a late result, e.g. " + "'By the way, those results from earlier just came in...' " + "then share the key findings concisely."); } // Batch response: only call response.create when ALL MCP calls for this // turn have completed. This prevents partial results and repeated tool calls. if (state.pendingApproval == null && state.approvalQueue.isEmpty() && state.mcpCallInProgress.get() <= 0) { preResponseMono .then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> { if (err.getMessage().toLowerCase().contains("active response")) { state.needsResponseCreate = true; } }); } else { preResponseMono .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); state.mcpResultsPending = true; System.out.println("[mcp] MCP calls still in progress (" + state.mcpCallInProgress.get() + ") or approval pending ā deferring response"); } } } else if (eventType == ServerEventType.RESPONSE_MCP_CALL_FAILED) { System.out.println("ā MCP tool call failed"); writeLog("ERROR: MCP tool call failed"); String failedJson = BinaryData.fromObject(event).toString(); String failedItemId = extractJsonField(failedJson, "item_id"); state.mcpCallInProgress.updateAndGet(v -> Math.max(0, v - 1)); if (failedItemId != null) { state.activeMcpItems.remove(failedItemId); state.staleMcpItems.remove(failedItemId); } cancelMcpStallTimer(state); try { session.send(BinaryData.fromString("{\"type\":\"response.create\"}")) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } catch (Exception e) { // best effort } } else if (eventType == ServerEventType.CONVERSATION_ITEM_CREATED) { handleMCPConversationItem(event, state, session); } } catch (Exception e) { System.err.println("ā Error handling event: " + e.getMessage()); } } // </handle_mcp_events> // <handle_approval> /** * Handle MCP conversation items: approval requests, tool call announcements, * and item-to-server tracking. */ private static void handleMCPConversationItem(SessionUpdate event, SessionState state, VoiceLiveSessionAsyncClient session) { String eventJson = BinaryData.fromObject(event).toString(); if (eventJson.contains("mcp_approval_request")) { // Extract approval details String approvalId = extractJsonField(eventJson, "id"); String serverLabel = extractJsonField(eventJson, "server_label"); String functionName = extractJsonField(eventJson, "name"); if ("unknown".equals(approvalId)) { return; } final int MAX_APPROVAL_CALLS_PER_TASK = 3; int currentCount = state.approvalCallCount.getOrDefault(serverLabel, 0); if (currentCount >= MAX_APPROVAL_CALLS_PER_TASK) { System.out.println(" Auto-denied: " + serverLabel + "/" + functionName + " (max " + MAX_APPROVAL_CALLS_PER_TASK + " calls reached)"); try { String denyJson = String.format( "{\"type\":\"conversation.item.create\",\"item\":" + "{\"type\":\"mcp_approval_response\"," + "\"approval_request_id\":\"%s\"," + "\"approve\":false}}", approvalId); session.send(BinaryData.fromString(denyJson)) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> System.err.println("Failed to send auto-deny: " + err.getMessage())); } catch (Exception e) { System.err.println("Failed to send auto-deny: " + e.getMessage()); } return; } // Auto-approve if user already approved this server earlier in the same turn if (state.approvedServersThisTurn.contains(serverLabel)) { System.out.println(" Auto-approved: " + serverLabel + "/" + functionName + " (already approved this turn)"); try { String approveJson = String.format( "{\"type\":\"conversation.item.create\",\"item\":" + "{\"type\":\"mcp_approval_response\"," + "\"approval_request_id\":\"%s\"," + "\"approve\":true}}", approvalId); session.send(BinaryData.fromString(approveJson)) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> System.err.println("Failed to send auto-approve: " + err.getMessage())); } catch (Exception e) { System.err.println("Failed to send auto-approve: " + e.getMessage()); } return; } // If another approval is already pending, queue this one if (state.pendingApproval != null) { state.approvalQueue.add( new SessionState.ApprovalInfo(approvalId, serverLabel, functionName)); return; } System.out.println(); System.out.println("š MCP Approval Request (voice-based):"); System.out.println(" Server: " + serverLabel + " Tool: " + functionName); writeLog("Approval request: server=" + serverLabel + " tool=" + functionName); state.pendingApproval = new SessionState.ApprovalInfo(approvalId, serverLabel, functionName); if (!state.responseActive) { sendApprovalVoicePrompt(state, session); } else { state.approvalPromptNeeded = true; } } else if (eventJson.contains("\"type\":\"mcp_call\"")) { // Track MCP call items and announce non-approval tool calls String itemId = extractJsonField(eventJson, "id"); String serverLabel = extractJsonField(eventJson, "server_label"); String functionName = extractJsonField(eventJson, "name"); System.out.println("š§ MCP tool call: " + serverLabel + "/" + functionName); state.mcpItemToServer.put(itemId, serverLabel + "/" + functionName); // Announce to the user if this server doesn't require approval if (state.pendingApproval == null && !state.approvalServers.contains(serverLabel)) { sendSystemMessage(session, "Briefly tell the user you're looking something up. One short sentence only.") .then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> {}); } } } /** * Inject a system message asking the model to verbally request permission. */ private static void sendApprovalVoicePrompt(SessionState state, VoiceLiveSessionAsyncClient session) { SessionState.ApprovalInfo pending = state.pendingApproval; if (pending == null) return; int callCount = state.approvalCallCount.getOrDefault(pending.serverLabel(), 0); state.approvalCallCount.put(pending.serverLabel(), callCount + 1); String prompt; if (callCount == 0) { prompt = "You MUST ask the user for explicit permission before proceeding. " + "Say exactly: \"I'd like to search the " + pending.serverLabel() + " service for information. Do you approve? Please say yes or no.\""; } else { prompt = "You MUST ask the user for permission again. " + "Say exactly: \"I need to do one more search to get complete information. " + "Should I continue? Please say yes or no.\""; } sendSystemMessage(session, prompt) .then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> System.err.println("ā Failed to send approval voice prompt: " + err.getMessage())); } /** * Interpret the user's spoken response as approval or denial. */ private static void resolveVoiceApproval(String transcript, SessionState state, VoiceLiveSessionAsyncClient session) { SessionState.ApprovalInfo pending = state.pendingApproval; if (pending == null) return; String text = transcript.trim().toLowerCase(); boolean approved = YES_PATTERN.matcher(text).find(); boolean denied = NO_PATTERN.matcher(text).find(); if (!approved && !denied) { // Ambiguous ā ask again at next RESPONSE_DONE state.approvalPromptNeeded = true; return; } if (approved && denied) { approved = false; // conflicting signals ā deny for safety } state.pendingApproval = null; if (approved) { state.approvedServersThisTurn.add(pending.serverLabel()); } else { state.approvalCallCount.clear(); state.approvedServersThisTurn.remove(pending.serverLabel()); } System.out.println(" Voice approval: " + (approved ? "Approved ā " : "Denied ā")); writeLog("Approval resolved: " + (approved ? "APPROVED" : "DENIED") + " for " + pending.serverLabel() + "/" + pending.functionName()); // Send approval/denial response via raw JSON. // Chain processNextApproval after the send completes to avoid racing. String approvalJson = String.format( "{\"type\":\"conversation.item.create\",\"item\":" + "{\"type\":\"mcp_approval_response\"," + "\"approval_request_id\":\"%s\"," + "\"approve\":%s}}", pending.approvalId(), approved); session.send(BinaryData.fromString(approvalJson)) .subscribeOn(Schedulers.boundedElastic()) .subscribe( v -> processNextApproval(state, session), error -> { System.err.println("ā Failed to send approval response: " + error.getMessage()); processNextApproval(state, session); } ); } /** * Pop the next queued approval and ask via voice. */ private static void processNextApproval(SessionState state, VoiceLiveSessionAsyncClient session) { SessionState.ApprovalInfo next = state.approvalQueue.poll(); if (next == null) return; // Auto-approve if user already approved this server earlier in the same turn if (state.approvedServersThisTurn.contains(next.serverLabel())) { System.out.println(" Auto-approved (queued): " + next.serverLabel() + "/" + next.functionName()); String approveJson = String.format( "{\"type\":\"conversation.item.create\",\"item\":" + "{\"type\":\"mcp_approval_response\"," + "\"approval_request_id\":\"%s\"," + "\"approve\":true}}", next.approvalId()); session.send(BinaryData.fromString(approveJson)) .subscribeOn(Schedulers.boundedElastic()) .subscribe( v -> processNextApproval(state, session), err -> { System.err.println("Failed to send queued auto-approve: " + err.getMessage()); processNextApproval(state, session); }); return; } state.pendingApproval = next; if (!state.responseActive) { sendApprovalVoicePrompt(state, session); } else { state.approvalPromptNeeded = true; } } // </handle_approval> // <mcp_stall_detection> /** * Start a timer that verbally updates the user if an MCP call takes too long. */ private static void startMcpStallTimer(SessionState state, VoiceLiveSessionAsyncClient session) { cancelMcpStallTimer(state); final AtomicInteger stallCount = new AtomicInteger(0); state.mcpStallTimer = SCHEDULER.scheduleAtFixedRate(() -> { if (state.mcpCallInProgress.get() <= 0) { cancelMcpStallTimer(state); return; } int count = stallCount.incrementAndGet(); if (count > 3) { cancelMcpStallTimer(state); return; } // MCP calls cannot be cancelled ā only honest status updates are possible. String msg = "The tool call is still running. " + "Briefly reassure the user that you're still waiting for results. " + "One short sentence only."; sendSystemMessage(session, msg) .then(session.send(BinaryData.fromString("{\"type\":\"response.create\"}"))) .subscribeOn(Schedulers.boundedElastic()) .subscribe(v -> {}, err -> { if (err.getMessage() != null && err.getMessage().toLowerCase().contains("active response")) { state.needsResponseCreate = true; } }); }, 10, 10, TimeUnit.SECONDS); } /** * Cancel the MCP stall timer if running. */ private static void cancelMcpStallTimer(SessionState state) { ScheduledFuture<?> timer = state.mcpStallTimer; if (timer != null && !timer.isDone()) { timer.cancel(false); } state.mcpStallTimer = null; } // </mcp_stall_detection> /** * Send a system message to the model via raw JSON. * Returns a Mono so callers can chain subsequent sends sequentially, * avoiding FAIL_NON_SERIALIZED errors from concurrent sends. */ private static Mono<Void> sendSystemMessage(VoiceLiveSessionAsyncClient session, String text) { String escaped = text.replace("\\", "\\\\").replace("\"", "\\\""); String json = "{\"type\":\"conversation.item.create\",\"item\":" + "{\"type\":\"message\",\"role\":\"system\",\"content\":" + "[{\"type\":\"input_text\",\"text\":\"" + escaped + "\"}]}}"; return session.send(BinaryData.fromString(json)); } /** * Write a line to the conversation log file. */ private static void writeLog(String message) { try { Path logDir = Paths.get("logs"); Files.createDirectories(logDir); try (PrintWriter writer = new PrintWriter( new FileWriter(logDir.resolve(LOG_FILENAME).toString(), true))) { writer.println(message); } } catch (IOException e) { System.err.println("Failed to write conversation log: " + e.getMessage()); } } /** * Extract a simple string field value from a JSON string. */ private static String extractJsonField(String json, String fieldName) { String pattern = "\"" + fieldName + "\":\""; int start = json.indexOf(pattern); if (start < 0) return "unknown"; start += pattern.length(); int end = json.indexOf("\"", start); if (end < 0) return "unknown"; return json.substring(start, end); } private static boolean checkAudioSystem() { try { AudioFormat format = new AudioFormat(SAMPLE_RATE, SAMPLE_SIZE_BITS, CHANNELS, true, false); if (!AudioSystem.isLineSupported(new DataLine.Info(TargetDataLine.class, format))) { System.err.println("ā No compatible microphone found"); return false; } if (!AudioSystem.isLineSupported(new DataLine.Info(SourceDataLine.class, format))) { System.err.println("ā No compatible speaker found"); return false; } System.out.println("ā Audio system check passed"); return true; } catch (Exception e) { System.err.println("ā Audio system check failed: " + e.getMessage()); return false; } } public static void main(String[] args) { Config config = Config.load(args); if (config.endpoint == null) { System.err.println("ā Missing endpoint. Set AZURE_VOICELIVE_ENDPOINT or pass --endpoint."); return; } if (!config.useTokenCredential && config.apiKey == null) { System.err.println("ā No authentication. Set AZURE_VOICELIVE_API_KEY or use --use-token-credential."); return; } if (!checkAudioSystem()) return; System.out.println("šļø Starting Voice Assistant with MCP..."); // Session state for voice-based MCP approval flow SessionState state = new SessionState(); state.approvalServers = Set.of("azure_doc"); try { VoiceLiveAsyncClient client; if (config.useTokenCredential) { TokenCredential credential = new AzureCliCredentialBuilder().build(); client = new VoiceLiveClientBuilder() .endpoint(config.endpoint) .credential(credential) .serviceVersion(VoiceLiveServiceVersion.V2026_01_01_PREVIEW) .buildAsyncClient(); System.out.println("š Using Token Credential authentication"); } else { client = new VoiceLiveClientBuilder() .endpoint(config.endpoint) .credential(new KeyCredential(config.apiKey)) .serviceVersion(VoiceLiveServiceVersion.V2026_01_01_PREVIEW) .buildAsyncClient(); System.out.println("š Using API Key authentication"); } VoiceLiveSessionOptions sessionOptions = createSessionOptions(config); AtomicReference<AudioProcessor> audioProcessorRef = new AtomicReference<>(); client.startSession(config.model) .flatMap(session -> { System.out.println("ā Session started"); AudioProcessor audioProcessor = new AudioProcessor(session); audioProcessorRef.set(audioProcessor); session.receiveEvents() .subscribe( event -> handleServerEvent(event, audioProcessor, state, session), error -> System.err.println("ā Event error: " + error.getMessage()) ); ClientEventSessionUpdate updateEvent = new ClientEventSessionUpdate(sessionOptions); session.sendEvent(updateEvent).subscribe(); audioProcessor.startPlayback(); System.out.println(); System.out.println("=".repeat(70)); System.out.println("š¤ VOICE ASSISTANT WITH MCP READY"); System.out.println("Try saying:"); System.out.println(" ⢠'Can you summarize the GitHub repo azure-sdk-for-java?'"); System.out.println(" ⢠'Search the Azure documentation for Voice Live API.'"); System.out.println("Approve MCP tool calls by voice ā say 'yes' or 'no' when asked."); System.out.println("Press Ctrl+C to exit"); System.out.println("=".repeat(70)); System.out.println(); Runtime.getRuntime().addShutdownHook(new Thread(() -> { System.out.println("\nš Shutting down..."); audioProcessor.shutdown(); SCHEDULER.shutdownNow(); })); return Mono.never(); }) .doFinally(signalType -> { AudioProcessor ap = audioProcessorRef.get(); if (ap != null) ap.shutdown(); SCHEDULER.shutdownNow(); }) .block(); } catch (Exception e) { System.err.println("ā Fatal error: " + e.getMessage()); } } }Sign in to Azure with the following command:
az loginBuild and run the application:
mvn compile exec:java -Dexec.mainClass="MCPQuickstart" -qSpeak into your microphone. Try asking questions like "What tools do you have?" or "Search the Azure documentation for Voice Live API."
- For the
deepwikiserver (requireApproval="never"), tool calls execute automatically. - For the
azure_docserver (requireApproval="always"), you're prompted to approve each tool call in the console.
- For the
Press Ctrl+C to stop the session.
MCP server configuration reference
| Parameter | Required | Description |
|---|---|---|
serverLabel |
Yes | Display name for the MCP server. |
serverUrl |
Yes | URL of the remote MCP endpoint. |
allowedTools |
No | List of tool names the model can call. If omitted, all tools are allowed. |
requireApproval |
No | "never", "always" (default), or a per-tool dictionary. |
headers |
No | Extra HTTP headers to include in MCP requests. |
authorization |
No | Authorization token for MCP requests. |
For the complete REST API type definition, see MCPTool in the Voice Live API reference.
Learn how to connect remote MCP servers to a Voice Live session using the VoiceLive SDK for JavaScript. This article builds on the Quickstart: Create a Voice Live real-time voice agent with MCP server integration.
Reference documentation | Package (npm) | Additional samples on GitHub
Follow the how-to below or get the full sample code:
Note
The JavaScript Voice Live SDK is designed for browser-based applications with built-in WebSocket and Web Audio support. This how-to guide uses Node.js with node-record-lpcm16 and speaker for a console experience.
Prerequisites
- An Azure subscription. Create one for free.
- Node.js version 18 or later.
- SoX installed on your system (required by
node-record-lpcm16for microphone capture). - A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see the Voice Live overview documentation.
@azure/ai-voicelivepackage version 1.0.0 or later (MCP support requires API version2026-04-10).- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Tip
To use Voice Live with MCP, you don't need to deploy an audio model with your Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about model availability, see the Voice Live overview documentation.
Prepare the environment
Complete the Voice Live quickstart to set up your environment, configure authentication, and test your first Voice Live conversation.
MCP integration concepts
MCP server definition
Use an MCP server object with type: "mcp" to declare each remote MCP endpoint. At minimum, provide server_label (a display name) and server_url (the MCP endpoint URL). Optionally restrict available tools with allowed_tools and configure the approval mode.
Approval modes
Control whether MCP tool calls require user approval before execution:
require_approval: "never": The tool executes automatically when the model invokes it.require_approval: "always"(default): The client receives an approval request and must respond before the tool runs.
API version requirement
MCP support requires API version 2026-04-10 or later.
Define MCP servers
Define the MCP servers that Voice Live can use during the session. Each server is an MCP server object added to the tools list in the session configuration.
The following code defines two MCP servers: one with automatic tool execution and one that requires user approval before running.
/**
* Define MCP servers that Voice Live can use during the session.
* Each server is an MCPTool object added to the session tools array.
*/
function defineMCPServers() {
return [
{
type: "mcp",
serverLabel: "deepwiki",
serverUrl: "https://mcp.deepwiki.com/mcp",
allowedTools: ["read_wiki_structure", "ask_question"],
requireApproval: "never",
},
{
type: "mcp",
serverLabel: "azure_doc",
serverUrl: "https://learn.microsoft.com/api/mcp",
requireApproval: "always",
},
];
}
In this sample:
- The
deepwikiserver allows onlyread_wiki_structureandask_questiontools, withrequire_approvalset to"never"for automatic execution. - The
azure_docserver allows all tools on the endpoint, withrequire_approvalset to"always"so users can review each tool call before execution.
Configure the session with MCP tools
Pass the MCP server definitions to the session configuration alongside your voice, modality, and turn-detection settings.
/**
* Configure the session with MCP servers in the tools list.
*/
async _setupSession() {
console.log("[session] Configuring session with MCP tools...");
const mcpServers = defineMCPServers();
this._approvalServers = new Set(
mcpServers.filter(s => s.requireApproval === "always").map(s => s.serverLabel)
);
await this._session.updateSession({
model: this.model,
modalities: ["text", "audio"],
instructions: this.instructions,
voice: resolveVoiceConfig(this.voice),
inputAudioFormat: "pcm16",
outputAudioFormat: "pcm16",
turnDetection: {
type: "server_vad",
threshold: 0.5,
prefixPaddingInMs: 300,
silenceDurationInMs: 500,
},
inputAudioEchoCancellation: { type: "server_echo_cancellation" },
inputAudioNoiseReduction: { type: "azure_deep_noise_suppression" },
inputAudioTranscription: { model: this.model.toLowerCase().includes("realtime") ? "whisper-1" : "azure-speech" },
tools: mcpServers,
});
console.log("[session] Session configuration with MCP tools sent");
}
In this sample:
- The session configuration bundles MCP tools with audio format, voice, and turn detection settings.
session.updateSession(...)sends the full configuration to Voice Live.- Voice Live automatically discovers available tools from each MCP server after the session starts.
Handle MCP events
Process MCP-specific events in the event loop. The key events include MCP tool call creation, completion, failure, and approval requests.
/**
* Subscribe to session events, including MCP-specific events.
*/
_subscribeToEvents(session) {
this._subscription = session.subscribe({
onSessionUpdated: async (event, context) => {
const s = event.session;
const model = s?.model;
const voice = s?.voice;
console.log(`[session] Session ready: ${context.sessionId}`);
console.log(
` Model: ${typeof model === "string" ? model : model?.toString?.() ?? ""}`,
);
console.log(` Voice: ${voice?.name ?? ""}`);
writeConversationLog(
[
`SessionID: ${context.sessionId}`,
`Model: ${typeof model === "string" ? model : model?.toString?.() ?? ""}`,
`Voice Name: ${voice?.name ?? ""}`,
`Voice Type: ${voice?.type ?? ""}`,
`Log File: ${conversationLogFile}`,
"",
].join("\n"),
);
},
onConversationItemInputAudioTranscriptionCompleted: async (event) => {
const transcript = event.transcript ?? "";
console.log(`š¤ You said:\t${transcript}`);
writeConversationLog(`User Input:\t${transcript}`);
if (this._pendingApproval !== null) {
await this._resolveVoiceApproval(transcript, session);
}
},
onResponseTextDone: async (event) => {
const text = event.text ?? "";
console.log(`š¤ Assistant text:\t${text}`);
writeConversationLog(`Assistant Text Response:\t${text}`);
},
onResponseAudioTranscriptDone: async (event) => {
const transcript = event.transcript ?? "";
console.log(`š¤ Assistant audio transcript:\t${transcript}`);
writeConversationLog(`Assistant Audio Response:\t${transcript}`);
},
onInputAudioBufferSpeechStarted: async () => {
console.log("š¤ Listening...");
this._audio.skipPendingAudio();
// Do NOT reset _approvalCallCount here ā the counter should only
// reset on task completion (in onResponseMcpCallCompleted when no
// pending/queued approvals remain) or on denial (in _resolveVoiceApproval).
// Resetting on every speech-start would let the model retry denied calls.
// Clear ALL deferred response flags on barge-in.
// This prevents onResponseDone (fired by the cancelled response)
// from immediately creating a new response that overlaps the user.
this._needsResponseCreate = false;
this._mcpResultsPending = false;
// Reset approved-servers-this-turn when user starts a new topic
if (this._pendingApproval === null && this._mcpCallInProgress <= 0) {
this._approvedServersThisTurn.clear();
}
if (this._activeResponse && !this._responseApiDone) {
// Mark barge-in so onResponseDone skips deferred actions
this._bargeInActive = true;
try {
await session.sendEvent({ type: "response.cancel" });
} catch (err) {
const msg = err?.message ?? "";
if (!msg.toLowerCase().includes("no active response")) {
console.warn("[barge-in] Cancel failed:", msg);
}
}
try {
await session.sendEvent({ type: "input_audio_buffer.clear" });
} catch { /* best-effort */ }
}
if (this._mcpCallInProgress > 0 && this._pendingApproval === null) {
this._staleMcpItems = new Set([...this._staleMcpItems, ...this._activeMcpItems]);
console.log(`[barge-in] Marking ${this._activeMcpItems.size} MCP calls as stale`);
try {
await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "A tool call is still running in the background. The user just spoke. Respond to what the user said. If a tool result arrives later, briefly introduce it as a late result from an earlier request." }] });
} catch {}
}
},
onInputAudioBufferSpeechStopped: async () => {
console.log("š¤ Processing...");
},
onResponseCreated: async () => {
this._activeResponse = true;
this._responseApiDone = false;
},
onResponseAudioDelta: async (event) => {
if (event.delta) {
this._audio.queueAudio(event.delta);
}
},
onResponseAudioDone: async () => {
console.log("š¤ Ready for next input...");
},
onResponseDone: async () => {
console.log("ā
Response complete");
writeConversationLog("--- Response complete ---");
this._activeResponse = false;
this._responseApiDone = true;
// If this response.done is the result of a barge-in cancel,
// skip all deferred actions ā the user's new turn will handle things.
if (this._bargeInActive) {
this._bargeInActive = false;
return;
}
if (this._approvalPromptNeeded && this._pendingApproval !== null) {
this._approvalPromptNeeded = false;
await this._sendApprovalVoicePrompt(session);
} else if (this._mcpResultsPending && this._mcpCallInProgress <= 0 && this._pendingApproval === null) {
this._mcpResultsPending = false;
try { await session.sendEvent({ type: "response.create" }); } catch {}
} else if (this._needsResponseCreate) {
this._needsResponseCreate = false;
try { await session.sendEvent({ type: "response.create" }); } catch {}
}
},
onServerError: async (event) => {
const msg = event.error?.message ?? "";
// Reset response state ā errors can terminate a response without onResponseDone
this._activeResponse = false;
this._responseApiDone = true;
if (msg.includes("Cancellation failed: no active response")) return;
if (msg.toLowerCase().includes("interim response")) {
console.log("[session] Interim response not supported (non-fatal)");
return;
}
if (msg.toLowerCase().includes("active response")) return;
console.error(`ā VoiceLive error: ${msg}`);
writeConversationLog(`ERROR: ${msg}`);
},
// MCP-specific event handlers
onMcpListToolsCompleted: async (event) => {
console.log(`š§ MCP tools discovered successfully`);
writeConversationLog("MCP tools discovered successfully");
},
onMcpListToolsFailed: async (event) => {
console.error(`ā MCP tool discovery failed`);
writeConversationLog("ERROR: MCP tool discovery failed");
},
onResponseMcpCallInProgress: async (event) => {
console.log("ā³ MCP tool call in progress...");
writeConversationLog(`MCP call in progress: ${event.item_id ?? ""}`);
this._mcpCallInProgress++;
this._activeMcpItems.add(event.item_id);
this._startMcpStallTimer(session);
},
onResponseMcpCallArgumentsDone: async (event) => {
const name = event.name ?? "";
console.log(`š MCP tool call arguments ready: ${name}`);
},
onResponseMcpCallCompleted: async (event) => {
const itemId = event.item_id ?? "";
this._mcpCallInProgress = Math.max(0, this._mcpCallInProgress - 1);
this._activeMcpItems.delete(itemId);
this._cancelMcpStallTimer();
if (this._handledMcpCompletions.has(itemId)) return;
this._handledMcpCompletions.add(itemId);
const isStale = this._staleMcpItems.has(itemId);
this._staleMcpItems.delete(itemId);
console.log(`ā
MCP tool call completed (stale=${isStale})`);
writeConversationLog(`MCP call completed: ${itemId} (stale=${isStale})`);
delete this._mcpItemToServer[itemId];
if (this._pendingApproval === null && this._approvalQueue.length === 0) {
this._approvalCallCount = {};
}
if (isStale) {
try {
await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "This tool result is from an earlier request. The user has since moved on. Briefly introduce it as a late result, e.g. 'By the way, those results from earlier just came in...' then share the key findings concisely." }] });
} catch {}
}
// Batch response: only call response.create when ALL MCP calls for this
// turn have completed. This prevents partial results and repeated tool calls.
if (this._mcpCallInProgress <= 0 && this._pendingApproval === null && this._approvalQueue.length === 0) {
try {
await session.sendEvent({ type: "response.create" });
} catch (e) {
if (e?.message?.toLowerCase().includes("active response")) {
this._needsResponseCreate = true;
}
}
} else {
this._mcpResultsPending = true;
console.log(`[mcp] MCP calls still in progress (${this._mcpCallInProgress}) ā deferring response`);
}
},
onResponseMcpCallFailed: async (event) => {
const itemId = event.item_id ?? "";
console.error("ā MCP tool call failed");
writeConversationLog(`ERROR: MCP call failed: ${itemId}`);
this._mcpCallInProgress = Math.max(0, this._mcpCallInProgress - 1);
this._activeMcpItems.delete(itemId);
this._staleMcpItems.delete(itemId);
this._cancelMcpStallTimer();
try { await session.sendEvent({ type: "response.create" }); } catch {}
},
onConversationItemCreated: async (event) => {
const item = event.item;
if (item?.type === "mcp_call") {
const sl = item.serverLabel ?? item.server_label ?? "";
const fn = item.name ?? "";
this._mcpItemToServer[item.id] = `${sl}/${fn}`;
console.log(`š§ MCP tool call: ${sl}/${fn}`);
writeConversationLog(`MCP tool call: ${sl}/${fn} (id=${item.id})`);
if (!this._pendingApproval && !this._approvalServers.has(sl)) {
try {
await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "Briefly tell the user you're looking something up. One short sentence only." }] });
await session.sendEvent({ type: "response.create" });
} catch {}
}
}
if (item?.type === "mcp_approval_request") {
writeConversationLog(`MCP approval request: ${item.serverLabel ?? item.server_label ?? ""} / ${item.name ?? ""} (id=${item.id ?? ""})`);
await this._handleApprovalRequest(item, session);
}
},
});
}
Handle approval requests
When a server is configured with require_approval: "always", client code must handle the approval flow. Instead of blocking on readline, the sample injects a system message so the model asks the user verbally. The user's spoken transcript is then parsed for intent using word-boundary regex (\byes\b, \b(no|stop|cancel)\b).
/**
* Handle MCP approval requests via voice-based approval flow.
*/
async _handleApprovalRequest(item, session) {
const approvalId = item.id ?? "unknown";
const serverLabel = item.serverLabel ?? item.server_label ?? "unknown";
const functionName = item.name ?? "unknown";
console.log();
console.log("š MCP Approval Request");
console.log(` Server: ${serverLabel}`);
console.log(` Tool: ${functionName}`);
console.log(` Approval ID: ${approvalId}`);
const MAX_APPROVAL_CALLS_PER_TASK = 3;
const currentCount = this._approvalCallCount[serverLabel] ?? 0;
if (currentCount >= MAX_APPROVAL_CALLS_PER_TASK) {
console.log(` Auto-denied: ${serverLabel}/${functionName} (max ${MAX_APPROVAL_CALLS_PER_TASK} calls reached)`);
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: approvalId,
approve: false,
});
} catch (err) {
console.warn("Failed to send auto-deny:", err?.message ?? err);
}
return;
}
// Auto-approve if user already approved this server earlier in the same turn
if (this._approvedServersThisTurn.has(serverLabel)) {
console.log(` Auto-approved: ${serverLabel}/${functionName} (already approved this turn)`);
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: approvalId,
approve: true,
});
} catch (err) {
console.warn("Failed to send auto-approve:", err?.message ?? err);
}
return;
}
if (this._pendingApproval !== null) {
this._approvalQueue.push({ approvalId, serverLabel, functionName });
console.log(" (queued ā another approval is pending)");
return;
}
this._pendingApproval = { approvalId, serverLabel, functionName };
if (!this._activeResponse) {
await this._sendApprovalVoicePrompt(session);
} else {
this._approvalPromptNeeded = true;
}
}
async _sendApprovalVoicePrompt(session) {
const pending = this._pendingApproval;
if (!pending) return;
const server = pending.serverLabel;
const count = this._approvalCallCount[server] ?? 0;
this._approvalCallCount[server] = count + 1;
let prompt;
if (count === 0) {
prompt = `You MUST ask the user for explicit permission before proceeding. Say exactly: "I'd like to search the ${server} service for information. Do you approve? Please say yes or no."`;
} else {
prompt = `You MUST ask the user for permission again. Say exactly: "I need to do one more search to get complete information. Should I continue? Please say yes or no."`;
}
try {
await session.addConversationItem({
type: "message",
role: "system",
content: [{ type: "input_text", text: prompt }],
});
await session.sendEvent({ type: "response.create" });
} catch (err) {
console.error("ā Failed to send approval voice prompt:", err?.message ?? err);
}
}
async _resolveVoiceApproval(transcript, session) {
if (this._pendingApproval === null) return;
const lower = transcript.toLowerCase();
let approved = /\byes\b/.test(lower);
const denied = /\b(no|stop|cancel)\b/.test(lower);
if (!approved && !denied) {
// Ambiguous ā will re-prompt at next response.done
this._approvalPromptNeeded = true;
return;
}
if (approved && denied) {
approved = false; // Conflicting signals ā deny for safety
}
const { approvalId, serverLabel } = this._pendingApproval;
console.log(` Voice response: ${approved ? "Approved ā
" : "Denied ā"}`);
writeConversationLog(`Voice approval: ${approved ? "Approved" : "Denied"} for ${serverLabel}`);
this._pendingApproval = null;
if (approved) {
this._approvedServersThisTurn.add(serverLabel);
} else {
this._approvalCallCount = {};
this._approvedServersThisTurn.delete(serverLabel);
}
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: approvalId,
approve: approved,
});
} catch (err) {
console.error("ā Failed to send approval response:", err?.message ?? err);
}
await this._processNextApproval(session);
}
async _processNextApproval(session) {
if (this._approvalQueue.length === 0) return;
const next = this._approvalQueue.shift();
// Auto-approve if user already approved this server earlier in the same turn
if (this._approvedServersThisTurn.has(next.serverLabel)) {
console.log(` Auto-approved (queued): ${next.serverLabel}/${next.functionName}`);
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: next.approvalId,
approve: true,
});
} catch (err) {
console.warn("Failed to send queued auto-approve:", err?.message ?? err);
}
await this._processNextApproval(session);
return;
}
this._pendingApproval = next;
if (!this._activeResponse) {
await this._sendApprovalVoicePrompt(session);
} else {
this._approvalPromptNeeded = true;
}
}
In this sample:
- A system message instructs the model to verbally ask for permission.
mcp_approval_responsesends the decision back to Voice Live withapprove: trueorapprove: false.
Resolve voice-based approval
Parse the user's spoken transcript to determine approval. Use word-boundary regex to avoid false positives from words like "yesterday" or "nobody".
async _resolveVoiceApproval(transcript, session) {
if (this._pendingApproval === null) return;
const lower = transcript.toLowerCase();
let approved = /\byes\b/.test(lower);
const denied = /\b(no|stop|cancel)\b/.test(lower);
if (!approved && !denied) {
// Ambiguous ā will re-prompt at next response.done
this._approvalPromptNeeded = true;
return;
}
if (approved && denied) {
approved = false; // Conflicting signals ā deny for safety
}
const { approvalId, serverLabel } = this._pendingApproval;
console.log(` Voice response: ${approved ? "Approved ā
" : "Denied ā"}`);
writeConversationLog(`Voice approval: ${approved ? "Approved" : "Denied"} for ${serverLabel}`);
this._pendingApproval = null;
if (approved) {
this._approvedServersThisTurn.add(serverLabel);
} else {
this._approvalCallCount = {};
this._approvedServersThisTurn.delete(serverLabel);
}
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: approvalId,
approve: approved,
});
} catch (err) {
console.error("ā Failed to send approval response:", err?.message ?? err);
}
await this._processNextApproval(session);
}
async _processNextApproval(session) {
if (this._approvalQueue.length === 0) return;
const next = this._approvalQueue.shift();
// Auto-approve if user already approved this server earlier in the same turn
if (this._approvedServersThisTurn.has(next.serverLabel)) {
console.log(` Auto-approved (queued): ${next.serverLabel}/${next.functionName}`);
try {
await session.addConversationItem({
type: "mcp_approval_response",
approvalRequestId: next.approvalId,
approve: true,
});
} catch (err) {
console.warn("Failed to send queued auto-approve:", err?.message ?? err);
}
await this._processNextApproval(session);
return;
}
this._pendingApproval = next;
if (!this._activeResponse) {
await this._sendApprovalVoicePrompt(session);
} else {
this._approvalPromptNeeded = true;
}
}
In this sample:
- The transcript from
conversation.item.input_audio_transcription.completedis matched against\byes\band\b(no|stop|cancel)\bpatterns. - Subsequent calls to the same server within the same turn are auto-approved to avoid repeated prompts.
- After a configurable maximum (for example, 3 approvals), further calls are auto-denied and the model responds with what it has.
Detect stalls during MCP tool calls
MCP tool calls can take several seconds. Use a repeating timer to proactively inform the user that the assistant is still waiting for results.
_startMcpStallTimer(session) {
this._cancelMcpStallTimer();
let stallCount = 0;
const MCP_STALL_MAX_NOTIFICATIONS = 3;
this._mcpStallTimer = setInterval(async () => {
if (this._mcpCallInProgress <= 0) {
this._cancelMcpStallTimer();
return;
}
stallCount++;
if (stallCount > MCP_STALL_MAX_NOTIFICATIONS) {
this._cancelMcpStallTimer();
return;
}
// MCP calls cannot be cancelled ā only honest status updates are possible.
const msg = "The tool call is still running. Briefly reassure the user that you're still waiting for results. One short sentence only.";
try {
await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: msg }] });
await session.sendEvent({ type: "response.create" });
} catch (e) {
if (e?.message?.toLowerCase().includes("active response")) {
this._needsResponseCreate = true;
}
}
}, 10000);
}
_cancelMcpStallTimer() {
if (this._mcpStallTimer) {
clearInterval(this._mcpStallTimer);
this._mcpStallTimer = null;
}
}
In this sample:
- A
setIntervaltimer fires at a 10-second interval, injecting system messages up to 3 times. - The timer is cancelled when the MCP call completes or the user interrupts with barge-in.
Run the sample
Create the mcp-quickstart.js file with the following code:
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. import "dotenv/config"; import { VoiceLiveClient } from "@azure/ai-voicelive"; import { AzureKeyCredential } from "@azure/core-auth"; import { DefaultAzureCredential } from "@azure/identity"; import { spawn } from "node:child_process"; import { existsSync, mkdirSync, appendFileSync } from "node:fs"; import { join, dirname } from "node:path"; import { fileURLToPath } from "node:url"; const __dirname = dirname(fileURLToPath(import.meta.url)); const logsDir = join(__dirname, "logs"); if (!existsSync(logsDir)) mkdirSync(logsDir, { recursive: true }); const timestamp = new Date() .toISOString() .replace(/[:.]/g, "-") .replace("T", "_") .slice(0, 19); const conversationLogFile = join(logsDir, `conversation_${timestamp}.log`); function writeConversationLog(message) { appendFileSync(conversationLogFile, message + "\n", "utf-8"); } function printUsage() { console.log("Usage: node mcp-quickstart.js [options]"); console.log(""); console.log("Options:"); console.log(" --api-key <key> VoiceLive API key"); console.log(" --endpoint <url> VoiceLive endpoint URL"); console.log(" --model <name> Model to use (default: gpt-realtime)"); console.log( " --voice <name> Voice (default: en-US-Ava:DragonHDLatestNeural)", ); console.log(" --instructions <text> System instructions for the assistant"); console.log(" --audio-input-device <name> Explicit SoX input device name (Windows)"); console.log(" --list-audio-devices List available audio input devices and exit"); console.log(" --use-token-credential Use Azure credential instead of API key"); console.log(" --no-audio Connect and configure session without mic/speaker"); console.log(" -h, --help Show this help text"); } function parseArguments(argv) { const parsed = { apiKey: process.env.AZURE_VOICELIVE_API_KEY, endpoint: process.env.AZURE_VOICELIVE_ENDPOINT, model: process.env.AZURE_VOICELIVE_MODEL ?? "gpt-realtime", voice: process.env.AZURE_VOICELIVE_VOICE ?? "en-US-Ava:DragonHDLatestNeural", instructions: process.env.AZURE_VOICELIVE_INSTRUCTIONS ?? "You are a helpful AI assistant with access to MCP tools. Always respond in English. When a user asks a question, use the appropriate tool once to find information, then summarize the results conversationally. IMPORTANT: Never call the same tool more than once per user question. After receiving a tool result, always respond to the user with what you found ā do not search again. Some tools require user approval before they can be used. When you receive a system message asking you to request permission, you MUST clearly ask the user for their explicit approval before proceeding. Always wait for the user to say yes or no. Never skip the approval question or assume permission is granted. If a tool result arrives after the conversation has moved to a different topic, briefly introduce it as a late result before sharing the findings.", audioInputDevice: process.env.AUDIO_INPUT_DEVICE, listAudioDevices: false, useTokenCredential: false, noAudio: false, help: false, }; for (let i = 0; i < argv.length; i++) { const arg = argv[i]; switch (arg) { case "--api-key": parsed.apiKey = argv[++i]; break; case "--endpoint": parsed.endpoint = argv[++i]; break; case "--model": parsed.model = argv[++i]; break; case "--voice": parsed.voice = argv[++i]; break; case "--instructions": parsed.instructions = argv[++i]; break; case "--audio-input-device": parsed.audioInputDevice = argv[++i]; break; case "--list-audio-devices": parsed.listAudioDevices = true; break; case "--use-token-credential": parsed.useTokenCredential = true; break; case "--no-audio": parsed.noAudio = true; break; case "--help": case "-h": parsed.help = true; break; default: if (arg?.startsWith("-")) { throw new Error(`Unknown option: ${arg}`); } break; } } return parsed; } /** * List available audio input devices on Windows (AudioEndpoint via WMI). */ async function listAudioDevices() { if (process.platform !== "win32") { console.log("Device listing is currently supported on Windows only."); console.log("On macOS/Linux, run: sox -V6 -n -t coreaudio -n trim 0 0 (or similar)"); return; } const { execSync } = await import("node:child_process"); try { const output = execSync( 'powershell -NoProfile -Command "Get-CimInstance Win32_PnPEntity | Where-Object { $_.PNPClass -eq \'AudioEndpoint\' } | Select-Object -ExpandProperty Name"', { encoding: "utf-8", timeout: 10000 }, ).trim(); if (!output) { console.log("No audio endpoint devices found."); return; } console.log("Available audio endpoint devices:"); console.log(""); for (const line of output.split(/\r?\n/)) { const name = line.trim(); if (name) console.log(` ${name}`); } console.log(""); console.log("Use the device name (or a unique substring) with --audio-input-device."); console.log('Example: node mcp-quickstart.js --audio-input-device "Microphone"'); } catch (err) { console.error("Failed to query audio devices:", err.message); } } function resolveVoiceConfig(voiceName) { const looksLikeAzureVoice = voiceName.includes("-") || voiceName.includes(":"); if (looksLikeAzureVoice) { return { type: "azure-standard", name: voiceName }; } return { type: "openai", name: voiceName }; } class AudioProcessor { constructor(enableAudio = true, inputDevice = undefined) { this._enableAudio = enableAudio; this._inputDevice = inputDevice; this._recorder = null; this._soxProcess = null; this._speaker = null; this._skipSeq = 0; this._nextSeq = 0; this._recordModule = null; this._speakerCtor = null; } async _ensureAudioModulesLoaded() { if (!this._enableAudio) return; if (this._recordModule && this._speakerCtor) return; try { const recordModule = await import("node-record-lpcm16"); const speakerModule = await import("speaker"); this._recordModule = recordModule.default; this._speakerCtor = speakerModule.default; } catch { throw new Error( "Audio dependencies are unavailable. Install optional packages (node-record-lpcm16, speaker) " + "and required native build tools, or run with --no-audio for connectivity-only validation.", ); } } async startCapture(session) { if (!this._enableAudio) { console.log("[audio] --no-audio enabled: microphone capture skipped"); return; } if (this._recorder || this._soxProcess) return; if (this._inputDevice) { console.log(`[audio] Using explicit input device: ${this._inputDevice}`); const soxArgs = [ "-q", "-t", "waveaudio", this._inputDevice, "-r", "24000", "-c", "1", "-e", "signed-integer", "-b", "16", "-t", "raw", "-", ]; this._soxProcess = spawn("sox", soxArgs, { stdio: ["ignore", "pipe", "pipe"], }); this._soxProcess.stdout.on("data", (chunk) => { if (session.isConnected) { session.sendAudio(new Uint8Array(chunk)).catch(() => {}); } }); this._soxProcess.stderr.on("data", (data) => { const msg = data.toString().trim(); if (msg) console.error(`[audio] sox stderr: ${msg}`); }); this._soxProcess.on("error", (error) => { console.error(`[audio] SoX process error: ${error?.message ?? error}`); }); this._soxProcess.on("close", (code) => { if (code !== 0) console.error(`[audio] SoX exited with code ${code}`); this._soxProcess = null; }); console.log("[audio] Microphone capture started"); return; } await this._ensureAudioModulesLoaded(); const recorderOptions = { sampleRate: 24000, channels: 1, audioType: "raw", recorder: "sox", encoding: "signed-integer", bitwidth: 16, }; this._recorder = this._recordModule.record(recorderOptions); const recorderStream = this._recorder.stream(); recorderStream.on("data", (chunk) => { if (session.isConnected) { session.sendAudio(new Uint8Array(chunk)).catch(() => {}); } }); recorderStream.on("error", (error) => { console.error(`[audio] Recorder stream error: ${error?.message ?? error}`); }); console.log("[audio] Microphone capture started"); } async startPlayback() { if (!this._enableAudio) { console.log("[audio] --no-audio enabled: speaker playback skipped"); return; } if (this._speaker) return; await this._resetSpeaker(); console.log("[audio] Playback ready"); } queueAudio(base64Delta) { const seq = this._nextSeq++; if (seq < this._skipSeq) return; const chunk = Buffer.from(base64Delta, "base64"); if (this._speaker && !this._speaker.destroyed) { this._speaker.write(chunk); } } skipPendingAudio() { if (!this._enableAudio) return; this._skipSeq = this._nextSeq++; this._resetSpeaker().catch(() => {}); } shutdown() { if (this._soxProcess) { try { this._soxProcess.kill(); } catch { /* no-op */ } this._soxProcess = null; } if (this._recorder) { this._recorder.stop(); this._recorder = null; } if (this._speaker) { this._speaker.end(); this._speaker = null; } console.log("[audio] Audio processor shut down"); } async _resetSpeaker() { await this._ensureAudioModulesLoaded(); if (this._speaker && !this._speaker.destroyed) { // Use destroy() instead of end() to immediately discard buffered audio. // end() drains the buffer (plays it out), which causes old MCP response // audio to keep playing after barge-in. try { this._speaker.destroy(); } catch { /* no-op */ } } this._speaker = new this._speakerCtor({ channels: 1, bitDepth: 16, sampleRate: 24000, signed: true, }); this._speaker.on("error", () => {}); } } // <define_mcp_servers> /** * Define MCP servers that Voice Live can use during the session. * Each server is an MCPTool object added to the session tools array. */ function defineMCPServers() { return [ { type: "mcp", serverLabel: "deepwiki", serverUrl: "https://mcp.deepwiki.com/mcp", allowedTools: ["read_wiki_structure", "ask_question"], requireApproval: "never", }, { type: "mcp", serverLabel: "azure_doc", serverUrl: "https://learn.microsoft.com/api/mcp", requireApproval: "always", }, ]; } // </define_mcp_servers> class MCPVoiceAssistant { constructor(options) { this.endpoint = options.endpoint; this.credential = options.credential; this.model = options.model; this.voice = options.voice; this.instructions = options.instructions; this.audioInputDevice = options.audioInputDevice; this.noAudio = options.noAudio; this._session = null; this._subscription = null; this._audio = new AudioProcessor(!options.noAudio, options.audioInputDevice); this._activeResponse = false; this._responseApiDone = false; this._pendingApproval = null; this._approvalQueue = []; this._approvalPromptNeeded = false; this._mcpCallInProgress = 0; this._handledMcpCompletions = new Set(); this._needsResponseCreate = false; this._approvalCallCount = {}; this._mcpItemToServer = {}; this._approvalServers = new Set(); this._mcpStallTimer = null; this._activeMcpItems = new Set(); this._staleMcpItems = new Set(); this._mcpResultsPending = false; this._approvedServersThisTurn = new Set(); this._bargeInActive = false; } // <configure_session> /** * Configure the session with MCP servers in the tools list. */ async _setupSession() { console.log("[session] Configuring session with MCP tools..."); const mcpServers = defineMCPServers(); this._approvalServers = new Set( mcpServers.filter(s => s.requireApproval === "always").map(s => s.serverLabel) ); await this._session.updateSession({ model: this.model, modalities: ["text", "audio"], instructions: this.instructions, voice: resolveVoiceConfig(this.voice), inputAudioFormat: "pcm16", outputAudioFormat: "pcm16", turnDetection: { type: "server_vad", threshold: 0.5, prefixPaddingInMs: 300, silenceDurationInMs: 500, }, inputAudioEchoCancellation: { type: "server_echo_cancellation" }, inputAudioNoiseReduction: { type: "azure_deep_noise_suppression" }, inputAudioTranscription: { model: this.model.toLowerCase().includes("realtime") ? "whisper-1" : "azure-speech" }, tools: mcpServers, }); console.log("[session] Session configuration with MCP tools sent"); } // </configure_session> // <handle_mcp_events> /** * Subscribe to session events, including MCP-specific events. */ _subscribeToEvents(session) { this._subscription = session.subscribe({ onSessionUpdated: async (event, context) => { const s = event.session; const model = s?.model; const voice = s?.voice; console.log(`[session] Session ready: ${context.sessionId}`); console.log( ` Model: ${typeof model === "string" ? model : model?.toString?.() ?? ""}`, ); console.log(` Voice: ${voice?.name ?? ""}`); writeConversationLog( [ `SessionID: ${context.sessionId}`, `Model: ${typeof model === "string" ? model : model?.toString?.() ?? ""}`, `Voice Name: ${voice?.name ?? ""}`, `Voice Type: ${voice?.type ?? ""}`, `Log File: ${conversationLogFile}`, "", ].join("\n"), ); }, onConversationItemInputAudioTranscriptionCompleted: async (event) => { const transcript = event.transcript ?? ""; console.log(`š¤ You said:\t${transcript}`); writeConversationLog(`User Input:\t${transcript}`); if (this._pendingApproval !== null) { await this._resolveVoiceApproval(transcript, session); } }, onResponseTextDone: async (event) => { const text = event.text ?? ""; console.log(`š¤ Assistant text:\t${text}`); writeConversationLog(`Assistant Text Response:\t${text}`); }, onResponseAudioTranscriptDone: async (event) => { const transcript = event.transcript ?? ""; console.log(`š¤ Assistant audio transcript:\t${transcript}`); writeConversationLog(`Assistant Audio Response:\t${transcript}`); }, onInputAudioBufferSpeechStarted: async () => { console.log("š¤ Listening..."); this._audio.skipPendingAudio(); // Do NOT reset _approvalCallCount here ā the counter should only // reset on task completion (in onResponseMcpCallCompleted when no // pending/queued approvals remain) or on denial (in _resolveVoiceApproval). // Resetting on every speech-start would let the model retry denied calls. // Clear ALL deferred response flags on barge-in. // This prevents onResponseDone (fired by the cancelled response) // from immediately creating a new response that overlaps the user. this._needsResponseCreate = false; this._mcpResultsPending = false; // Reset approved-servers-this-turn when user starts a new topic if (this._pendingApproval === null && this._mcpCallInProgress <= 0) { this._approvedServersThisTurn.clear(); } if (this._activeResponse && !this._responseApiDone) { // Mark barge-in so onResponseDone skips deferred actions this._bargeInActive = true; try { await session.sendEvent({ type: "response.cancel" }); } catch (err) { const msg = err?.message ?? ""; if (!msg.toLowerCase().includes("no active response")) { console.warn("[barge-in] Cancel failed:", msg); } } try { await session.sendEvent({ type: "input_audio_buffer.clear" }); } catch { /* best-effort */ } } if (this._mcpCallInProgress > 0 && this._pendingApproval === null) { this._staleMcpItems = new Set([...this._staleMcpItems, ...this._activeMcpItems]); console.log(`[barge-in] Marking ${this._activeMcpItems.size} MCP calls as stale`); try { await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "A tool call is still running in the background. The user just spoke. Respond to what the user said. If a tool result arrives later, briefly introduce it as a late result from an earlier request." }] }); } catch {} } }, onInputAudioBufferSpeechStopped: async () => { console.log("š¤ Processing..."); }, onResponseCreated: async () => { this._activeResponse = true; this._responseApiDone = false; }, onResponseAudioDelta: async (event) => { if (event.delta) { this._audio.queueAudio(event.delta); } }, onResponseAudioDone: async () => { console.log("š¤ Ready for next input..."); }, onResponseDone: async () => { console.log("ā Response complete"); writeConversationLog("--- Response complete ---"); this._activeResponse = false; this._responseApiDone = true; // If this response.done is the result of a barge-in cancel, // skip all deferred actions ā the user's new turn will handle things. if (this._bargeInActive) { this._bargeInActive = false; return; } if (this._approvalPromptNeeded && this._pendingApproval !== null) { this._approvalPromptNeeded = false; await this._sendApprovalVoicePrompt(session); } else if (this._mcpResultsPending && this._mcpCallInProgress <= 0 && this._pendingApproval === null) { this._mcpResultsPending = false; try { await session.sendEvent({ type: "response.create" }); } catch {} } else if (this._needsResponseCreate) { this._needsResponseCreate = false; try { await session.sendEvent({ type: "response.create" }); } catch {} } }, onServerError: async (event) => { const msg = event.error?.message ?? ""; // Reset response state ā errors can terminate a response without onResponseDone this._activeResponse = false; this._responseApiDone = true; if (msg.includes("Cancellation failed: no active response")) return; if (msg.toLowerCase().includes("interim response")) { console.log("[session] Interim response not supported (non-fatal)"); return; } if (msg.toLowerCase().includes("active response")) return; console.error(`ā VoiceLive error: ${msg}`); writeConversationLog(`ERROR: ${msg}`); }, // MCP-specific event handlers onMcpListToolsCompleted: async (event) => { console.log(`š§ MCP tools discovered successfully`); writeConversationLog("MCP tools discovered successfully"); }, onMcpListToolsFailed: async (event) => { console.error(`ā MCP tool discovery failed`); writeConversationLog("ERROR: MCP tool discovery failed"); }, onResponseMcpCallInProgress: async (event) => { console.log("ā³ MCP tool call in progress..."); writeConversationLog(`MCP call in progress: ${event.item_id ?? ""}`); this._mcpCallInProgress++; this._activeMcpItems.add(event.item_id); this._startMcpStallTimer(session); }, onResponseMcpCallArgumentsDone: async (event) => { const name = event.name ?? ""; console.log(`š MCP tool call arguments ready: ${name}`); }, onResponseMcpCallCompleted: async (event) => { const itemId = event.item_id ?? ""; this._mcpCallInProgress = Math.max(0, this._mcpCallInProgress - 1); this._activeMcpItems.delete(itemId); this._cancelMcpStallTimer(); if (this._handledMcpCompletions.has(itemId)) return; this._handledMcpCompletions.add(itemId); const isStale = this._staleMcpItems.has(itemId); this._staleMcpItems.delete(itemId); console.log(`ā MCP tool call completed (stale=${isStale})`); writeConversationLog(`MCP call completed: ${itemId} (stale=${isStale})`); delete this._mcpItemToServer[itemId]; if (this._pendingApproval === null && this._approvalQueue.length === 0) { this._approvalCallCount = {}; } if (isStale) { try { await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "This tool result is from an earlier request. The user has since moved on. Briefly introduce it as a late result, e.g. 'By the way, those results from earlier just came in...' then share the key findings concisely." }] }); } catch {} } // Batch response: only call response.create when ALL MCP calls for this // turn have completed. This prevents partial results and repeated tool calls. if (this._mcpCallInProgress <= 0 && this._pendingApproval === null && this._approvalQueue.length === 0) { try { await session.sendEvent({ type: "response.create" }); } catch (e) { if (e?.message?.toLowerCase().includes("active response")) { this._needsResponseCreate = true; } } } else { this._mcpResultsPending = true; console.log(`[mcp] MCP calls still in progress (${this._mcpCallInProgress}) ā deferring response`); } }, onResponseMcpCallFailed: async (event) => { const itemId = event.item_id ?? ""; console.error("ā MCP tool call failed"); writeConversationLog(`ERROR: MCP call failed: ${itemId}`); this._mcpCallInProgress = Math.max(0, this._mcpCallInProgress - 1); this._activeMcpItems.delete(itemId); this._staleMcpItems.delete(itemId); this._cancelMcpStallTimer(); try { await session.sendEvent({ type: "response.create" }); } catch {} }, onConversationItemCreated: async (event) => { const item = event.item; if (item?.type === "mcp_call") { const sl = item.serverLabel ?? item.server_label ?? ""; const fn = item.name ?? ""; this._mcpItemToServer[item.id] = `${sl}/${fn}`; console.log(`š§ MCP tool call: ${sl}/${fn}`); writeConversationLog(`MCP tool call: ${sl}/${fn} (id=${item.id})`); if (!this._pendingApproval && !this._approvalServers.has(sl)) { try { await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: "Briefly tell the user you're looking something up. One short sentence only." }] }); await session.sendEvent({ type: "response.create" }); } catch {} } } if (item?.type === "mcp_approval_request") { writeConversationLog(`MCP approval request: ${item.serverLabel ?? item.server_label ?? ""} / ${item.name ?? ""} (id=${item.id ?? ""})`); await this._handleApprovalRequest(item, session); } }, }); } // </handle_mcp_events> // <handle_approval> /** * Handle MCP approval requests via voice-based approval flow. */ async _handleApprovalRequest(item, session) { const approvalId = item.id ?? "unknown"; const serverLabel = item.serverLabel ?? item.server_label ?? "unknown"; const functionName = item.name ?? "unknown"; console.log(); console.log("š MCP Approval Request"); console.log(` Server: ${serverLabel}`); console.log(` Tool: ${functionName}`); console.log(` Approval ID: ${approvalId}`); const MAX_APPROVAL_CALLS_PER_TASK = 3; const currentCount = this._approvalCallCount[serverLabel] ?? 0; if (currentCount >= MAX_APPROVAL_CALLS_PER_TASK) { console.log(` Auto-denied: ${serverLabel}/${functionName} (max ${MAX_APPROVAL_CALLS_PER_TASK} calls reached)`); try { await session.addConversationItem({ type: "mcp_approval_response", approvalRequestId: approvalId, approve: false, }); } catch (err) { console.warn("Failed to send auto-deny:", err?.message ?? err); } return; } // Auto-approve if user already approved this server earlier in the same turn if (this._approvedServersThisTurn.has(serverLabel)) { console.log(` Auto-approved: ${serverLabel}/${functionName} (already approved this turn)`); try { await session.addConversationItem({ type: "mcp_approval_response", approvalRequestId: approvalId, approve: true, }); } catch (err) { console.warn("Failed to send auto-approve:", err?.message ?? err); } return; } if (this._pendingApproval !== null) { this._approvalQueue.push({ approvalId, serverLabel, functionName }); console.log(" (queued ā another approval is pending)"); return; } this._pendingApproval = { approvalId, serverLabel, functionName }; if (!this._activeResponse) { await this._sendApprovalVoicePrompt(session); } else { this._approvalPromptNeeded = true; } } async _sendApprovalVoicePrompt(session) { const pending = this._pendingApproval; if (!pending) return; const server = pending.serverLabel; const count = this._approvalCallCount[server] ?? 0; this._approvalCallCount[server] = count + 1; let prompt; if (count === 0) { prompt = `You MUST ask the user for explicit permission before proceeding. Say exactly: "I'd like to search the ${server} service for information. Do you approve? Please say yes or no."`; } else { prompt = `You MUST ask the user for permission again. Say exactly: "I need to do one more search to get complete information. Should I continue? Please say yes or no."`; } try { await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: prompt }], }); await session.sendEvent({ type: "response.create" }); } catch (err) { console.error("ā Failed to send approval voice prompt:", err?.message ?? err); } } // <voice_approval_transcription> async _resolveVoiceApproval(transcript, session) { if (this._pendingApproval === null) return; const lower = transcript.toLowerCase(); let approved = /\byes\b/.test(lower); const denied = /\b(no|stop|cancel)\b/.test(lower); if (!approved && !denied) { // Ambiguous ā will re-prompt at next response.done this._approvalPromptNeeded = true; return; } if (approved && denied) { approved = false; // Conflicting signals ā deny for safety } const { approvalId, serverLabel } = this._pendingApproval; console.log(` Voice response: ${approved ? "Approved ā " : "Denied ā"}`); writeConversationLog(`Voice approval: ${approved ? "Approved" : "Denied"} for ${serverLabel}`); this._pendingApproval = null; if (approved) { this._approvedServersThisTurn.add(serverLabel); } else { this._approvalCallCount = {}; this._approvedServersThisTurn.delete(serverLabel); } try { await session.addConversationItem({ type: "mcp_approval_response", approvalRequestId: approvalId, approve: approved, }); } catch (err) { console.error("ā Failed to send approval response:", err?.message ?? err); } await this._processNextApproval(session); } async _processNextApproval(session) { if (this._approvalQueue.length === 0) return; const next = this._approvalQueue.shift(); // Auto-approve if user already approved this server earlier in the same turn if (this._approvedServersThisTurn.has(next.serverLabel)) { console.log(` Auto-approved (queued): ${next.serverLabel}/${next.functionName}`); try { await session.addConversationItem({ type: "mcp_approval_response", approvalRequestId: next.approvalId, approve: true, }); } catch (err) { console.warn("Failed to send queued auto-approve:", err?.message ?? err); } await this._processNextApproval(session); return; } this._pendingApproval = next; if (!this._activeResponse) { await this._sendApprovalVoicePrompt(session); } else { this._approvalPromptNeeded = true; } } // </voice_approval_transcription> // </handle_approval> // <mcp_stall_detection> _startMcpStallTimer(session) { this._cancelMcpStallTimer(); let stallCount = 0; const MCP_STALL_MAX_NOTIFICATIONS = 3; this._mcpStallTimer = setInterval(async () => { if (this._mcpCallInProgress <= 0) { this._cancelMcpStallTimer(); return; } stallCount++; if (stallCount > MCP_STALL_MAX_NOTIFICATIONS) { this._cancelMcpStallTimer(); return; } // MCP calls cannot be cancelled ā only honest status updates are possible. const msg = "The tool call is still running. Briefly reassure the user that you're still waiting for results. One short sentence only."; try { await session.addConversationItem({ type: "message", role: "system", content: [{ type: "input_text", text: msg }] }); await session.sendEvent({ type: "response.create" }); } catch (e) { if (e?.message?.toLowerCase().includes("active response")) { this._needsResponseCreate = true; } } }, 10000); } _cancelMcpStallTimer() { if (this._mcpStallTimer) { clearInterval(this._mcpStallTimer); this._mcpStallTimer = null; } } // </mcp_stall_detection> async start() { const client = new VoiceLiveClient(this.endpoint, this.credential, { apiVersion: "2026-01-01-preview", }); const session = client.createSession({ model: this.model }); this._session = session; console.log( `[init] Connecting to VoiceLive with model "${this.model}" at "${this.endpoint}" ...`, ); this._subscribeToEvents(session); await session.connect(); console.log("[init] Connected to VoiceLive session websocket"); await this._setupSession(); await this._audio.startPlayback(); await this._audio.startCapture(session); console.log("\n" + "=".repeat(70)); console.log("š¤ VOICE ASSISTANT WITH MCP READY"); console.log("Try saying:"); console.log(' ⢠"Can you summarize the GitHub repo azure-sdk-for-java?"'); console.log(' ⢠"Search the Azure documentation for Voice Live API."'); console.log("You may need to approve some MCP tool calls by voice."); console.log("Press Ctrl+C to exit"); console.log("=".repeat(70) + "\n"); if (this.noAudio) { setTimeout(() => { process.emit("SIGINT"); }, 6000); } await new Promise((resolve) => { const onSignal = () => resolve(); process.once("SIGINT", onSignal); process.once("SIGTERM", onSignal); const poll = setInterval(() => { if (!session.isConnected) { clearInterval(poll); resolve(); } }, 500); }); await this.shutdown(); } async shutdown() { this._cancelMcpStallTimer(); if (this._subscription) { await this._subscription.close(); this._subscription = null; } if (this._session) { try { await this._session.disconnect(); } catch { // ignore disconnect errors during shutdown } this._audio.shutdown(); try { await this._session.dispose(); } catch { // ignore dispose errors during shutdown } this._session = null; } } } async function main() { let args; try { args = parseArguments(process.argv.slice(2)); } catch (err) { console.error(`ā ${err.message}`); printUsage(); process.exit(1); } if (args.help) { printUsage(); return; } if (args.listAudioDevices) { await listAudioDevices(); return; } if (!args.endpoint) { console.error( "ā Missing endpoint. Set AZURE_VOICELIVE_ENDPOINT or pass --endpoint.", ); process.exit(1); } if (!args.apiKey && !args.useTokenCredential) { console.error("ā No authentication provided."); console.error( "Provide --api-key / AZURE_VOICELIVE_API_KEY or use --use-token-credential.", ); process.exit(1); } const credential = args.useTokenCredential ? new DefaultAzureCredential() : new AzureKeyCredential(args.apiKey); console.log("Configuration:"); console.log(` AZURE_VOICELIVE_ENDPOINT: ${args.endpoint}`); console.log(` AZURE_VOICELIVE_MODEL: ${args.model}`); console.log(` AZURE_VOICELIVE_VOICE: ${args.voice}`); console.log(` AUDIO_INPUT_DEVICE: ${args.audioInputDevice ?? "(not set)"}`); console.log(` No audio mode: ${args.noAudio ? "enabled" : "disabled"}`); console.log( ` Authentication: ${args.useTokenCredential ? "DefaultAzureCredential" : "API Key"}`, ); console.log(` Log file: ${conversationLogFile}`); const assistant = new MCPVoiceAssistant({ endpoint: args.endpoint, credential, model: args.model, voice: args.voice, instructions: args.instructions, audioInputDevice: args.audioInputDevice, noAudio: args.noAudio, }); try { await assistant.start(); } catch (err) { if (err?.code === "ERR_USE_AFTER_CLOSE") return; console.error("Fatal error:", err); process.exit(1); } } console.log("šļø Voice Assistant with MCP - Azure VoiceLive SDK"); console.log("=".repeat(70)); main().then( () => console.log("\nš Voice assistant shut down. Goodbye!"), (err) => { console.error("Unhandled error:", err); process.exit(1); }, );Sign in to Azure with the following command:
az loginRun the application:
node mcp-quickstart.jsSpeak into your microphone. Try asking questions like "What tools do you have?" or "Search the Azure documentation for Voice Live API."
- For the
deepwikiserver (require_approval: "never"), tool calls execute automatically. - For the
azure_docserver (require_approval: "always"), you're prompted to approve each tool call in the console.
- For the
Press Ctrl+C to stop the session.
MCP server configuration reference
| Parameter | Required | Description |
|---|---|---|
server_label |
Yes | Display name for the MCP server. |
server_url |
Yes | URL of the remote MCP endpoint. |
allowed_tools |
No | List of tool names the model can call. If omitted, all tools are allowed. |
require_approval |
No | "never", "always" (default), or a per-tool dictionary. |
headers |
No | Extra HTTP headers to include in MCP requests. |
authorization |
No | Authorization token for MCP requests. |
For the complete REST API type definition, see MCPTool in the Voice Live API reference.
Best practices
Integrating MCP servers into a voice assistant introduces UX challenges that don't exist in text-based or console-based MCP clients. MCP tool calls can take 3ā60+ seconds, approval prompts must happen conversationally, and users expect continuous spoken feedback. Plan for these patterns when building a voice-enabled MCP integration.
Voice-native approval
Console-based MCP samples typically use blocking input (such as input() or readline) for approval. In a voice assistant, blocking the audio pipeline freezes the conversation. Instead, handle approvals conversationally:
- Inject a system message that instructs the model to verbally ask for permission.
- Parse the user's spoken response for clear intent (
yes,no,stop,cancel). - Allow barge-in so the user can say "yes" without waiting for the full approval prompt to finish.
- Use word-boundary matching (such as
\byes\b) to avoid false positives from words like "yesterday" or "nobody".
System instructions for the approval flow
The model needs explicit instructions about the approval flow in its system prompt. Without them, it might paraphrase the permission request into a generic "Let me look that up," skipping the actual question. Include language like:
"Some tools require user approval. When you receive a system message asking you to request permission, you MUST clearly ask the user for their explicit approval. Never skip the approval question or assume permission is granted."
Use "Say exactly:" phrasing in per-request system messages to prevent the model from rewording the question.
Handle repeated tool calls
MCP servers might require multiple searches to gather complete information. Each search triggers a separate approval if require_approval="always". Rather than asking the identical question each time:
- Track the call count per server.
- Change the prompt wording for subsequent calls (for example, "I need one more search. Should I continue?").
- Consider auto-denying after a maximum number of approved calls (for example, 3) to prevent infinite loops. The model responds with what it has.
- Reset the counter when results are delivered or the user denies a request.
For approval-required servers, consider auto-approving subsequent calls to the same server within the same turn to avoid repeated voice prompts for what is logically a single task.
Fill silence during tool calls
MCP tool calls can take several seconds to complete. Without feedback, the user assumes the assistant is unresponsive. Use these complementary layers:
- Tool announcements (immediate, client-side): For auto-approved servers, have the assistant say something like "Let me look that up" when the call starts. Skip this for approval-required servers since the approval prompt already communicates that a tool call is happening.
- Stall detection (client-side, repeating timer): If a tool call runs longer than expected, proactively tell the user the assistant is still waiting. A 10-second interval with a maximum of 3 notifications works well for medium-latency servers (5ā15 seconds). Adjust the interval based on your expected MCP server latency.
Note
MCP calls can't be cancelled. Stall notifications are status updates, not actionable options. Once a call starts, it runs until the server responds or times out.
Handle barge-in during MCP calls
Users naturally try to interrupt or ask "Are you still there?" during long tool calls. Rather than ignoring this:
- Inject a system message so the model can acknowledge the user.
- If the original MCP call completes later, introduce its result as a late result (for example, "By the way, those results from earlier just came in...").
- Protect against response collisions: when a cancelled response's completion handler runs, skip any deferred processing (pending approval prompts, queued MCP results) so it doesn't overlap with the user's new turn.
Choose MCP servers for voice latency
Not all MCP servers are well-suited for voice UX. When selecting MCP servers for a voice assistant:
- Prefer low-latency servers ā search APIs, simple lookups, and cached data sources that respond within 5 seconds work best.
- Avoid servers that perform heavy computation ā large repository analysis, complex document retrieval, or multi-step workflows can take 30ā60+ seconds, degrading the voice experience.
- Plan for non-cancellable calls ā MCP calls can't be cancelled from the client. If the user moves on during a slow call, the result arrives out of context and must be introduced as a late result, which can feel disjointed.
- Consider your use case ā if users expect real-time answers, long-running MCP servers frustrate them. If the interaction style is more like a research assistant, asynchronous results might be acceptable.
Troubleshooting
MCP tool discovery fails (mcp_list_tools.failed)
Voice Live contacts each MCP server's tool listing endpoint at session start. If discovery fails, no tools from that server are available during the session.
| Cause | Resolution |
|---|---|
Incorrect server_url |
Verify the MCP server URL is reachable and includes the correct path (for example, https://mcp.deepwiki.com/mcp). |
| Server is unreachable | Confirm the MCP server is running and accessible from Azure's network. Check firewall rules and DNS resolution. |
| Authentication failure | If the server requires authentication, verify the authorization or headers values are correct and not expired. |
| Server returns invalid tool schema | Check the MCP server's tool listing response conforms to the MCP specification. |
MCP tool call fails (response.mcp_call.failed)
A tool call failure means Voice Live successfully discovered the tool but the call didn't complete.
| Cause | Resolution |
|---|---|
| Server timeout | The MCP server took too long to respond. Optimize the server-side handler or choose a lower-latency server. |
| Server returned an error | Check your MCP server logs. Common issues include missing parameters, invalid input, or downstream service failures. |
| Network interruption | Transient network errors between Voice Live and the MCP server. Retry by prompting the model again. |
Tip
When an MCP call fails, trigger response.create so the model can inform the user and continue the conversation. The sample code does this automatically.
No MCP events received
| Cause | Resolution |
|---|---|
| Wrong API version | MCP requires api_version="2026-04-10" or later. Earlier API versions silently ignore MCP server configuration. |
| MCP servers not in session config | Verify that MCPServer objects are included in the tools list passed to configure_session or updateSession. |
allowed_tools mismatch |
If allowed_tools is set, only the listed tool names are exposed. Verify the names match exactly what the MCP server advertises. |
Approval requests not received
| Cause | Resolution |
|---|---|
require_approval set to "never" |
Tool calls auto-execute without approval. Change to "always" or use a per-tool dictionary if you need approval for specific tools. |
| Event handler not subscribed | Ensure your code listens for mcp_approval_request conversation items in the event loop. |
| Duplicate handling | The approval request arrives as a conversation item creation event, not a standalone event type. Check that your conversation.item.created handler inspects the item type. |
Response collision errors during MCP flow
Voice Live doesn't allow overlapping responses. During MCP flows, response.create calls can collide with an in-progress response.
| Cause | Resolution |
|---|---|
"Cancellation failed: no active response" |
Non-fatal. This occurs when a cancel is issued but the response already completed. Log and ignore. |
"active response" errors |
A new response.create was attempted while another response is still generating. Track response state (response.created / response.done events) and defer actions until the active response completes. |
| Interim response errors | Some model pipelines don't support interimResponse. If you receive interim response errors, remove the interim response configuration or verify your model supports it. |