Managed Inference and Agents API /v1/chat/completions
Last updated May 16, 2025
Table of Contents
The /v1/chat/completions
endpoint generates conversational completions for a provided set of input messages. You can specify the model, adjust generation settings such as temperature
, and opt to stream the responses in real time. You can also specify tools
the model can choose to call.
Selecting a chat model: for the best intelligence, we recommend using a Claude Sonnet model; for cost-savings and fast inference we recommend Claude Haiku.
Request Body Parameters
Use parameters to manage how conversational completions are generated.
Required Parameters
Field | Type | Description | Example |
---|---|---|---|
model | string | model used for completion, typically the value of your INFERENCE_MODEL_ID config var |
"claude-3-7-sonnet" |
messages | array | array of messages objects (user-assistant conversational turns) used by the model to generate the next response | [{"role": "user", "content": "Why is Heroku so awesome?"}] |
Optional Parameters
Field | Type | Description | Default | Example |
---|---|---|---|---|
extended_thinking | object | (Claude 3.7 Sonnet only) enable extended thinking to perform internal reasoning steps (see Anthropic’s extended thinking docs) | null |
{"enabled": true, "budget_tokens": 1024, "include_reasoning": true} |
max_tokens | integer | maximum tokens the model may generate before stopping (each token typically represents around 4 characters of text) max value: 4096 for Haiku models, 8192 for Sonnet models |
varies | 1024 |
stop | array | list of strings that stop the model from generating further tokens if any of the strings are in the response | null |
["foo"] |
stream | boolean | option to stream responses incrementally via server-sent events (useful for chat interfaces and avoiding timeout errors) | false |
true |
temperature | float | controls the randomness of the response—values closer to 0 make the response more focused by favoring high-probability tokens, while values closer to 1.0 encourage more diverse responses by sampling from a broader range of possibilities for each generated tokenrange: 0.0 to 1.0 |
1.0 |
0.2 |
tool_choice | enum or object | option to force the model to use one or more of the tools listed in tools (see tool_choice) |
"required" |
"auto" |
tools | array | list of tools the model may call (see tools) | [] |
refer to the JSON example in the tools section |
top_p | float | specifies the proportion of tokens to consider when generating the next token, in terms of cumulative probability range: 0 to 1.0 |
0.999 |
0.95 |
extended_thinking
Object
Extended thinking is only supported for Claude 3.7 Sonnet. Requests that include extended_thinking
for unsupported models fail.
The extended_thinking
object lets you request that the model use additional internal tokens for reasoning steps before producing its final output. Enabling extended thinking typically improves reasoning ability on complex tasks.
Field | Type | Description | Default |
---|---|---|---|
enabled | boolean | indicates if extended thinking is enabled | false |
budget_tokens | integer | maximum number of internal “thinking” tokens to use during internal reasoning must be >= 1024 and < max_tokens |
null |
include_reasoning | boolean | indicates if the model’s internal reasoning trace is included in the response | false |
tools
Array of Objects
tools
lets you provide your model with an array of tools it can choose to call. Use tool_choice
to specify how the model calls tools.
When provided, your model may send back tool_calls
in the role="assistant"
generated message, asking your system to run the specified tool, and send back the result in a role="tool"
message.
Note that these tools are given to the model in the form of an extended prompt and no further validation is done.
Models may make up tool names that don’t exist in the tools array you have given them. To avoid this, we recommend you perform tool validation on your end when a model sends back a tool_calls
assistant message.
Field | Type | Description | Example |
---|---|---|---|
type | enum<string> | type of tool always: "function" ) |
"function" |
function | object | details about the function to call | example) |
function
Object
Field | Type | Description | Example |
---|---|---|---|
description | string | description of what the function does, used by the model to choose when and how to call the function | "This function calculates X" |
name | string | name of the function to be called | "example_function" |
parameters | object | parameters the function accepts as a JSON Schema object | {"type": "object", "properties": {}} |
Example tools
Array
[
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. Portland, OR"
}
},
"required": ["location"]
}
}
}
]
tool_choice
Object
The tool_choice
object specifies how the model should use the provided tools
.
It can either be a string (none
, auto
, or required
), or a tool_choice
object.
none
will mean the model will call no tools. auto
allows the model to call zero to many of the provided tools, and required
forces the model to call at least one or more tools before responding to the user.
To force the model to call a specific tool, you may simply specify a single tool in the tools
object and pass "tools": "required"
, or you can force the tool selection by passing a tool_choice
object that specifies the required function.
Field | Type | Description | Example |
---|---|---|---|
type | enum<string> | type of tool always: "function" |
"function" |
function | object | JSON object containing the function’s name | {"name": "example_function"} |
messages
Array of Objects
The messages
object is an array of message objects.
Each message must specify a role
field that determines the messages’s schema (see below).
Currently, the supported types are user
, assistant
, system
, and tool
.
If the most recent message uses the assistant
role, the model will continue its answer starting from the content in that most recent message.
role=user
message
user
messages are the primary way to send queries to your model and prompt it to respond.
Field | Type | Description | Required | Example |
---|---|---|---|---|
role | string | role of the message (user ) |
yes | "user" |
content | string | contents of the user message | yes | "What is the weather?" |
role=assistant
message
Typically, assistant
messages are only generated by the model, however you can create your own or pre-fill a partially completed assistant
response to help influence the content that the model will generate on its next turn.
Field | Type | Description | Required | Example |
---|---|---|---|---|
role | string | role of the message (assistant ) |
yes | "assistant" |
content | string | contents of the assistant message | yes, unless tool_calls is specified |
"Here is the information:" |
refusal | string or null | refusal message by the assistant | no | "I can't answer that." |
tool_calls | array | tool calls generated by the model | no | [{"id": "tool_call_12345", "type": "function", "function": {"name": "example_tool", "arguments": {"example_input": 123}}}] |
Tool Call Object
Field | Type | Description | Example |
---|---|---|---|
id | string | unique ID for the tool call | "tooluse_abc123" |
type | string | type of call, currently always "function" |
"function" |
function | object | function call details | see tool call example |
Tool Call Example
Here’s an example of what a tool_calls
object might look like, when your model has decided to call a tool you’ve given to it as an option via tools.
"tool_calls": [
{
"id": "toolu_02F9GXvY5MZAq8Lw3PTNQyJK",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Portland, OR\"}"
}
}
]
Function Object
Field | Type | Description | Example |
---|---|---|---|
name | string | name of the tool to invoke | "your_cool_tool" |
arguments | string | JSON-encoded string of tool arguments | "{}" |
role=system
message
A system
message is sort of a prompt ‘prefix’ that is given to the model to help influence its responses.
Field | Type | Description | Required | Example |
---|---|---|---|---|
role | string | role of the message (system ) |
yes | "system" |
content | string or array | contents of the system message | yes | "You are a helpful assistant. You favor brevity and avoid hedging. You readily admit when you don't know an answer." |
role=tool
message
A tool
message object lets you communicate a specified tool’s result (output) to the model.
Field | Type | Description | Required | Example |
---|---|---|---|---|
role | string | role of the message (tool ) |
yes | "get_weather" |
content | string or array | tool call result (output) | yes | "Rainy and 84º" |
tool_call_id | string | tool call that this message is responding to | yes | "toolu_02F9GXvY5MZAq8Lw3PTNQyJK" |
Request Headers
In the following example, we assume your model resource has an alias of “INFERENCE
” (the default).
Header | Type | Description |
---|---|---|
Authorization |
string | your AI add-on’s 'INFERENCE_KEY' value (API bearer token) |
All inference curl
requests must include an Authorization
header containing your Heroku Inference key.
Response Format
When a request is successful, the API returns a JSON object with the following structure:
Field | Type | Description | Example |
---|---|---|---|
id | string | unique identifier for the chat completion | "chatcmpl-12345" |
object | string | the response object type always: "chat.completion" |
"chat.completion" |
created | integer | unix timestamp when the completion was created | 1745623456 |
model | string | model ID used to generate the response | "claude-3-7-sonnet" |
system_fingerprint | string | (optional) fingerprint of the system version that generated the output | "heroku-inf-abc123" |
choices | array of objects | list of generated message choices (always length 1) | see example response |
usage | object | token usage statistics | {"prompt_tokens":15,"completion_tokens":13,"total_tokens":28} |
Choice Object
The object inside the choices
array (length 1) has the following structure:
Field | Type | Description | Example |
---|---|---|---|
index | integer | index of the choice always: 0 |
0 |
message | object | generated message content | see example response |
finish_reason | enum<string> | reason the model stopped one of: "stop" , "length" , "tool_calls" |
"stop" |
Message Object
Field | Type | Description | Example |
---|---|---|---|
role | enum<string> | role of the message sender one of: assistant , user , system , tool |
assistant |
content | string | text content of the message | "hello! how can I help you today?" |
reasoning | object | internal reasoning trace generated if extended_thinking.include_reasoning is true |
|
refusal | string | (optional) refusal message if the model declines to answer | "I can't answer that." |
tool_calls | array of objects | (optional) tool call requests generated by the model | see example response |
Reasoning Object
If extended_thinking.include_reasoning
is set to true, the model returns a reasoning
object inside the message
.
Field | Type | Description | Example |
---|---|---|---|
thinking | string | internal chain-of-thought reasoning used to form the model’s response | "The user is asking about the weather. I should call the get_weather function with Portland, Oregon." |
signature | string | cryptographic signature verifying the reasoning contents | "ErcBCkgIAxABGAIi..." |
redacted_thinking | string | (optional, typically omitted in response) redacted version of thinking if any parts were removed for safety or privacy |
null |
Usage Object
Information about token consumption.
Field | Type | Description | Example |
---|---|---|---|
prompt_tokens | integer | number of tokens used in the input prompt | 407 |
completion_tokens | integer | number of tokens generated in the response | 107 |
total_tokens | integer | total number of tokens used (prompt + completion) | 514 |
Example Request
Let’s walk through an example /v1/chat/completions
curl request.
First, use this command to set your Heroku environment variables as local variables.
bash
eval $(heroku config -a $APP_NAME --shell | grep '^INFERENCE_' | sed 's/^/export /' | tee >(cat >&2))
Next, send the curl
request:
curl $INFERENCE_URL/v1/chat/completions \
-H "Authorization: Bearer $INFERENCE_KEY" \
-d @- <<EOF | jq
{
"model": "$INFERENCE_MODEL_ID",
"messages": [{"role": "user", "content": "Hello"}]
}
EOF
Example Response
{
"id": "chatcmpl-1839afa8133ceda215788",
"object": "chat.completion",
"created": 1745619466,
"model": "claude-3-7-sonnet",
"system_fingerprint": "heroku-inf-1y38gdr",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hi! How can I help you today?",
"refusal": null
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 12,
"total_tokens": 20
}
}
Example Request with Tools
curl $INFERENCE_URL/v1/chat/completions \
-H "Authorization: Bearer $INFERENCE_KEY" \
-d @- <<EOF | jq
{
"model": "$INFERENCE_MODEL_ID",
"messages": [
{
"role": "user",
"content": "What's the weather like in Portland?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. Portland, OR"
}
},
"required": [
"location"
]
}
}
}
]
}
EOF
Example Response with Tools
{
"id": "chatcmpl-1839adcc2079997417288",
"object": "chat.completion",
"created": 1745617422,
"model": "claude-3-7-sonnet",
"system_fingerprint": "heroku-inf-1y38gdr",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I'll help you check the current weather in Portland. Since Portland could refer to either Portland, Oregon or Portland, Maine, I should specify the state.\nI'll check Portland, OR as it's the larger and more commonly referenced Portland.",
"refusal": null,
"tool_calls": [
{
"id": "tooluse_aFByQsacQ_2BmYMGHvkBmg",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Portland, OR\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 407,
"completion_tokens": 107,
"total_tokens": 514
}
}