By 2025, every company and their grandmother has a "AI-powered chatbot." Most of them are trash. They hallucinate facts, forget what you said 3 messages ago, and can't actually DO anything except regurgitate text.
This guide is about building the OTHER kind - production-ready chatbots that stream responses, remember context, call functions to take real actions, and don't bankrupt you with API costs.
No fluff. Just code and strategy.
The Foundation (Get This Right First)
First, install the essentials:
pip install openai python-dotenv tiktoken tenacity
Create a .env file (NEVER hardcode API keys, bana). Learn best practices for securing API keys and working with AI APIs in production.
OPENAI_API_KEY=sk-your-actual-key-here
MODEL_NAME=gpt-4o-mini
MAX_TOKENS=2000
Basic setup:
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Simple chat
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper than gpt-4, good enough for most tasks
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain async/await in JavaScript"}
]
)
print(response.choices[0].message.content)
That's your "Hello World." Now let's make it actually useful.
Streaming: Stop Making Users Wait
Nobody wants to stare at a loading spinner for 8 seconds. Streaming delivers text token-by-token, like a typewriter. It feels MUCH faster even if the total time is similar.
def chat_with_streaming(prompt):
"""
Stream response token by token
"""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True, # The magic flag
)
print("Assistant: ", end="", flush=True)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # Newline
return full_response
# Try it
chat_with_streaming("Explain Docker in 2 sentences")
Frontend integration: In your web app, send these chunks via Server-Sent Events (SSE) or WebSocket. Users see text appear in real-time. Feels magical.
Context Management: Making the Bot Remember
Here's the dirty secret: GPT-4 has zero memory. Every API call is completely isolated. It's like talking to someone with amnesia who only remembers the current conversation you show them.
The solution? Send the entire conversation history with every request.
class ChatBot:
def __init__(self, system_prompt):
self.messages = [
{"role": "system", "content": system_prompt}
]
def chat(self, user_message):
# Add user message to history
self.messages.append({
"role": "user",
"content": user_message
})
# Get response
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=self.messages
)
assistant_message = response.choices[0].message.content
# Add assistant response to history
self.messages.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
# Usage
bot = ChatBot("You are a Python tutor who explains concepts with food analogies.")
print(bot.chat("My name is Jiji"))
# "Nice to meet you, Jiji! How can I help you learn Python today?"
print(bot.chat("What's my name?"))
# "Your name is Jiji!"
The Context Window Problem
GPT-4 has a 128k token limit (roughly 300 pages of text), but:
- It's expensive (every token costs money)
- Performance degrades with huge contexts
Solutions:
Option 1: Sliding Window (Keep last N messages)
MAX_HISTORY = 10
def trim_history(messages):
# Always keep system message
system_msg = messages[0]
recent = messages[-MAX_HISTORY:]
return [system_msg] + recent
Option 2: Summarization (Compress old context)
def summarize_conversation(messages):
if len(messages) < 20:
return messages
# Ask GPT to summarize everything except recent messages
old_messages = messages[1:-5] # Skip system and last 5
recent_messages = messages[-5:]
# For bots requiring access to large knowledge bases, implement RAG instead
summary_prompt = "Summarize this conversation concisely:\n\n"
summary_prompt += "\n".join([f"{m['role']}: {m['content']}" for m in old_messages])
summary = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": summary_prompt}]
).choices[0].message.content
return [
messages[0], # System
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*recent_messages
]
Note: For bots requiring access to large knowledge bases beyond conversation history, implement RAG (Retrieval-Augmented Generation) instead of stuffing everything into context.
System Prompts: The Hidden Instruction Manual
The system message is your control panel. It's what makes a bot a "customer support agent" vs a "sarcastic comedian."
Bad system prompt:
"You are helpful."
Good system prompt:
"""You are SupportBot, a customer service agent for TechCorp.
PERSONALITY:
- Professional but warm
- Patient with technical questions
- Never make promises about refunds (escalate to humans)
CONSTRAINTS:
- Responses must be under 100 words
- Always ask for order ID before looking up orders
- If you don't know, say "Let me connect you with a specialist"
KNOWLEDGE:
- Return window: 30 days
- Shipping: 3-5 business days for Kenya
- Support hours: Mon-Fri 9am-6pm EAT
"""
See the difference? Specificity is power.
These system prompts follow proven prompt engineering patterns like CO-STAR and role-based prompting for consistent bot behavior.
Function Calling: When Chatbots Take Action
This is where it gets SPICY. Function calling lets your bot actually DO things: look up data, send emails, book appointments, charge credit cards.
How it works:
- You describe available functions to GPT
- User asks something requiring that function
- GPT returns a structured JSON saying "call this function with these parameters"
- You execute the function
- Send result back to GPT
- GPT gives user-friendly response
Example: Weather Bot
import json
# 1. Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. Nairobi, London"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# 2. Mock function (replace with real API call)
def get_weather(location, unit="celsius"):
# In production, call OpenWeatherMap or similar
fake_data = {
"Nairobi": {"temp": 22, "condition": "Partly cloudy"},
"London": {"temp": 8, "condition": "Rainy"},
"Dubai": {"temp": 35, "condition": "Hot AF"}
}
weather = fake_data.get(location, {"temp": 20, "condition": "Unknown"})
return json.dumps(weather)
# 3. Chat function with tool handling
def chat_with_tools(user_message):
messages = [{"role": "user", "content": user_message}]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
tool_choice="auto" # Let GPT decide if it needs to call a function
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Check if GPT wants to call a function
if tool_calls:
messages.append(response_message)
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
# Execute the function
if function_name == "get_weather":
function_response = get_weather(**function_args)
# Send function result back to GPT
messages.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response
})
# Get final response
second_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return second_response.choices[0].message.content
return response_message.content
# Try it
print(chat_with_tools("What's the weather in Nairobi?"))
# "It's currently 22°C and partly cloudy in Nairobi."
Real-world functions you can add:
lookup_order(order_id)- Check order statusbook_appointment(date, time)- Schedule meetingssearch_knowledge_base(query)- Query your docssend_email(to, subject, body)- Send notifications
For more sophisticated tool usage and autonomous behavior, upgrade to building full AI agents with LangChain.
Error Handling: When APIs Misbehave
OpenAI's API will fail. Rate limits (429), server errors (500), timeouts. Handle it gracefully.
from tenacity import retry, stop_after_attempt, wait_exponential
from openai import OpenAIError, RateLimitError
@retry(
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(3),
reraise=True
)
def robust_chat(messages):
try:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
timeout=30 # Don't wait forever
)
except RateLimitError:
print("Rate limit hit, retrying...")
raise # Retry will handle this
except OpenAIError as e:
print(f"API error: {e}")
raise
User-facing error messages:
try:
response = robust_chat(messages)
except Exception as e:
return "Sorry, I'm having trouble right now. Please try again in a moment."
Never show raw API errors to users. Nobody cares about "HTTP 429 Too Many Requests." They care about "the bot is broken."
Cost Optimization: Don't Go Broke
GPT-4 is expensive. Here's how to keep costs sane:
1. Use the Right Model
# Expensive ($10/1M input tokens, $30/1M output)
model="gpt-4"
# Cheaper ($0.15/1M input, $0.60/1M output) - 98% as good
model="gpt-4o-mini"
Strategy: Use gpt-4o-mini for 90% of requests. Use gpt-4 only for complex reasoning.
2. Count Tokens Before Sending
import tiktoken
def count_tokens(text, model="gpt-4o-mini"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Check before sending
user_message = "Very long message..."
if count_tokens(user_message) > 1000:
print("Warning: This will cost $$$")
3. Set Max Tokens
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=500 # Prevent 10-page essays
)
4. Cache Identical Requests
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_response(prompt_hash):
# Only call API if we haven't seen this exact prompt before
pass
The Production Checklist
Before you deploy:
- ✅ Streaming enabled for better UX
- ✅ Context management implemented
- ✅ System prompt is specific and tested
- ✅ Error handling with retries
- ✅ Rate limiting on your end (don't let one user spam API)
- ✅ Logging for debugging (save failed interactions)
- ✅ Cost monitoring dashboard
- ✅ Human escalation path (some questions need humans)
The Bottom Line
Building a chatbot in 2025 is less about prompt engineering and more about systems engineering. It's about:
- Managing state (conversation history)
- Orchestrating tools (function calling)
- Handling failures gracefully
- Optimizing costs
Start simple. Get the basic chat loop working. Add streaming. Then add memory. Then add one function. Test it thoroughly. Add another function.
The bots that actually ship aren't the ones with the fanciest prompts. They're the ones with solid error handling and proper cost controls.
Remember, it is not just for chatting, it should actually helps users get things done.