In the ever-evolving realm of artificial intelligence, Generative AI (GenAI) stands out as a transformative force in software development. However, despite its potential, developers often encounter numerous challenges when creating AI-powered applications. Today, we will explore how to build a fully functional Generative AI chatbot using Docker Model Runner, alongside observability tools like Prometheus, Grafana, and Jaeger. This guide aims to address common hurdles faced by developers, demonstrating how Docker Model Runner provides a streamlined solution, and guiding you through the process of building a production-ready chatbot with comprehensive monitoring capabilities.
The Current Challenges in GenAI Development
Generative AI is reshaping the landscape of software development, but constructing AI-driven applications is fraught with challenges. One significant issue is the fragmented AI landscape, where developers must integrate various libraries and frameworks that are not inherently compatible. Additionally, executing large language models efficiently necessitates specialized hardware configurations, which differ across platforms. This often results in teams maintaining separate environments for their application code and AI models, complicating the development process.
Another challenge is the lack of standardized methods for storing, versioning, and deploying models, leading to inconsistent practices. Moreover, relying on cloud-based AI services can lead to unpredictable costs that scale with usage. Sending data to external AI services also poses privacy and security risks, particularly for applications handling sensitive information. These challenges collectively create a frustrating developer experience, hampering experimentation and slowing down innovation at a time when businesses are eager to accelerate their AI adoption.
How Docker is Addressing These Challenges
Docker Model Runner offers a revolutionary approach to GenAI development by integrating AI model execution directly into familiar container workflows. This innovation simplifies the process for developers by making it easier to run AI models locally, within the existing Docker framework. Key benefits of using Docker Model Runner include:
- Simplified Model Execution: Execute AI models locally with a simple Docker CLI command, eliminating the need for complex setup.
- Hardware Acceleration: Gain direct access to GPU resources without the overhead associated with containerization.
- Integrated Workflow: Seamlessly integrate with existing Docker tools and container development practices.
- Standardized Packaging: Distribute models as Open Container Initiative (OCI) artifacts through the same registries you already use.
- Cost Control: Eliminate unpredictable API costs by running models locally.
- Data Privacy: Keep sensitive data within your infrastructure, with no external API calls.
This approach fundamentally transforms the way developers can build and test AI-powered applications, making local development faster, more secure, and significantly more efficient.
Building an AI Chatbot with Docker
This guide will walk you through the process of building a comprehensive GenAI application, showcasing how to create a fully-featured chat interface powered by Docker Model Runner, complete with advanced observability tools to monitor and optimize your AI models.
Project Overview
The project is a complete Generative AI interface that demonstrates how to:
- Create a responsive React/TypeScript chat UI with streaming responses.
- Build a Go backend server that integrates with Docker Model Runner.
- Implement comprehensive observability with metrics, logging, and tracing.
- Monitor AI model performance with real-time metrics.
Architecture
The application consists of several key components:
- Frontend: Sends chat messages to the backend API.
- Backend: Formats the messages and sends them to the Model Runner.
- LLM: Processes the input and generates a response.
- Backend: Streams the tokens back to the frontend as they’re generated.
- Frontend: Displays the incoming tokens in real-time.
- Observability Components: Collect metrics, logs, and traces throughout the process.
This architecture enables a seamless flow of data between the frontend, backend, Model Runner, and observability tools like Prometheus, Grafana, and Jaeger.
Project Structure
The project is structured as follows:
<br /> .<br /> ├── Dockerfile<br /> ├── README-model-runner.md<br /> ├── README.md<br /> ├── backend.env<br /> ├── compose.yaml<br /> ├── frontend<br /> ..<br /> ├── go.mod<br /> ├── go.sum<br /> ├── grafana<br /> │ └── provisioning<br /> ├── main.go<br /> ├── main_branch_update.md<br /> ├── observability<br /> │ └── README.md<br /> ├── pkg<br /> │ ├── health<br /> │ ├── logger<br /> │ ├── metrics<br /> │ ├── middleware<br /> │ └── tracing<br /> ├── prometheus<br /> │ └── prometheus.yml<br /> ├── refs<br /> │ └── heads<br /> ..<br />We’ll explore the key files and understand how they work together throughout this guide.
Prerequisites
Before starting, ensure you have:
- Docker Desktop (version 4.40 or newer).
- Docker Model Runner enabled.
- At least 16GB of RAM for efficient AI model execution.
- Familiarity with Go (for backend development).
- Familiarity with React and TypeScript (for frontend development).
Getting Started
To run the application, follow these steps:
- Clone the Repository:
bash<br /> git clone https://github.com/dockersamples/genai-model-runner-metrics<br /> cd genai-model-runner-metrics<br /> - Enable Docker Model Runner in Docker Desktop:
- Go to Settings > Features in Development > Beta tab.
- Enable “Docker Model Runner”.
- Select “Apply and restart”.
- Download the Model:
For this demonstration, we’ll use Llama 3.2, but you can substitute any model of your choice:
bash<br /> docker model pull ai/llama3.2:1B-Q8_0<br /> - Start the Application:
bash<br /> docker compose up -d --build<br /> - Access the Chat Interface:
Open your browser and navigate to http://localhost:3000. You’ll be greeted with a modern chat interface featuring a clean, responsive design with a dark/light mode toggle, a message input area ready for your first prompt, and model information displayed in the footer.
- View Metrics:
Click on "Expand" to view metrics like input tokens, output tokens, total requests, average response time, and error rate.
Implementation Details
Let’s delve into the workings of the key components:
- Frontend Implementation: The React frontend provides a clean, responsive chat interface built with TypeScript and modern React patterns. The core
App.tsxcomponent manages state for dark mode preferences and model metadata fetched from the backend’s health endpoint. When the component mounts, theuseEffecthook automatically retrieves information about the currently running AI model, displaying details like the model name directly in the footer. - Backend Implementation: The Go backend communicates with Docker Model Runner, leveraging its OpenAI-compatible API. The Model Runner exposes endpoints that match OpenAI’s API structure, allowing standard client usage.
- Metrics Flow: The backend acts as a metrics bridge, connecting to llama.cpp via Model Runner API, collecting performance data from each API call, calculating metrics like tokens per second and memory usage, and exposing all metrics in Prometheus format.
- LLama.cpp Metrics Integration: The project provides detailed real-time metrics for llama.cpp models, including tokens per second (generation speed), context window size (maximum context length in tokens), prompt evaluation time (time spent processing input prompt), memory per token (memory efficiency), thread utilization (CPU threads used), and batch size (token processing batch size).
- Chat Implementation with Streaming: The chat endpoint implements streaming for real-time token generation, ensuring tokens appear in real-time in the user interface, providing a smooth and responsive chat experience.
- Performance Measurement: The system measures various performance aspects of the model, including first token time, tokens per second, and others, helping to optimize the user experience.
- Metrics Collection: The
metrics.gofile defines a comprehensive set of Prometheus metrics that allow monitoring of both application performance and llama.cpp model behavior. - Core Metrics Architecture: The file establishes a collection of Prometheus metric types, including counters for cumulative values, gauges for values that can increase and decrease, and histograms for measuring distributions of values.
Docker Compose: LLM as a First-Class Service
With Docker Model Runner integration, Docker Compose simplifies AI model deployment, turning it into a standard service, akin to any other infrastructure component. The
docker-compose.ymlfile defines the entire AI application, including AI models, application backend and frontend, observability stack, and all networking and dependencies, with the llm service using Docker’s model provider.Conclusion
The genai-model-runner-metrics project exemplifies a powerful approach to building AI-powered applications with Docker Model Runner while maintaining comprehensive visibility into performance metrics. By combining local model execution with extensive metrics, developers gain both privacy and cost benefits of local execution alongside the observability essential for production applications.
Whether you’re developing a customer support bot, a content generation tool, or a specialized AI assistant, this architecture provides a solid foundation for reliable, observable, and efficient AI applications. The metrics-driven approach ensures continuous monitoring and optimization, leading to better user experiences and more efficient resource utilization.
Learn More
- Clone the Repository:
- Read our quickstart guide to Docker Model Runner.
- Find documentation for Model Runner.
- Subscribe to the Docker Navigator Newsletter.
- New to Docker? Create an account.
- Have questions? The Docker community is here to help.
By following this guide, you can embark on your journey to build a robust, locally-executed, metrics-driven Generative AI application with Docker Model Runner.
For more Information, Refer to this article.

































