In the ever-evolving world of software development, Docker Labs has embarked on an exciting journey through their GenAI series, which delves into the intriguing realm of AI developer tools. At Docker, the ethos is to explore this vast space openly and without succumbing to the hype. With this series, Docker aims to share their findings and collaborate with the developer community in real-time.
While developers have already adopted tools like GitHub Copilot for autocompletion and chat functionalities, there is still enormous potential for AI tools to assist with more specific tasks throughout the entire software lifecycle. To this end, Docker’s exploration is broad and inclusive. They are releasing software as open source, inviting developers to play, explore, and innovate alongside them.
Streamlining Command Line Tool Usage with AI
Using new tools on the command line can be challenging. Even when you are confident that you’ve found the right tool, figuring out how to use it effectively can be a steep learning curve.
Telling an Agent to RT(F)M
A typical workflow for using a new tool might look something like this:
- Install the tool.
- Read the documentation.
- Run the command.
- Repeat.
The question arises: Can we improve this flow using Large Language Models (LLMs)?
Install Tool
Docker provides isolated environments to run tools. Instead of installing commands directly on your host system, Docker creates minimal Docker images for each tool, ensuring that using the tool does not impact the host system. This means you can leave no trace, so to speak.
Read the Documentation
Man pages (manual pages) are one of the conventional ways tool authors provide usage instructions. These pages come with standard retrieval mechanisms, like the
man
command. Some tools also support command-line options like--help
. Ideally, you should be able to retrieve usage information directly from the tool itself.In this experiment, Docker has created two entry points for each tool. The first entry point involves passing a set of arguments directly to a command-line program. The OpenAI-compatible description generated for this entry point is standardized for every tool.
json<br /> {<br /> "name": "run_my_tool",<br /> "description": "Run the my_tool command.",<br /> "parameters": {<br /> "type": "object",<br /> "properties": {<br /> "args": {<br /> "type": "string",<br /> "description": "The arguments to pass to my_tool"<br /> }<br /> }<br /> },<br /> "container": {<br /> "image": "namespace/my_tool:latest"<br /> }<br /> }<br />
The second entry point allows the agent to read the man page, ideally improving its ability to run the first entry point. This entry point is simpler as it only asks the tool how to use it.
json<br /> {<br /> "name": "my_tool_manual",<br /> "description": "Read the man page for my_tool",<br /> "container": {<br /> "image": "namespace/my_tool:latest",<br /> "command": ["man"]<br /> }<br /> }<br />
Run the Command
Let’s explore a simple example. We want to use a tool called qrencode to generate a QR code for a link. Docker has packaged this tool into a minimal image for qrencode. We will now pass this prompt to a few different LLMs trained for tool calling, such as GPT-4, Llama 3.1, and Mistral.
Prompt:
Generate a QR code for the content https://github.com/docker/labs-ai-tools-for-devs/blob/main/prompts/qrencode/README.md. Save the generated image to qrcode.png. If the command fails, read the man page and try again.
The output of the prompt is shown below. You can scan it using your phone.
Figure 1: Content QR code generated by AI assistant.Repeat
When an LLM generates a description of how to run something, it usually formats the output to be user-friendly for terminal usage:
bash<br /> qrencode -o qrcode.png 'my content'<br />
However, if the LLM is generating tool calls, the output is formatted to be easier to run programmatically:
json<br /> {<br /> "function": {<br /> "arguments": {<br /> "args": "-o qrcode.png 'my content'"<br /> },<br /> "name": "qrencode",<br /> "id": "call_Vdw2gDFMAwaEUMgxLYBTg8MB"<br /> }<br /> }<br />
We respond to this by spinning up a Docker container.
Running the tool as part of the conversation loop is useful even when the command fails. In Unix, there are standard ways to communicate failures, such as exit codes and stderr streams. These mechanisms create feedback loops to correct our behavior while iterating at the terminal. This same mechanism can be used in a conversation loop involving an LLM.
For instance, let’s try running a tool that writes ASCII art with the following prompt:
Write ASCII art with the message "Docker" using the font "helvetica". Read the man page for figlet if you detect any errors and then try again.
In our test, this command failed but provided useful feedback through the stderr output stream.
plaintext<br /> Error: error: could not load font helvetica<br />
By including this message in the conversation loop, the assistant can suggest different courses of action. Different LLMs produced varying results. For example, Llama 3.1 provided instructions for installing the missing font, while GPT-4 re-ran the command with a different font.
I’m sorry, it seems the font Helvetica is not available. Let’s try it with the default font.
Here is the output of the figlet command with the default font (term):
plaintext<br /> ____ _ <br /> | \ _ | | _____ _ <br /> | | | |/ \ / _| |/ / \ '_|<br /> | |_| | (_) | (__| < __/ | <br /> |____/ \___/ \___|_|\_\___|_|<br />
We are still early in understanding how to leverage this capability effectively. However, quarantining these tools in Docker containers is useful as it limits their impact while encouraging experimentation.
Results
Docker started by creating a pipeline to produce minimal Docker images for each tool. The tools were selected based on their usefulness in developer-facing workflows. They continue to add new tools as new use cases emerge. The initial set includes:
- gh, pylint, commitlint, scalafix, gitlint, yamllint, checkmake, gqlint, sqlint, golint, golangci-lint, hadolint, markdownlint-cli2, cargo-toml-lint, ruff, dockle, clj-kondo, selene, tflint, rslint, yapf, puppet-lint, oxlint, kube-linter, csslint, cpplint, ansible-lint, actionlint, black, checkov, jfmt, datefmt, rustfmt, cbfmt, yamlfmt, whatstyle, rufo, fnlfmt, shfmt, zprint, jet, typos, docker-ls, nerdctl, diffoci, dive, kompose, git-test, kubectl, fastly, infracost, sops, curl, fzf, ffmpeg, babl, unzip, jq, graphviz, pstree, figlet, toilet, tldr, qrencode, clippy, go-tools, ripgrep, awscli2, azure-cli, luaformatter, nixpkgs-lint, hclfmt, fop, dnstracer, undocker, dockfmt, fixup_yarn_lock, github-runner, swiftformat, swiftlint, nix-linter, go-critic, regal, textlint, formatjson5, commitmsgfmt.
Missing Manual Pages
Only about 60% of the selected tools have man pages. However, even in those cases, there are usually other ways to obtain help content. The following steps outline the final procedure:
- Try to run the man page.
- Try to run the tool with the argument
--help
. - Try to run the tool with the argument
-h
. - Try to run the tool with
--broken args
and then read stderr.Using this procedure, every tool in the list above eventually produced documentation.
Long Manual Pages
Limited context lengths affected some longer manual pages. It was necessary to use standard Retrieval-Augmented Generation (RAG) techniques to summarize verbose man pages. The tactic was to focus on command-line argument descriptions and sections with sample usage, as these had the most significant impact on the agent’s output. The structure of Unix man pages helped with content chunking, as standard sections could be relied upon.
Subcommands
For a small set of tools, it was necessary to traverse a tree of help menus. These were relatively popular tools, and the LLMs deployed already knew about their command structures. You can easily test this by asking an LLM, for example: “What are the subcommands of Git?” or “What are the subcommands of Docker?” It appears that only popular tools get big enough to be broken into subcommands.
Summary
We should consider the active role that agents can play when determining how to use a tool. The Unix model has given us standards like man pages, stderr streams, and exit codes, which we can leverage when asking an assistant to learn a tool. Beyond distribution, Docker provides process isolation, which is useful for creating safe exploration environments.
Whether or not an AI can successfully generate tool calls may also become a metric for how well-documented a tool is.
To follow along with this effort, check out the GitHub repository for this project.
Learn More
For further details, you can explore the Docker Labs GenAI series.
For more Information, Refer to this article.