For early-stage open-source tasks, the “Getting began” information is commonly the primary actual interplay a developer has with the venture. If a command fails, an output doesn’t match, or a step is unclear, most customers received’t file a bug report, they are going to simply transfer on.
Drasi, a CNCF sandbox venture that detects adjustments in your information and triggers fast reactions, is supported by our small workforce of 4 engineers in Microsoft Azure’s Workplace of the Chief Expertise Officer. We have now complete tutorials, however we’re transport code quicker than we are able to manually take a look at them.
The workforce didn’t notice how large this hole was till late 2025, when GitHub up to date its Dev Container infrastructure, bumping the minimal Docker model. The replace broke the Docker daemon connection, and each single tutorial stopped working. As a result of we relied on guide testing, we didn’t instantly know the extent of the injury. Any developer making an attempt Drasi throughout that window would have hit a wall.
This incident pressured a realization: with superior AI coding assistants, documentation testing might be transformed to a monitoring drawback.
The issue: Why does documentation break?
Documentation often breaks for 2 causes:
1. The curse of data
Skilled builders write documentation with implicit context. After we write “look ahead to the question to bootstrap,” we all know to run `drasi checklist question` and look ahead to the `Working` standing, and even higher to run the `drasi wait` command. A brand new consumer has no such context. Neither does an AI agent. They learn the directions actually and don’t know what to do. They get caught on the “how,” whereas we solely doc the “what.”
2. Silent drift
Documentation doesn’t fail loudly like code does. While you rename a configuration file in your codebase, the construct fails instantly. However when your documentation nonetheless references the previous filename, nothing occurs. The drift accumulates silently till a consumer reviews confusion.
That is compounded for tutorials like ours, which spin up sandbox environments with Docker, k3d, and pattern databases. When any upstream dependency adjustments—a deprecated flag, a bumped model, or a brand new default—our tutorials can break silently.
The answer: Brokers as artificial customers
To resolve this, we handled tutorial testing as a simulation drawback. We constructed an AI agent that acts as a “artificial new consumer.”
This agent has three vital traits:
- It’s naïve: It has no prior information of Drasi—it is aware of solely what’s explicitly written within the tutorial.
- It’s literal: It executes each command precisely as written. If a step is lacking, it fails.
- It’s unforgiving: It verifies each anticipated output. If the doc says, “It’s best to see ‘Success’”, and the command line interface (CLI) simply returns silently, the agent flags it and fails quick.
The stack: GitHub Copilot CLI and Dev Containers
We constructed an answer utilizing GitHub Actions, Dev Containers, Playwright, and the GitHub Copilot CLI.
Our tutorials require heavy infrastructure:
- A full Kubernetes cluster (k3d)
- Docker-in-Docker
- Actual databases (resembling PostgreSQL and MySQL)
We wanted an atmosphere that precisely matches what our human customers expertise. If customers run in a selected Dev Container on GitHub Codespaces, our take a look at should run in that identical Dev Container.
The structure
Contained in the container, we invoke the Copilot CLI with a specialised system immediate (view the complete immediate right here):

This immediate utilizing the immediate mode (-p) of the CLI agent provides us an agent that may execute terminal instructions, write information, and run browser scripts—identical to a human developer sitting at their terminal. For the agent to simulate an actual consumer, it wants these capabilities.
To allow the brokers to open webpages and work together with them as any human following the tutorial steps would, we additionally set up Playwright on the Dev Container. The agent additionally takes screenshots which it then compares towards these offered within the documentation.
Safety mannequin
Our safety mannequin is constructed round one precept: the container is the boundary.
Quite than making an attempt to limit particular person instructions (a dropping recreation when the agent must run arbitrary node scripts for Playwright), we deal with your entire Dev Container as an remoted sandbox and management what crosses its boundaries: no outbound community entry past localhost, a Private Entry Token (PAT) with solely “Copilot Requests” permission, ephemeral containers destroyed after every run, and a maintainer-approval gate for triggering workflows.
Coping with non-determinism
One of many largest challenges with AI-based testing is non-determinism. Massive language fashions (LLMs) are probabilistic—generally the agent retries a command; different instances it provides up.
We dealt with this with a three-stage retry with mannequin escalation (begin with Gemini-Professional, on failure strive with Claude Opus), semantic comparability for screenshots as an alternative of pixel-matching, and verification of core-data fields slightly than unstable values.
We even have a checklist of tight constraints in our prompts that stop the agent from occurring a debugging journey, directives to regulate the construction of the ultimate report, and likewise skip directives that inform the agent to bypass non-obligatory tutorial sections like establishing exterior companies.
Artifacts for debugging
When a run fails, we have to know why. For the reason that agent is operating in a transient container, we are able to’t simply Safe Shell (SSH) in and go searching.
So, our agent preserves proof of each run, screenshots of net UIs, terminal output of vital instructions, and a closing markdown report detailing its reasoning like proven right here:

These artifacts are uploaded to the GitHub Motion run abstract, permitting us to “time journey” again to the precise second of failure and see what the agent noticed.

Parsing the agent’s report
With LLMs, getting a definitive “Move/Fail” sign {that a} machine can perceive might be difficult. An agent would possibly write a protracted, nuanced conclusion like:

To make this actionable in a CI/CD pipeline, we needed to do some immediate engineering. We explicitly instructed the agent:

In our GitHub Motion, we then merely grep for this particular string to set the exit code of the workflow.

Easy strategies like this bridge the hole between AI’s fuzzy, probabilistic outputs and CI’s binary cross/fail expectations.
Automation
We now have an automated model of the workflow which runs weekly. This model evaluates all our tutorials each week in parallel—every tutorial will get its personal sandbox container and a contemporary perspective from the agent appearing as an artificial consumer. If any of the tutorial analysis fails, the workflow is configured to file a difficulty on our GitHub repo.
This workflow can optionally even be run on pull-requests, however to forestall assaults now we have added a maintainer-approval requirement and a `pull_request_target` set off, which implies that even on pull-requests by exterior contributors, the workflow that executes would be the one in our foremost department.
Working the Copilot CLI requires a PAT token which is saved within the atmosphere secrets and techniques for our repo. To ensure this doesn’t leak, every run requires maintainer approval—besides the automated weekly run which solely runs on the `foremost` department of our repo.
What we discovered: Bugs that matter
Since implementing this technique, now we have run over 200 “artificial consumer” periods. The agent recognized 18 distinct points together with some severe atmosphere points and different documentation points like these. Fixing them improved the docs for everybody, not simply the bot.
- Implicit dependencies: In a single tutorial, we instructed customers to create a tunnel to a service. The agent ran the command, after which—following the subsequent instruction—killed the method to run the subsequent command.
The repair: We realized we hadn’t advised the consumer to maintain that terminal open. We added a warning: “This command blocks. Open a brand new terminal for subsequent steps.” - Lacking verification steps: We wrote: “Confirm the question is operating.” The agent received caught: “How, precisely?”
The repair: We changed the imprecise instruction with an specific command: `drasi wait -f question.yaml`. - Format drift: Our CLI output had developed. New columns had been added; older fields had been deprecated. The documentation screenshots nonetheless confirmed the 2024 model of the interface. A human tester would possibly gloss over this (“it seems principally proper”). The agent flagged each mismatch, forcing us to maintain our examples updated.
AI as a pressure multiplier
We regularly hear about AI changing people, however on this case, the AI is offering us with a workforce we by no means had.
To duplicate what our system does—operating six tutorials throughout contemporary environments each week—we would want a devoted QA useful resource or a major funds for guide testing. For a four-person workforce, that’s unattainable. By deploying these Artificial Customers, now we have successfully employed a tireless QA engineer who works nights, weekends, and holidays.
Our tutorials at the moment are validated weekly by artificial customers. strive the Getting Began information your self and see the outcomes firsthand. And when you’re going through the identical documentation drift in your personal venture, think about GitHub Copilot CLI not simply as a coding assistant, however as an agent—give it a immediate, a container, and a aim—and let it do the work a human doesn’t have time for.
