How Drasi used GitHub Copilot to search out documentation bugs

For early-stage open-source tasks, the “Getting began” information is commonly the primary actual interplay a developer has with the venture. If a command fails, an output doesn’t match, or a step is unclear, most customers received’t file a bug report, they are going to simply transfer on.

Drasi, a CNCF sandbox venture that detects adjustments in your information and triggers fast reactions, is supported by our small workforce of 4 engineers in Microsoft Azure’s Workplace of the Chief Expertise Officer. We have now complete tutorials, however we’re transport code quicker than we are able to manually take a look at them.

The workforce didn’t notice how large this hole was till late 2025, when GitHub up to date its Dev Container infrastructure, bumping the minimal Docker model. The replace broke the Docker daemon connection, and each single tutorial stopped working. As a result of we relied on guide testing, we didn’t instantly know the extent of the injury. Any developer making an attempt Drasi throughout that window would have hit a wall.

This incident pressured a realization: with superior AI coding assistants, documentation testing might be transformed to a monitoring drawback.

The issue: Why does documentation break?

Documentation often breaks for 2 causes:

1. The curse of data

Skilled builders write documentation with implicit context. After we write “look ahead to the question to bootstrap,” we all know to run `drasi checklist question` and look ahead to the `Working` standing, and even higher to run the `drasi wait` command. A brand new consumer has no such context. Neither does an AI agent. They learn the directions actually and don’t know what to do. They get caught on the “how,” whereas we solely doc the “what.”

2. Silent drift

Documentation doesn’t fail loudly like code does. While you rename a configuration file in your codebase, the construct fails instantly. However when your documentation nonetheless references the previous filename, nothing occurs. The drift accumulates silently till a consumer reviews confusion.

That is compounded for tutorials like ours, which spin up sandbox environments with Docker, k3d, and pattern databases. When any upstream dependency adjustments—a deprecated flag, a bumped model, or a brand new default—our tutorials can break silently.

The answer: Brokers as artificial customers

To resolve this, we handled tutorial testing as a simulation drawback. We constructed an AI agent that acts as a “artificial new consumer.”

This agent has three vital traits:

It’s naïve: It has no prior information of Drasi—it is aware of solely what’s explicitly written within the tutorial.
It’s literal: It executes each command precisely as written. If a step is lacking, it fails.
It’s unforgiving: It verifies each anticipated output. If the doc says, “It’s best to see ‘Success’”, and the command line interface (CLI) simply returns silently, the agent flags it and fails quick.

The stack: GitHub Copilot CLI and Dev Containers

We constructed an answer utilizing GitHub Actions, Dev Containers, Playwright, and the GitHub Copilot CLI.

Our tutorials require heavy infrastructure:

A full Kubernetes cluster (k3d)
Docker-in-Docker
Actual databases (resembling PostgreSQL and MySQL)

We wanted an atmosphere that precisely matches what our human customers expertise. If customers run in a selected Dev Container on GitHub Codespaces, our take a look at should run in that identical Dev Container.

The structure

Contained in the container, we invoke the Copilot CLI with a specialised system immediate (view the complete immediate right here):

This immediate utilizing the immediate mode (-p) of the CLI agent provides us an agent that may execute terminal instructions, write information, and run browser scripts—identical to a human developer sitting at their terminal. For the agent to simulate an actual consumer, it wants these capabilities.

To allow the brokers to open webpages and work together with them as any human following the tutorial steps would, we additionally set up Playwright on the Dev Container. The agent additionally takes screenshots which it then compares towards these offered within the documentation.

Safety mannequin

Our safety mannequin is constructed round one precept: the container is the boundary.

Quite than making an attempt to limit particular person instructions (a dropping recreation when the agent must run arbitrary node scripts for Playwright), we deal with your entire Dev Container as an remoted sandbox and management what crosses its boundaries: no outbound community entry past localhost, a Private Entry Token (PAT) with solely “Copilot Requests” permission, ephemeral containers destroyed after every run, and a maintainer-approval gate for triggering workflows.

Coping with non-determinism

One of many largest challenges with AI-based testing is non-determinism. Massive language fashions (LLMs) are probabilistic—generally the agent retries a command; different instances it provides up.

We dealt with this with a three-stage retry with mannequin escalation (begin with Gemini-Professional, on failure strive with Claude Opus), semantic comparability for screenshots as an alternative of pixel-matching, and verification of core-data fields slightly than unstable values.

We even have a checklist of tight constraints in our prompts that stop the agent from occurring a debugging journey, directives to regulate the construction of the ultimate report, and likewise skip directives that inform the agent to bypass non-obligatory tutorial sections like establishing exterior companies.

Artifacts for debugging

When a run fails, we have to know why. For the reason that agent is operating in a transient container, we are able to’t simply Safe Shell (SSH) in and go searching.

So, our agent preserves proof of each run, screenshots of net UIs, terminal output of vital instructions, and a closing markdown report detailing its reasoning like proven right here:

These artifacts are uploaded to the GitHub Motion run abstract, permitting us to “time journey” again to the precise second of failure and see what the agent noticed.

Parsing the agent’s report

With LLMs, getting a definitive “Move/Fail” sign {that a} machine can perceive might be difficult. An agent would possibly write a protracted, nuanced conclusion like:

To make this actionable in a CI/CD pipeline, we needed to do some immediate engineering. We explicitly instructed the agent:

In our GitHub Motion, we then merely grep for this particular string to set the exit code of the workflow.

Easy strategies like this bridge the hole between AI’s fuzzy, probabilistic outputs and CI’s binary cross/fail expectations.

Automation

We now have an automated model of the workflow which runs weekly. This model evaluates all our tutorials each week in parallel—every tutorial will get its personal sandbox container and a contemporary perspective from the agent appearing as an artificial consumer. If any of the tutorial analysis fails, the workflow is configured to file a difficulty on our GitHub repo.

This workflow can optionally even be run on pull-requests, however to forestall assaults now we have added a maintainer-approval requirement and a `pull_request_target` set off, which implies that even on pull-requests by exterior contributors, the workflow that executes would be the one in our foremost department.

Working the Copilot CLI requires a PAT token which is saved within the atmosphere secrets and techniques for our repo. To ensure this doesn’t leak, every run requires maintainer approval—besides the automated weekly run which solely runs on the `foremost` department of our repo.

What we discovered: Bugs that matter

Since implementing this technique, now we have run over 200 “artificial consumer” periods. The agent recognized 18 distinct points together with some severe atmosphere points and different documentation points like these. Fixing them improved the docs for everybody, not simply the bot.

Implicit dependencies: In a single tutorial, we instructed customers to create a tunnel to a service. The agent ran the command, after which—following the subsequent instruction—killed the method to run the subsequent command.
The repair: We realized we hadn’t advised the consumer to maintain that terminal open. We added a warning: “This command blocks. Open a brand new terminal for subsequent steps.”
Lacking verification steps: We wrote: “Confirm the question is operating.” The agent received caught: “How, precisely?”
The repair: We changed the imprecise instruction with an specific command: `drasi wait -f question.yaml`.
Format drift: Our CLI output had developed. New columns had been added; older fields had been deprecated. The documentation screenshots nonetheless confirmed the 2024 model of the interface. A human tester would possibly gloss over this (“it seems principally proper”). The agent flagged each mismatch, forcing us to maintain our examples updated.

AI as a pressure multiplier

We regularly hear about AI changing people, however on this case, the AI is offering us with a workforce we by no means had.

To duplicate what our system does—operating six tutorials throughout contemporary environments each week—we would want a devoted QA useful resource or a major funds for guide testing. For a four-person workforce, that’s unattainable. By deploying these Artificial Customers, now we have successfully employed a tireless QA engineer who works nights, weekends, and holidays.

Our tutorials at the moment are validated weekly by artificial customers. strive the Getting Began information your self and see the outcomes firsthand. And when you’re going through the identical documentation drift in your personal venture, think about GitHub Copilot CLI not simply as a coding assistant, however as an agent—give it a immediate, a container, and a aim—and let it do the work a human doesn’t have time for.

How Drasi used GitHub Copilot to search out documentation bugs

The issue: Why does documentation break?

1. The curse of data

2. Silent drift

The answer: Brokers as artificial customers

The stack: GitHub Copilot CLI and Dev Containers

The structure

Safety mannequin

Coping with non-determinism

Artifacts for debugging

Parsing the agent’s report

Automation

What we discovered: Bugs that matter

AI as a pressure multiplier

Related Articles

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

Robotic Discuss Episode 161 – Collaborative haptic methods, with Allison Okamura

How one can Tame AI’s Voracious Urge for food for Vitality

LEAVE A REPLY Cancel reply

Latest Articles

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

Robotic Discuss Episode 161 – Collaborative haptic methods, with Allison Okamura

How one can Tame AI’s Voracious Urge for food for Vitality

How SEL Eradicated Ergonomic Accidents and Automated 1.4 Million Screws a 12 months with Robotiq

Exact Gene Modifying in Early Human Embryos Reignites the ‘Designer Child’ Debate

ABOUT US