Agent Lightning: Including reinforcement studying to AI brokers with out code rewrites

December 13, 2025

52

AI brokers are reshaping software program growth, from writing code to finishing up complicated directions. But LLM-based brokers are vulnerable to errors and infrequently carry out poorly on difficult, multi-step duties. Reinforcement studying (RL) is an strategy the place AI methods be taught to make optimum choices by receiving rewards or penalties for his or her actions, enhancing via trial and error. RL can assist brokers enhance, nevertheless it sometimes requires builders to extensively rewrite their code. This discourages adoption, although the info these brokers generate might considerably increase efficiency via RL coaching.

To handle this, a analysis crew from Microsoft Analysis Asia – Shanghai has launched Agent Lightning. This open-source (opens in new tab) framework makes AI brokers trainable via RL by separating how brokers execute duties from mannequin coaching, permitting builders so as to add RL capabilities with nearly no code modification.

Capturing agent habits for coaching

Agent Lightning converts an agent’s expertise right into a format that RL can use by treating the agent’s execution as a sequence of states and actions, the place every state captures the agent’s standing and every LLM name is an motion that strikes the agent to a brand new state.

This strategy works for any workflow, irrespective of how complicated. Whether or not it entails a number of collaborating brokers or dynamic device use, Agent Lightning breaks it down right into a sequence of transitions. Every transition captures the LLM’s enter, output, and reward (Determine 1). This standardized format means the info can be utilized for coaching with none further steps.

Figure 1: Diagram illustrating Agent Lightning’s unified data interface for a retrieval-augmented generation (RAG) agent. On the left, four states (state₀ to state₃) show the agent’s execution flow, where semantic variables—UserInput, Query, Passages, and Answer—are updated after each component call (LLM or Search). Green blocks represent populated variables; gray blocks indicate empty ones. On the right, the unified data interface converts these transitions into a trajectory format containing prompt, generation, and immediate reward for RL training. — Determine 1. An illustration of Agent Lightning’s standardized format utilizing a retrieval-augmented technology (RAG) agent. Left: The complete agent workflow, the place the agent’s state updates after every part step. The inexperienced blocks present assigned variables, and the grey blocks point out variables with out content material. Proper: The collected transitions are based mostly on the standardized format for the RL coaching course of, with every transition corresponding to at least one LLM step that accommodates its immediate, consequence, and fast reward.

Hierarchical reinforcement studying

Conventional RL coaching for brokers that make a number of LLM requests entails stitching collectively all content material into one lengthy sequence after which figuring out which elements ought to be realized and which ignored throughout coaching. This strategy is troublesome to implement and might create excessively lengthy sequences that degrade mannequin efficiency.

As an alternative, Agent Lightning’s LightningRL algorithm takes a hierarchical strategy. After a job completes, a credit score project module determines how a lot every LLM request contributed to the result and assigns it a corresponding reward. These impartial steps, now paired with their very own reward scores, can be utilized with any current single-step RL algorithm, resembling Proximal Coverage Optimization (PPO) or Group Relative Coverage Optimization (GRPO) (Determine 2).

Figure 2: Comparison of three reinforcement learning approaches for LLM tasks. (a) Single-step GRPO: The model completes the task in one call, and multiple outputs for the same task are compared with associated rewards. (b) Previous multi-step GRPO: The task spans multiple LLM calls, forming trajectories; non-LLM tokens (gray boxes) are ignored during training, and entire multi-step runs are compared. (c) LightningRL: Breaks multi-step runs into individual LLM calls, each including input, context, output, and reward assigned by a credit assignment module. Calls from the same task are grouped for reinforcement. — Determine 2. (a) Single-step GRPO: The LLM completes the duty in a single name. A number of responses for a similar job are in comparison with decide how strongly every ought to be strengthened. (b) Earlier multi-step GRPO: The duty entails a number of LLM calls. A number of multi-step runs of the identical job are in contrast, with non-LLM generated tokens (gray containers) ignored throughout coaching. (c) LightningRL: The multi-step run is split into particular person LLM calls. Calls from the identical job are in comparison with decide how strongly every ought to be strengthened. Every name contains its enter, context, output, and reward, assigned by the credit score project module.

This design provides a number of advantages. It stays absolutely suitable with broadly used single-step RL algorithms, permitting current coaching strategies to be utilized with out modification. Organizing knowledge as a sequence of impartial transitions lets builders flexibly assemble the LLM enter as wanted, supporting complicated behaviors like brokers that use a number of instruments or work with different brokers. Moreover, by holding sequences brief, the strategy scales cleanly and retains coaching environment friendly.

Agent Lightning as middleware

Agent Lightning serves as middleware between RL algorithms and agent environments, offering modular parts that allow scalable RL via standardized protocols and well-defined interfaces.

An agent runner manages the brokers as they full duties. It distributes work and collects and shops the outcomes and progress knowledge. It operates individually from the LLMs, enabling them to run on completely different assets and scale to help a number of brokers operating concurrently.

An algorithm trains the fashions and hosts the LLMs used for inference and coaching. It orchestrates the general RL cycle, managing which duties are assigned, how brokers full them, and the way fashions are up to date based mostly on what the brokers be taught. It sometimes runs on GPU assets and communicates with the agent runner via shared protocols.

The LightningStore (opens in new tab) serves because the central repository for all knowledge exchanges inside the system. It gives standardized interfaces and a shared format, guaranteeing that the completely different parts can work collectively and enabling the algorithm and agent runner to speak successfully.

Figure 3: Diagram showing the architecture of Agent Lightning (AGL). On the left, the AGL Algorithm block includes an inference engine (e.g., vLLM), an algorithm iteration loop, and an adapter for trainable data and weights update. In the center, the AGL Core contains LightningStore, which manages tasks, resources, spans, and LLM calls. On the right, the AGL Agent Runner & Tracer includes a user-defined agent using OpenAI chat completion and agl.emit(). Arrows indicate flows of prompts, responses, tasks, resources, spans, and datasets between components, with roles for algorithm researchers and agent developers highlighted. — Determine 3. The Agent Lightning framework

All RL cycles observe two steps: (1) Agent Lightning collects agent execution knowledge (known as “spans”) and retailer them within the knowledge retailer; (2) it then retrieves the required knowledge and sends it to the algorithm for coaching. Via this design, the algorithm can delegate duties asynchronously to the agent runner, which completes them and studies the outcomes again (Determine 4).

Figure 4: Diagram of the training loop in Agent Lightning. The central element is ‘Trainer,’ with arrows forming a cycle between three components: Agent on the left, Algorithm on the right, and Trainer in the middle. The top arrow labeled ‘Tasks’ flows from Algorithm to Agent, while the bottom arrow labeled ‘Spans’ flows from Agent to Algorithm. ‘Prompt Templates’ is noted above the cycle, indicating its role in task generation. — Determine 4. Agent Lightning’s RL cycle

One key benefit of this strategy is its algorithmic flexibility. The system makes it straightforward for builders to customise how brokers be taught, whether or not they’re defining completely different rewards, capturing intermediate knowledge, or experimenting with completely different coaching approaches.

One other benefit is useful resource effectivity. Agentic RL methods are complicated, integrating agentic methods, LLM inference engines, and coaching frameworks. By separating these parts, Agent Lightning makes this complexity manageable and permits every half to be optimized independently

A decoupled design permits every part to make use of the {hardware} that fits it finest. The agent runner can use CPUs whereas mannequin coaching makes use of GPUs. Every part can even scale independently, enhancing effectivity and making the system simpler to keep up. In apply, builders can preserve their current agent frameworks and swap mannequin calls to the Agent Lightning API with out altering their agent code (Determine 5).

Figure 5: Side-by-side code comparison showing agent implementation before and after integrating Agent Lightning. The left panel (dark background) displays the original agent code written by the developer, including logic for LLM calls, tool usage, and reward assignment. The right panel (light background) shows the modified version using Agent Lightning, where most of the agent logic remains unchanged but includes additional imports and calls to Agent Lightning components such as agl.PromptTemplate, agl.emit(), and agl.Trainer for training and credit assignment. A stylized lightning icon is centered between the two panels. — Determine 5. On the left, the developer implements the agent code. On the underside proper is the code required for Agent Lightning. The principle physique of the agent code is unchanged.

Analysis throughout three real-world situations

Agent Lightning was examined on three distinct duties, reaching constant efficiency enhancements throughout all situations (Determine 6):

Textual content-to-SQL (LangChain): In a system with three brokers dealing with SQL technology, checking, and rewriting, Agent Lightning concurrently optimized two of them, considerably enhancing the accuracy of producing executable SQL from pure language queries.

Retrieval-augmented technology (OpenAI Brokers SDK implementation): On the multi-hop question-answering dataset MuSiQue, which requires querying a big Wikipedia database, Agent Lightning helped the agent generate more practical search queries and purpose higher from retrieved content material.

Mathematical QA and power use (AutoGen implementation): For complicated math issues, Agent Lightning skilled LLMs to extra precisely decide when and easy methods to name the device and combine the outcomes into its reasoning, rising accuracy.

Figure 6: Figure with six line charts showing reward curves across three evaluation scenarios (Spider, MuSiQue, Calculator) for train and test splits. Top row: Train Rewards on Spider, MuSiQue, and Calculator—each plot shows a blue line with noisy upward trend over steps, indicating increasing rewards; Spider and Calculator rise faster with more variance, MuSiQue climbs more gradually. Bottom row: Test Rewards on Spider, MuSiQue, and Calculator—each plot shows a blue line that increases and then stabilizes at higher rewards; Calculator reaches near-plateau earliest, Spider shows steady gains with minor fluctuations, MuSiQue improves more slowly. All plots use ‘Steps’ on the x‑axis and ‘Rewards’ on the y‑axis, with a legend labeled ‘ours’ and light gridlines. — Determine 6. Reward curves throughout the three analysis situations

Enabling steady agent enchancment

By simplifying RL integration, Agent Lightning could make it simpler for builders to construct, iterate, and deploy high-performance brokers. We plan to broaden Agent Lightning’s capabilities to incorporate computerized immediate optimization and extra RL algorithms.

The framework is designed to function an open platform the place any AI agent can enhance via real-world apply. By bridging current agentic methods with reinforcement studying, Agent Lightning goals to assist create AI methods that be taught from expertise and enhance over time.

Previous articlePowering Progress: How Information and AI Are Rewiring Productiveness in Banking and Funds

Next articleAI information middle increase might be dangerous information for different infrastructure initiatives

Agent Lightning: Including reinforcement studying to AI brokers with out code rewrites

Capturing agent habits for coaching

Hierarchical reinforcement studying

Agent Lightning as middleware

Analysis throughout three real-world situations

Enabling steady agent enchancment

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US