In-depth evaluation of DS-STAR
Subsequent, we performed ablation research to confirm the effectiveness of DS-STAR’s particular person elements and analyze the influence of the variety of refinement rounds, particularly by measuring the iterations required to generate a enough plan.
Information File Analyzer: This agent is crucial for top efficiency. With out the descriptions it generates (Variant 1), DS-STAR’s accuracy on tough duties throughout the DABStep benchmark sharply dropped to 26.98%, underscoring the significance of wealthy information context for efficient planning and implementation.
Router: The Router agent’s capability to find out if a brand new step is required or to repair an incorrect step is significant. Once we eliminated it (Variant 2), DS-STAR solely added new steps sequentially, resulting in worse efficiency on each simple and exhausting duties. This demonstrated that it’s simpler to right errors in a plan than to maintain including probably flawed steps.
Generalizability Throughout LLMs: We additionally examined DS-STAR’s adaptability by utilizing GPT-5 as the bottom mannequin. This yielded promising outcomes on the DABStep benchmark, indicating the framework’s generalizability. Curiously, DS-STAR with GPT-5 carried out higher on simple duties, whereas the Gemini-2.5-Professional model carried out higher on exhausting duties.
