Evaluating progress of LLMs on scientific problem-solving

April 5, 2025

56

Programmatic and model-based evaluations

Duties in CURIE are diversified and have ground-truth annotations in blended and heterogeneous type, e.g., as JSONs, latex equations, YAML information, or free-form textual content. Evaluating free-form technology is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our instances, the response to every area can have differing varieties. For instance, supplies grid factors might typically be specified as “[p, q, r]” and at different instances as “p × q × r”. Therefore, along with the programmatic analysis metrics, comparable to ROUGE-L, intersection-over-inion (used for BIOGR), and id ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how intently the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are lots of minor errors, and “dangerous” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to provide a remaining confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered record of dictionaries or information. We use a chain-of-thought (CoT) immediate that asks the LLM to have a look at every ground-truth document and determine the anticipated information that accurately match every area (key) and worth of the bottom reality. As soon as we match the ground-truth information with predicted information, we are able to then measure precision and recall for the retrieval job, and compute the imply common precision, recall and F1 scores throughout all paperwork.

Previous articleSensible Constructing Cybersecurity: Guaranteeing Information Privateness and Safety

Next articleStudy as much as 25 with this lifetime deal

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US