As AI applied sciences advance, really useful brokers will turn out to be able to higher anticipating consumer wants. For experiences on cellular gadgets to be really useful, the underlying fashions want to know what the consumer is doing (or attempting to do) when customers work together with them. As soon as present and former duties are understood, the mannequin has extra context to foretell potential subsequent actions. For instance, if a consumer beforehand looked for music festivals throughout Europe and is now in search of a flight to London, the agent may provide to seek out festivals in London on these particular dates.
Massive multimodal LLMs are already fairly good at understanding consumer intent from a consumer interface (UI) trajectory. However utilizing LLMs for this job would sometimes require sending data to a server, which will be sluggish, pricey, and carries the potential threat of exposing delicate data.
Our current paper “Small Fashions, Large Outcomes: Reaching Superior Intent Extraction By way of Decomposition”, offered at EMNLP 2025, addresses the query of the right way to use small multimodal LLMs (MLLMs) to know sequences of consumer interactions on the internet and on cellular gadgets all on system. By separating consumer intent understanding into two levels, first summarizing every display screen individually after which extracting an intent from the sequence of generated summaries, we make the duty extra tractable for small fashions. We additionally formalize metrics for analysis of mannequin efficiency and present that our method yields outcomes akin to a lot bigger fashions, illustrating its potential for on-device purposes. This work builds on earlier work from our group on consumer intent understanding.
