Posted by Shai Barack, Android Platform Efficiency Lead and Charles Munger, Principal Software program Engineer

In Android 17, apps focusing on SDK 37 or greater will obtain a brand new implementation of MessageQueue the place the implementation is lock-free. The brand new implementation improves efficiency and reduces missed frames, however might break shoppers that replicate on MessageQueue personal fields and strategies. To be taught extra in regards to the conduct change and how one can mitigate impression, try the MessageQueue conduct change documentation. This technical weblog put up supplies an summary of the MessageQueue rearchitecture and how one can analyze lock rivalry points utilizing Perfetto.
The Looper drives the UI thread of each Android software. It pulls work from a MessageQueue, dispatches it to a Handler, and repeats. For twenty years, MessageQueue used a single monitor lock (i.e. a synchronized code block) to guard its state.
Android 17 introduces a major replace to this element: a lock-free implementation named DeliQueue.
This put up explains how locks have an effect on UI efficiency, how you can analyze these points with Perfetto, and the precise algorithms and optimizations used to enhance the Android primary thread.
The issue: Lock Competition and Precedence Inversion
The legacy MessageQueue functioned as a precedence queue protected by a single lock. If a background thread posts a message whereas the primary thread performs queue upkeep, the background thread blocks the primary thread.
When two or extra threads are competing for unique use of the identical lock, that is referred to as Lock rivalry. This rivalry could cause Precedence Inversion, resulting in UI jank and different efficiency issues.
Precedence inversion can occur when a high-priority thread (just like the UI thread) is made to attend for a low-priority thread. Contemplate this sequence:
A low precedence background thread acquires the MessageQueue lock to put up the results of work that it did.
A medium precedence thread turns into runnable and the Kernel’s scheduler allocates it CPU time, preempting the low precedence thread.
The excessive precedence UI thread finishes its present job and makes an attempt to learn from the queue, however is blocked as a result of the low precedence thread holds the lock.
The low-priority thread blocks the UI thread, and the medium-priority work delays it additional.
Analyzing rivalry with Perfetto
You’ll be able to diagnose these points utilizing Perfetto. In an ordinary hint, a thread blocked on a monitor lock enters the sleeping state, and Perfetto exhibits a slice indicating the lock proprietor.
While you question hint knowledge, search for slices named “monitor rivalry with …” adopted by the identify of the thread that owns the lock and the code website the place the lock was acquired.
Case research: Launcher jank
For example, let’s analyze a hint the place a consumer skilled jank whereas navigating dwelling on a Pixel telephone instantly after taking a photograph within the digicam app. Beneath we see a screenshot of Perfetto exhibiting the occasions main as much as the missed body:

Symptom: The Launcher primary thread missed its body deadline. It blocked for 18ms, which exceeds the 16ms deadline required for 60Hz rendering.
Prognosis: Perfetto confirmed the primary thread blocked on the MessageQueue lock. A “BackgroundExecutor” thread owned the lock.
Root Trigger: The BackgroundExecutor runs at Course of.THREAD_PRIORITY_BACKGROUND (very low precedence). It carried out a non-urgent job (checking app utilization limits). Concurrently, medium precedence threads have been utilizing CPU time to course of knowledge from the digicam. The OS scheduler preempted the BackgroundExecutor thread to run the digicam threads.
This sequence precipitated the Launcher’s UI thread (excessive precedence) to grow to be not directly blocked by the digicam employee thread (medium precedence), which was protecting the Launcher’s background thread (low precedence) from releasing the lock.
Querying traces with PerfettoSQL
You need to use PerfettoSQL to question hint knowledge for particular patterns. That is helpful when you’ve got a big financial institution of traces from consumer units or checks, and also you’re trying to find particular traces that display an issue.
For instance, this question finds MessageQueue rivalry coincident with dropped frames (jank):
INCLUDE PERFETTO MODULE android.monitor_contention; INCLUDE PERFETTO MODULE android.frames.jank_type; SELECT process_name, -- Convert period from nanoseconds to milliseconds SUM(dur) / 1000000 AS sum_dur_ms, COUNT(*) AS count_contention FROM android_monitor_contention WHERE is_blocked_thread_main AND short_blocked_method LIKE "%MessageQueue%" -- Solely have a look at app processes that had jank AND upid IN ( SELECT DISTINCT(upid) FROM actual_frame_timeline_slice WHERE android_is_app_jank_type(jank_type) = TRUE ) GROUP BY process_name ORDER BY SUM(dur) DESC;
On this extra complicated instance, be part of hint knowledge that spans a number of tables to determine MessageQueue rivalry throughout app startup:
INCLUDE PERFETTO MODULE android.monitor_contention; INCLUDE PERFETTO MODULE android.startup.startups; -- Be part of bundle and course of info for startups DROP VIEW IF EXISTS startups; CREATE VIEW startups AS SELECT startup_id, ts, dur, upid FROM android_startups JOIN android_startup_processes USING(startup_id); -- Intersect monitor rivalry with startups in the identical course of. DROP TABLE IF EXISTS monitor_contention_during_startup; CREATE VIRTUAL TABLE monitor_contention_during_startup USING SPAN_JOIN(android_monitor_contention PARTITIONED upid, startups PARTITIONED upid); SELECT process_name, SUM(dur) / 1000000 AS sum_dur_ms, COUNT(*) AS count_contention FROM monitor_contention_during_startup WHERE is_blocked_thread_main AND short_blocked_method LIKE "%MessageQueue%" GROUP BY process_name ORDER BY SUM(dur) DESC;
You need to use your favourite LLM to put in writing PerfettoSQL queries to search out different patterns.
At Google, we use BigTrace to run PerfettoSQL queries throughout tens of millions of traces. In doing so, we confirmed that what we noticed anecdotally was, the truth is, a systemic challenge. The info revealed that MessageQueue lock rivalry impacts customers throughout the complete ecosystem, substantiating the necessity for a elementary architectural change.
Answer: lock-free concurrency
We addressed the MessageQueue rivalry drawback by implementing a lock-free knowledge construction, utilizing atomic reminiscence operations fairly than unique locks to synchronize entry to shared state. A knowledge construction or algorithm is lock-free if at the least one thread can all the time make progress whatever the scheduling conduct of the opposite threads. This property is mostly laborious to realize, and is often not price pursuing for many code.
The atomic primitives
Lock-free software program typically depends on atomic Learn-Modify-Write primitives that the {hardware} supplies.
On older era ARM64 CPUs, atomics used a Load-Hyperlink/Retailer-Conditional (LL/SC) loop. The CPU masses a price and marks the handle. If one other thread writes to that handle, the shop fails, and the loop retries. As a result of the threads can hold making an attempt and succeed with out ready for an additional thread, this operation is lock-free.
ARM64 LL/SC loop instance
retry:
ldxr x0, [x1] // Load unique from handle x1 to x0
add x0, x0, #1 // Increment worth by 1
stxr w2, x0, [x1] // Retailer unique.
// w2 will get 0 on success, 1 on failure
cbnz w2, retry // If w2 is non-zero (failed), department to retr
Newer ARM architectures (ARMv8.1) help Massive System Extensions (LSE) which embrace directions within the type of Examine-And-Swap (CAS) or Load-And-Add (demonstrated beneath). In Android 17 we added help to the Android Runtime (ART) compiler to detect when LSE is supported and emit optimized directions:
/ ARMv8.1 LSE atomic instance ldadd x0, x1, [x2] // Atomic load-add. // Sooner, no loop required.
In our benchmarks, high-contention code that makes use of CAS achieves a ~3x speedup over the LL/SC variant.
The Java programming language gives atomic primitives through java.util.concurrent.atomic that depend on these and different specialised CPU directions.
The Knowledge Construction: DeliQueue
To take away lock rivalry from MessageQueue, our engineers designed a novel knowledge construction referred to as DeliQueue. DeliQueue separates Message insertion from Message processing:
The record of Messages (Treiber stack): A lock-free stack. Any thread can push new Messages right here with out rivalry.
The precedence queue (Min-heap): A heap of Messages to deal with, solely owned by the Looper thread (therefore no synchronization or locks are wanted to entry).
Enqueue: pushing to a Treiber stack
The record of Messages is stored in a Treiber stack [1], a lock-free stack that makes use of a CAS loop to replace the pinnacle pointer.
public class TreiberStack <E> { AtomicReference<Node<E>> prime = new AtomicReference<Node<E>>(); public void push(E merchandise) closing lengthy when1 = m1.when; closing lengthy when2 = m2.when; closing lengthy insertSeq1 = m1.insertSeq; closing lengthy insertSeq2 = m2.insertSeq; // signum returns the signal (-1, 0, 1) of the argument, // and is applied as pure arithmetic: // ((num >> 63) public E pop() (-num >>> 63)) closing int whenSign = Lengthy.signum(when1 - when2); closing int insertSeqSign = Lengthy.signum(insertSeq1 - insertSeq2); // whenSign takes priority over insertSeqSign, // so the system beneath is such that insertSeqSign solely issues // as a tie-breaker if whenSign is 0. return whenSign * 2 + insertSeqSign; }
Supply code primarily based on Java Concurrency in Follow [2], accessible on-line and launched to the general public area
Any producer can push new Messages to the stack at any time. That is like pulling a ticket at a deli counter – your quantity is set by once you confirmed up, however the order you get your meals in would not need to match. As a result of it is a linked stack, every Message is a sub-stack – you’ll be able to see what the Message queue was like at any cut-off date by monitoring the pinnacle and iterating forwards – you will not see any new Messages pushed on prime, even when they’re being added throughout your traversal.
Dequeue: bulk switch to a min-heap
To search out the subsequent Message to deal with, the Looper processes new Messages from the Treiber stack by strolling the stack ranging from the highest and iterating till it finds the final Message that it beforehand processed. Because the Looper traverses down the stack, it inserts Messages into the deadline-ordered min-heap. Because the Looper solely owns the heap, it orders and processes Messages with out locks or atomics.
In strolling down the stack, the Looper additionally creates hyperlinks from stacked Messages again to their predecessors, thus forming a doubly-linked record. Creating the linked record is protected as a result of hyperlinks pointing down the stack are added through the Treiber stack algorithm with CAS, and hyperlinks up the stack are solely ever learn and modified by the Looper thread. These inbound links are then used to take away Messages from arbitrary factors within the stack in O(1) time.
This design supplies O(1) insertion for producers (threads posting work to the queue) and amortized O(log N) processing for the buyer (the Looper).
Utilizing a min-heap to order Messages additionally addresses a elementary flaw within the legacy MessageQueue, the place Messages have been stored in a singly-linked record (rooted on the prime). Within the legacy implementation, removing from the pinnacle was O(1), however insertion had a worst case of O(N) – scaling poorly for overloaded queues! Conversely, insertion to and removing from the min-heap scale logarithmically, delivering aggressive common efficiency however actually excelling in tail latencies.
Within the legacy queue implementation, producers and the buyer used a lock to coordinate unique entry to the underlying singly-linked record. In DeliQueue, the Treiber stack handles concurrent entry, and the one shopper handles ordering its work queue.
Elimination: consistency through tombstones
DeliQueue is a hybrid knowledge construction, becoming a member of a lock-free Treiber stack with a single-threaded min-heap. Protecting these two buildings in sync with no international lock presents a novel problem: a message is likely to be bodily current within the stack however logically faraway from the queue.
To resolve this, DeliQueue makes use of a way referred to as “tombstoning.” Every Message tracks its place within the stack through the backwards and forwards pointers, its index within the heap’s array, and a boolean flag indicating whether or not it has been eliminated. When a Message is able to run, the Looper thread will CAS its eliminated flag, then take away it from the heap and stack.
When one other thread must take away a Message, it would not instantly extract it from the information construction. As an alternative, it performs the next steps:
Logical removing: the thread makes use of a CAS to atomically set the Message’s removing flag from false to true. The Message stays within the knowledge construction as proof of its pending removing, a so-called “tombstone”. As soon as a Message is flagged for removing, DeliQueue treats it as if it now not exists within the queue every time it’s discovered.
Deferred cleanup: The precise removing from the information construction is the accountability of the Looper thread, and is deferred till later. Quite than modifying the stack or heap, the remover thread provides the Message to a different lock-free freelist stack.
Structural removing: Solely the Looper can work together with the heap or take away components from the stack. When it wakes up, it clears the freelist and processes the Messages it contained. Every Message is then unlinked from the stack and faraway from the heap.
This method retains all administration of the heap single-threaded. It minimizes the variety of concurrent operations and reminiscence boundaries required, making the essential path sooner and less complicated.
Traversal: benign Java reminiscence mannequin knowledge races
Most concurrency APIs, reminiscent of Future within the Java normal library, or Kotlin’s Job and Deferred, include a mechanism to cancel work earlier than it completes. An occasion of one in every of these courses matches 1:1 with a unit of underlying work, and calling cancel on an object cancels the precise operations related to them.
As we speak’s Android units have multi-core CPUs and concurrent, generational rubbish assortment. However when Android was first developed, it was too costly to allocate one object for every unit of labor. Consequently, Android’s Handler helps cancellation through quite a few overloads of removeMessages – fairly than eradicating a particular Message, it removes all Messages that match the required standards. In follow, this requires iterating by way of all Messages inserted earlier than removeMessages was referred to as and eradicating those that match.
When iterating ahead, a thread solely requires one ordered atomic operation, to learn the present head of the stack. After that, bizarre discipline reads are used to search out the subsequent Message. If the Looper thread modifies the subsequent fields whereas eradicating Messages, the Looper’s write and one other thread’s learn are unsynchronized – this can be a knowledge race. Usually, an information race is a severe bug that may trigger large issues in your app – leaks, infinite loops, crashes, freezes, and extra. Nonetheless, beneath sure slender situations, knowledge races may be benign throughout the Java Reminiscence Mannequin. Suppose we begin with a stack of:

We carry out an atomic learn of the pinnacle, and see A. A’s subsequent pointer factors to B. Similtaneously we course of B, the looper may take away B and C, by updating A to level to C after which D.
Despite the fact that B and C are logically eliminated, B retains its subsequent pointer to C, and C to D. The studying thread continues traversing by way of the indifferent eliminated nodes and finally rejoins the reside stack at D.
By designing DeliQueue to deal with races between traversal and removing, we permit for protected, lock-free iteration.
Quitting: Native refcount
Looper is backed by a local allocation that should be manually freed as soon as the Looper has stop. If another thread is including Messages whereas the Looper is quitting, it may use the native allocation after it’s freed, a reminiscence security violation. We forestall this utilizing a tagged refcount, the place one little bit of the atomic is used to point whether or not the Looper is quitting.
Earlier than utilizing the native allocation, a thread reads the refcount atomic. If the quitting bit is ready, it returns that the Looper is quitting and the native allocation should not be used. If not, it makes an attempt a CAS to increment the variety of lively threads utilizing the native allocation. After doing what it must, it decrements the depend. If the quitting bit was set after its increment however earlier than the decrement, and the depend is now zero, then it wakes up the Looper thread.
When the Looper thread is able to stop, it makes use of CAS to set the quitting bit within the atomic. If the refcount was 0, it may well proceed to free its native allocation. In any other case, it parks itself, figuring out that will probably be woken up when the final consumer of the native allocation decrements the refcount. This method does imply that the Looper thread waits for the progress of different threads, however solely when it’s quitting. That solely occurs as soon as and isn’t efficiency delicate, and it retains the opposite code for utilizing the native allocation totally lock-free.
There’s lots of different tips and complexity within the implementation. You’ll be able to be taught extra about DeliQueue by reviewing the supply code.
Optimization: branchless programming
Whereas growing and testing DeliQueue, the group ran many benchmarks and punctiliously profiled the brand new code. One challenge recognized utilizing the simpleperf device was pipeline flushes attributable to the Message comparator code.
A typical comparator makes use of conditional jumps, with the situation for deciding which Message comes first simplified beneath:
static int compareMessages(@NonNull Message m1, @NonNull Message m2) { if (m1 == m2) { return 0; } // Major queue order is by when. // Messages with an earlier when ought to come first within the queue. closing lengthy whenDiff = m1.when - m2.when; if (whenDiff > 0) return 1; if (whenDiff < 0) return -1; // Secondary queue order is by insert sequence. // If two messages have been inserted with the identical `when`, the one inserted // first ought to come first within the queue. closing lengthy insertSeqDiff = m1.insertSeq - m2.insertSeq; if (insertSeqDiff > 0) return 1; if (insertSeqDiff < 0) return -1; return 0; }
This code compiles to conditional jumps (b.le and cbnz directions). When the CPU encounters a conditional department, it may well’t know whether or not the department is taken till the situation is computed, so it doesn’t know which instruction to learn subsequent, and has to guess, utilizing a way referred to as department prediction. In a case like binary search, the department route will likely be unpredictably totally different at every step, so it’s seemingly that half the predictions will likely be unsuitable. Department prediction is usually ineffective in looking out and sorting algorithms (such because the one utilized in a min-heap), as a result of the price of guessing unsuitable is bigger than the advance from guessing appropriately. When the department predictor guesses unsuitable, it should throw away the work it did after assuming the anticipated worth, and begin once more from the trail that was truly taken – that is referred to as a pipeline flush.
To search out this challenge, we profiled our benchmarks utilizing the branch-misses efficiency counter, which information stack traces the place the department predictor guesses unsuitable. We then visualized the outcomes with Google pprof, as proven beneath:
Recall that the unique MessageQueue code used a singly-linked record for the ordered queue. Insertion would traverse the record in sorted order as a linear search, stopping on the first aspect that’s previous the purpose of insertion and linking the brand new Message forward of it. Elimination from the pinnacle merely required unlinking the pinnacle. Whereas DeliQueue makes use of a min-heap, the place mutations require reordering some components (sifting up or down) with logarithmic complexity in a balanced knowledge construction, the place any comparability has a good probability of directing the traversal to a left youngster or to a proper youngster. The brand new algorithm is asymptotically sooner, however exposes a brand new bottleneck because the search code stalls on department misses half the time.
Realizing that department misses have been slowing down our heap code, we optimized the code utilizing branch-free programming:
// Branchless Logic static int compareMessages(@NonNull Message m1, @NonNull Message m2) (-num >>> 63)) closing int whenSign = Lengthy.signum(when1 - when2); closing int insertSeqSign = Lengthy.signum(insertSeq1 - insertSeq2); // whenSign takes priority over insertSeqSign, // so the system beneath is such that insertSeqSign solely issues // as a tie-breaker if whenSign is 0. return whenSign * 2 + insertSeqSign;
To grasp the optimization, disassemble the 2 examples in Compiler Explorer and use LLVM-MCA, a CPU simulator that may generate an estimated timeline of CPU cycles.
The unique code: Index 01234567890123 [0,0] DeER . . . sub x0, x2, x3 [0,1] D=eER. . . cmp x0, #0 [0,2] D==eER . . cset w0, ne [0,3] .D==eER . . cneg w0, w0, lt [0,4] .D===eER . . cmp w0, #0 [0,5] .D====eER . . b.le #12 [0,6] . DeE---R . . mov w1, #1 [0,7] . DeE---R . . b #48 [0,8] . D==eE-R . . tbz w0, #31, #12 [0,9] . DeE--R . . mov w1, #-1 [0,10] . DeE--R . . b #36 [0,11] . D=eE-R . . sub x0, x4, x5 [0,12] . D=eER . . cmp x0, #0 [0,13] . D==eER. . cset w0, ne [0,14] . D===eER . cneg w0, w0, lt [0,15] . D===eER . cmp w0, #0 [0,16] . D====eER. csetm w1, lt [0,17] . D===eE-R. cmp w0, #0 [0,18] . .D===eER. csinc w1, w1, wzr, le [0,19] . .D====eER mov x0, x1 [0,20] . .DeE----R ret
Notice the one conditional department, b.le, which avoids evaluating the insertSeq fields if the result’s already identified from evaluating the when fields.
The branchless code: Index 012345678 [0,0] DeER . . sub x0, x2, x3 [0,1] DeER . . sub x1, x4, x5 [0,2] D=eER. . cmp x0, #0 [0,3] .D=eER . cset w0, ne [0,4] .D==eER . cneg w0, w0, lt [0,5] .DeE--R . cmp x1, #0 [0,6] . DeE-R . cset w1, ne [0,7] . D=eER . cneg w1, w1, lt [0,8] . D==eeER add w0, w1, w0, lsl #1 [0,9] . DeE--R ret
Right here, the branchless implementation takes fewer cycles and directions than even the shortest path by way of the branchy code – it’s higher in all circumstances. The sooner implementation plus the elimination of mispredicted branches resulted in a 5x enchancment in a few of our benchmarks!
Nonetheless, this system isn’t all the time relevant. Branchless approaches usually require doing work that will likely be thrown away, and if the department is predictable more often than not, that wasted work can sluggish your code down. As well as, eradicating a department typically introduces a knowledge dependency. Fashionable CPUs execute a number of operations per cycle, however they will’t execute an instruction till its inputs from a earlier instruction are prepared. In distinction, a CPU can speculate about knowledge in branches, and work forward if a department is predicted appropriately.
Testing and Validation
Validating the correctness of lock-free algorithms is notoriously troublesome!
Along with normal unit checks for steady validation throughout improvement, we additionally wrote rigorous stress checks to confirm queue invariants and to try to induce knowledge races in the event that they existed. In our check labs we may run tens of millions of check situations on emulated units and on actual {hardware}.
With Java ThreadSanitizer (JTSan) instrumentation, we may use the identical checks to additionally detect some knowledge races in our code. JTSan didn’t discover any problematic knowledge races in DeliQueue, however – surprisingly -actually detected two concurrency bugs within the Robolectric framework, which we promptly fastened.
To enhance our debugging capabilities, we constructed new evaluation instruments. Beneath is an instance exhibiting a problem in Android platform code the place one thread is overloading one other thread with Messages, inflicting a big backlog, seen in Perfetto due to the MessageQueue instrumentation function that we added.
To allow MessageQueue tracing within the system_server course of, embrace the next in your Perfetto configuration:
data_sources {
config {
identify: "track_event"
target_buffer: 0 # Change this per your buffers configuration
track_event_config {
enabled_categories: "mq"
}
}
}Influence
DeliQueue improves system and app efficiency by eliminating locks from MessageQueue.
Artificial benchmarks: multi-threaded insertions into busy queues is as much as 5,000x sooner than the legacy MessageQueue, due to improved concurrency (the Treiber stack) and sooner insertions (the min-heap).
In Perfetto traces acquired from inside beta testers, we see a discount of 15% in app primary thread time spent in lock rivalry.
On the identical check units, the decreased lock rivalry results in vital enhancements to the consumer expertise, reminiscent of:
-4% missed frames in apps.
-7.7% missed frames in System UI and Launcher interactions.
-9.1% in time from app startup to the primary body drawn, on the 95percentile.
Subsequent steps
DeliQueue is rolling out to apps in Android 17. App builders ought to evaluation making ready your app for the brand new lock-free MessageQueue on the Android Builders weblog to discover ways to check their apps.
References
[1] Treiber, R.Ok., 1986. Techniques programming: Dealing with parallelism. Worldwide Enterprise Machines Included, Thomas J. Watson Analysis Heart.
[2] Goetz, B., Peierls, T., Bloch, J., Bowbeer, J., Holmes, D., & Lea, D. (2006). Java Concurrency in Follow. Addison-Wesley Skilled.
