Although concurrent data structures are commonly used in practice on shared-memory machines, even the most efficient concurrent structures often lack performance theorems guaranteeing linear speedup for the enclosing parallel program. Moreover, efficient concurrent data structures are difficult to design. In contrast, parallel batched data structures do provide provable performance guarantees, since processing a batch in parallel is easier than dealing with the arbitrary asynchrony of concurrent accesses. They can limit programmability, however, since restructuring a parallel program to use batched data structure instead of concurrent data structure can often be difficult or even infeasible.
This paper presents BATCHER, a scheduler that achieves the best of both worlds through the idea of implicit batching, and a corresponding general performance theorem. BATCHER takes as input (1) a dynamically multithreaded program that makes arbitrary parallel accesses to an abstract data type, and (2) an implementation of the abstract data type as a batched data structure that need not cope with concurrent accesses. BATCHER extends a randomized work-stealing scheduler and guarantees probably good performance to parallel algorithms that use these data structures. In particular, suppose a parallel algorithm has (i)T_1(i/) work, (I)T_∞(I/) span, and (I)n(I/) data-structure operations. Let (I)W(n)(I/) be the total work of data-structure operations and let (I)s(n)(I/) be the span of a size-(I)P(I/) batch. Then BATCHER executes the program in (I)O((T_1+W(n) + n s(n))/P+ s(n) T_∞)(I/) expected time on (I)P(I/) processors. For higher-cost data structures like search trees and large enough (I)n(I/), this bound becomes (I)(T_1+n\lg n)/P + T_∞lg n)(I/) provably matching the work of a sequential search tree but with nearly linear speedup, even though the data structure is accessed concurrently. The BATCHER runtime bound also readily extends to data structures with amortized bounds.