In some micro-benchmarks, more than 50% of cpu time is spent in GenericTaskQueueSet::steal_best_of_2 due to evacuation threads repeatedly trying to steal when there is very little to steal, and being too eager about it.
Significant improvements can be shown by:
- reducing the number of steal attempts per steal round
- not trying to steal if the victim queue is almost empty as well (because then the next thing that happens is that the victim will try to steal immediately)
- if unsuccessful to steal (and at least some elements are in the queues), give up the CPU immediately instead of doing long active waiting (after looking if we can terminate)