False sharing induced by card table marking
Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark the card table entry for the card page containing (covering) that field as dirty. In the HotSpot JVM the card page size is 512 bytes and the card table is implemented as a simple array of bytes. That is, a given card table entry, which represents the state of a card page, is just one byte. The write barrier is emitted directly by the JIT and is usually just a shift and store instruction. In a subsequent non-moving minor GC, the collector can avoid scanning reference fields in a card that is not dirty.
This design is well-proven widely employed but unfortunately it can result in performance issues in highly concurrent environments. Lets say our cache line size is 64 bytes, which is fairly common in modern processors. This means that 64 cards (32KB = 64*512) will share a cache line in the card table. So reference stores by different threads that just happen to fall within the same 32KB region cause writes to the same cache line underlying the card table. This can result in excessive write invalidation and cache coherence traffic, which can reduce performance and impede scaling. Interestingly, the impact can worsen after a full/moving GC as threads tend to allocate into different address ranges by virtue of thread-local allocation buffers (TLABs), but after a full collection the remaining objects tend to be more tightly packed and thus more prone to the problem. Furthermore, most card table stores are redundant, as often the card is already marked dirty. This suggests a simple solution: instead of using an unconditional store in the barrier, we first check the card table entry and only store if it is clean. This slightly increases the barrier path-length and adds a conditional branch -- unless we were to be somewhat clever with conditional moves by annulling a redundant store by changing the destination address to be a thread-local dummy variable. On the other hand it avoids the problem. (For historical background, some years ago Doug Lea noticed an odd slow-down after a full GC in some concurrent Java code. He contacted me and I speculated that false sharing in the card table could be the issue. We conjured up a JVM with an experimental -XX:+UseCondCardMark flag that let us emit write barriers as either the usual unconditional store, or a conditional form that avoids redundant stores. The conditional form provided relief).
I ran into the problem recently when experimenting with some concurrent queue implementations, which are reference-heavy, on a Sun 256-way Sun/Oracle T5440. This is 4-socket system where each socket contains a Niagara T2+ UltraSPARC processor having 64 logical processors. My benchmark has 50 producer threads and 50 consumer threads and measures and reports the message throughput over a timing interval. In the default configuration we can pass about 8.8K messages per msec. Using the same JVM, when I add the -XX:+UseCondCardMark flag we can achieve 52.5K messages per msec, clearly demonstrating the magnitude of the effect.
I should also note that we ran into this same issue when experimenting with Java-level lock elision using hardware transactional memory. If two unrelated concurrent transactions happened to store into reference fields in the same 32KB region we'd have false aborts because of write-sharing on the card table cache line. Again, -XX:+UseCondCardMark provided relief.