Currently G1 thread local data contains only the two PtrQueues for the marking and dirty card queue.
It may be worth putting the card table base pointer into it too: currently card table base is encoded directly into the code stream. This takes a lot of space (on x86) there, which may be a problem (one 10 byte instruction for loading the constant), then the comparison taking 3 bytes using that register.
Putting card table base into TLS would result in a 4 bytes total comparison.
At least the first check (card in young is young) is laid out in the fast path.