In the constructor for the ConcurrentMark class in G1 we set up one bit map per worker thread:
for (uint i = 0; i < _max_worker_id; ++i) {
...
_count_card_bitmaps[i] = BitMap(card_bm_size, false);
...
}
Each of these bitmaps are malloced, which means that the amount of C-heap we require grows with the number of GC worker threads we have. On a machine with many CPUs we get many worker threads by default. For example, on scaaa007 I get 79 GC worker threads. The size of the bitmaps also scale with the heap size, so since this large machine has a lot of memory we get a large default heap as well.
Here is the output from just running java -version with G1 on scaaa007:
$ java -d64 -XX:+UseG1GC -XX:+PrintMallocStatistics -version
java version "1.8.0-ea-fastdebug"
Java(TM) SE Runtime Environment (build 1.8.0-ea-fastdebug-b92)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b34-fastdebug, mixed mode)
allocation stats: 35199 mallocs (221MB), 13989 frees (0MB), 35MB resrc
We malloc 221MB just by starting the VM. Most of the large allocations are due to the BitMap allocations. I have a patch that changes the BitMap allocations to use the ArrayAllocator instead. That class uses mmap on Solaris if the size is larger than 64K.
With this patch the output looks like this:
$ java -d64 -XX:+UseG1GC -XX:+PrintMallocStatistics -version
java version "1.8.0-ea-fastdebug"
Java(TM) SE Runtime Environment (build 1.8.0-ea-fastdebug-b93)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b34-internal-201306130943.brutisso.hs-gc-g1-mmap-fastdebug, mixed mode)
allocation stats: 35217 mallocs (31MB), 14031 frees (0MB), 35MB resrc
We are down to 31MB.
Note that the ArrayAllocator only has this effect on Solaris machines. Also note that I have not reduced the total amount of memory, just moved it from the C-heap to mapped memory.