I've been looking at the forums on forum.java.sun.com and became aware of
a vm crash that is caused by jni code modifying the fpu control word
and not restoring it. The vm could be more resilient against this and the
particular piece of code that failed for the customer (safepoint blob) could
be changed with virtually no performance impact. In discussing this internally
the following suggestion was put forth:
This has been a long standing problem. C2 is generally not affected by this I think because it tends to use SSE registers and it also has somewhat different register save logic. C1 often saves more registers than it needs to because it doesn't know enough at the call site to save the right ones. The new C1 doesn't save the FPU registers using fstp much at all, except in the deopt blob, so the incidence of these kinds of crashes should be much less. Instead we'll just compute slightly wrong answers because the FPU control word isn't correctly set. 6292965 is a recent instance of this problem.
I had a conversation with Ken about this and it certainly would certainly be possible to make the safepoint blob and/or other blobs make sure the FPU control word is set correctly. You could also modify the interpreter to look for it as well and maybe identify problematic native methods. You could also modify the trap handler to detect that a trap occurred in generated code because the FPU control word wasn't properly set and restore it and resume. It's a bit unsatisfying though since you run for some arbitrary amount of time with an invalid control word.
One thought I had was to modify the native wrappers issue some fpu instruction on return from native that would trap if issued with an invalid control word. Then we detect it immediately and can take corrective measures. If a particular native always corrupts the control word then we could throw it out and recompile it with code to always restore the control word. You might also have to do this on JNI upcalls to Java.
I'm not sure what the right solution is but I think we should figure out something and backport it to a tiger update. I did a search in google for the hs_err output that correspond to this kind of crash and it showed up quite a lot. I think it's something that customers see occasionally but don't report much. We generally treated it as being a bug that someone else changed the control word instead of making ourselves robust in the face of it. I think we should figure out a way to be robust without having silently incorrect FPU results.