JDK-8029952 : Parser doesn't scale well for large literals
  • Type: Bug
  • Component: core-libs
  • Sub-Component: jdk.nashorn
  • Priority: P4
  • Status: Closed
  • Resolution: Future Project
  • OS: generic
  • CPU: generic
  • Submitted: 2013-12-11
  • Updated: 2017-07-05
  • Resolved: 2017-07-05
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbd_majorResolved
Related Reports
Relates :  
Relates :  
Description
See attached program. Significant time is spent in lexing the array and also accepting the values inside it.

Comments
Set priority to P4 and removed regression label as JDK-8143304 fixes the SyntaxError in JDK9.
20-11-2015

Linking to JDK-8059934 as an alternative way to fix this issue.
20-11-2015

Linking to JDK-8134292 as caching of eager pre-pass AST would avoid duplicate parsing of the script.
19-11-2015

The SyntaxError regression only happens in optimistic mode, and is caused by using the wrong boundaries when reparsing the main program function. We use the wrong boundaries because the source code size of the example (38888903 bytes) exceeds the limits of the token length in our token encoding. We currently encode tokens into a long where 8 bits are used for the token type, 24 bits for the token length, and the remaining 32 bits for the token position. The thinking for using less bits for the token length than the token position was probably that a token would never be as large as the whole source, but in the case of the top level (program) function that is the case. The fix I suggest for this is to use 28 bits for both token position and length, given that they can both get roughly the size of the source code. 2^28 (roughly 260 MB) should still be enough for code size. Jim suggested to not store the token length in the token itself but instead refer to the position of the next token. I do see the benefits of this approach, but it would require significant changes to the parser and all parts dealing with tokens, so I'm a bit hesitant to go that way. There are a few other things I found that could be improved, however. The first the fact that the program AST is not cached. Currently we only cache split functions (in codegen.CacheAst) because apply-to-call and symbol count would disallow caching. However, this is a case where a function is not split, does not use apply-to-call or symbol count. We probably should do the extra work to make sure functions/scripts like this can be cached. The other thing I found is number parsing. The script mostly consists of number literals, and it takes a long time to parse (about 4 seconds on my computer). We might want to port the StrToD conversion from V8 double-conversion which I omitted when porting double-conversion in JDK-8010803.
18-11-2015