Bug ID: JDK-8046095 JEP 105: DocTree API

Summary
-------

Extend the Compiler Tree API to provide structured access to the content
of javadoc comments.


Goals
-----

Provide access to the syntactic elements of a javadoc comment.


Non-Goals
---------

It is a non-goal to check the HTML tags in a javadoc comment for semantic
well-formedness, that is, checking the HTML against a DTD or similar; 
however, it should be possible to use the API to provide such tools.


Motivation
----------

This API will enable a new generation of doc comment tools to be provided. Such
tools could either be written using the Compiler and Tree API, or could be
written as annotation processors. One tool that is long long overdue is the
updated equivalent of the old DocCheck doclet, to check simple rules and
guidelines for the contents of doc comments, and which has never been updated
for the language changes in Java 5 and later.

javadoc could also be rewritten to take advantage of the new structured doc
comment objects, and to be able to use the additional information such as
source positions in its error messages. The HTML parsing would also help make
javadoc be able to generate valid XHTML. Although the work in JDK 7 javadoc
makes it easy to generate XHTML for the sections generated by javadoc itself,
javadoc does not currently have the means to check or verify the use of XHTML
within the doc comments of the sources files it is processing.


Description
-----------

The problem(s) ...

In JDK 5, there was a single scanner, capable of reading doc comments as needed
by javadoc. In JDK 6, the code was refactored into two scanners, one that was
capable of reading doc comments, suitable for use by javadoc, and one that was
not, suitable for use by javac. That is, until we added the various public APIs
to javac, by which clients and annotation processors could get access to doc
comment if they so desired.

This means there are 3 types of clients for doc comments:

1. comments not required -- javac, when no annotation processors need to be run
2. comments definitely required -- javadoc
3. comments maybe required -- clients of javac public API, including annotation
   processors run by javac

The problem is that reading and maintaining the doc comment table is expensive,
and in category 3, we have to support the doc comment table, for the off chance
that clients might need it, even though most do not.

Another problem with the doc comment table is that it is very low-tech. It is
simply a map of tree node to string, where the string is the doc comment as
needed by javadoc, meaning that the beginning of each line (the white space and
typical '*') has been stripped away. This makes it very difficult indeed to
relate positions within the doc comment back to positions in the original
source file, which is why you don't see any traditional "emacs-style" error
messages coming from javadoc about "parameter name not found" or "exception not
declared to be thrown".

The same low-tech doc comment table is exposed to clients of the Tree API. For
any tree node, you can get the doc comment string. That's it, beyond that,
you're on your own.  The ideas(s) ...

First up is to upgrade the doc comment table stored in each compilation
unit. Replace

    Map<JCTree, String> docComments; 

by

    Map<JCTree, JCDocComment> docComments;

JCDocComment is a new object internal to javac, and provides lazy access to the
doc comment string. At a minimum it contains the starting position of the doc
comment in the source file: the position of the "/" character.

    interface JCDocComment {
        int getPosition();
        String getComment();
    }

This allows us to have possibly three different doc comment scanners, for the
three different kinds of client. For javac with no annotation processors, we
continue to use the standard Scanner and leave the docComments table empty, as
now. For javadoc, we continue to read the doc comments as now, except that now
we store them in JCDocComment objects. For javac when we don't know whether doc
comments are required or not, we simply store the starting position of the doc
comment. This saves us storing the text of all the doc comments when they are
not required. The price is that when we do need the comments, we have to go
back and recover the text of the comment from the source file. Different
strategies are possible. If any doc comment in a source file is required, we
could scan all of them. Note we don't have to scan the source text between the
comments because we have the starting position of the comment available, so we
can just skip the text between the comments. Or we could just read the comments
as needed, and rely on the content cache to save us having to read the source
file contents for each individual comment.

The next idea is a better, parsed, representation of doc comments.

A doc comment is comprised of

* an initial sentence
* the rest of the main description
* a list of tags each followed by a description

Each of these contains a sequence of fragments where each fragment can be one
of

* plain text, including characters from malformed fragments like '<',
  '>', '&', '{', etc.
* an HTML start-entity, which contains a name and a list of name-value pairs,
  such as '\<a href="Object.html">'
* an HTML end-entity, which contains a name, such as '\</a>'
* an HTML character entity, such as '\&amp;'
* a taglet, such as {@link Object}

Obviously, these can be modeled with a simple hierarchy of tree nodes, so I
suggest a new package, com.sun.source.doccomments to contain the interfaces for
these tree nodes. It would best be a separate hierarchy from the existing
com.sun.source.tree.Tree, so I suggest a new common super-interface
com.sun.source.doccomments.DTree. We can then extend the utility methods in
com.sun.source.util, to provide access to the parsed doc comment for any tree
node, and to provide source position info for any DTree node.

Note, the HTML start-entity and end-entity are handled as separate items to
avoid getting into issues of parsing HTML and knowing which tags require a
closing tag and which not. That layer can be built on top of this abstraction
by those applications that need it.

Parsing these comments will not be cheap; nothing involving lexing and parsing
ever is. And so this is another reason to provide and use the lazy access to
doc comments via the JCDocComment table described earlier. Except now, there is
a more interesting method on it as well, to get the DTree for a comment --
which is another reason not to bother to keep the simple string that is
currently provided.


Testing
-------

langtools regression tests will be written to exercise the new API.
One specific test will be to read and process all the JDK API comments.

There are no special platform or hardware requirements.


Dependences
-----------

This work has no dependences on other JEPs. 

It is expected that other JEPs will depend on this one.


Impact
------

  - Other JDK components: javadoc
  - Compatibility: minimal
  - Internationalization: minimal
  - Localization: minimal
Blocks :	JDK-8046162 - JEP 172: DocLint
Relates :	JDK-7021614 - extend com.sun.source API to support parsing javadoc comments
Relates :	JDK-7070810 - DocTree API: extend javac Tree API to include javadoc comments