Bug ID: JDK-8221327 Arrays.sort places the first entry in the array last when it should be first

Type: Bug
Component: core-libs
Sub-Component: java.util:collections
Affected Version: 8u201

Priority: P4
Status: Closed
Resolution: Duplicate
OS: windows_7
CPU: x86_64

Submitted: 2019-03-06
Updated: 2019-03-26
Resolved: 2019-03-26

ADDITIONAL SYSTEM INFORMATION :
SE runtime build is 1.8.0.201-b09. Hotspot 64-bit build 25.201.b09 mixed mode.

A DESCRIPTION OF THE PROBLEM :
Using a comparator with arrays sort the value "aah " comes after "aalii " and after "zythum ". All other entries are sorted correctly. So the sort is correct from aalii to zythem, then aah is the final element in the sorted array. The hex values in the input file for aah are: 61 61 68 20 and directly follow the magic number EF BB BF. Using compareToIgnoreCase the comparison returns abacist to aah  -65182; aah to zythum 65157; abacist to aalii  1. The array size is 306,962 elements.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a 307,000 list of words beginning with aah and ending with zythum and sort it using a class comparator.


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All elements sorted alphabetically in ascending order.
ACTUAL -
Sort is perfect except that what should be the first entry ends up last.

---------- BEGIN SOURCE ----------
The input file is probably more useful than code.
The invocation is after converting an ArrayList to an array:
	DictEntry dea[] = new DictEntry[dictionary.size()];
	dea = dictionary.toArray(dea);
	Arrays.sort(dea, new DictSortEntries());
My DictEntry class has three elements: word, part-of-speech and definition.
public final class DictEntry {
	String word, pos, definition;
}
And the comparator:
public class DictSortEntries implements Comparator<DictEntry> {
	public int compare(DictEntry a, DictEntry b) {
		return a.word.compareToIgnoreCase(b.word);
	} 
}


---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
No good one. Implement my own sort.

FREQUENCY : always

Response from submitter on Stuart's comment: Thank you for the very complete response. I agree, there is no good solution. I can imagine some situations when the programmer might want to read in the magic number... Also, there are many magic numbers, so it is probably best to leave it as is and let the programmer handle it. Now that I am aware of this, it won't be a problem. Again, thank you!
26-03-2019
Additional comments from submitter: After working on this further, I discovered that the issue was that the magic number was included in the string "aah" even though the length function reported the "aah" string as being four characters (not 6). After removing the magic number from the file, everything worked as expected. Some of the entries in the source file contained a trailing space, so the length of four did not raise any concern initially. I still do not understand why the length function only returned four if the string was preceded by the three character magic number? I have attached a small text file (created with Microsoft Notepad) that has three words and starts with the magic number EF BB BF. Each word is on a separate line. Output below was printed using this statement: System.out.println("word: " + word + " length word: " + word.length()); word: aah length word: 4 word: zyt length word: 3 word: bbb length word: 3
26-03-2019
It looks to me like the file is encoded in UTF-8, since the "magic numbers" EF BB BF are the UTF-8 encoding of the character U+FEFF "zero width no-break space" otherwise known as a byte-order mark or BOM. Some windows applications are known to emit a BOM at the front of text files. If this is true, then the reason for sorting and the comparators' apparent misbehavior is that the first string being sorted doesn't actually have the value "aah" but instead has the value "\ufeffaah". That is, the first char value in the string is 0xfeff. Since this is a very high char value, it's not surprising that it causes that string to sort at the very end of the list. Probably the best course of action is to try to ensure that the BOM character doesn't make its way into any strings. It's invisible, but as noted it affects sort order. It's also a part of the data, so "\ufeffaah" has length 4 which is surprising even though it's invisible. As far as I know there is no mechanism in the JDK that automatically removes BOM characters. There was an attempt at this in the past; see JDK-4508058. However, this caused regressions and was backed out: JDK-6378911. There is currently a bug open on this, but there's no obviously good resolution: JDK-6959785. There are various techniques suggested on the internet for dealing with BOM, either at the level of an InputStream (though it seems to me that a Reader would be preferable), or by stripping it from the front of a string after it's read in.
26-03-2019
To submitter: I tried to reproduce the issue stated in the bug report with the following test case, however I am getting the expected output. Can you please let me know if you have any inputs that may help reproduce the issue. If possible, can you please provide the complete test case along with the test data to reproduce the issue ? package JI9059674; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class DictTest { public static void main(String[] args) { DictEntry d1 = new DictEntry(); d1.setWord("zythum "); DictEntry d2 = new DictEntry(); d2.setWord("aalii "); DictEntry d3 = new DictEntry(); d3.setWord("aah "); //List<DictEntry> dictionary = List.of(d1, d2, d3); List<DictEntry> dictionary = new ArrayList<>(); dictionary.add(d1); dictionary.add(d2); dictionary.add(d3); DictEntry dea[] = new DictEntry[dictionary.size()]; dea = dictionary.toArray(dea); Arrays.sort(dea, new DictSortEntries()); for(DictEntry d : dea){ System.out.println(d.getWord()); } } } package JI9059674; public class DictEntry { String word, pos, definition; public String getWord() { return word; } public void setWord(String word) { this.word = word; } } package JI9059674; import java.util.Comparator; public class DictSortEntries implements Comparator<DictEntry> { public int compare(DictEntry a, DictEntry b) { return a.word.compareToIgnoreCase(b.word); } }
22-03-2019

Duplicate :	JDK-6959785 - UTF-8 encoding does not recognize initial BOM
Relates :	JDK-6959785 - UTF-8 encoding does not recognize initial BOM