Java – Unexplained Java HashMap behavior

Unexplained Java HashMap behavior… here is a solution to the problem.

Unexplained Java HashMap behavior

In the code below, I create a hash graph to store an object named Datums with a string (position) and a count. Unfortunately, the code gives very strange behavior.

            FileSystem fs = FileSystem.get(new Configuration());
            Random r = new Random();
            FSDataOutputStream fsdos = fs.create(new Path("error/" + r.nextInt(1000000)));

HashMap<String, Datum> datums = new HashMap<String, Datum>();
            while (itrtr.hasNext()) {
                Datum next = itrtr.next();
                synchronized (datums) {
                    if (!datums.containsKey(next.location)) {
                        fsdos.writeUTF("INSERTING: " + next + "\n");
                        datums.put(next.location, next);
                    } else {
                    } // skit those that are already indexed 
                }
            }
            for (Datum d : datums.values()) {
                fsdos.writeUTF("PRINT DATUM VALUES: " + d.toString() + "\n");
            }

HashMap takes a string as the key.

Here is the output I get in the error file (example):

INSERTING: (test.txt,3)

INSERTING: (test2.txt,1)

PRINT DATUM VALUES: (test.txt,3)

PRINT DATUM VALUES: (test.txt,3)

The correct output for the print should be:
INSERTING: (test.txt,3)

INSERTING: (test2.txt,1)

PRINT DATUM VALUES: (test.txt,3)

PRINT DATUM VALUES: (test2.txt,1)

What happened to Datum with test2.txt as the location? Why is it superseded by test.txt?

Basically, I’ll never see the same location twice. (That’s what !datums.containsKey is checking.) Unfortunately, my behavior was strange.

By the way, this is in a reducer on Hadoop.

I

tried putting synchronized here in case it’s running in multiple threads, and as far as I know, that’s not the case. Still, the same thing happened.

Solution

According to this answer, Hadoop’s iterators always return the same object, rather than creating a new object each time the loop returns.

Therefore, keeping a reference to the object returned by the iterator is invalid and produces surprising results. You need to copy the data to the new object:

        while (itrtr.hasNext()) {
            Datum next = itrtr.next();
             copy any values from the Datum to a fresh instance
            Datum insert = new Datum(next.location, next.value);
            if (!datums.containsKey(insert.location)) {
                datums.put(insert.location, insert);
            }
        }

This is a reference to Hadoop Reducer documentation, It confirms this:

The framework will reuse the key and value objects that are passed
into the reduce, therefore the application should clone the objects
they want to keep a copy of.

Related Problems and Solutions