Wednesday, February 16, 2011

Implementing Hashcode

In Java, every object has a method hashCode, which is inherited from Object super class. In this Java tutorial, we will discuss about hashCode, equals methods and what role they play in an object. Through this article we will find answer for the following questions,
  • What is the purpose of hashCode and equals methods?
  • How hashCode and equals are implemented?
  • Why hashCode should also be overridden when equals is overridden? 
Understanding hashcode for object
 An object’s hash code allows algorithms and data structures to put objects into compartments, just like letter types in a printer’s type case. The printer puts all “A” types into the compartment for “A”, and he looks for an “A” only in this one compartment. This simple system lets him find types much faster than searching in an unsorted drawer. That’s also the idea of hash-based collections, such as HashMap and HashSet.

How does Java computes hashcode for existing classes?
A hashcode is an integer value that represents the state of the object upon which it was called. That is why an Integer that is set to 1 will return a hashcode of "1" because an Integer's hashcode and its value are the same thing. A character's hashcode is equal to it's ASCII character code. If you write a custom type you are responsible for creating a good hashCode implementation that will best represent the state of the current instance.
But what does it means for us, if we write our own class, then also it returns some hashcode. How

Implementing hashCode :
  • if a class overrides equals, it must override hashCode
  • when they are both overridden, equals and hashCode must use the same set of fields
  • if two objects are equal, then their hashCode values must be equal as well
  • if the object is immutable, then hashCode is a candidate for caching and lazy initialization
Misconception:
It is a popular misconception that hashCode provides a unique identifier for an object. It does not.

Cases of Inconsistency when not implementing equals() or hashCode() OR Why we need to override equals and hashCode() method

When implementing hashCode() but not equals()
public class MyString {

 

      private String inputString;

     

      public MyString(String string) {

            inputString=string;

      }

 

      @Override

      public int hashCode() {

            return inputString.length();

      }

}




Now lets see the result with tester class:

public static void main(String[] args) {

     

      MyString obj1 = new MyString (“my-string”);

      MyString Obj2 = new MyString (“my-string”);

      if(obj1.hashCode() == obj2.hashCode()){

            System.out.println(“HashCode are equal”);

      }

      if(obj1.equals(obj2)){

            System.out.println(“Objects are equal”);

      }else{

            System.out.println(“Objects are not equal”);

      }

 

}



Case when not implementing hashCode()

If the object does not implement hashcode() method and used as key then we will not get the object back as shown in below code.

class Person{
    private String name;
    private boolean gender;//false for female, true for male
//some getters and setters
//no equals and hashcode implemented in this class
}



Now using this class:

public class HashMapDemo {

    public static void main(String[] args) {
        Person person=new Person("ramesh",true);
        Person person2 = new Person("ramesh",true);
        Person person3 = new Person("suresh",true);
        Person person4 = new Person("kavya",false);

        HashMap<Person, String> map = new HashMap<Person, String>();
//Note that 2 duplicate object added to hashmap and no exception thrown...
        map.put(person,"New person");
        map.put(person2,"Duplicate person");
        map.put(person3,"Another new person");
        map.put(person4,"new Female person");

//
//Now retrieving the hashmap
        Person rameshAgain=new Person("ramesh",true);
        if(map.get(rameshAgain) == null)
            System.out.println("Object not found");
        else
            System.out.println("Our object found in map");
}

        


Output:

Object not found



As you can see in above program :

  1. Duplicate objects are added in Hashmap as a key (Because we have not overided the hashcode and equals method)
  2. We are not able to get back object from map (Because hashcode is not implemented)

Some guidelines for implementing hashCode()


 


Implementing the hashCode() and equals()


Now implementing hashCode() and equals:

public class Person {
 
    int id;
 
    String name;
 
    public boolean equals(Object obj){
     
        if(this == obj) return true; // if both are referring to the same object
        if ((obj == null) || (obj.getClass() != this.getClass())) {
        return false;
    }
    Person rhs = (Person) obj;
    return id == rhs.id &amp;&amp; (name == rhs.name ||
    (name != null &amp;&amp; name.equals(rhs.name)) );
 
}
 
//both fields id &amp; name are used in equals(), so both fields must be used in
 
//hashCode() as well.
 
    public int hashCode() {
 
        int hash = 9;
         
        hash = (31 * hash) + id;
         
        hash = (31 * hash) + (null == name ? 0 : name.hashCode());
         
        return hash;
         
    }
 
}

Never misuse hashCode as a key
You may object that, unlike the printer’s type case, in Java there are 4,294,967,296 compartments (232 possible int values). With 4 billion slots, collisions seem to be extremely unlikely, right?
Turns out that it’s not so unlikely. Here’s the surprising math of collisions: Please imagine 23 random people in a room. How would you estimate the odds of finding two fellows with the same birthday among them? Pretty low, because there are 365 days in a year? In fact, the odds are about 50%! And with 50 people it’s a save bet. This phenomenon is called the Birthday paradox. Transferred to hash codes, this means that with 77,163 different objects, you have a 50/50 chance for a collision – given that you have an ideal hashCode function, that evenly distributes objects over all available buckets.

Example:

The Enron email dataset contains 520,924 emails. Computing the String hash codes of the email contents, I found 50 pairs (and even 2 triples) of different emails with the same hash code. For half a million strings, this is a pretty good result. But the message here is: if you have many data items, collisions will occur. If you were using the hashCode as a key here, you would not immediately notice your mistake. But a few people would get the wrong mail.

HashCodes can change

Finally, there’s one important detail in the hashCode contract that can be quite surprising: hashCode does not guarantee the same result in different executions. Let’s have a look at the JavaDoc:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
This is uncommon, in fact, some classes in the class library even specify the exact formula they use to calculate hash codes (e.g. String). For these classes, the hash code will always be the same. But while most of the hashCode implementations provide stable values, you must not rely on it. As this article points out, there are Java libraries that actually return different hashCode values in different processes and this tends to confuse people. Google’s Protocol Buffers is an example.
Therefore, you should not use the hash code in distributed applications. A remote object may have a different hash code than a local one, even if the two are equal.
3. Do not use hashCode in distributed applications
Moreover, you should be aware that the implementation of a hashCode function may change from one version to another. Therefore your code should not depend on any particular hash code values. For example, your should not use the hash code to persist state. Next time you run the application, the hash codes of the “same” objects may be different.
The best advice is probably: don’t use hashCode at all, except when you create hash-based algorithms.

An alternative: SHA1

You may know that cryptographic hash codes such as SHA1 are sometimes used to identify objects (Git does this, for example). Is this also unsafe? No. SHA1 uses 160-bit keys, which makes collisions virtually impossible. Even with a gigantic number of objects, the odds of a collision in this space are far below the odds of a meteor crashing the computer that runs your program. This article has a great overview of collision probabilities.
There’s probably more to say about hash codes, but these seem to be the most important things. If there’s anything I’ve missed, I’m happy to hear about it!

No comments:

Post a Comment

Chitika