Preventing Hash Collisions
Articles —> Preventing Hash Collisions
Hashing is an irreversible digestion of data into a data type if uniform length. In Java, hashing of objects occurs via the hashCode method, and is important for storing and accessing objects in data structures (such as a Map or Set).
Because the hashCode method in java returns an int data type, it is limited to only the size of the int: 32-bits of information. Therefore with a large number of objects hash collisions are likely. This being said, even with a small number of objects, if the hashCode method does not return a number that is uniformly distributed across all plausible int values, hash collisions can be inevitable.
Take the following use case: a Line class defined by two end Point's:
Point
/** * Point class based upon an x and y coordinate * @author gcope * */ public class Point { private final int x; private final int y; public Point(int east, int south) { this.x = east; this.y = south; } public int getX(){ return x; } public int getY(){ return y; } @Override public int hashCode() { final int prime = 31; int result = 17; result = prime * result + y; result = prime * result + x; return result; } }
The Line class:
/** * Line class defined by two end Points * @author gcope * */ public class Line { private Point p1; private Point p2; public Line(Point startCell, Point endCell) { this.p1 = startCell; this.p2 = endCell; } @Override public int hashCode() { final int prime = 59; int result = 43; result = prime * result + ((p1 == null) ? 0 : p1.hashCode()); result = prime * result + ((p2 == null) ? 0 : p2.hashCode()); return result; } }
For brevity, accessor and equals methods are omitted, as are comments. Each class defines a simple hashCode method, returning an int value based upon its fields. The danger here of course, comes from hash collisions. A simple example:
Line line1 = new Line(new Point(154, 156), new Point(154, 156)); Line line2 = new Line(new Point(153, 156), new Point(151, 158)); System.out.println(line1.hashCode() + ":" + line2.hashCode());
Both line1 and line2 have the same hashCode: 1429303. In fact, in this particular case the level of collision is extremely high. Consider the test case below, in which 6,250,000 Lines with different endpoints get generated:
public static void main(String[] args) { List<Point> points = new ArrayList<Point>(); for ( int i = 0; i < 50; i++ ){ for ( int j = 0; j < 50; j++ ){ Point c = new Point(i,j); points.add(c); } } System.out.println(points.size() + " points generated"); System.out.println("Testing " + (points.size()*points.size()) + " number of Lines"); Set<Integer> set = new HashSet<Integer>(); int collisions = 0; for ( int i = 0; i < points.size(); i++ ){ for ( int j = 0; j < points.size(); j++ ){ Line r = new Line(points.get(i), points.get(j)); if ( set.contains(r.hashCode() ) ){ collisions++; } set.add(r.hashCode()); } } System.out.println(collisions); } }
The above results in an astounding 6,155,919 collisions!
How might one lower the probability of collisions? A hash can be defined by the fields of a class, but also inter-dependent properties of those fields. In simpler terms, a line has a length, and a line has a slope. What happens if we include these calculations within the hashCode method of the Line class?
slope = (startCell.getY() - endCell.getY()); if ( slope == 0 ){//prevent division by 0 slope = startCell.getX() - endCell.getX(); }else{ slope = slope / (startCell.getX()- endCell.getX()) / slope; } length = Math.sqrt(Math.pow(startCell.getX() - endCell.getX(), 2) + Math.pow(startCell.getY() - endCell.getY(), 2)); .... @Override public int hashCode() { final int prime = 59; int result = 43; result = prime * result + ((p1 == null) ? 0 : p1.hashCode()); long temp; temp = Double.doubleToLongBits(length); result = prime * result + ((int) (temp ^ (temp >> 32))); temp = Double.doubleToLongBits(slope); result = prime * result + ((int) (temp ^ (temp >> 32))); result = prime * result + ((p2 == null) ? 0 : p2.hashCode()); return result; }
With the above changes, there are 870116 collisions: still a lot, but an 85% reduction in hashCode collisions. Something to consider when hashing is an integral part of your application.
There are no comments on this article.