Task #2545
closedBlob should support hashCode()
Added by Anonymous almost 10 years ago. Updated over 9 years ago.
0%
Description
The Blob class should support hashCode() with an efficient hash function of the content. This way, equals can use it.
Updated by Anonymous almost 10 years ago
We should use an efficient hash function like City/Farm Hash, but one that the Java library natively supports. What to use?
Updated by Andrew Brown over 9 years ago
From what I can tell, the simplest and quickest hashCode() is the Java String version:
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
I implemented two Name hash tables using MurmurHash (precursor to City/FarmHash) and Java String.hashCode() and the Java version was approx. 5x faster when creating the hashes and adding to the table. As long as hashCode() isn't used externally, it doesn't have to be a cryptographic hash, right? (The current use case for me is being able to use HashMaps for the FIB and PIT in my InternalForwarder).
Updated by Anonymous over 9 years ago
I agree that it doesn't have to be a cryptographic hash. From what I've seen, hashCode() is only meant for hash maps, etc. and if two hashCodes match it still requires an explicit comparison to be sure they are equal.
Updated by Andrew Brown over 9 years ago
Are we also considering adding hashCode() to Name? That would be helpful.
Updated by Anonymous over 9 years ago
Sure we can. But I'm curious where do you use Name.equals in your application that a hash code can provide a significant speed up?
Updated by Andrew Brown over 9 years ago
Well, I'm not using it for equals() and it's a minor optimization, but I want to use a HashMap to avoid looping through Name lists and that relies on Object.hashCode() being overriden. So I made a HashedName class extending Name and added a hashCode() that does something like 'this.toUri().hashCode();' (using the String.hashCode() I describe above); I did this for convenience and if we add it to the CCL I would recommend not doing this--instead adding a hashCode() to Component and looping over these. I also cache the resulting hash and use ChangeCountable to know when it needs to be recalculated.
I don't think equals() needs to change because you don't necessarily want to calculate a hash to compare bytes (more scans, slower), just compare the bytes from the back to the front like we discussed. Also, this is a Java-specific feature and the "Java way" is to override one (e.g. hashCode()) if the other is overriden (e.g. equals()). For reference, the relationship between hashCode() and equals() is described at http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode().
Updated by Anonymous over 9 years ago
I made a branch issue/2545-Blob-hashCode. Please see the hashCode() method here:
https://github.com/named-data/jndn/blob/56e46d48d6cdabd842a752a917ce20598cf756ba/src/net/named_data/jndn/util/Blob.java#L269
In Blob.equals(), I first check if the sizes of the two buffers are the same, because if they aren't we can avoid computing the hash if not necessary. Does that sound right?
https://github.com/named-data/jndn/blob/56e46d48d6cdabd842a752a917ce20598cf756ba/src/net/named_data/jndn/util/Blob.java#L244
A Name Component already just holds a Blob, so its equals calls the Blob equals. But I added hashCode anyway because it is the "Java way" to have both equals and hashCode.
https://github.com/named-data/jndn/blob/d2e8254c42c8e9e4220f46919b3d967fd8808217/src/net/named_data/jndn/Name.java#L294
However, I agree that its not clear what a hashCode() for Name would be so I didn't add it. I did make Name.equals() check from last to first like match() does.
https://github.com/named-data/jndn/commit/7ed64177e35fb0b5f39ead0e98fe6628c4f93b0b#diff-7455c130cf44ba498cd3a7c8c50529d1
Updated by Andrew Brown over 9 years ago
This is good stuff. A few comments:
When we add tests, can we add a test for equals(null); I "think" the implementation would return false but I'm not completely confident.
I don't know if we want to use the hashCode() inside the equals(); if the Blobs are equal (and thus the sizes match) and equals() is called, we will have to scan the buffer to calculate the hashCode and then scan the buffer again to compare each byte (because hash codes being equal doesn't guarantee the Blobs being equal). If the Blobs are unequal, we will have to scan the entire buffer to calculate the hash code instead of scanning up to the point they differ. In both cases hashCode() causes extra scans; I would recommend leaving the hashCode() calculation out of equals() altogether.
Blob never changes, right? So we don't have to worry about changing the hash?
For Name, I recommend (along with ChangeCountable checking, etc.):
for(Component component : components_){
hashCode_ = 37 * hashCode_ + component.hashCode(); // or some other prime number
}
Updated by Anonymous over 9 years ago
- When we add tests, can we add a test for equals(null); I "think" the implementation would return false but I'm not completely confident.
Do you mean equals(Object) which is used by HashMap, etc.? It will return false for !(other instanceof Blob). If you mean equals(Blob) then I suppose we could check for null but this isn't the same as any place where you are passed an object, like expressInterest(Interest) where you normally don't check for null. But if the type-specific equals(Blob) would be expected to check for null, sure we can add it (and it all other equivalent places). What do you think?
I don't know if we want to use the hashCode() inside the equals(); if the Blobs are equal (and thus the sizes match) and equals() is called, we will have to scan the buffer to calculate the hashCode and then scan the buffer again to compare each byte (because hash codes being equal doesn't guarantee the Blobs being equal). If the Blobs are unequal, we will have to scan the entire buffer to calculate the hash code instead of scanning up to the point they differ. In both cases hashCode() causes extra scans; I would recommend leaving the hashCode() calculation out of equals() altogether.
Interesting point. OK, I'll take hashCode out of equals.
Blob never changes, right? So we don't have to worry about changing the hash?
That's right. Blob never changes.
For Name, I recommend (along with ChangeCountable checking, etc.):
When you say "along with ChangeCountable checking" I assume you mean for checking if the cached hashCode_ is still valid?
Updated by Andrew Brown over 9 years ago
Jeff Thompson wrote:
Do you mean equals(Object) which is used by HashMap, etc.? It will return false for !(other instanceof Blob). If you mean equals(Blob) then I suppose we could check for null but this isn't the same as any place where you are passed an object, like expressInterest(Interest) where you normally don't check for null. But if the type-specific equals(Blob) would be expected to check for null, sure we can add it (and it all other equivalent places). What do you think?
Now that I think of it, why do we have a type-specific equals? Can't you just do all of it with a cast in the equals(object) or is there something I'm missing? Re: the null check, I'm coming from just reading http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#equals(java.lang.Object) which says For any non-null reference value x, x.equals(null) should return false.
So I looked at instanceof
and null WILL result in a false condition in that first instanceof
we do; so everything is fine, no need for an additional check.
When you say "along with ChangeCountable checking" I assume you mean for checking if the cached hashCode_ is still valid?
Yes, exactly.
Updated by Anonymous over 9 years ago
Why do we have a type-specific equals?
To avoid an unnecessary call to instanceof if possible. Also, the CCL API has this method which mirrors the other libraries: http://named-data.net/doc/ndn-ccl-api/blob.html#blob-equals-method
Updated by Anonymous over 9 years ago
In your argument to remove hashCode() from equals(), you say "If the Blobs are unequal, we will have to scan the entire buffer to calculate the hash code instead of scanning up to the point they differ." For completeness sake, I should say that you would only have to scan the entire buffer to compute the hash the first time that equals is called. For further calls, the hash is cached and it might be a lot faster if the two blobs are not equal. Do you think equals() will typically only be called once for each Blob object?
Updated by Andrew Brown over 9 years ago
#11, got it. #12, yes, that is more complete; I would still say leave it out because we can't guess how Blob will be used.
Updated by Anonymous over 9 years ago
OK, I reverted Blob.equals to not use hashCode().
Let me know what you think of Name.hashCode() here:
https://github.com/named-data/jndn/blob/9b5e352acd3d88bbf551bec78e018b6153d4b4f1/src/net/named_data/jndn/Name.java#L723
Updated by Anonymous over 9 years ago
Hi Andrew, I added basic unit tests.
https://github.com/named-data/jndn/commit/1ae2cb96380295c93dd91f8e375ae79177d077f6
Shall I merge and close?
Updated by Alex Afanasyev over 9 years ago
Why "Blob" abstract needs to be hashed? I can understand that name and name component should be hashable as they could be in hash tables, I'm not convinced that this should apply to everything else.
One can always use some static interface to do hashing of blobs, this doesn't require anything to be embedded inside the blob class.
Updated by Anonymous over 9 years ago
- Status changed from Closed to In Progress
Re-open to address Alex's questions.
Updated by Andrew Brown over 9 years ago
It's more a Java-ism than anything else; see discussion above about why but basically the equals() contract assumes that you will override both equals() and hashCode(). hashCode() does not get used except for HashMaps and the like. Since it is a Java-centric change, I do not recommend it gets copied to any other CCLs.
Updated by Alex Afanasyev over 9 years ago
I read the definitions and still think that it should not be generally defined for Blob (the use of Blob in hash tables should not be encouraged), but can definitely defined for some of the classes such name and name component (and may be others).
I would also suggest using some existing implementation for doing the hash. Though, the only thing that I was able to find is HashCodeBuilder http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/builder/HashCodeBuilder.html
Updated by Andrew Brown over 9 years ago
I still think hashCode() should be implemented in Blob (especially since it doesn't hurt to have it there) but... I don't actually care that much as long as Name and Name.Component have it; Jeff, thoughts?
But using the Commons helper I think is overkill, it's just doing this (http://en.wikipedia.org/wiki/Java_hashCode%28%29) with a simplified API. For speed and simplicity we can implement it ourselves like Jeff did.
Updated by Alex Afanasyev over 9 years ago
Not sure about java-way, I'm kind of against reimplementing the wheel. Implementing hashing of the data buffer ourselves (however simple it is) is error-prone and I consider to be a wheel.
I'm not insisting on use of HashCodeBuilder specifically, but my desire (if possible) is to use something that has been already implemented and well tested.
Updated by Andrew Brown over 9 years ago
I understand and that makes sense, just note from previous comments that I was advocating using the same hash as String.hashCode(), which is very well tested. If you want to reuse an implementation, take a look at the hashCode() in java.nio.ByteBuffer; when I inspect it, it looks like (note the warning not to use it as a hash key, which we could add to Blob as well):
/**
* Returns the current hash code of this buffer.
*
* <p> The hash code of a byte buffer depends only upon its remaining
* elements; that is, upon the elements from <tt>position()</tt> up to, and
* including, the element at <tt>limit()</tt> - <tt>1</tt>.
*
* <p> Because buffer hash codes are content-dependent, it is inadvisable
* to use buffers as keys in hash maps or similar data structures unless it
* is known that their contents will not change. </p>
*
* @return The current hash code of this buffer
*/
public int hashCode() {
int h = 1;
int p = position();
for (int i = limit() - 1; i >= p; i--)
h = 31 * h + (int)get(i);
return h;
}
I am not sure if position() and limit() are modifiable in Blob, but if they aren't then Blob.hashCode() could be return buffer_.hashCode();
Updated by Anonymous over 9 years ago
The ByteBuffer position and limit will not change, nor will its contents. I hadn't noticed that ByteBuffer implements hashCode(). Yes, I think we should use it. I made a branch issue/2545-Blob-hashCode-update. Is the following change the right idea?
https://github.com/named-data/jndn/commit/c9ee522f6be9373c34319ae37676273cbe03c1de#diff-fb77dcaf8350c4d345d996f251ea0e15
Updated by Alex Afanasyev over 9 years ago
ByteBuffer.hashCode() doesn't do any real hashing (value depends on the number of remaining bytes). I don't mind using it for Blob, as it would match what ByteBuffer is doing.
Updated by Alex Afanasyev over 9 years ago
Just forgot to add that we would still need to implement real (as it note-25) hashCode() for name and name components. (Not directly related: Blob is really a ByteBuffer and we shouldn't really introduce a new abstraction. It could be simply a class derived from ByteBuffer to introduce name alias.)
Updated by Andrew Brown over 9 years ago
I don't understand: in #28 you say ByteBuffer.hashCode() is not a real hash but it actually is (see #25, just not a crypto hash) and in #29 you say that we should use the ByteBuffer hash in Name and Name.Component.
Here's what makes sense to me: Blob uses ByteBuffer's hash (return buffer_.hashCode();
), Name.Component uses Blob's hash (return value_.hashCode();
), and Name hashes the components together as in #8. And that is what Jeff wrote.
Updated by Alex Afanasyev over 9 years ago
Andrew Brown wrote:
I don't understand: in #28 you say ByteBuffer.hashCode() is not a real hash but it actually is (see #25, just not a crypto hash) and in #29 you say that we should use the ByteBuffer hash in Name and Name.Component.
I could be wrong. It's just on some of the forums I read that ByteBuffer.hashCode doesn't capture content of the buffer, but something just about length of the buffer (see http://stackoverflow.com/questions/11089890/why-are-bytebuffers-hashcodes-the-same).
Updated by Andrew Brown over 9 years ago
Look down to the answer to that question: the OP was not filling the buffer, therefore the hash was always the same.
Updated by Alex Afanasyev over 9 years ago
Oh. ok, I'm got mislead by the question and like this approach. Please ignore my notes 28, 29.
Updated by Andrew Brown over 9 years ago
No worries, I understand the concerns.
Updated by Anonymous over 9 years ago
- Status changed from In Progress to Closed
The change for Blob.hashCode to use ByteBuffer.hashCode is merged into the master branch.