HashMap内部存储

jiezi

5 年前

翻译自 http://coding-geek.com/how-do…
并在原文基础上做了修改和补充

Java HashMap 类实现了 Map<K,V> 接口，
这个接口的几个主要的方法如下：
• V put(K key, V value)
• V get(Object key)
• V remove(Object key)
• Boolean containsKey(Object key)
HashMap 用一个内部类去存储数据 Entry<K, V>，它就是一个简单的键值对，同时有两个额外的数据：
• 指向另一个 entry 的指针，有点类似于单向无序链表。
• hash 值，这个是对应于 K 的 hash 值，以避免每次 HashMap 使用时重新计算。
在 JDK 7 中，Entry 的部分实现如下：

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        int hash;
}

HashMap 将数据存储到多个 Entry 链表，然后每个链表都会注册到 Entry 数组中去，对应于一个 bucket。这初始数组的大小是 16(DEFAULT_INITIAL_CAPACITY)

HashMap 中有一个桶 (bucket) 的概念, 其实这个桶就是 entry 数组 (HashMap 初始化的时候会创建一个固定大小的数组) 中的一个元素，如下所示：

 /**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

有相同 hash value 的 key 都会被放到同一个桶中，然后 entry 之间通过指针相连，就组成了链表。当调用 HashMap 的 put 或者是 get 方法时，方法首先计算 key 的 hash 值，然后计算出到底是该属于哪个 bucket。找到桶之后，遍历 bucket 对应的 entry 链表，再找到对应的 key 值，链表里面的查是通过 key 的 equals 方法完成的。该步骤的代码如下所示（JDK 7）：
public V get(Object key) {

    if (key == null)
        return getForNullKey();
    int hash = hash(key.hashCode());
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
            return e.value;
    }
    return null;
}

map 生成桶需要经过下面的 3 步：
• 首先获取 key 的 hash value。
• 然后会再次计算一次 hash, 这样做的目的是为了避免 hash 函数将数据都放在同一个桶中（以下是 JDK 7 的实现，JDK 8 做了简化，原理一样）。
/**

 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

根据注释, HashMap 会提供这样的一个补充函数，它作用于 HashCode，就是为了防止不稳定的 hash 函数造成的冲突。该函数称为‘扰动函数‘。
• 拿到重新计算的 hash 值之后, 再计算它对应的 index，也就是该放到哪个桶里面。

 /**
 * Returns index for hash code h.
 */
static int indexFor(int h, int length) {return h & (length-1);
}

为什么需要重新计算 hash？
举个栗子：
shinadata 字符串的 HashCode 的 binary 是 1000001111000010101000110010001，上述代码中的 length 就是数组（Entry[] table）的长度。（length – 1)=15，15 的二进制是 1111，如果不做 rehash 操作，此时 1111 高位补 0，和 1000001111000010101000110010001 做位与 & 运算，那么 shinadata 的 HashCode 在计算过程中只有最后几位能起作用，无论高位是多少，只要低位是相同的话，那么最后的结果就相同了，这样 indexFor 方法计算出的 index 值就容易产生冲突。扰动函数的作用就是为了消除这方面的影响。

为什么内部数组的大小是 2 的幂？
假设数组大小是 17，那么掩码值是 16（size -1），二进制是 0…010000，现在对于任何哈希值 h , h & 16 得到索引不是 16 就是 0。那么这就意味着，数组大小 17，使用的桶只能是 2 个。但是要是用 2 的幂的值的话，那就不同了，比如 16，size – 1 = 15, 二进制是 0…001111, h & 15 得到的索引值可以使 0~15。
这部分可以查看 JDK 7 的源码。http://hg.openjdk.java.net/jd…
参考：
https://blog.csdn.net/wenyiqi…
https://www.zhihu.com/questio…
翻译自 http://coding-geek.com/how-do…
并在原文基础上做了修改和补充