关于java:JDK源码分析HashMap2

本文首发于微信公众号【WriteOnRead】，欢送关注。

前文「JDK 源码剖析 -HashMap(1)」剖析了 HashMap 的内部结构和次要办法的实现原理。然而，面试中通常还会问到很多其余的问题，本文简要剖析下常见的一些问题。

这里再贴一下 HashMap 外部的结构图（JDK 1.8）：

首先 HashMap 是线程不平安的。这一点很多人应该都理解，HashMap 源码中也有阐明。然而为什么说不平安？体现在哪里呢？上面通过两个例子简要进行剖析（可能不够全面，仅做参考）。

case 1

线程 T1 执行 put / remove 等结构性批改（structural modification）的操作；线程 T2 执行遍历操作，这种状况下会抛出 ConcurrentModificationException.

示例代码（以 put 为例）：

private static void test() {Map<Integer, Integer> map = new HashMap<>();
    
  Thread t1 = new Thread(() -> {for (int i = 0; i < 5000; i++) {map.put(i, i);
    }
  });

  Thread t2 = new Thread(() -> {for (Map.Entry<Integer, Integer> entry : map.entrySet()) {System.out.println(entry);
    }
  });

  t1.start();
  t2.start();}
// 执行后果：// 抛出 java.util.ConcurrentModificationException

起因在这里：

if (modCount != expectedModCount)
    throw new ConcurrentModificationException();

HashMap 的迭代器和汇合视图中，都会对该值进行比拟。目标是判断是否有其余线程正在对该 HashMap 进行结构性批改，若有则抛会出异样。

PS: 仔细阅读 HashMap 源码的话能够发现，结构性批改的办法中都会有如下一行代码：

++modCount;

该值就是用来记录结构性批改的次数。

case 2

线程 T1 和 T2 同时执行 put / remove 等结构性批改（structural modification）的操作。以 put 办法为例剖析，会产生元素笼罩。

示例代码：

private static void test() throws InterruptedException {Map<Integer, Integer> map = new HashMap<>();

  Thread t1 = new Thread(() -> {for (int i = 0; i < 5000; i++) {map.put(i, i);
    }
  });

  Thread t2 = new Thread(() -> {for (int i = 5000; i < 10000; i++) {map.put(i, i);
    }
  });

  t1.start();
  t2.start();
  
  TimeUnit.SECONDS.sleep(20);
  System.out.println(map);
  System.out.println("size:" + map.size());
}
// 输入后果：// {8192=8192, 8193=8193, 8194=8194, 8195=8195, ...
// size: 9666
// PS: 这是某一次的后果，屡次测试的具体后果可能不同，但根本都没有 size=10000 的状况。

这里问题出在 put 办法上，通过前文剖析 HashMap 中 put 办法的外部实现原理能够找出起因，这里不再赘述。

这里再说一句，HashMap 的设计就是为了单线程下的高效率，理解线程不平安是为了加深对它的了解，晓得在哪些状况不适宜应用，若有线程平安的需要，能够思考应用 ConcurrentHashMap。

首先剖析下为什么会有链表和红黑树。现实状况下，HashMap 中每个 bin 所在位置只有一个节点，这样查问效率最高，为 O(1)。而拉出一个链表、或者把链表再转为红黑树，是在散列抵触比较严重时的一种应答措施，目标是为了让 HashMap 在极其状况下依然可能放弃较高的效率。

至于为什么是 8，HashMap 的局部 Implementation notes 如下:

/* Because TreeNodes are about twice the size of regular nodes, we
 * use them only when bins contain enough nodes to warrant use
 * (see TREEIFY_THRESHOLD). And when they become too small (due to
 * removal or resizing) they are converted back to plain bins.  In
 * usages with well-distributed user hashCodes, tree bins are
 * rarely used.  Ideally, under random hashCodes, the frequency of
 * nodes in bins follows a Poisson distribution
 * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 * parameter of about 0.5 on average for the default resizing
 * threshold of 0.75, although with a large variance because of
 * resizing granularity. Ignoring variance, the expected
 * occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 * factorial(k)). The first values are:
 *
 * 0:    0.60653066
 * 1:    0.30326533
 * 2:    0.07581633
 * 3:    0.01263606
 * 4:    0.00157952
 * 5:    0.00015795
 * 6:    0.00001316
 * 7:    0.00000094
 * 8:    0.00000006
 * more: less than 1 in ten million
 */

因为 TreeNode 的大小是一般节点（Node）的两倍，因而只有当 bin 中蕴含足够多（即树化的阈值 TREEIFY_THRESHOLD）的节点时才会转为 TreeNode；而当 bin 中节点缩小时（删除节点或扩容），又会把红黑树再转为链表。

hashCode 均匀分布时，TreeNode 用到的机会很小。现实状况下，在随机散布的 hashCode 下，bin 中节点的散布遵循泊松散布，并列出了几个数据，能够看到一个 bin 中链表长度达到 8 的概率（0.00000006）有余千万分之一，因而将转换的阈值设为 8.

两个转换阈值及其阐明如下：

/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;

/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;

至于将红黑树转为链表的阈值为 6，网上有说法是为了防止频繁转换。比方，若红黑树转为链表的阈值也是 8，如果一个 HashMap 不停地进行插入和删除元素，链表的个数始终在 8 左右，这种状况会频繁地进行树和链表的互相转换，效率很低。

这样解释仿佛也有些情理，各位能够去摸索。

JDK 1.7 中的相干形容：

/* As a general rule, the default load factor (.75) offers a good tradeoff
 * between time and space costs.  Higher values decrease the space overhead
 * but increase the lookup cost (reflected in most of the operations of the
 * <tt>HashMap</tt> class, including <tt>get</tt> and <tt>put</tt>). 
 */

上面文章也对此做了剖析：

https://www.jianshu.com/p/64f…

兴许这个问题没必要那么深究，简略来讲，其实就是工夫和空间的 tradeoff.

位运算 n & (length – 1) 和取余运算 n % length 成果是一样的。然而在计算机中，位运算的效率却远高于取余运算。因而，这样做是为了进步运算效率。

个别用 String、Integer 等，它们是不可变的（Immutable），作为不可变类天生是线程平安的。而且重写了 hashCode 办法和 equals 办法。

hash 办法的设计能够有很多。尽管散列值越平均越好，但 HashMap 首要目标是谋求快，因而 hash 算法的设计要尽可能地快。String 类的 hashCode 办法如下：

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {char val[] = value;
        for (int i = 0; i < value.length; i++) {h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

PS: 上述问题是自己从网上搜寻后整顿和思考的后果，仅做参考，并不一定齐全精确（要持有狐疑态度）。无关 HashMap 的问题可能还有很多，这里不再一一列举。

参考链接：

https://www.jianshu.com/p/7af…

相干浏览：

JDK 源码剖析 -HashMap(1)

关于java:JDK源码分析HashMap2

Q1: HashMap 是否线程平安？为什么？

Q2: 链表和红黑树的转换阈值为什么是 8 和 6？

Q3: 为什么负载因子是 0.75？

Q4: 为什么容量是 2 的次幂？

Q5: 个别用什么类型的元素作为 Key？为什么？

Q6: 掂量 hash 算法的好坏？String 类的 hashCode 实现？