本文首发于 vivo互联网技术 微信公众号
链接: https://mp.weixin.qq.com/s/8f34CaTp--Wz5pTHKA0Xeg
作者:vivo 官网商城开发团队
家喻户晓,Oracle JDK 是 Java 语言的相对权威,很多时候 JDK 与 Java 语言近似一个概念。但咱们始终要放弃捕风捉影的精力,敢于质疑。本文记录了一次线上troubleshoot 实战,蕴含问题剖析、解决并提交 Oracle JDK bug 的外围过程。
一、背景景象
总之 就是某零碎上线后 CLOSE_WAIT数量随着工夫减少而大量减少,继续触发多个告警。
二、剖析定位过程
部署了一个节点,用来复现之前呈现的问题。
Step1 问题聚焦
先查看到底是哪些IP之间的连贯产生了大量CLOSE_WAIT,另外零碎还会波及调第三方,总之要确认连贯建设的单方。
执行命令:
netstat -np | grep tcp|grep "CLOSE_WAIT"
后果:
(ps:xxx、yyy、zzz 均无含意,基于信息安全思考,屏蔽掉 ip)。
tcp 3547 0 10.107.17.xxx:34602 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59088 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:58028 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51962 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3563 0 10.107.17.xxx:46962 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:34608 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:46496 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:50774 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59904 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:40208 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:41064 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:36994 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3547 0 10.107.17.xxx:45080 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 6235 0 10.107.17.xxx:60966 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:56178 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3547 0 10.107.17.xxx:39922 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:43270 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:40926 zzz.202.32.242:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:44472 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 2891 0 10.107.17.xxx:43036 zzz.202.32.241:443 CLOSE_WAIT 19819/java ................ tcp 38 0 10.107.17.xxx:33472 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51976 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:57788 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:35638 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:43778 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:46418 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49914 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49258 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:48718 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51480 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59816 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49266 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:50246 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:39324 yyy.12.230.115:443 CLOSE_WAIT 19819/java
总之:
yyy.12.230.115
zzz.202.32.241
zzz.202.32.241
这个三个IP是导火索。
Step2 问题剖析
这三个IP具体是谁?具体是申请了哪个接口?
临时无奈间接获知!最间接的导火索临时断了线索。接着从侧面开始查看更多信息,
- JVM信息
内部资源、线程 什么的都看了,未发现显著异样
- 抓包
要抓包获取更多线索了。对于很久没有碰过TCP层,有些吃力。
失去线索:发现大量的RST
那么是什么操作会导致CLOSE_WAIT呢?什么样的连贯导致大量RST呢(可参考RST通常起因)?
Step3 代码剖析定位
运维大佬的帮助查问,得悉这三个IP是图片CDN服务。
至此,能够定位到具体代码逻辑,图片CDN申请能够排查代码。
仔细分析这部分源码后,揣测因为服务器 发动 URL申请,申请不存在,导致抛出异样,然而JDK中却没有中央敞开Socket。
javax.imageio.read(URL)
/** * Returns a <code>BufferedImage</code> as the result of decoding * a supplied <code>URL</code> with an <code>ImageReader</code> * chosen automatically from among those currently registered. An * <code>InputStream</code> is obtained from the <code>URL</code>, * which is wrapped in an <code>ImageInputStream</code>. If no * registered <code>ImageReader</code> claims to be able to read * the resulting stream, <code>null</code> is returned. * * <p> The current cache settings from <code>getUseCache</code>and * <code>getCacheDirectory</code> will be used to control caching in the * <code>ImageInputStream</code> that is created. * * <p> This method does not attempt to locate * <code>ImageReader</code>s that can read directly from a * <code>URL</code>; that may be accomplished using * <code>IIORegistry</code> and <code>ImageReaderSpi</code>. * * @param input a <code>URL</code> to read from. * * @return a <code>BufferedImage</code> containing the decoded * contents of the input, or <code>null</code>. * * @exception IllegalArgumentException if <code>input</code> is * <code>null</code>. * @exception IOException if an error occurs during reading. */ public static BufferedImage read(URL input) throws IOException { if (input == null) { throw new IllegalArgumentException("input == null!"); } InputStream istream = null; try { //此处,建设TCP连贯!并且间接获取流,因为流数据不存在,进入cache块,抛出! istream = input.openStream(); } catch (IOException e) { throw new IIOException("Can't get input stream from URL!", e); } ImageInputStream stream = createImageInputStream(istream); BufferedImage bi; try { bi = read(stream); if (bi == null) { stream.close(); } } finally { istream.close(); } return bi; }
能够看到JDK并没有敞开 ImageIO.read(url) 代码中封装的Socket连贯!CDN会申请超时敞开导致服务器处于CLOSE_WAIT?限于网络教训无限,并不能100%确认我的想法。所以模仿下吧。
Step4 复现与模仿
依据零碎业务源码,疾速模仿:
public static void main(String[] args) throws InterruptedException { ExecutorService ex = Executors.newFixedThreadPool(100); for (int i = 0; i < 5000; i++) { ex.execute(task()); }} /** * @throws IOException * @throws MalformedURLException */private static Runnable task() { return new Runnable() { @Override public void run() { // domain must exists,but file doesnot. String vivofsUrl = "https://vivobbs.xx.yy.zz/wiwNWYCFW9ieGbWq/20181129/3a2adfde12cd328d81f965088890eeffff.jpg"; File file = null; BufferedImage image = null; try { file = File.createTempFile("abc", "jpg"); URL url1 = new URL(vivofsUrl); image = ImageIO.read(url1); } catch (Throwable e) { e.printStackTrace(); } finally { if (null != file) { file.delete(); } if (null != image) { image.flush(); image = null; } } } };}
抓包
TCP查看
问题复现!
Step5 沟通后提报bug
report 给Oracle。
三、Oracle沟通
提单之后,Oracle跟我分割沟通。截取局部邮件内容,仅供参考。
已被驳回
四、疑点与有余
TCP状态机的流转不够相熟透彻。导致一些问题不能从TCP状态机剖析推理,常识的全面精通须要一直进步。
更多内容敬请关注 vivo 互联网技术 微信公众号
注:转载文章请先与微信号:Labs2020 分割。