关于java:记一次使用低版本ES-Java-Client偶尔查询超时问题解决过程

首先阐明我的项目中ES应用版本为2.4版本，ES JavaClient为2.4.4版本。服务器配置为16G、8核。

景象

咱们一个地区我的项目中有一个查问ES的接口莫名其妙的翻页会呈现超时，可能翻第一页会呈现，可能翻第三页会呈现。

排查思路

 因为我的项目没有做任何监控零碎，因而惟一的排查思路就是定位到底是哪里执行超时的，在代码中可能产生超时的地位都打了日志并进行长期公布，最初发现是在调用查问ES的时候卡住的（`searchRequestBuilder.execute().actionGet();`），并且没有任何的异样日志，代码执行的过程中很安详的就完结了。<br />到此为止，还没有任何的脉络，

狐疑点

查问ActionRequestBuilder#execute源码看到应用到了线程池，因而狐疑是机器配置不够，线程不够用了？

public ListenableActionFuture<Response> execute() {        PlainListenableActionFuture<Response> future = new PlainListenableActionFuture<>(threadPool);        execute(future);        return future;}

因而进入容器内应用top 1命令查看cpu占用，发现每个cpu占用都很低，因而排除机器配置问题。

难道是内存不够用了？JVM始终进行GC，STW了？

应用jstat -gcutil -t 过程jd 1000 1000查看GC信息，发现GC也失常，Eden、S1、S0、Old区占用都处于失常范畴，FULL GC才3次，Yong GC和Old GC也不是很频繁，因而排除是内存不够导致。

再次狐疑是线程问题
应用jstack -l 命令将线程信息输入进去，查看后，发现有大量的es线程，并且线程状态为WAITING (parking)，具体线程信息如下

发现线程是阻塞在BaseFuture#get地位，并且是在调用AdapterActionFuture#actionGet时产生的，因而跟踪源码，发现BaseFuture外部Sync实现了AQS，AQS相干文章可查看我之前写的几篇博客，

源码跟踪

BaseFuture#get实际上是调用的Sync#get，源码如下：

BaseFuture.Sync#getV get() throws CancellationException, ExecutionException,                InterruptedException {            // Acquire the shared lock allowing interruption.            //AQS的模板办法，尝试获取共享锁            acquireSharedInterruptibly(-1);            return getValue();}AQS#acquireSharedInterruptiblypublic final void acquireSharedInterruptibly(int arg)            throws InterruptedException {        if (Thread.interrupted())            throw new InterruptedException();        if (tryAcquireShared(arg) < 0)            doAcquireSharedInterruptibly(arg);    }

看到AQS#acquireSharedInterruptibly中调用了tryAcquireShared(arg)判断是否拿到了共享锁，tryAcquireShared(arg)在Sync中实现了，代码如下

/** Acquisition succeeds if the future is done, otherwise it fails.*/@Overrideprotected int tryAcquireShared(int ignored) {    if (isDone()) {        return 1;    }    return -1;} boolean isDone() {    return (getState() & (COMPLETED | CANCELLED)) != 0;}

getState是AQS的办法，state默认为0，0与任何数都是0，因而只有当state不为0的时候，才会返回true，那什么时候state才会更改呢，看到这里预计也能猜个八九不离十了，必定是从ES拿到数据，或者异样后才会批改state，查看Sync看到重写了tryReleaseShared办法。

@Overrideprotected boolean tryReleaseShared(int finalState) {    setState(finalState);    return true;}

原来就是在这里批改state的，查看该办法被哪里调用了，最终是从Sync#complete调用过来的，依据办法名称咱们也能够晓得就是实现的时候调用。

private boolean complete(@Nullable V v, @Nullable Throwable t,                                 int finalState) {    boolean doCompletion = compareAndSetState(RUNNING, COMPLETING);    if (doCompletion) {        // If this thread successfully transitioned to COMPLETING, set the value        // and exception and then release to the final state.        this.value = v;        this.exception = t;        releaseShared(finalState);    } else if (getState() == COMPLETING) {        // If some other thread is currently completing the future, block until        // they are done so we can guarantee completion.        acquireShared(-1);    }    return doCompletion;}

问题定位

到这里，咱们根本就确定了问题所在，咱们在调用ES Client去查问的时候，实际上是开启了一个查问线程和一个期待线程，因为查问线程始终没有回调后果，期待线程始终阻塞，导致查问失败。

解决方案

因为用户焦急应用，因而采纳长期计划，察看到景象为偶发，因而尝试在失败时减少重试机制，发现#actionGet办法能够设置超时工夫，因而更换可设置超时API，并捕捉ElasticsearchTimeoutException，在捕捉到异样时进行重试，问题解决。

public T actionGet(long timeout, TimeUnit unit) {        try {            return get(timeout, unit);        } catch (TimeoutException e) {            throw new ElasticsearchTimeoutException(e.getMessage());        } catch (InterruptedException e) {            throw new IllegalStateException("Future got interrupted", e);        } catch (ExecutionException e) {            throw rethrowExecutionException(e);        }}

至于为何查问线程迟迟不回调，还没有具体的定论，还在持续排查。

本文由博客一文多发平台 OpenWrite 公布！