一、背景

某一日收到上游调用方的反馈，提供的某一个Dubbo接口，每天在固定的工夫点被短时间熔断，抛出的异样信息为提供方dubbo线程池被耗尽。以后dubbo接口日申请量18亿次，报错申请94W/天，至此开始了优化之旅。

二、疾速应急

2.1 疾速定位

首先进行惯例的零碎信息监控（机器、JVM内存、GC、线程），发现虽稍有突刺，但都在正当范畴内，且跟报错工夫点对不上，先临时疏忽。

其次进行流量剖析，发现每天固定工夫点会有流量突增的状况，流量突增的点跟报错的工夫点也吻合，初步判断为短时大流量导致。

流量趋势

被降级量

接口99线

三、寻找性能瓶颈点

3.1 接口流程剖析

3.1.1 流程图

3.1.2 流程剖析

收到申请后调用上游接口，应用hystrix熔断器，熔断工夫为500MS；
依据上游接口返回的数据，进行详情数据的封装，第一步先到本地缓存中获取，如果本地缓存没有，则从Redis进行回源，Redis中无则间接返回，异步线程从数据库进行回源。
如果第一步调用上游接口异样，则进行数据兜底，兜底流程为先到本地缓存中获取，如果本地缓存没有，则从Redis进行回源，Redis中无则间接返回，异步线程从数据库进行回源。

3.2 性能瓶颈点排查

3.2.1 上游接口服务耗时比拟长

调用链显示，尽管上游接口的P99线在峰值流量时存在突刺，超出1S，但因为熔断超时的设置（熔断工夫500MS，coreSize&masSize=50，上游接口均匀耗时10MS以下），判断上游接口不是问题的关键点，为进一步排除烦扰，在上游服务存在突刺时能疾速失败，调整熔断工夫为100MS，dubbo超时工夫100MS。

3.2.2 获取详情本地缓存无数据，Redis回源

借助调用链平台，第一步剖析Redis申请流量，以此来判断本地缓存的命中率，发现Redis的流量是接口流量的2倍，从设计上来说不应该呈现这个景象。开始代码Review，发现在有一处逻辑呈现了问题。

没有从本地缓存读取，而是间接从Redis中获取了数据，Redis最大响应工夫也的确发现了不合理的突刺，持续剖析发现Redis响应工夫和Dubbo99线突刺状况基本一致，感觉此时曾经找到了问题的起因，心中暗喜。

Redis申请流量

服务接口申请流量

Dubbo99线

Redis最大响应工夫

3.2.3 获取兜底数据本地缓存无数据，Redis回源

失常

3.2.4 记录申请后果入Redis

因为以后Redis做了资源隔离，且未在DB后盾查问到慢日志，此时剖析导致Redis变慢的起因有很多，不过其余的都被主观疏忽了，注意力都在申请Redis流量翻倍的问题上了，故优先解决3.2.2中的问题。

四、解决方案

4.1 3.3.2中定位的问题上线

上线前Redis申请量

上线后Redis申请量

上线后Redis流量翻倍问题失去解决，Redis最大响应工夫突刺有所缓解，但仍旧没能彻底解决，阐明大流量查问不是最基本的起因。

redis最大响应工夫（上线前）

redis最大响应工夫（上线后）

4.2 Redis扩容

在Redis异样流量问题解决后，问题并未失去彻底解决，此时能做的就是静下心来，认真去梳理导致Redis慢的起因，思路次要从以下三个方面:

呈现了慢查问
Redis服务呈现性能瓶颈
客户端配置不合理

基于以上思路，一个个的进行排查；查问Redis慢查问日志，未发现慢查问。

借用调用链平台详细分析慢的Redis命令，没有了大流量导致的慢查问的烦扰，问题定位流程很快，大量的耗时申请在setex办法上，偶然呈现查问的慢申请也都是在setex办法之后，依据Redis单线程的个性判断setex是Redis99线突刺的首恶。找到具体语句，定位到具体业务后，首先申请扩容Redis，由6个master扩到8个master。

Redis扩容前

Redis扩容后

从后果上看，扩容基本上没有成果，阐明redis服务自身不是性能瓶颈点，此时剩下的一个就是客户端相干配置了。

4.3 客户端参数优化

4.3.1 连接池优化

Redis扩容没有成果，针对客户端可能呈现的问题，此时狐疑的点有两个方向。

第一个是客户端在解决Redis集群模式时，对连贯的治理上存在BUG，第二个是连接池参数设置不合理，此时源码剖析和连接池参数调整同步进行。

4.3.1.1 判断客户端连贯治理上是否有BUG

在剖析完，客户端解决连接池的源码后，没有问题，跟料想统一，依照槽位缓存连接池，第一个假如被排除，源码如下。

1、setEx  public String setex(final byte[] key, final int seconds, final byte[] value) {    return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {      @Override      public String execute(Jedis connection) {        return connection.setex(key, seconds, value);      }    }.runBinary(key);  } 2、runBinary  public T runBinary(byte[] key) {    if (key == null) {      throw new JedisClusterException("No way to dispatch this command to Redis Cluster.");    }     return runWithRetries(key, this.maxAttempts, false, false);  }3、runWithRetries  private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {    if (attempts <= 0) {      throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");    }     Jedis connection = null;    try {       if (asking) {        // TODO: Pipeline asking with the original command to make it        // faster....        connection = askConnection.get();        connection.asking();         // if asking success, reset asking flag        asking = false;      } else {        if (tryRandomNode) {          connection = connectionHandler.getConnection();        } else {          connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));        }      }       return execute(connection);     } 4、getConnectionFromSlot  public Jedis getConnectionFromSlot(int slot) {    JedisPool connectionPool = cache.getSlotPool(slot);    if (connectionPool != null) {      // It can't guaranteed to get valid connection because of node      // assignment      return connectionPool.getResource();    } else {      renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state      connectionPool = cache.getSlotPool(slot);      if (connectionPool != null) {        return connectionPool.getResource();      } else {        //no choice, fallback to new connection to random node        return getConnection();      }    }  }

4.3.1.2 剖析连接池参数

通过跟中间件团队沟通，以及参考commons-pool2官网文档批改如下；

参数调整后，1S以上的申请量失去缩小，但还是存在，上游反馈降级量由每天90万左右降到每天6W个（对于maxWaitMillis设置为200MS后为什么还会有超过200MS的申请，下文有解释）。

参数优化后Reds最大响应工夫

参数优化后接口报错量

4.3.2 继续优化

优化不能进行，如何把Redis的所有写入申请升高到200MS以内，此时的优化思路还是调整客户端配置参数，剖析Jedis获取连贯相干源码；

Jedis获取连贯源码

final AbandonedConfig ac = this.abandonedConfig;if (ac != null && ac.getRemoveAbandonedOnBorrow() &&        (getNumIdle() < 2) &&        (getNumActive() > getMaxTotal() - 3) ) {    removeAbandoned(ac);}PooledObject<T> p = null;// Get local copy of current config so it is consistent for entire// method executionfinal boolean blockWhenExhausted = getBlockWhenExhausted();boolean create;final long waitTime = System.currentTimeMillis();while (p == null) {    create = false;    p = idleObjects.pollFirst();    if (p == null) {        p = create();        if (p != null) {            create = true;        }    }    if (blockWhenExhausted) {        if (p == null) {            if (borrowMaxWaitMillis < 0) {                p = idleObjects.takeFirst();            } else {                p = idleObjects.pollFirst(borrowMaxWaitMillis,                        TimeUnit.MILLISECONDS);            }        }        if (p == null) {            throw new NoSuchElementException(                    "Timeout waiting for idle object");        }    } else {        if (p == null) {            throw new NoSuchElementException("Pool exhausted");        }    }    if (!p.allocate()) {        p = null;    }    if (p != null) {        try {            factory.activateObject(p);        } catch (final Exception e) {            try {                destroy(p);            } catch (final Exception e1) {                // Ignore - activation failure is more important            }            p = null;            if (create) {                final NoSuchElementException nsee = new NoSuchElementException(                        "Unable to activate object");                nsee.initCause(e);                throw nsee;            }        }        if (p != null && (getTestOnBorrow() || create && getTestOnCreate())) {            boolean validate = false;            Throwable validationThrowable = null;            try {                validate = factory.validateObject(p);            } catch (final Throwable t) {                PoolUtils.checkRethrow(t);                validationThrowable = t;            }            if (!validate) {                try {                    destroy(p);                    destroyedByBorrowValidationCount.incrementAndGet();                } catch (final Exception e) {                    // Ignore - validation failure is more important                }                p = null;                if (create) {                    final NoSuchElementException nsee = new NoSuchElementException(                            "Unable to validate object");                    nsee.initCause(validationThrowable);                    throw nsee;                }            }        }    }}updateStatsBorrow(p, System.currentTimeMillis() - waitTime);return p.getObject();

获取连贯的大抵流程如下:

是否有闲暇连贯，有闲暇连贯就间接返回，没有就创立；
创立时如果超出最大连接数，则判断是否有其余线程在创立连贯，如果没则间接返回，如果有则期待maxWaitMis工夫（其余线程可能创立失败），如果未超出最大连贯，则执行创立连贯操作（此时获取连贯等待时间可能会大于maxWaitMs）。
如果创立不胜利，则判断是否是阻塞获取连贯，如果不是则间接抛出异样，连接池不够用，如果是则判断maxWaitMillis是否小于0，如果小于0则阻塞期待，如果大于0则阻塞期待maxWaitMillis。
后续就是依据参数来判断是否须要做连贯check等。

依据以上流程剖析，maxWaitMills目前设置的为200，以上流程加起来最大阻塞工夫为400MS，大部分状况为200MS，不应该呈现超出400MS的突刺。

此时问题可能呈现在创立连贯上，因为创立连贯比拟耗时，且创立工夫不定，重点剖析是否有这个场景，通过DB后盾监控Redis连贯状况。

DB后盾监控Redis服务连贯

剖析上图发现，的确在几个工夫点（9:00,12:00，19:00...），redis连接数存在上涨状况，跟Redis突刺工夫根本吻合。感觉（之前的各种尝试后，曾经不敢用确定了）问题到此定位清晰（在突增流量过去时，连接池可用连贯满足不了需要，会创立连贯，造成申请期待）。

此时的想法是在服务启动时就进行连接池的创立，尽量减少新连贯的创立，批改连接池参数vivo.cache.depend.common.poolConfig.minIdle，后果居然有效？？？

啥都不说了，开始撸源码，jedis底层应用的是commons-poll2来治理连贯的，查看我的项目中应用的commons-pool2-2.6.2.jar局部源码；

CommonPool2源码

public GenericObjectPool(final PooledObjectFactory<T> factory,        final GenericObjectPoolConfig<T> config) {     super(config, ONAME_BASE, config.getJmxNamePrefix());     if (factory == null) {        jmxUnregister(); // tidy up        throw new IllegalArgumentException("factory may not be null");    }    this.factory = factory;     idleObjects = new LinkedBlockingDeque<>(config.getFairness());     setConfig(config);}

居然发现没有初始化连贯的中央，开始征询中间件团队，中间件团队给出的源码（commons-pool2-2.4.2.jar）如下，办法执行后多了一次startEvictor办法的调用？

1、初始化连接池public GenericObjectPool(PooledObjectFactory<T> factory,            GenericObjectPoolConfig config) {super(config, ONAME_BASE, config.getJmxNamePrefix());if (factory == null) {            jmxUnregister(); // tidy upthrow new IllegalArgumentException("factory may not be null");        }this.factory = factory;        idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());        setConfig(config);        startEvictor(getTimeBetweenEvictionRunsMillis());    }

为啥不一样？？？开始查看Jar包，版本不一样，中间件给出的版本是在V2.4.2，我的项目理论应用的是V2.6.2，剖析startEvictor有一步逻辑正是解决连接池预热逻辑。

Jedis连接池预热

1、final void startEvictor(long delay) {        synchronized (evictionLock) {            if (null != evictor) {                EvictionTimer.cancel(evictor);                evictor = null;                evictionIterator = null;            }            if (delay > 0) {                evictor = new Evictor();                EvictionTimer.schedule(evictor, delay, delay);            }        }    }2、class Evictor extends TimerTask {       /**         * Run pool maintenance.  Evict objects qualifying for eviction and then         * ensure that the minimum number of idle instances are available.         * Since the Timer that invokes Evictors is shared for all Pools but         * pools may exist in different class loaders, the Evictor ensures that         * any actions taken are under the class loader of the factory         * associated with the pool.         */        @Override        public void run() {            ClassLoader savedClassLoader =                    Thread.currentThread().getContextClassLoader();            try {                if (factoryClassLoader != null) {                    // Set the class loader for the factory                    ClassLoader cl = factoryClassLoader.get();                    if (cl == null) {                        // The pool has been dereferenced and the class loader                        // GC'd. Cancel this timer so the pool can be GC'd as                        // well.                        cancel();                        return;                    }                    Thread.currentThread().setContextClassLoader(cl);                }                 // Evict from the pool                try {                    evict();                } catch(Exception e) {                    swallowException(e);                } catch(OutOfMemoryError oome) {                    // Log problem but give evictor thread a chance to continue                    // in case error is recoverable                    oome.printStackTrace(System.err);                }                // Re-create idle instances.                try {                    ensureMinIdle();                } catch (Exception e) {                    swallowException(e);                }            } finally {                // Restore the previous CCL                Thread.currentThread().setContextClassLoader(savedClassLoader);            }        }    }3、 void ensureMinIdle() throws Exception {        ensureIdle(getMinIdle(), true);    }4、 private void ensureIdle(int idleCount, boolean always) throws Exception {        if (idleCount < 1 || isClosed() || (!always && !idleObjects.hasTakeWaiters())) {            return;        }         while (idleObjects.size() < idleCount) {            PooledObject<T> p = create();            if (p == null) {                // Can't create objects, no reason to think another call to                // create will work. Give up.                break;            }            if (getLifo()) {                idleObjects.addFirst(p);            } else {                idleObjects.addLast(p);            }        }        if (isClosed()) {            // Pool closed while object was being added to idle objects.            // Make sure the returned object is destroyed rather than left            // in the idle object pool (which would effectively be a leak)            clear();        }    }

批改Jar版本，配置核心减少vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis（查看一次连接池中闲暇的连贯，把闲暇工夫超过minEvictableIdleTimeMillis毫秒的连贯断开,直到连接池中的连接数到minIdle为止）。

vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis（连接池中连贯可闲暇的工夫,毫秒）两个参数，重启服务后，连接池失常预热，最终从Redis层面上解决问题。

优化后果如下，性能问题根本失去解决；

Redis响应工夫（优化前）

Redis响应工夫（优化后）

接口99线（优化前）

接口99线（优化后）

五、总结

呈现线上问题时，首先要思考的还是疾速复原线上业务，将业务的影响度降到最低，所以针对线上的业务，要提前做好限流、熔断、降级等策略，在线上呈现问题时能疾速找到复原计划。对公司各监控平台的纯熟应用水平，决定了定位问题的速度，每个开发都要把纯熟应用监控平台（机器、服务、接口、DB等）作为一个根本能力。

Redis呈现响应慢时，能够优先从Redis集群服务端（机器负载、服务是否有慢查问）、业务代码（是否有BUG）、客户端（连接池配置是否正当）三个方面去排查，基本上能排查出大部分Redis慢响应问题。

Redis连接池在零碎冷启动时，对连接池的预热，不同commons-pool2的版本，冷启动的策略也不同，但都须要配置minEvictableIdleTimeMillis参数才会失效，能够看下common-pool2官网文档，对罕用参数都做到成竹在胸，在问题呈现时能疾速定位。

连接池默认参数在解决大流量的业务上稍显乏力，须要针对大流量场景进行调优解决，如果业务上流量不是很大间接应用默认参数即可。

具体问题要具体分析，不能解决问题的时候要变通思路，通过各种办法去尝试解决问题。

作者：vivo互联网服务器团队-Wang Shaodong