微服务架构之容错Hystrix

jiezi

5 年前

文章来源：http://www.liangsonghua.me
作者介绍：京东资深工程师 - 梁松华，长期关注稳定性保障、敏捷开发、JAVA 高级、微服务架构

一、容错的必要性
假设单体应用可用率为 99.99%, 即使拆分后每个微服务的可用率还是保持在 99.99%，总体的可用率还是下降的。因为凡是依赖都可能会失败，凡是资源都是有限制的，另外网络并不可靠。有可能一个很不起眼的微服务模块高延迟最后导致整体服务不可用

二、容错的基本模块
1、主动超时，一般设置成 2 秒或者 5 秒超时时间
2、服务降级，一般会降级成直接跳转到静态 CDN 托底页或者提示活动太火爆，以免开天窗
3、限流，一般使用令牌机制限制最大并发数
4、隔离，对不同依赖进行隔离，容器 CPU 绑核就是一种隔离措施
5、弹性熔断，错误数达到一定阀值后，开始拒绝请求，健康检查发现恢复后再次接受请求
三、Hystrix 主要概念
Hystrix 流程

想要使用 Hystrix，只需要继承 HystrixCommand 或者 HystrixObservableCommand 并重写业务逻辑方法即可，区别在于 HystrixCommand.run()返回一个结果或者异常，HystrixObservableCommand.construct()返回一个 Observable 对象

编者按：关于反应式编程可参考文章 Flux 反应式编程结合多线程实现任务编排

Hystrix 真正执行命令逻辑是通过 execute()、queue()、observe()、toObservable()的其中一种, 区别在于 execute 是同步阻塞的，queue 通过 myObservable.toList().toBlocking().toFuture()实现异步非阻塞，observe 是事件注册前执行，toObservable 是事件注册后执行, 后两者是基于发布和订阅响应式的调用

每个熔断器默认维护 10 个 bucket, 每秒一个 bucket, 每个 bucket 记录成功, 失败, 超时, 拒绝的状态，默认错误超过 50% 且 10 秒内超过 20 个请求才进行中断拦截。当断路器打开时，维护一个窗口，每过一个窗口时间，会放过一个请求以探测后端服务健康状态，如果已经恢复则断路器会恢复到关闭状态

当断路器打开、线程池提交任务被拒绝、信号量获得被拒绝、执行异常、执行超时任一情况发生都会触发降级 fallBack,Hystrix 提供两种 fallBack 方式，HystrixCommand.getFallback()和 HystrixObservableCommand.resumeWithFallback()

四、线程和信号量隔离

1、线程隔离，针对不同的服务依赖创建线程池
2、信号量隔离，本质是一个共享锁。当信号量中有可用的许可时，线程能获取该许可(seaphore.acquire()), 否则线程必须等待，直到有可用的许可为止。线程用完必须释放(seaphore.release()) 否则其他线程永久等待

五、Hystrix 主要配置项

六、使用
1、请求上下文，下面将要提到的请求缓存、请求合并都依赖请求上下文，我们可以在拦截器中进行管理

public class HystrixRequestContextServletFilter implements Filter {public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) 
     throws IOException, ServletException {HystrixRequestContext context = HystrixRequestContext.initializeContext();
        try {chain.doFilter(request, response);
        } finally {context.shutdown();
        }
    }
}

2、请求缓存，减少相同参数请求后端服务的开销，需要重写 getCacheKey 方法返回缓存 key

public class CommandUsingRequestCache extends HystrixCommand<Boolean> {

    private final int value;

    protected CommandUsingRequestCache(int value) {super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
        this.value = value;
    }

    @Override
    protected Boolean run() {return value == 0 || value % 2 == 0;}

    @Override
    protected String getCacheKey() {return String.valueOf(value);
    }
}

3、请求合并。请求合并在 Nginx 静态资源加载中也很常见,Nginx 使用的是 nginx-http-concat 扩展模块。但是在 Hystric 中请求合并会导致延迟增加，所以要求两者启动执行间隔时长足够小，减少等待合并的时间，超过 10ms 间隔不会自动合并

public class CommandCollapserGetValueForKey extends HystrixCollapser<List<String>, String, Integer> {

    private final Integer key;

    public CommandCollapserGetValueForKey(Integer key) {this.key = key;}

    @Override
    public Integer getRequestArgument() {return key;}

    @Override
    protected HystrixCommand<List<String>> createCommand(final Collection<CollapsedRequest<String, Integer>> requests) {return new BatchCommand(requests);
    }

    @Override
    protected void mapResponseToRequests(List<String> batchResponse, Collection<CollapsedRequest<String, Integer>> requests) {
        int count = 0;
        for (CollapsedRequest<String, Integer> request : requests) {request.setResponse(batchResponse.get(count++));
        }
    }

    private static final class BatchCommand extends HystrixCommand<List<String>> {
        private final Collection<CollapsedRequest<String, Integer>> requests;

        private BatchCommand(Collection<CollapsedRequest<String, Integer>> requests) {super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"))
                    .andCommandKey(HystrixCommandKey.Factory.asKey("GetValueForKey")));
            this.requests = requests;
        }

        @Override
        protected List<String> run() {ArrayList<String> response = new ArrayList<String>();
            for (CollapsedRequest<String, Integer> request : requests) {
                // artificial response for each argument received in the batch
                response.add("ValueForKey:" + request.getArgument());
            }
            return response;
        }
    }
}

4、快速失败，不走降级逻辑，直接抛出异常，通常用于非幂等性的写操作。幂等性是指一次和多次请求某一个资源应该具有同样的副作用，比如 bool take(ticket_id, account_id, amount)取钱操作，不管任何时候请求失败或超时，调用方都可以重试，当然把参数 ticket_id 去掉就是非幂等性的了。注意：在 Hystrix 可以轻松实现重试，只需降级时判断 isCircuitBreakerOpen 断路器状态可用然后重试即可，不会使问题雪上加霜

public class CommandThatFailsFast extends HystrixCommand<String> {

private final boolean throwException;

public CommandThatFailsFast(boolean throwException) {super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
    this.throwException = throwException;
}

@Override
protected String run() {if (throwException) {throw new RuntimeException("failure from CommandThatFailsFast");
    } else {return "success";}
}

文章来源：http://www.liangsonghua.me
作者介绍：京东资深工程师 - 梁松华，长期关注稳定性保障、敏捷开发、JAVA 高级、微服务架构