在采纳Istio 服务网格技术的很多客户案例中, 熔断是其中一个十分广泛的流量治理场景。在启用网格技术之前, 在一些应用Resilience4j 的Java 服务中客户曾经应用了熔断性能, 但相比之下, Istio可能在网络级别反对启用熔断能力,无需集成到每个服务的利用程序代码中。

深刻了解熔断器在不同场景下的行为, 是将其利用到线上环境之前的要害前提。

介绍

启用熔断性能,须要创立一个指标规定来为指标服务配置熔断。其中, connectionPool下定义了与熔断性能相干的参数, 相干配置参数为:

  • tcp.maxConnections: 到指标主机的最大 HTTP1 /TCP 连接数。默认值为2³²-1。
  • http.http1MaxPendingRequests:期待就绪的连接池连贯时,最多能够排队的申请数量。默认值为1024。
  • http.http2MaxRequests:对后端服务的最大沉闷申请数。默认值为1024。

这些参数在一个简略的场景中, 如一个客户端和一个指标服务实例(在 Kubernetes 环境中,一个实例相当于一个 pod)的状况下是清晰的。然而, 在生产环境中,比拟可能呈现的场景是:

  • 一个客户端实例和多个指标服务实例
  • 多个客户端实例和单个指标服务实例
  • 客户端和指标服务的多个实例

咱们创立了两个 python 脚本——一个用于示意指标服务,另一个用于调用服务的客户端。服务器脚本是一个简略的 Flask 应用程序,它公开一个休眠 5 秒的API服务端点,而后返回一个“hello world!”字符串。示例代码如下所示:

#! /usr/bin/env python3from flask import Flaskimport timeapp = Flask(__name__)@app.route('/hello')def get():    time.sleep(5)    return 'hello world!'if __name__ == '__main__':    app.run(debug=True, host='0.0.0.0', port='9080', threaded = True)

客户端脚本以 10 个为一组调用服务器端点,即 10 个并行申请,而后在发送下一批 10 个申请之前休眠一段时间。它在有限循环中执行此操作。为了确保当咱们运行客户端的多个pod 时,它们都同时发送批处理,咱们应用零碎工夫(每分钟的第 0、20 和40秒)发送批处理。

#! /usr/bin/env python3import requestsimport timeimport sysfrom datetime import datetimeimport _threaddef timedisplay(t):  return t.strftime("%H:%M:%S")def get(url):  try:    stime = datetime.now()    start = time.time()    response = requests.get(url)    etime = datetime.now()    end = time.time()    elapsed = end-start    sys.stderr.write("Status: " + str(response.status_code) + ", Start: " + timedisplay(stime) + ", End: " + timedisplay(etime) + ", Elapsed Time: " + str(elapsed)+"\n")    sys.stdout.flush()  except Exception as myexception:    sys.stderr.write("Exception: " + str(myexception)+"\n")    sys.stdout.flush()time.sleep(30)while True:  sc = int(datetime.now().strftime('%S'))  time_range = [0, 20, 40]  if sc not in time_range:    time.sleep(1)    continue  sys.stderr.write("\n----------Info----------\n")  sys.stdout.flush()  # Send 10 requests in parallel  for i in range(10):    _thread.start_new_thread(get, ("http://circuit-breaker-sample-server:9080/hello", ))  time.sleep(2)

部署示例

利用应用如下YAML部署示例利用:

###################################################################################################  circuit-breaker-sample-server services##################################################################################################apiVersion: v1kind: Servicemetadata:  name: circuit-breaker-sample-server  labels:    app: circuit-breaker-sample-server    service: circuit-breaker-sample-serverspec:  ports:  - port: 9080    name: http  selector:    app: circuit-breaker-sample-server---apiVersion: apps/v1kind: Deploymentmetadata:  name: circuit-breaker-sample-server  labels:    app: circuit-breaker-sample-server    version: v1spec:  replicas: 1  selector:    matchLabels:      app: circuit-breaker-sample-server      version: v1  template:    metadata:      labels:        app: circuit-breaker-sample-server        version: v1    spec:      containers:      - name: circuit-breaker-sample-server        image: registry.cn-hangzhou.aliyuncs.com/acs/istio-samples:circuit-breaker-sample-server.v1        imagePullPolicy: Always        ports:        - containerPort: 9080---###################################################################################################  circuit-breaker-sample-client services##################################################################################################apiVersion: apps/v1kind: Deploymentmetadata:  name: circuit-breaker-sample-client  labels:    app: circuit-breaker-sample-client    version: v1spec:  replicas: 1  selector:    matchLabels:      app: circuit-breaker-sample-client      version: v1  template:    metadata:      labels:        app: circuit-breaker-sample-client        version: v1    spec:      containers:      - name: circuit-breaker-sample-client        image: registry.cn-hangzhou.aliyuncs.com/acs/istio-samples:circuit-breaker-sample-client.v1        imagePullPolicy: Always

启动之后, 能够看到为客户端和服务器端别离启动了对应的pod, 相似如下:

> kubectl get po |grep circuit       circuit-breaker-sample-client-d4f64d66d-fwrh4   2/2     Running   0             1m22scircuit-breaker-sample-server-6d6ddb4b-gcthv    2/2     Running   0             1m22s

在未定义指标规定限度的状况下, 服务器端能够满足解决并发的10个客户端申请, 因而在服务器端的响应后果始终是200。客户端侧的日志该当相似如下:

----------Info----------Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.016539812088013Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012614488601685Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015984535217285Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015599012374878Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012874364852905Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.018714904785156Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.010422468185425Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012431621551514Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.011001348495483Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.01432466506958

启用熔断规定

通过服务网格技术启用熔断规定, 只须要针对指标服务定义对应的指标规定DestinationRule即可。创立并利用以下指标规定:

apiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata:  name: circuit-breaker-sample-serverspec:  host: circuit-breaker-sample-server  trafficPolicy:    connectionPool:      tcp:        maxConnections: 5

它将与指标服务的 TCP 连接数限度为 5。让咱们看看它在不同场景中的工作形式。

场景#1:一个客户端实例和一个指标服务实例

在这种状况下,客户端和指标服务都只有一个 pod。当咱们启动客户端 pod 并监控日志时(倡议重启下客户端以取得更直观的统计后果),咱们会看到相似以下内容:

----------Info----------Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.0167787075042725Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.011920690536499Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.017078161239624Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018405437469482Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018689393997192Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.018936395645142Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.016417503356934Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.019930601119995Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.022735834121704Status: 200, Start: 02:49:40, End: 02:49:55, Elapsed Time: 15.02303147315979

能够看到所有申请都胜利。然而,每批中只有 5 个申请的响应工夫约为 5 秒,其余的要慢得多(大部分为10秒之多)。这意味着仅应用tcp.maxConnections会导致过多的申请排队,期待连贯开释。如后面所述,默认状况下,能够排队的申请数为2³²-1。要真正具备熔断(即疾速失败)行为,咱们还须要设置http.http1MaxPendingRequests限度能够排队的申请数量。它的默认值为1024。乏味的是,如果咱们将它的值设置为0,它就会回落到默认值。所以咱们必须至多将它设置为1。让咱们更新指标规定以仅容许1 个待处理申请:

apiVersion: networking.istio.io/v1alpha3kind: DestinationRulemetadata:  name: circuit-breaker-sample-serverspec:  host: circuit-breaker-sample-server  trafficPolicy:    connectionPool:      tcp:        maxConnections: 5      http:        http1MaxPendingRequests: 1

并重启客户端 pod(记得肯定须要重启客户端, 否则统计后果会呈现偏差), 并持续察看日志, 相似如下:

----------Info----------Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.005339622497558594Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.007254838943481445Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.0044133663177490234Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.008964776992797852Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.018309116363525Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.017424821853638Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.019804954528809Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.01643180847168Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.025975227355957Status: 200, Start: 02:56:40, End: 02:56:50, Elapsed Time: 10.01716136932373

能够看到, 4 个申请立刻被限度,5 个申请发送到指标服务,1 个申请排队。这是预期的行为。能够看到客户端 Istio 代理与指标服务中的pod 建设的沉闷连接数为5:

kubectl exec $(kubectl get pod --selector app=circuit-breaker-sample-client --output jsonpath='{.items[0].metadata.name}') -c istio-proxy -- curl -X POST http://localhost:15000/clusters | grep circuit-breaker-sample-server | grep cx_activeoutbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::5

场景#2:一个客户端实例和多个指标服务实例

当初让咱们在有一个客户端实例和多个指标服务实例 pod 的场景中运行测试。

首先,咱们须要将指标服务部署扩大到多个正本(比方3 个):

kubectl scale deployment/circuit-breaker-sample-server  --replicas=3

在这里要验证的两个场景别离是:

  • 连贯限度利用于 pod 级别:指标服务的每个 pod 最多 5 个连贯;
  • 或者它是否利用于服务级别:无论指标服务中的 pod 数量如何,总共最多 5 个连贯;

在(1)中,咱们应该看不到限度或排队,因为容许的最大连接数为 15(3 个 pod,每个 pod 5 个连贯)。因为咱们一次只发送 10 个申请,所有申请都应该胜利并在大概 5 秒内返回。

在(2)中,咱们应该看到与之前场景 #1大致相同的行为。

让咱们再次启动客户端 pod 并监控日志:

----------Info----------Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011791706085205078Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.0032286643981933594Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.012153387069702148Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011871814727783203Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.012892484664917Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.013102769851685Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.016939163208008Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.014261484146118Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.01246190071106Status: 200, Start: 03:06:20, End: 03:06:30, Elapsed Time: 10.021712064743042

咱们依然看到相似的限度和排队,这意味着减少指标服务的实例数量不会减少客户端的限度。因而, 咱们推导出该限度实用于服务级别。

运行一段时间之后, 能够看到客户端 Istio 代理与指标服务中的每个 pod 建设的连接数:

outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::2outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.158:9080::cx_active::2outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.26:9080::cx_active::2

客户端代理与指标服务中的每个 pod 有 2 个流动连贯, 不是 5,而是 6。正如 Envoy 和 Istio 文档中所提到的,代理在连贯数量方面容许一些回旋余地。

场景#3:多个客户端实例和一个指标服务实例

在这种状况下,咱们有多个客户端实例 pod,而指标服务只有一个 pod。

相应地缩放正本:

kubectl scale deployment/circuit-breaker-sample-server --replicas=1 kubectl scale deployment/circuit-breaker-sample-client --replicas=3

因为所有 Istio 代理都基于本地信息独立运行,无需互相协调,因而对这个测试的冀望是每个客户端 pod 都会体现出场景 #1 的行为,即每个 pod 将有 5 个申请被立刻发送到指标服务,1 个申请正在排队并受到限制。

让咱们看一下日志,看看理论产生了什么:Client 1

----------Info----------Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.008828878402709961Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.010806798934936523Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.012855291366577148Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.004465818405151367Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.007823944091796875Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06221342086791992Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06922149658203125Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06859922409057617Status: 200, Start: 03:10:40, End: 03:10:45, Elapsed Time: 5.015282392501831Status: 200, Start: 03:10:40, End: 03:10:50, Elapsed Time: 9.378434181213379Client 2----------Info----------Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.007795810699462891Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.00595545768737793Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.013380765914916992Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.004278898239135742Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.010999202728271484Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.015426874160767Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.0184690952301025Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.019806146621704Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.0175628662109375Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.031521558761597Client 3----------Info----------Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.012019157409667969Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.012546539306640625Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.013760805130004883Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.014089822769165039Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.014792442321777344Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.015463829040527344Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.01661539077758789Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.02904224395751953Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.03912043571472168Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.06436014175415039

结果显示, 每个客户端上的 503 数量减少了。零碎仅容许来自所有三个客户端实例pod 的 5 个并发申请。查看客户端代理日志,心愿能失去一些线索,并察看到两种不同类型的日志,用于被限度的申请 ( 503 )。其中, 留神到RESPONSE_FLAGS包含了两个值:UO和URX。

  • UO:上游溢出(断路)
  • URX: 申请被回绝,因为已达到上游重试限度 (HTTP)或最大连贯尝试次数 (TCP) 。
{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"192.168.142.207:9080","downstream_remote_address":"172.20.192.31:44610","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"d9d87600-cd01-421f-8a6f-dc0ee0ac8ccd","requested_server_name":"-","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:14:00.095Z","trace_id":"-","upstream_cluster":"outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"192.168.142.207:9080","downstream_remote_address":"172.20.192.31:43294","duration":"58","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"931d080a-3413-4e35-91f4-0c906e7ee565","requested_server_name":"-","response_code":"503","response_flags":"URX","route_name":"default","start_time":"2023-02-28T03:12:20.995Z","trace_id":"-","upstream_cluster":"outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local","upstream_host":"172.20.192.84:9080","upstream_local_address":"172.20.192.31:58742","upstream_service_time":"57","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}

带有UO标记的申请由客户端代理在本地进行限度。带有URX标记的申请被指标服务代理回绝。日志中其余字段的值(例如DURATION、UPSTREAM_HOST 和 UPSTREAM_CLUSTER)也证实了这一点。

为了进一步验证,咱们还要查看指标服务端的代理日志:

{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"172.20.192.84:9080","downstream_remote_address":"172.20.192.31:59510","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"7684cbb0-8f1c-44bf-b591-40c3deff6b0b","requested_server_name":"outbound_.9080_._.circuit-breaker-sample-server.default.svc.cluster.local","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:14:00.095Z","trace_id":"-","upstream_cluster":"inbound|9080||","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"172.20.192.84:9080","downstream_remote_address":"172.20.192.31:58218","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"2aa351fa-349d-4283-a5ea-dc74ecbdff8c","requested_server_name":"outbound_.9080_._.circuit-breaker-sample-server.default.svc.cluster.local","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:12:20.996Z","trace_id":"-","upstream_cluster":"inbound|9080||","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}

正如预期的那样,这里有 503 响应码,这也是导致客户端代理上有 "response_code":"503"以及"response_flags":"URX"。

总而言之, 客户端代理依据它们的连贯限度(每个 pod最多 5 个连贯)发送申请——排队或限度(应用UO 响应标记)多余的申请。所有三个客户端代理在批处理开始时最多能够发送 15 个并发申请。然而,其中只有 5 个胜利,因为指标服务代理也在应用雷同的配置(最多 5 个连贯)限度。指标服务代理将仅承受 5 个申请并限度其余申请,这些申请在客户端代理日志中带有URX响应标记。

上述所产生状况的直观形容相似如下:

场景#4:多个客户端实例和多个指标服务实例

最初一个也可能是最常见的场景,咱们有多个客户端实例pod和多个指标服务实例 pod 。当咱们减少指标服务正本时,咱们应该会看到申请的成功率整体减少,因为每个指标代理能够容许 5 个并发申请。

  • 如果咱们将正本数减少到 2,咱们应该会看到所有 3 个客户端代理在一个批次中生成的 30 个申请中有 10 个申请胜利。咱们依然会察看到客户端和指标服务代理上的限度。
  • 如果咱们将正本减少到 3,咱们应该看到 15 个胜利的申请。
  • 如果咱们将数量减少到 4,咱们应该依然只能看到 15 个胜利的申请。为什么? 这是因为无论指标服务有多少个正本,客户端代理上的限度都实用于整个指标服务。因而,无论有多少个正本,每个客户端代理最多能够向指标服务收回 5 个并发申请。

总结

在客户端:

  • 每个客户端代理独立利用该限度。如果限度为 100,则每个客户端代理在利用本地限度之前能够有 100 个未实现的申请。如果有 N 个客户端调用指标服务,则总共最多能够有 100*N 个未实现的申请。
  • 客户端代理的限度是针对整个指标服务,而不是针对指标服务的单个正本。即便指标服务有 200 个流动 pod, 限流依然会是100。

在指标服务端:

每个指标服务代理也实用该限度。如果该服务有 50 个流动的 pod,则在利用限流并返回 503 之前,每个 pod 最多能够有 100 个来自客户端代理的未实现申请。

原文链接

本文为阿里云原创内容,未经容许不得转载。