关于prometheus:Flagger-on-ASM基于Mixerless-Telemetry实现渐进式灰度发布系列-3-渐进式灰度发布

简介：作为 CNCF[成员](https://landscape.cncf.io/car…[Weave Flagger](flagger.app)提供了继续集成和继续交付的各项能力。Flagger 将渐进式公布总结为 3 类：– ** 灰度公布 / 金丝雀公布(Canary)**：用于渐进式切流到灰度版本(progressive traffic shifting) – **A/ B 测试(A/B Testing)**：用于依据申请信息将

作为 CNCF 成员，Weave Flagger 提供了继续集成和继续交付的各项能力。Flagger 将渐进式公布总结为 3 类：

灰度公布 / 金丝雀公布(Canary)：用于渐进式切流到灰度版本(progressive traffic shifting)
A/ B 测试 (A/B Testing)：用于依据申请信息将 用户申请 路由到 A / B 版本(HTTP headers and cookies traffic routing)
蓝绿公布(Blue/Green)：用于流量切换和流量复制 (traffic switching and mirroring)

本篇将介绍 Flagger on ASM 的渐进式灰度公布实际。

执行如下命令部署 flagger(残缺脚本参见：demo\_canary.sh)。

alias k="kubectl --kubeconfig $USER_CONFIG"
alias h="helm --kubeconfig $USER_CONFIG"

cp $MESH_CONFIG kubeconfig
k -n istio-system create secret generic istio-kubeconfig --from-file kubeconfig
k -n istio-system label secret istio-kubeconfig istio/multiCluster=true

h repo add flagger https://flagger.app
h repo update
k apply -f $FLAAGER_SRC/artifacts/flagger/crd.yaml
h upgrade -i flagger flagger/flagger --namespace=istio-system \
    --set crd.create=false \
    --set meshProvider=istio \
    --set metricsServer=http://prometheus:9090 \
    --set istio.kubeconfig.secretName=istio-kubeconfig \
    --set istio.kubeconfig.key=kubeconfig

在灰度公布过程中，Flagger 会申请 ASM 更新用于灰度流量配置的 VirtualService，这个 VirtualService 会应用到命名为 public-gateway 的 Gateway。为此咱们创立相干 Gateway 配置文件 public-gateway.yaml 如下：

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: public-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*"

执行如下命令部署 Gateway：

kubectl --kubeconfig "$MESH_CONFIG" apply -f resources_canary/public-gateway.yaml

flagger-loadtester 是灰度公布阶段，用于探测灰度 POD 实例的利用。

执行如下命令部署 flagger-loadtester：

kubectl --kubeconfig "$USER_CONFIG" apply -k "https://github.com/fluxcd/flagger//kustomize/tester?ref=main"

咱们首先应用 Flagger 发行版自带的 HPA 配置(这是一个运维级的 HPA)，待实现残缺流程后，咱们再应用利用级的 HPA。

执行如下命令部署 PodInfo 及其 HPA：

kubectl --kubeconfig "$USER_CONFIG" apply -k "https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main"

Canary 是基于 Flagger 进行灰度公布的外围 CRD，详见 How it works。咱们首先部署如下 Canary 配置文件 podinfo-canary.yaml，实现残缺的渐进式灰度流程，而后在此基础上引入利用维度的监控指标，来进一步实现 利用有感知的渐进式灰度公布。

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # service port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
    # Istio gateways (optional)
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    # Istio virtual service host names (optional)
    hosts:
    - '*'
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd'test'http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

执行如下命令部署 Canary：

kubectl --kubeconfig "$USER_CONFIG" apply -f resources_canary/podinfo-canary.yaml

部署 Canary 后，Flagger 会将名为 podinfo 的 Deployment 复制为 podinfo-primary，并将podinfo-primary 扩容至 HPA 定义的最小 POD 数量。而后逐渐将名为 podinfo 的这个 Deployment 的 POD 数量将缩容至 0。也就是说，podinfo将作为 灰度版本的 Deployment，podinfo-primary将作为 生产版本的 Deployment。

同时，创立 3 个服务——podinfo、podinfo-primary和 podinfo-canary，前两者指向podinfo-primary 这个 Deployment，最初者指向 podinfo 这个 Deployment。

2 降级`podinfo`

执行如下命令，将灰度 Deployment 的版本从 3.1.0 降级到3.1.1：

kubectl --kubeconfig "$USER_CONFIG" -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

此时，Flagger 将开始执行如本系列第一篇所述的渐进式灰度公布流程，这里再简述次要流程如下：

逐渐扩容灰度 POD、验证
渐进式切流、验证
滚动降级生产 Deployment、验证
100% 切回生产
缩容灰度 POD 至 0

咱们能够通过如下命令察看这个渐进式切流的过程：

while true; do kubectl --kubeconfig "$USER_CONFIG" -n test describe canary/podinfo; sleep 10s;done

输入的日志信息示意如下：

Events:
  Type     Reason  Age                From     Message
  ----     ------  ----               ----     -------
  Warning  Synced  39m                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
  Normal   Synced  38m (x2 over 39m)  flagger  all the metrics providers are available!
  Normal   Synced  38m                flagger  Initialization done! podinfo.test
  Normal   Synced  37m                flagger  New revision detected! Scaling up podinfo.test
  Normal   Synced  36m                flagger  Starting canary analysis for podinfo.test
  Normal   Synced  36m                flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  36m                flagger  Advance podinfo.test canary weight 10
  Normal   Synced  35m                flagger  Advance podinfo.test canary weight 20
  Normal   Synced  34m                flagger  Advance podinfo.test canary weight 30
  Normal   Synced  33m                flagger  Advance podinfo.test canary weight 40
  Normal   Synced  29m (x4 over 32m)  flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test

相应的 Kiali 视图(可选)，如下图所示：

到此，咱们实现了一个残缺的渐进式灰度公布流程。如下是扩大浏览。

在实现上述渐进式灰度公布流程的根底上，咱们接下来再来看上述 Canary 配置中，对于 HPA 的配置。

  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo

这个名为 podinfo 的 HPA 是 Flagger 自带的配置，当灰度 Deployment 的 CPU 利用率达到 99% 时扩容。残缺配置如下：

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: podinfo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  minReplicas: 2
  maxReplicas: 4
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          # scale up if usage is above
          # 99% of the requested CPU (100m)
          averageUtilization: 99

咱们在后面一篇中讲述了利用级扩缩容的实际，在此，咱们将其利用于灰度公布的过程中。

执行如下命令部署 感知利用申请数量的 HPA，实现在 QPS 达到 10 时进行扩容(残缺脚本参见：advanced\_canary.sh)：

kubectl --kubeconfig "$USER_CONFIG" apply -f resources_hpa/requests_total_hpa.yaml

相应地，Canary 配置更新为：

  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo-total

2 降级`podinfo`

执行如下命令，将灰度 Deployment 的版本从 3.1.0 降级到3.1.1：

kubectl --kubeconfig "$USER_CONFIG" -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

命令察看这个渐进式切流的过程：

while true; do k -n test describe canary/podinfo; sleep 10s;done

在渐进式灰度公布过程中 (在呈现Advance podinfo.test canary weight 10 信息后，见下图)，咱们应用如下命令，从入口网关发动申请以减少 QPS：

INGRESS_GATEWAY=$(kubectl --kubeconfig $USER_CONFIG -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
hey -z 20m -c 2 -q 10 http://$INGRESS_GATEWAY

应用如下命令察看渐进式灰度公布进度：

watch kubectl --kubeconfig $USER_CONFIG get canaries --all-namespaces

应用如下命令察看 hpa 的正本数变动：

watch kubectl --kubeconfig $USER_CONFIG -n test get hpa/podinfo-total

后果如下图所示，在渐进式灰度公布过程中，当切流到 30% 的某一时刻，灰度 Deployment 的正本数为 4：

在实现上述灰度中的利用级扩缩容的根底上，最初咱们再来看上述 Canary 配置中，对于 metrics 的配置：

  analysis:
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    # testing (optional)

到目前为止，Canary 中应用的 metrics 配置始终是 Flagger 的两个内置监控指标：申请成功率 (request-success-rate) 和申请提早(request-duration)。如下图所示，Flagger 中不同平台对内置监控指标的定义，其中，istio 应用的是本系列第一篇介绍的 Mixerless Telemetry 相干的遥测数据。

为了展现灰度公布过程中，遥测数据为验证灰度环境带来的更多灵活性，咱们再次以 istio_requests_total 为例，创立一个名为 not-found-percentage 的 MetricTemplate，统计申请返回 404 错误码的数量占申请总数的比例。

配置文件 metrics-404.yaml 如下(残缺脚本参见：advanced\_canary.sh)：

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: not-found-percentage
  namespace: istio-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system:9090
  query: |
    100 - sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{namespace}}",
              destination_workload="{{target}}",
              response_code!="404"
            }[{{interval}}]
        )
    )
    /
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{namespace}}",
              destination_workload="{{target}}"
            }[{{interval}}]
        )
    ) * 100

执行如下命令创立上述 MetricTemplate：

k apply -f resources_canary2/metrics-404.yaml

相应地，Canary 中 metrics 的配置更新为：

  analysis:
    metrics:
      - name: "404s percentage"
        templateRef:
          name: not-found-percentage
          namespace: istio-system
        thresholdRange:
          max: 5
        interval: 1m

最初，咱们一次执行残缺的试验脚本。脚本 advanced_canary.sh 示意如下：

#!/usr/bin/env sh
SCRIPT_PATH="$(cd "$(dirname"$0")" >/dev/null 2>&1
    pwd -P
)/"cd"$SCRIPT_PATH" || exit

source config
alias k="kubectl --kubeconfig $USER_CONFIG"
alias m="kubectl --kubeconfig $MESH_CONFIG"
alias h="helm --kubeconfig $USER_CONFIG"

echo "#### I Bootstrap ####"
echo "1 Create a test namespace with Istio sidecar injection enabled:"
k delete ns test
m delete ns test
k create ns test
m create ns test
m label namespace test istio-injection=enabled

echo "2 Create a deployment and a horizontal pod autoscaler:"
k apply -f $FLAAGER_SRC/kustomize/podinfo/deployment.yaml -n test
k apply -f resources_hpa/requests_total_hpa.yaml
k get hpa -n test

echo "3 Deploy the load testing service to generate traffic during the canary analysis:"
k apply -k "https://github.com/fluxcd/flagger//kustomize/tester?ref=main"

k get pod,svc -n test
echo "......"
sleep 40s

echo "4 Create a canary custom resource:"
k apply -f resources_canary2/metrics-404.yaml
k apply -f resources_canary2/podinfo-canary.yaml

k get pod,svc -n test
echo "......"
sleep 120s

echo "#### III Automated canary promotion ####"

echo "1 Trigger a canary deployment by updating the container image:"
k -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

echo "2 Flagger detects that the deployment revision changed and starts a new rollout:"

while true; do k -n test describe canary/podinfo; sleep 10s;done

应用如下命令执行残缺的试验脚本：

sh progressive_delivery/advanced_canary.sh

试验后果示意如下：


#### I Bootstrap ####
1 Create a test namespace with Istio sidecar injection enabled:
namespace "test" deleted
namespace "test" deleted
namespace/test created
namespace/test created
namespace/test labeled
2 Create a deployment and a horizontal pod autoscaler:
deployment.apps/podinfo created
horizontalpodautoscaler.autoscaling/podinfo-total created
NAME            REFERENCE            TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
podinfo-total   Deployment/podinfo   <unknown>/10 (avg)   1         5         0          0s
3 Deploy the load testing service to generate traffic during the canary analysis:
service/flagger-loadtester created
deployment.apps/flagger-loadtester created
NAME                                      READY   STATUS     RESTARTS   AGE
pod/flagger-loadtester-76798b5f4c-ftlbn   0/2     Init:0/1   0          1s
pod/podinfo-689f645b78-65n9d              1/1     Running    0          28s

NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/flagger-loadtester   ClusterIP   172.21.15.223   <none>        80/TCP    1s
......
4 Create a canary custom resource:
metrictemplate.flagger.app/not-found-percentage created
canary.flagger.app/podinfo created
NAME                                      READY   STATUS    RESTARTS   AGE
pod/flagger-loadtester-76798b5f4c-ftlbn   2/2     Running   0          41s
pod/podinfo-689f645b78-65n9d              1/1     Running   0          68s

NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/flagger-loadtester   ClusterIP   172.21.15.223   <none>        80/TCP    41s
......
#### III Automated canary promotion ####
1 Trigger a canary deployment by updating the container image:
deployment.apps/podinfo image updated
2 Flagger detects that the deployment revision changed and starts a new rollout:

Events:
  Type     Reason  Age                  From     Message
  ----     ------  ----                 ----     -------
  Warning  Synced  10m                  flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
  Normal   Synced  9m23s (x2 over 10m)  flagger  all the metrics providers are available!
  Normal   Synced  9m23s                flagger  Initialization done! podinfo.test
  Normal   Synced  8m23s                flagger  New revision detected! Scaling up podinfo.test
  Normal   Synced  7m23s                flagger  Starting canary analysis for podinfo.test
  Normal   Synced  7m23s                flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  7m23s                flagger  Advance podinfo.test canary weight 10
  Normal   Synced  6m23s                flagger  Advance podinfo.test canary weight 20
  Normal   Synced  5m23s                flagger  Advance podinfo.test canary weight 30
  Normal   Synced  4m23s                flagger  Advance podinfo.test canary weight 40
  Normal   Synced  23s (x4 over 3m23s)  flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test

版权申明：本文内容由阿里云实名注册用户自发奉献，版权归原作者所有，阿里云开发者社区不领有其著作权，亦不承当相应法律责任。具体规定请查看《阿里云开发者社区用户服务协定》和《阿里云开发者社区知识产权爱护指引》。如果您发现本社区中有涉嫌剽窃的内容，填写侵权投诉表单进行举报，一经查实，本社区将立即删除涉嫌侵权内容。

关于prometheus:Flagger-on-ASM基于Mixerless-Telemetry实现渐进式灰度发布系列-3-渐进式灰度发布

Setup Flagger

1 部署 Flagger

2 部署 Gateway

3 部署 flagger-loadtester

4 部署 PodInfo 及其 HPA

渐进式灰度公布

1 部署 Canary

2 降级`podinfo`

3 渐进式灰度公布

灰度中的利用级扩缩容

1 感知利用 QPS 的 HPA

2 降级`podinfo`

3 验证渐进式灰度公布及 HPA

灰度中的利用级监控指标

1 Flagger 内置监控指标

2 自定义监控指标

3 最初的验证

Setup Flagger

1 部署 Flagger

2 部署 Gateway

3 部署 flagger-loadtester

4 部署 PodInfo 及其 HPA

渐进式灰度公布

1 部署 Canary

2 降级podinfo

3 渐进式灰度公布

灰度中的利用级扩缩容

1 感知利用 QPS 的 HPA

2 降级podinfo

3 验证渐进式灰度公布及 HPA

灰度中的利用级监控指标

1 Flagger 内置监控指标

2 自定义监控指标

3 最初的验证

2 降级`podinfo`

2 降级`podinfo`