关于监控:利用观测云实现业务数据驱动的弹性扩缩容

在应用观测云对业务零碎进行观测的过程中，除了能够实现业务零碎的全面感知，咱们还能够基于观测云数据处理开发平台 DataFlux Func，联合故障模型对被测系统进行被动治理，例如弹性扩容或系统故障自愈，从而实现系统管理从观测到复原的主动闭环。

本文以 K8s 环境部署的 ruoyi 零碎为例，演示一个简略的业务零碎主动扩容过程，设定的故障自愈场景是“当故障产生时，尝试扩容产生异样的容器”。

对于示例场景，须要在 DataFlux Func 平台执行如下两步操作：

首先，须要实现对 K8s 集群的拜访和操作，获取对象利用容器以后的状态并执行扩容；
其次，要将该处理过程公布为 API 接口，由观测云核心触发调用操作。

以脚本形式实现对 K8s api-server 的形式次要有三种：

HTTPS 证书认证：基于 CA 证书签名的数字证书认证
HTTP Token 认证：通过一个 Token 来辨认用户
HTTP Base 认证：用户名 + 明码的形式认证

本例应用 HTTP Token 认证实现拜访，配置过程如下：

Func 平台新建脚本

pip3 install kubernetes

创立账号并获取 token

创立用户

kubectl create serviceaccount demo-admin -n kube-system

用户受权

kubectl create clusterrolebinding demo-admin --clusterrole=cluster-admin --serviceaccount=kube-system: demo-admin

获取用户 Token

kubectl describe secrets -n kube-system $(kubectl -n kube-system get secret | awk '/ demo-admin /{print $1}')

配置客户端连贯

configuration = client.Configuration()
setattr(configuration, 'verify_ssl', False)
client.Configuration.set_default(configuration)
configuration.host = KUBE_API_HOSTS                    #ApiHost
configuration.verify_ssl = False
configuration.debug = True
configuration.api_key = {"authorization": "Bearer" + API_TOKEN}
client.Configuration.set_default(configuration)

这样就做好了集群拜访筹备。

（官网库的阐明可参考链接：https://github.com/kubernetes-client/python）

当故障产生时，咱们须要通过某种形式获取到须要操作对象的相干信息，作为参数传递给 api-server，从而实现对象的操作。该信息能够通过观测云核心告警告诉的相干参数来获取：在调用 DataFlux Func 封装的 API 时，告警核心会将以后告警事件的详情作为参数下发，如果将须要的参数配置告警聚合条件，即可在函数平台一侧获取到须要的参数。

本例咱们执行 kubectl scale 命令进行扩容，须要获取的参数为 deployment 的名称和以后正本数。因而在告警配置时，须要选取 deployment name 作为聚合条件。

留神：若对应指标没有绑定咱们须要的标签，例如这里在选取监控指标时，指标并未携带归属的 deployment name，仅携带 pod name，则须要多一步查问操作：基于告警下发参数进行查问，确定被操作对象 (deployment) 的参数。

解决告警参数

如下展现的是，观测云控制台下发到 DataFlux Func 的一个告警参数示例：

{
  "Result": 3.8138525999999997,
  "date": 1685630030,
  "df_channels": [],
  "df_check_range_end": 1685629970,
  "df_check_range_start": 1685629910,
  "df_date_range": 60,
  "df_dimension_tags": "{\"namespace\":\"gc-dev\",\"pod_name\":\"ruoyi-system-7f49bc9cc9-dms9r\"}",
  "df_event_id": "event-2cfcdf99f7a4473f84967e88d14cecba",
  "df_event_link": "http://gc-studio.huawei.com/keyevents/monitor?time=1685629130000%2C1685630030000&tags=%7B%22df_event_id%22%3A%22event-2cfcdf99f7a4473f84967e88d14cecba%22%7D&w=wksp_9541fb20bb8a4115b596af0c689613d4",
  "df_event_reason": "\u6ee1\u8db3\u76d1\u63a7\u5668\u4e2d\u6545\u969c\u7684\u8ba4\u5b9a\u6761\u4ef6\uff0c\u4ea7\u751f\u6545\u969c\u4e8b\u4ef6",
  "df_exec_mode": "async",
  "df_issue_duration": 0,
  "df_issue_start_time": 1685630030,
  "df_language": "zh",
  "df_message": "CPU\u5229\u7528\u7387\uff1a3.8138525999999997  \n\u53d1\u751f\u65f6\u95f4\uff1a1685630030",
  "df_monitor_checker": "custom_metric",
  "df_monitor_checker_event_ref": "c240d31146e4e198990734486c27331e",
  "df_monitor_checker_id": "rul_dc585aee34c3466fa0981dd69aecf56a",
  "df_monitor_checker_name": "\u4f4d\u4e8e{{namespace}}\u4e2d\u7684Pod:{{pod_name}}\u4e1a\u52a1\u538b\u529b\u8fc7\u9ad8\uff0c\u8bf7\u5c3d\u5feb\u5904\u7406",
  "df_monitor_checker_ref": "acde60927d12cb3117f43009110dbb96",
  "df_monitor_checker_sub": "check",
  "df_monitor_checker_value": "3.8138525999999997",
  "df_monitor_id": "monitor_e4496b567ce04acb9565dc318b27412d",
  "df_monitor_name": "\u5f39\u6027\u6269\u5bb9demo\u64cd\u4f5c",
  "df_monitor_type": "custom",
  "df_source": "monitor",
  "df_status": "warning",
  "df_sub_status": "warning",
  "df_title": "\u4f4d\u4e8egc-dev\u4e2d\u7684Pod:ruoyi-system-7f49bc9cc9-dms9r\u4e1a\u52a1\u538b\u529b\u8fc7\u9ad8\uff0c\u8bf7\u5c3d\u5feb\u5904\u7406",
  "df_workspace_name": "gc\u6f14\u793a\u7a7a\u95f4",
  "df_workspace_uuid": "wksp_9541fb20bb8a4115b596af0c689613d4",
  "namespace": "gc-dev",
  "pod_name": "ruoyi-system-7f49bc9cc9-dms9r",
  "timestamp": 1685630030,
  "workspace_name": "gc\u6f14\u793a\u7a7a\u95f4",
  "workspace_uuid": "wksp_9541fb20bb8a4115b596af0c689613d4"
}

获取扩容参数

获取异样 pod 的名称

通过对上述告警参数字段进行解决，咱们能够提取到本次产生异样的 pod 名称：

# 缓存告警 ID
dfEventId=kwargs.get('df_event_id')
dfMonitorId=kwargs.get('df_monitor_id')

#提取 Pod_name 和归属的 ns
eventPodName=kwargs.get('pod_name')
eventNameSpace=kwargs.get('namespace')

获取 deployment 参数

再通过 K8s 接口查问 pod 归属的 deployment 名称及以后正本等信息，即可获取须要做操作的 deployment 参数。通过接口获取 pod 详情的代码如下：

# 获取指定 pod 的信息
k8s_api_coreV1 = client.CoreV1Api()

try:
    target_pod = k8s_api_coreV1.read_namespaced_pod(name=eventPodName, namespace=eventNameSpace)
except ApiException as e:
    print("Exception when calling CoreV1Api->read_namespaced_pod: %s\n" % e)

meta_OwnerRef = eval(str(target_pod.metadata.owner_references[0]))

#获取对应 kind 的 name 并调用接口获取元数据
k8sObjName = meta_OwnerRef.get('name')
k8sObjKind = meta_OwnerRef.get('kind')

对于获取到的数据，须要分状况解决：

如果以后对象的 OwnerReferences.kind 为 deployment，则能够间接获取 deployname；
但少数状况下，K8s 对象经验过屡次降级或回滚后，Owner 信息会变为对应的 ReplicaSet，这时就须要应用 Replicaset 再做一次查问，从而获取到 deployment name。示例如下：

    deployName = ""

    #获取 API 入口
    query_api = client.AppsV1Api()

    if  k8sObjKind =="ReplicaSet" :
        try:
            replicaset = query_api.read_namespaced_replica_set(k8sObjName, eventNameSpace)
        except ApiException as e:
            print("Exception when calling AppsV1Api->read_namespaced_replica_set: %s\n" % e)
        replica_meta = eval(str(replicaset.metadata.owner_references[0]))
        deployName = replica_meta.get('name')

    elif k8sObjKind =="Deployment" :
        print("如类型为 Deployment, 则间接应用本次查问到的 ObjName。")
        deployName = k8sObjName

    else:
        print("非本次解决对象，退出。")
        return

获取到 deployment name 后，查问以后正本数，并查看是否继续执行扩容操作：

# 获取 Deployment 以后正本数

target_deployment = query_api.read_namespaced_deployment(deployName, eventNameSpace)
cur_replicas = target_deployment.spec.replicas
#查看以后正本数是否能够继续执行扩容
if cur_replicas >= SCALEOUT_SOFT_UPLIMITS:
    print("以后正本数:",cur_replicas,"\n 以后设置软下限:",SCALEOUT_SOFT_UPLIMITS,"\n 无奈持续扩容，请人工解决！")
    sendCustomEventScaleFailed("warning",cur_replicas)
    return

若查看通过，则封装 K8s 数据包并执行扩容：

# 查看通过，开始封装扩容数据并执行扩容操作
body_patch = {
                'api_version': 'apps/v1',
                'kind': 'Deployment',
                'metadata':{
                    'name': deployName,
                    'namespace': eventNameSpace
                },
                'spec':{'replicas': cur_replicas+1}
}
# 下发配置更改
update_deployment(query_api,deployName,eventNameSpace,body_patch)

return("Resource scale out Good!\n")

实现解决逻辑的编写后，须要进入「DataFlux Func」-「治理」-「受权链接」中，创立新的受权链接，并复制 API 地址：

在「控制台」-「监控」-「告诉对象治理」中，新建一个 Webhook 类型的告诉对象，并将下面复制的 API 地址保留到 Webhook 栏：

在进行故障模拟前，须要为指标利用容器设置监控指标阈值，在故障注入时触发对应的告警告诉。关上「控制台」-「监控」-「新建监控器」，选取对应的指标项，配置告警阈值：

在告警告诉栏，抉择上一步创立的告警告诉对象，点击「确定」并「保留」：

因为没有压测环境，本例应用脚本占用前端负载平衡计算资源的形式，模仿高并发条件下须要进行零碎扩容的场景。

示例脚本的成果是使指标利用容器的 CPU 占用率升高：

[root@dns01-dev gc]$ vi cpu.sh
[root@dns01-dev gc]$ cat cpu.sh
#! /bin/sh
# filename killcpu.sh
if [$# -ne 1] ; then
  echo "USAGE: $0 <CPUs>|stop"
  exit 1;
fi

stop()
{
while read LINE
  do
    kill -9 $LINE
    echo "kill $LINE sucessfull"
  done < pid.txt
cat /dev/null > pid.txt
}

start()
{
  echo "u want to cpus is:"$1
  for i in `seq $1`
do
  echo -ne "
i=0;
while true
do
i=i+1;
done" | /bin/sh &
  pid_array[$i]=$! ;
done

for i in "${pid_array[@]}"; do
  echo 'pid is:' $i ';';
  echo $i >> pid.txt
done
}

case $1 in
  stop)
    stop
  ;;
  *)
  start $1
;;
esac

首先，查看以后环境 deployment 的正本数量以及 pod 列表，能够看到每个 deployment 的正本均为 1:

其次，查看以后各 pod 指标运行状况，指标均失常：

而后，将下面列出的 cpu.h 脚本拷贝到 ruoyi-nginx 容器，赋权并执行脚本：

[root@dns01-dev gc]$ kubectl get deploy -n gc-dev
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
ruoyi-auth      1/1     1            1           113d
ruoyi-gateway   1/1     1            1           113d
ruoyi-mysql     1/1     1            1           113d
ruoyi-nacos     1/1     1            1           113d
ruoyi-nginx     1/1     1            1           113d
ruoyi-redis     1/1     1            1           113d
ruoyi-system    1/1     1            1           2d15h
[root@dns01-dev gc]$ ls
1  cpu.sh  gc-func  gc_func_svcacc.yaml  gc-launcher  mock_cpu.sh  note.txt  stp1_nfssc  stp2_openesb  stp5_td  stp6_redis
[root@dns01-dev gc]$ kubectl get po -n gc-dev
NAME                             READY   STATUS    RESTARTS   AGE
ruoyi-auth-6475544879-pvx4r      2/2     Running   0          109d
ruoyi-gateway-7f46976bb5-g2qtf   2/2     Running   0          61d
ruoyi-mysql-6c48f4f47b-sbqpv     1/1     Running   0          113d
ruoyi-nacos-667ff88589-769rk     1/1     Running   0          113d
ruoyi-nginx-d44f6c5ff-8jphj      1/1     Running   0          2d22h
ruoyi-redis-594b4d99dd-vjl6v     1/1     Running   0          113d
ruoyi-system-7f49bc9cc9-dms9r    2/2     Running   0          2d14h
[root@dns01-dev gc]$ kubectl cp -n gc-dev cpu.sh ruoyi-nginx-d44f6c5ff-8jphj:/home
[root@dns01-dev gc]$ kubectl exec -it -n gc-dev ruoyi-nginx-d44f6c5ff-8jphj -- /bin/bash
root@ruoyi-nginx-d44f6c5ff-8jphj:/home/ruoyi/projects/ruoyi-ui# cd /home
root@ruoyi-nginx-d44f6c5ff-8jphj:/home# chmod +x cpu.sh
root@ruoyi-nginx-d44f6c5ff-8jphj:/home# ls -l
total 8
-rwxr-xr-x 1 root root  521 Jun  1 16:14 cpu.sh
drwxr-xr-x 1 root root 4096 May  4 07:48 ruoyi
root@ruoyi-nginx-d44f6c5ff-8jphj:/home# ./cpu.sh 1
u want to cpus is: 1
./cpu.sh: 29: pid_array[1]=54: not found
./cpu.sh: 32: Bad substitution
root@ruoyi-nginx-d44f6c5ff-8jphj:/home# /bin/sh: 1: -ne: not found

root@ruoyi-nginx-d44f6c5ff-8jphj:/home#

通过 kubectl 命令和观测云仪表板，查看脚本执行成果：

查看「控制台」-「事件」，查看告警触发状况。点击具体的告警条目关上告警详情，点击「告警告诉」，查看是否胜利调用了 Webhook 告诉对象。若看到告警未发送，点击「告诉对象」能够看到未发送的具体起因：

如下两种形式均可看到本次扩容已实现：若告警告诉已发送至 DataFlux Func，通过「控制台」-「基础设施」-「容器」-「Pod」能够看到以后正本数。

或者，能够通过集群的 Kubectl 命令查看以后正本数。

至此，弹性扩容的用例演示完结。

通过利用观测云函数开发平台，用户能够衍生出各种对被测系统的查问、治理等操作，为系统管理工作提供了极大的便利性和灵活性，也为拓宽观测云应用场景提供了技术根底，是十分好用的一款工具。

附：残缺示例代码：

from kubernetes import client, config
from kubernetes.client.rest import ApiException

import pytz
import re
import datetime
import json
import requests

API_TOKEN = “xxxxxx”

KUBE_API_HOSTS= “https://x.x.x.x:5443”

”’
Demo 演示：
基于监控数据扩容资源
”’

def update_deployment(api,deploy_name,ns,patch_detail):

# patch the deployment
try:
    resp = api.patch_namespaced_deployment(name=deploy_name, namespace=ns, body=patch_detail)
except ApiException as e:
    print("Exception when calling api->read_namespaced_pod: %s\n" % e)

print("\n[INFO] deployment's replicas count updated.\n")
print("%s\t%s\t\t\t%s\t%s" % ("NAMESPACE", "NAME", "REVISION", "REPLICAS"))
print(
    "%s\t\t%s\t%s\t\t%s\n"
    % (
        resp.metadata.namespace,
        resp.metadata.name,
        resp.metadata.generation,
        resp.spec.replicas,
    )
)

SCALEOUT_SOFT_UPLIMITS=5

@DFF.API(‘ 演示资源扩容 ’)
def cceDeployScaleOps(**kwargs):

# 格式化打印告警参数，用于解析须要的信息
rawEventMsg = json.dumps(kwargs,indent=2,ensure_ascii=False)
print("告警发送内容:\n",rawEventMsg)

#缓存告警 ID
dfEventId=kwargs.get('df_event_id')
dfMonitorId=kwargs.get('df_monitor_id')

#提取 Pod_name 和归属的 ns
eventPodName=kwargs.get('pod_name')
eventNameSpace=kwargs.get('namespace')

#配置客户端连贯
configuration = client.Configuration()
setattr(configuration, 'verify_ssl', False)
client.Configuration.set_default(configuration)
configuration.host = KUBE_API_HOSTS                    #ApiHost
configuration.verify_ssl = False
configuration.debug = True
configuration.api_key = {"authorization": "Bearer" + API_TOKEN}
client.Configuration.set_default(configuration)


#获取指定 pod 的信息
k8s_api_coreV1 = client.CoreV1Api()

try:
    target_pod = k8s_api_coreV1.read_namespaced_pod(name=eventPodName, namespace=eventNameSpace)
except ApiException as e:
    print("Exception when calling CoreV1Api->read_namespaced_pod: %s\n" % e)

meta_OwnerRef = eval(str(target_pod.metadata.owner_references[0]))

#获取对应 kind 的 name 并调用接口获取元数据
k8sObjName = meta_OwnerRef.get('name')
k8sObjKind = meta_OwnerRef.get('kind')

deployName = ""

#获取 API 入口
query_api = client.AppsV1Api()

if  k8sObjKind =="ReplicaSet" :
    try:
        replicaset = query_api.read_namespaced_replica_set(k8sObjName, eventNameSpace)
    except ApiException as e:
        print("Exception when calling AppsV1Api->read_namespaced_replica_set: %s\n" % e)
    replica_meta = eval(str(replicaset.metadata.owner_references[0]))
    deployName = replica_meta.get('name')

elif k8sObjKind =="Deployment" :
    print("如类型为 Deployment, 则间接应用本次查问到的 ObjName。")
    deployName = k8sObjName

else:
    print("非本次解决对象，退出。")
    return

#获取 Deployment 以后正本数

target_deployment = query_api.read_namespaced_deployment(deployName, eventNameSpace)
cur_replicas = target_deployment.spec.replicas

#查看以后正本数是否能够继续执行扩容
if cur_replicas >= SCALEOUT_SOFT_UPLIMITS:
    print("以后正本数:",cur_replicas,"\n 以后设置软下限:",SCALEOUT_SOFT_UPLIMITS,"\n 无奈持续扩容，请人工解决！")
    sendCustomEventScaleFailed("warning",cur_replicas)
    return

#查看通过，开始封装扩容数据并执行扩容操作
body_patch = {
                'api_version': 'apps/v1',
                'kind': 'Deployment',
                'metadata':{
                    'name': deployName,
                    'namespace': eventNameSpace
                },
                'spec':{'replicas': cur_replicas+1}
}
# 下发配置更改
update_deployment(query_api,deployName,eventNameSpace,body_patch)

return("Resource scale out Good!\n")

关于监控:利用观测云实现业务数据驱动的弹性扩缩容

背景

施行步骤

配置 K8s 集群拜访权限

Func 平台新建脚本

创立账号并获取 token

创立用户

用户受权

获取用户 Token

配置客户端连贯

编写扩容解决脚本

解决告警参数

获取扩容参数

获取异样 pod 的名称

获取 deployment 参数

配置告警告诉对象

配置观测云告警

成果展现

执行故障注入

查看注入成果

查看告警触发

查看脚本执行后果

总结

ApiToken

这里填入您本人的 API_TOKEN

这里填写指标集群的 API-Server 地址

用于执行 kubectl patch 的操作

扩容正本数软下限设置, 示意最大扩容到 5 正本