共计 3565 个字符,预计需要花费 9 分钟才能阅读完成。
1. 概述功能
1.querys 信息汇总(action,affected rows,avg.frequency,avg.latency,bytesin,bytesout,count,cputime,errors,failedrate)
2. 趋势 环比
3.faults requests continue to arrive but do not get serviced by the system(mysqld,disk) {why?so you can prevent it from escalating into an outage?are some of my database problems cause by small,hidden faults?}
3. 资源
2.how to architect and build highly observable systems
external quality of service and internal sufficieny of resources
- customers viewpoint(external,if these four no problem,no prolem for customers):
concurrency(request in process,backlog),
error rate,
latency(wait+process,99 percentile better),
throughput(complect request per second over a time interval)
- internal (Brendan Greggis USE):
utilization(cpu,mem,storage,net)
saturation
errors
- formal performance thory,queue,little’s law
文中说了些经验,比如 log level waring 没有用,fatal 起始是 panic 不如 retyrn error。error 不知道是否经过处理。所以建议只有两个 log info/debug,其中 debug 充足可关
还有应该有 online profiling 的能力,what request,what states,canceling
在 DB 领域分析时注意 reduce diversity of workload 分类
adative fault detection versus anomaly detection
原因:resource overload/saturation, storms of queries, bad application behavior, SELECT FOR UPDATE or other locking queries, internal scalability problems such as mutex contention around a long- running operation, or intensive periodic tasks. For example, the query cache mutex in MySQL has caused server stalls, and InnoDB checkpoint stalls or dirty-page flushing can cause similar effects. In the application layer, bad behavior such as stampedes to regenerate an expired cache entry are common culprits.
work isn’t getting done
why not anomalous?
1.This is because systems are continually anomalous in a variety of ways. We humans tend to greatly underestimate how crazily our systems behave all the time.
2.false-alarm rate vs missed-alarm rate
They will miss most true faults and alarm you on things that are just“normal abnormality.”
3.practical scalability analysis with the universal scalabiliy law
scalability with througput
- 理论公式
scala 非线性原因:contention(比如数据分发聚合) and crosstalk
Neil Gunther’s USL:effects of linear speedup,contention delay,coherency delay due to crosstalk
coefficient of performance: lmd,scala:N,throughout X
Amdahl’s Law: X(N)=lmd * N / (1+sgm * (N-1))
USL(add coherency):
- 参数定义
去 noise:scatterplot and time-series to ensure you’re working with a relatively consistent set of data. 去除离散点或者选择特定时间的均值数据
USL package from CRAN
scalability with response time
Little’s Law: N=XR (N 是 concurrency。)
response time vs througput
leanerly scalable system:R(X)=1/lmd
add coefficient:
add coherency:
反之:
R 不仅仅与 X 有关,it's pointless
。当有 retrograde 发生时,It’s a lost cause with no pratical purpose
限制
用 USL 只能是 inscreasing scalal until queueing。比如增加线程数到每个 core 一个,无法 forecast how many servers are serving queues。
因为已经 different model
了
评估
拟合参数,求出 Nmax
根据能接受的 response time 求出 N
求出 Throughput
优化 scalability
根据参数看出 contention,crosstalk 的程度
注意点
每个 scala 的其他变量保持不变。比如每个 node 因那个该和原来一样的输入, 持有一样的数据等等。
4.estimating cpu per query with weighted linear regression
CPU-execution-time
有些请求,are aggregated and defferred to be done later,often with a gingle IO op,because of disconnect,acuurate cost accounting is impossible。
如果用每个 query 的特征做线性聚合不是一个好方法,
比如 query1(是个不那么频繁所以对整体 cpu 贡献稍少)的执行时间,rows 等等,因为可能 query 在最开始的时候分配解析之类的占用 cpu 高,后面 Io 操作时候 cpu 就会少,所以会得出类似 CPU=-K * Rquery 的错误。所以应该在每个 time series 上 frame-by-frame 分析 cpu,io 与当前 frame 下的 query 特性做线性回归。
哎,论文中简单的假设写的一点不简单明了,好多东西明明很简单,就是不写出来,这里边有个最基本的假设是每个 query 每个 frame 下 cpu 的分配都是一样的(不管 query 什么类型内部在做什么,就按照 query 的数量分配 cpu)
evaluate:
数值上 1. 与现有 tool.2. 与已知结果比较
1.goodness of fit —— correlation coefficient;R 方
2.standard error —— error terms;T-staistics for the slope;intercept
3.statistical significance —— mean absolute percentage error(MAPE)
解释性
可见性
猜测 IO 应该用影响行数?
practical query optimization
response time,consistency
1.you can’t improve what you don’t measure
performance schema: aggregates
slow query:
tcp traffic cap:tcpdump,libpcap
2. 分类 (利用 performance shcema,pt-query-digest, 自己打的怎么聚合呢)
不要太多。top20 左右。
one-by-one 会看不出来问题,比如很多短频快的可以合成一个