VividCortex2017

1.概述功能

1.querys信息汇总(action,affected rows,avg.frequency,avg.latency,bytesin,bytesout,count,cputime,errors,failedrate)
2.趋势环比
3.faults requests continue to arrive but do not get serviced by the system(mysqld,disk) {why?so you can prevent it from escalating into an outage?are some of my database problems cause by small,hidden faults？}
3.资源

2.how to architect and build highly observable systems

external quality of service and internal sufficieny of resources

customers viewpoint(external,if these four no problem,no prolem for customers):

concurrency(request in process,backlog),
error rate,
latency(wait+process,99 percentile better),
throughput(complect request per second over a time interval)

internal (Brendan Greggis USE):

utilization(cpu,mem,storage,net)
saturation
errors

formal performance thory,queue,little’s law

文中说了些经验，比如log level waring没有用，fatal起始是panic不如retyrn error。error不知道是否经过处理。所以建议只有两个log info/debug，其中debug充足可关
还有应该有online profiling的能力，what request,what states,canceling
在DB领域分析时注意reduce diversity of workload分类

adative fault detection versus anomaly detection

原因：resource overload/saturation, storms of queries, bad application behavior, SELECT FOR UPDATE or other locking queries, internal scalability problems such as mutex contention around a long- running operation, or intensive periodic tasks. For example, the query cache mutex in MySQL has caused server stalls, and InnoDB checkpoint stalls or dirty-page flushing can cause similar effects. In the application layer, bad behavior such as stampedes to regenerate an expired cache entry are common culprits.
work isn’t getting done
why not anomalous?
1.This is because systems are continually anomalous in a variety of ways. We humans tend to greatly underestimate how crazily our systems behave all the time.
2.false-alarm rate vs missed-alarm rate
They will miss most true faults and alarm you on things that are just “normal abnormality.”

3.practical scalability analysis with the universal scalabiliy law

scalability with througput

理论公式

scala 非线性原因：contention(比如数据分发聚合) and crosstalk
Neil Gunther’s USL:effects of linear speedup,contention delay,coherency delay due to crosstalk
coefficient of performance: lmd,scala:N,throughout X
Amdahl’s Law: X(N)=lmd * N / (1+sgm * (N-1))
USL(add coherency):

参数定义

去noise:scatterplot and time-series to ensure you’re working with a relatively consistent set of data. 去除离散点或者选择特定时间的均值数据
USL package from CRAN

scalability with response time

Little’s Law: N=XR (N是concurrency。)

response time vs througput

leanerly scalable system:R(X)=1/lmd
add coefficient:

add coherency:

反之：

R不仅仅与X有关，
it's pointless。当有retrograde发生时，It’s a lost cause with no pratical purpose

限制

用USL只能是inscreasing scalal until queueing。比如增加线程数到每个core一个，无法forecast how many servers are serving queues。
因为已经different model了

评估

拟合参数，求出Nmax
根据能接受的response time 求出N
求出Throughput

优化scalability

根据参数看出contention,crosstalk的程度

注意点

每个scala的其他变量保持不变。比如每个node因那个该和原来一样的输入,持有一样的数据等等。

4.estimating cpu per query with weighted linear regression

CPU-execution-time
有些请求，are aggregated and defferred to be done later,often with a gingle IO op,because of disconnect,acuurate cost accounting is impossible。
如果用每个query的特征做线性聚合不是一个好方法，
比如query1（是个不那么频繁所以对整体cpu贡献稍少）的执行时间，rows等等，因为可能query在最开始的时候分配解析之类的占用cpu高，后面Io操作时候cpu就会少，所以会得出类似CPU=-K * Rquery的错误。所以应该在每个time series上 frame-by-frame分析cpu,io与当前frame下的query特性做线性回归。

哎，论文中简单的假设写的一点不简单明了，好多东西明明很简单，就是不写出来，这里边有个最基本的假设是每个query每个frame下cpu的分配都是一样的（不管query什么类型内部在做什么，就按照query的数量分配cpu）
evaluate:
数值上 1.与现有tool.2.与已知结果比较
1.goodness of fit —— correlation coefficient;R方
2.standard error —— error terms；T-staistics for the slope;intercept
3.statistical significance —— mean absolute percentage error(MAPE)
解释性
可见性
猜测IO应该用影响行数？

practical query optimization

response time,consistency
1.you can’t improve what you don’t measure
performance schema: aggregates
slow query:
tcp traffic cap:tcpdump,libpcap
2.分类(利用performance shcema,pt-query-digest,自己打的怎么聚合呢)
不要太多。top20左右。
one-by-one会看不出来问题，比如很多短频快的可以合成一个

1.概述功能

2.how to architect and build highly observable systems

adative fault detection versus anomaly detection

3.practical scalability analysis with the universal scalabiliy law

scalability with througput

scalability with response time

response time vs througput

限制

评估

优化scalability

注意点

4.estimating cpu per query with weighted linear regression

practical query optimization

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

VividCortex2017

1.概述功能

2.how to architect and build highly observable systems

adative fault detection versus anomaly detection

3.practical scalability analysis with the universal scalabiliy law

scalability with througput

scalability with response time

response time vs througput

限制

评估

优化scalability

注意点

4.estimating cpu per query with weighted linear regression

practical query optimization

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复