关于cpu:CPU基础知识详解

冯·诺依曼计算机

冯·诺依曼计算机由存储器、运算器、输出设施、输出设备和控制器五局部组成。

哈佛构造

哈佛构造是一种将程序指令存储和数据存储离开的存储器构造，它的次要特点是将程序和数据存储在不同的存储空间中，即程序存储器和数据存储器是两个独立的存储器，每个存储器独立编址、独立拜访，目标是为了加重程序运行时的访存瓶颈。哈佛架构的中央处理器典型代表 ARM9/10 及后续 ARMv8 的处理器，例如：华为鲲鹏 920 处理器。

组成计算机的根底硬件都须要与主板（Motherboard）连贯

计算机根底硬件 (2)

Opening the Box（Apple IPad2）

手机的内部结构 – 华为 Mate30 Pro

主板 (来自于 Tech Insights）

主板反面

射频板

Inside the Processor (CPU)

Datapath(数据通路): performs operationson data
Control: sequences datapath, memory, …
Register 寄存器
Cache memory 缓存
- Small，fast：SRAM(动态随机拜访存储器)
  memory for immediate access to data

Intel Core i7-5960X

毅力号 CPU 曝光：250nm 工艺、23 年旧架构、主频仅 233MHz

毅力号搭载的处理器是 20 多年前技术的产品。处理器型号为 PowerPC 750 处理器，与 1998 年苹果出品的 iMac G3 电脑同款，PowerPC 750 处理器最高主频速度仅 233MHz，且晶体管数量也只有 600 万个，但单价仍高达 20 万美元（约 130 万元）。抗辐射、耐凛冽 -55~125℃

比照苹果最近推出的 M1ARM 架构处理器领有最高主频 3.2GHz，晶体管数量达 160 亿个。

处理器发展趋势

支流 CPU 倒退门路

Through the Looking Glass

LCD screen: picture elements (pixels 像素)

Mirrors content of frame buffer memory 帧缓冲存储器

Touchscreen(触摸屏)

PostPC device

Supersedes(取代)keyboard and mouse
Resistive 阻性 and Capacitive 容性 types
- Most tablets, smart phones use capacitive
- Capacitive allows multiple touches simultaneously(多点同时触控)

A Safe Place for Data

Volatile main memory(易失性主存)

Loses instructions and data when power off(断电)

Non-volatile secondary memory

Magnetic disk(磁盘)
Flash memory(闪存)
Optical disk (CDROM, DVD) 光盘

Networks 与其余计算机通信

Communication(通信), resource sharing(资源共享), nonlocal access(近程拜访)
Local area network (LAN): Ethernet, 局域网 / 以太网
Wide area network (WAN): the Internet，广域网 / 互联网
Wireless network: WiFi, Bluetooth(蓝牙)

计算机根底硬件 (3)

The BIG Picture

Abstraction helps us deal with complexity
- Hide lower-level detail
Instruction set architecture (ISA) 指令集体系结构
- The hardware/software (abstraction) interface
Application< —- > binary interface 利用二进制接口
- The ISA plus system software interface
Implementation(区别于 Architecture)
- The details underlying the interface

Technology Trends 处理器和存储器制作技术 – 趋势

Electronics technology continues to evolve

Increased capacity and performance
Reduced cost

Silicon 硅: semiconductor 半导体
Add materials to transform properties 属性:
- Conductors
- Insulators
- Switch

设施列表

厂商在制作芯片的过程中，从前端工序、到晶圆制作工序，之后再到封装和测试工序，次要用到的设施顺次包含，单晶炉、气相内涵炉、氧化炉、低压化学气相沉积零碎、磁控溅射台、光刻机、刻蚀机、离子注入机、晶片减薄机、晶圆划片机、键合封装设施、测试机、分选机和探针台等

1952 年，英国雷达研究所的科学家达默在一次会议上提出：能够把电子线路中的分立元器件，集中制作在一块半导体晶片上，一小块晶片就是一个残缺电路，这样一来，电子线路的体积就可大大放大，可靠性大幅提高。这就是初期集成电路的构想。
1956 年，美国材料科学专家富勒和赖斯创造了半导体生产的扩散工艺，这样就为创造集成电路提供了工艺技术根底。
1958 年 9 月，美国德州仪器公司的青年工程师杰克·基尔比（Jack Kilby），胜利地将包含锗晶体管在内的五个元器件集成在一起，基于锗资料制作了一个叫做相移振荡器的繁难集成电路，并于 1959 年 2 月申请了小型化的电子电路（Miniaturized Electronic Circuit）专利（专利号为 No.31838743，批准工夫为 1964 年 6 月 26 日），这就是世界上第一块锗集成电路。

2000 年，集成电路问世 42 年当前，人们终于理解到他和他的创造的价值，他被授予了诺贝尔物理学奖。诺贝尔奖评审委员会已经这样评估基尔比：“为古代信息技术奠定了根底”。
1959 年 7 月，美国仙童半导体公司的诺伊斯，钻研出一种利用二氧化硅屏蔽的扩散技术和 PN 结隔离技术，基于硅平面工艺创造了世界上第一块硅集成电路，并申请了基于硅平面工艺的集成电路发明专利（专利号为 No.2981877，批准工夫为 1961 年 4 月 26 日。尽管诺伊斯申请专利在基尔比之后，但批准在前）。
基尔比和诺伊斯简直在同一时间别离创造了集成电路，两人均被认为是集成电路的发明者，而诺伊斯创造的硅集成电路更适于商业化生产，使集成电路从此进入商业规模化生产阶段。

300mm wafer, 280 chips, 32nm technology
Each chip is 20.7 x 10.5 mm

$Cost per die =\frac{\text { Cost per wafer}}{\text { Dies per wafer} \times \text {Yield}}$

$Dies per wafer \approx Wafer area/Die area$

$Yield =\frac{1}{(1+(\text { Defects per area} \times \text {Die area} / 2))^{2}}$

成品率

Defects per area：单位面积缺点

Die area：模具面积

Which airplane has the best performance? 从不同的方面进行考查。

Response time 响应工夫
- How long it takes to do a task(the time between the start and completion of a task)
Throughput 吞吐量
- Total work done per unit time
- e.g., tasks/transactions/… per hour
How are response time and throughput affected by
- Replacing the processor with a faster version? 改善处理器
- Adding more processors to do separate tasks? 增加更多的处理器
- Queue？采纳排队机制，改善吞吐量
We’ll focus on response time for now…

Define Performance = 1/Execution Time

“X is n time faster than Y”
$Performance _{X} / Performance _{Y}
= Execution time _{Y} / Execution time _{X}=n $

Example: time taken to run a program

10s on A, 15s on B
Execution TimeB / Execution TimeA = 15s / 10s = 1.5
So A is 1.5 times faster than B

Elapsed time 消失工夫
- Total response time, including all aspects
  - Processing, I/O, OS overhead, idle time
- Determines system performance
CPU time（共享时, 单独占用 CPU 工夫）
- Time spent processing a given job
  - Discounts I/O time, other jobs’shares
- Comprises user CPU time and system CPU
  time
- Different programs are affected differently by
  CPU and system performance

Operation of digital hardware governed(掌控) by a constant-rate clock（数字同步电路）

Clock period: duration of a clock cycle
- e.g., 250ps = 0.25ns = 250×10^–12s
Clock frequency (rate): cycles per second
- e.g., 4.0GHz = 4000MHz = 4.0×10^9Hz

CPU Time = CPU Clock Cycles x Clock Cycle Time =$\frac{\text { CPU Clock Cycles}}{\text { Clock Rate}}$

Performance improved by

Reducing number of clock cycles
Increasing clock rate
Hardware designer must often trade off(折中)clock rate against cycle count

CPU Time Example

Computer A: 2GHz clock, 10s CPU time
Designing Computer B
- Aim for 6s CPU time
- Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?

$Clock Cycles _{A}= CPU Time _{A} \times Clock Rate _{A}$

=$10 \mathrm{~s} \times 2 \mathrm{GHz}=20 \times 10^{9}$

=$\frac{1.2 \times 20 \times 10^{9}}{6 \mathrm{~s}}=\frac{24 \times 10^{9}}{6 \mathrm{~s}}=4 \mathrm{GHz}$

$Clock Cycles = Instruction Count \times Cycles per Instruction$

$CPUTime = Instruction Count \times CPI \times Clock Cycle Time$

$=\frac{\text { Instruction Count} \times \mathrm{CPI}}{\text { Clock Rate}}$

Instruction Count for a program
- Determined by program, ISA and compiler
Average cycles per instruction
- Determined by CPU hardware
- If different instructions have different CPI 指令具备不同 CPI
  - Average CPI affected by instruction mix

Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?

If different instruction classes take different numbers 每指令类 CPI 不同，且指令呈现频率不同

$\text {Clock Cycles}=\sum_{\mathrm{i}=1}^{n}\left(\mathrm{CPI}_{\mathrm{i}} \times \operatorname{Instruction~Count~}_{\mathrm{i}}\right)$

Weighted average CPI(均匀 CPI)

$\mathrm{CPI}=\frac{\text { Clock Cycles}}{\text { Instruction Count}}=\sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{CPI}_{\mathrm{i}} \times \frac{\text { Instruction Count}_{\mathrm{i}}}{\text { Instruction Count}}\right)$

CPI Example

Alternative compiled code sequences using instructions in classes A, B, C (三类指令)

Which code sequence executes the most instructions? sequence2

Which will be faster?

What is the CPI for each sequence?

$\text {CPU Time}=\frac{\text { Instructions}}{\text { Program}} \times \frac{\text { Clock cycles}}{\text { Instruction}} \times \frac{\text { Seconds}}{\text { Clock cycle}}$

Performance depends on

Algorithm: affects IC(指令数), possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, Tc

In CMOS IC technology

$\text {Power}=\frac{1}{2} \text {Capacitive load} \times \text {Voltage}^{2} \times \text {Frequency}$

Capacitive load: 负载电容。

Reducing Power

Suppose a new CPU has

85% of capacitive load of old CPU
15% voltage and 15% frequency reduction

$\frac{P_{\text {new}}}{P_{\text {old}}}=\frac{C_{\text {old}} \times 0.85 \times\left(V_{\text {old}} \times 0.85\right)^{2} \times F_{\text {old}} \times 0.85}{C_{\text {old}} \times V_{\text {old}}^{2} \times F_{\text {old}}}=0.85^{4}=0.52$

The power wall (功率墙)
- We can’t reduce voltage further 可能低压泄露
- We can’t remove more heat 可能 sleep
How else can we improve performance?

Constrained by power, instruction-level parallelism, memory latency（受到功率、指令级并行性、内存提早的制约）

Multiprocessors（多核）

Multicore microprocessors
- More than one processor per chip
Requires explicitly parallel programming
- Compare with instruction level parallelism(e.g. 流水线）
  - Hardware executes multiple instructions at once
  - Hidden from the programmer (程序员不可见)
Hard to do
- Programming for performance 编程难度减少
- Load balancing 负载平衡
- Optimizing communication and synchronization

A R M 提供更多计算外围

多核架构单位芯片面积提供更强算力，更合乎分布式业务的需要

A R M 多核高并发劣势，匹配互联网分布式架构

随着多核 A R M CPU 的性能一直加强，应用领域一直扩大

A R M 服务器级别处理器一览

Programs used to measure performance
- Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
- Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
- Elapsed time to execute a selection of programs
  - Negligible I/O, so focuses on CPU performance
- Normalize relative to reference machine（参考机器）
- Summarize as geometric mean of performance ratios
  - CINT2006 (integer) and CFP2006 (floating-point)

$\sqrt[n]{\prod_{\mathrm{i}=1}^{n} \text {Execution time ratio}_{i}}$

CINT2006 for Intel Core i7 920

Power consumption of server at different workload levels

Performance: ssj_ops/sec
Power: Watts (Joules/sec)

SPECpower_ssj2008 for Xeon X5650

Improving an aspect of a computer and expecting a proportional improvement in overall performance

$T_{\text {improved}}=\frac{T_{\text {affected}}}{\text { improvement factor}}+T_{\text {unaffected}}$

Example: multiply accounts for 80s/100s

Speedup(E)=1/{(1-P)+P/S}

Amdahl’s law 次要的用 ` 途是指出了在计算机体系结构设计过程中，某个部件的优化对整个构造的优化帮忙是有下限的，这个极限就是当 S -> 时, speedup(E)= 1/(1-P); 也从另外一个方面阐明了在体系结构的优化设计过程中，应该筛选对整体有重大影响的部件来进行优化，以失去更好的后果。

Look back at i7 power benchmark

At 100% load: 258W
At 50% load: 170W (66%)
At 10% load: 121W (47%)

Google data center

Mostly operates at 10% – 50% load
At 100% load less than 1% of the time

Consider designing processors to make power proportional to load

MIPS: Millions of Instructions Per Second

Doesn’t account for 思考
- Differences in ISAs between computers
- Differences in complexity between instructions
$\begin{aligned}
\text {MIPS} &=\frac{\text { Instruction count}}{\text { Execution time} \times 10^{6}} \
&=\frac{\text { Instruction count}}{\frac{\text { Instruction count} \times \mathrm{CPI}}{\text { Clock rate}} \times 10^{6}}=\frac{\text { Clock rate}}{\mathrm{CPI} \times 10^{6}}
\end{aligned}$
CPI varies between programs on a given CPU

Cost/performance is improving

Due to underlying technology development

Hierarchical layers of abstraction

In both hardware and software

Instruction set architecture

The hardware/software interface

Execution time: the best performance measure

Power is a limiting factor

Use parallelism to improve performance

关于cpu:CPU基础知识详解

Abstractions 形象

半导体与集成电路

Semiconductor Technology

集成电路创造

Intel Core i7 Wafer

Integrated Circuit Cost

Defining Performance

Response Time and Throughput

Relative Performance

Measuring Execution Time

CPU Clocking

CPU Time

CPU Time Example

Instruction Count and CPI

CPI Example

CPI in More Detail

CPI Example

Performance Summary

Power Trends

Reducing Power

Multiprocessors（多核）

SPEC CPU Benchmark

SPEC Power Benchmark

Pitfall(陷阱): Amdahl’s Law

Fallacy 舛误: Low Power at Idle

Pitfall: MIPS as a Performance Metric

Concluding Remarks