冯·诺依曼计算机
冯·诺依曼计算机由存储器、运算器、输出设施、输出设备和控制器五局部组成。
哈佛构造
哈佛构造是一种将程序指令存储和数据存储离开的存储器构造,它的次要特点是将程序和数据存储在不同的存储空间中,即程序存储器和数据存储器是两个独立的存储器,每个存储器独立编址、独立拜访,目标是为了加重程序运行时的访存瓶颈。哈佛架构的中央处理器典型代表 ARM9/10 及后续 ARMv8 的处理器,例如:华为鲲鹏 920 处理器。
组成计算机的根底硬件都须要与主板(Motherboard)连贯
计算机根底硬件 (2)
Opening the Box(Apple IPad2)
手机的内部结构 – 华为 Mate30 Pro
主板 (来自于 Tech Insights)
主板 反面
射频板
Inside the Processor (CPU)
- Datapath(数据通路): performs operationson data
- Control: sequences datapath, memory, …
- Register 寄存器
-
Cache memory 缓存
- Small,fast:SRAM(动态随机拜访存储器)
memory for immediate access to data
- Small,fast:SRAM(动态随机拜访存储器)
Intel Core i7-5960X
毅力号 CPU 曝光:250nm 工艺、23 年旧架构、主频仅 233MHz
毅力号搭载的处理器是 20 多年前技术的产品。处理器型号为 PowerPC 750 处理器,与 1998 年苹果出品的 iMac G3 电脑同款,PowerPC 750 处理器最高主频速度仅 233MHz,且晶体管数量也只有 600 万个,但单价仍高达 20 万美元(约 130 万元)。抗辐射、耐凛冽 -55~125℃
比照苹果最近推出的 M1ARM 架构处理器领有最高主频 3.2GHz,晶体管数量达 160 亿个。
处理器发展趋势
支流 CPU 倒退门路
Through the Looking Glass
LCD screen: picture elements (pixels 像素)
- Mirrors content of frame buffer memory 帧缓冲存储器
Touchscreen(触摸屏)
PostPC device
- Supersedes(取代)keyboard and mouse
-
Resistive 阻性 and Capacitive 容性 types
- Most tablets, smart phones use capacitive
- Capacitive allows multiple touches simultaneously(多点同时触控)
A Safe Place for Data
Volatile main memory(易失性主存)
- Loses instructions and data when power off(断电)
Non-volatile secondary memory
- Magnetic disk(磁盘)
- Flash memory(闪存)
- Optical disk (CDROM, DVD) 光盘
Networks 与其余计算机通信
- Communication(通信), resource sharing(资源共享), nonlocal access(近程拜访)
- Local area network (LAN): Ethernet, 局域网 / 以太网
- Wide area network (WAN): the Internet,广域网 / 互联网
- Wireless network: WiFi, Bluetooth(蓝牙)
计算机根底硬件 (3)
Abstractions 形象
The BIG Picture
-
Abstraction helps us deal with complexity
- Hide lower-level detail
-
Instruction set architecture (ISA) 指令集体系结构
- The hardware/software (abstraction) interface
-
Application< —- > binary interface 利用二进制接口
- The ISA plus system software interface
-
Implementation(区别于 Architecture)
- The details underlying the interface
半导体与集成电路
Technology Trends 处理器和存储器制作技术 – 趋势
Electronics technology continues to evolve
- Increased capacity and performance
- Reduced cost
Semiconductor Technology
- Silicon 硅: semiconductor 半导体
-
Add materials to transform properties 属性:
- Conductors
- Insulators
- Switch
设施列表
厂商在制作芯片的过程中,从前端工序、到晶圆制作工序,之后再到封装和测试工序,次要用到的设施顺次包含,单晶炉、气相内涵炉、氧化炉、低压化学气相沉积零碎、磁控溅射台、光刻机、刻蚀机、离子注入机、晶片减薄机、晶圆划片机、键合封装设施、测试机、分选机和探针台等
集成电路创造
- 1952 年,英国雷达研究所的科学家达默在一次会议上提出:能够把电子线路中的分立元器件,集中制作在一块半导体晶片上,一小块晶片就是一个残缺电路,这样一来,电子线路的体积就可大大放大,可靠性大幅提高。这就是初期集成电路的构想。
- 1956 年,美国材料科学专家富勒和赖斯创造了半导体生产的扩散工艺,这样就为创造集成电路提供了工艺技术根底。
- 1958 年 9 月,美国德州仪器公司的青年工程师杰克·基尔比(Jack Kilby),胜利地将包含锗晶体管在内的五个元器件集成在一起,基于锗资料制作了一个叫做相移振荡器的繁难集成电路,并于 1959 年 2 月申请了小型化的电子电路(Miniaturized Electronic Circuit)专利(专利号为 No.31838743,批准工夫为 1964 年 6 月 26 日),这就是世界上第一块锗集成电路。
- 2000 年,集成电路问世 42 年当前,人们终于理解到他和他的创造的价值,他被授予了诺贝尔物理学奖。诺贝尔奖评审委员会已经这样评估基尔比:“为古代信息技术奠定了根底”。
- 1959 年 7 月,美国仙童半导体公司的诺伊斯,钻研出一种利用二氧化硅屏蔽的扩散技术和 PN 结隔离技术,基于硅平面工艺创造了世界上第一块硅集成电路,并申请了基于硅平面工艺的集成电路发明专利(专利号为 No.2981877,批准工夫为 1961 年 4 月 26 日。尽管诺伊斯申请专利在基尔比之后,但批准在前)。
- 基尔比和诺伊斯简直在同一时间别离创造了集成电路,两人均被认为是集成电路的发明者,而诺伊斯创造的硅集成电路更适于商业化生产,使集成电路从此进入商业规模化生产阶段。
Intel Core i7 Wafer
- 300mm wafer, 280 chips, 32nm technology
- Each chip is 20.7 x 10.5 mm
Integrated Circuit Cost
$Cost per die =\frac{\text { Cost per wafer}}{\text { Dies per wafer} \times \text {Yield}}$
$Dies per wafer \approx Wafer area/Die area$
$Yield =\frac{1}{(1+(\text { Defects per area} \times \text {Die area} / 2))^{2}}$
成品率
Defects per area:单位面积缺点
Die area:模具面积
Defining Performance
- Which airplane has the best performance? 从不同的方面进行考查。
Response Time and Throughput
-
Response time 响应工夫
- How long it takes to do a task(the time between the start and completion of a task)
-
Throughput 吞吐量
- Total work done per unit time
- e.g., tasks/transactions/… per hour
-
How are response time and throughput affected by
- Replacing the processor with a faster version? 改善处理器
- Adding more processors to do separate tasks? 增加更多的处理器
- Queue?采纳排队机制,改善吞吐量
- We’ll focus on response time for now…
Relative Performance
Define Performance = 1/Execution Time
- “X is n time faster than Y”
- $Performance _{X} / Performance _{Y}
= Execution time _{Y} / Execution time _{X}=n $
Example: time taken to run a program
- 10s on A, 15s on B
- Execution TimeB / Execution TimeA = 15s / 10s = 1.5
- So A is 1.5 times faster than B
Measuring Execution Time
-
Elapsed time 消失工夫
-
Total response time, including all aspects
- Processing, I/O, OS overhead, idle time
- Determines system performance
-
-
CPU time(共享时, 单独占用 CPU 工夫)
-
Time spent processing a given job
- Discounts I/O time, other jobs’shares
- Comprises user CPU time and system CPU
time - Different programs are affected differently by
CPU and system performance
-
CPU Clocking
Operation of digital hardware governed(掌控) by a constant-rate clock(数字同步电路)
-
Clock period: duration of a clock cycle
- e.g., 250ps = 0.25ns = 250×10^–12s
-
Clock frequency (rate): cycles per second
- e.g., 4.0GHz = 4000MHz = 4.0×10^9Hz
CPU Time
CPU Time = CPU Clock Cycles x Clock Cycle Time =$\frac{\text { CPU Clock Cycles}}{\text { Clock Rate}}$
Performance improved by
- Reducing number of clock cycles
- Increasing clock rate
- Hardware designer must often trade off(折中)clock rate against cycle count
CPU Time Example
- Computer A: 2GHz clock, 10s CPU time
-
Designing Computer B
- Aim for 6s CPU time
- Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?
$Clock Cycles _{A}= CPU Time _{A} \times Clock Rate _{A}$
=$10 \mathrm{~s} \times 2 \mathrm{GHz}=20 \times 10^{9}$
=$\frac{1.2 \times 20 \times 10^{9}}{6 \mathrm{~s}}=\frac{24 \times 10^{9}}{6 \mathrm{~s}}=4 \mathrm{GHz}$
Instruction Count and CPI
$Clock Cycles = Instruction Count \times Cycles per Instruction$
$CPUTime = Instruction Count \times CPI \times Clock Cycle Time$
$=\frac{\text { Instruction Count} \times \mathrm{CPI}}{\text { Clock Rate}}$
-
Instruction Count for a program
- Determined by program, ISA and compiler
-
Average cycles per instruction
- Determined by CPU hardware
-
If different instructions have different CPI 指令具备不同 CPI
- Average CPI affected by instruction mix
CPI Example
- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?
CPI in More Detail
If different instruction classes take different numbers 每指令类 CPI 不同,且指令呈现频率不同
$\text {Clock Cycles}=\sum_{\mathrm{i}=1}^{n}\left(\mathrm{CPI}_{\mathrm{i}} \times \operatorname{Instruction~Count~}_{\mathrm{i}}\right)$
Weighted average CPI(均匀 CPI)
$\mathrm{CPI}=\frac{\text { Clock Cycles}}{\text { Instruction Count}}=\sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{CPI}_{\mathrm{i}} \times \frac{\text { Instruction Count}_{\mathrm{i}}}{\text { Instruction Count}}\right)$
CPI Example
Alternative compiled code sequences using instructions in classes A, B, C (三类指令)
Which code sequence executes the most instructions? sequence2
Which will be faster?
What is the CPI for each sequence?
Performance Summary
$\text {CPU Time}=\frac{\text { Instructions}}{\text { Program}} \times \frac{\text { Clock cycles}}{\text { Instruction}} \times \frac{\text { Seconds}}{\text { Clock cycle}}$
Performance depends on
- Algorithm: affects IC(指令数), possibly CPI
- Programming language: affects IC, CPI
- Compiler: affects IC, CPI
- Instruction set architecture: affects IC, CPI, Tc
Power Trends
In CMOS IC technology
$\text {Power}=\frac{1}{2} \text {Capacitive load} \times \text {Voltage}^{2} \times \text {Frequency}$
Capacitive load: 负载电容。
Reducing Power
Suppose a new CPU has
- 85% of capacitive load of old CPU
- 15% voltage and 15% frequency reduction
$\frac{P_{\text {new}}}{P_{\text {old}}}=\frac{C_{\text {old}} \times 0.85 \times\left(V_{\text {old}} \times 0.85\right)^{2} \times F_{\text {old}} \times 0.85}{C_{\text {old}} \times V_{\text {old}}^{2} \times F_{\text {old}}}=0.85^{4}=0.52$
-
The power wall (功率墙)
- We can’t reduce voltage further 可能低压泄露
- We can’t remove more heat 可能 sleep
- How else can we improve performance?
Constrained by power, instruction-level parallelism, memory latency(受到功率、指令级并行性、内存提早的制约)
Multiprocessors(多核)
-
Multicore microprocessors
- More than one processor per chip
-
Requires explicitly parallel programming
-
Compare with instruction level parallelism(e.g. 流水线)
- Hardware executes multiple instructions at once
- Hidden from the programmer (程序员不可见)
-
-
Hard to do
- Programming for performance 编程难度减少
- Load balancing 负载平衡
- Optimizing communication and synchronization
A R M 提供更多计算外围
多核架构单位芯片面积提供更强算力,更合乎分布式业务的需要
A R M 多核高并发劣势,匹配互联网分布式架构
随着多核 A R M CPU 的性能一直加强,应用领域一直扩大
A R M 服务器级别处理器一览
SPEC CPU Benchmark
-
Programs used to measure performance
- Supposedly typical of actual workload
-
Standard Performance Evaluation Corp (SPEC)
- Develops benchmarks for CPU, I/O, Web, …
-
SPEC CPU2006
-
Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
- Normalize relative to reference machine(参考机器)
-
Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
-
$\sqrt[n]{\prod_{\mathrm{i}=1}^{n} \text {Execution time ratio}_{i}}$
CINT2006 for Intel Core i7 920
SPEC Power Benchmark
Power consumption of server at different workload levels
- Performance: ssj_ops/sec
- Power: Watts (Joules/sec)
SPECpower_ssj2008 for Xeon X5650
Pitfall(陷阱): Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance
$T_{\text {improved}}=\frac{T_{\text {affected}}}{\text { improvement factor}}+T_{\text {unaffected}}$
Example: multiply accounts for 80s/100s
Speedup(E)=1/{(1-P)+P/S}
Amdahl’s law 次要的用 ` 途是指出了在计算机体系结构设计过程中,某个部件的优化对整个构造的优化帮忙是有下限的,这个极限就是当 S -> 时, speedup(E)= 1/(1-P); 也从另外一个方面阐明了在体系结构的优化设计过程中,应该筛选对整体有重大影响的部件来进行优化,以失去更好的后果。
Fallacy 舛误: Low Power at Idle
Look back at i7 power benchmark
- At 100% load: 258W
- At 50% load: 170W (66%)
- At 10% load: 121W (47%)
Google data center
- Mostly operates at 10% – 50% load
- At 100% load less than 1% of the time
Consider designing processors to make power proportional to load
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
-
Doesn’t account for 思考
- Differences in ISAs between computers
- Differences in complexity between instructions
- $\begin{aligned}
\text {MIPS} &=\frac{\text { Instruction count}}{\text { Execution time} \times 10^{6}} \
&=\frac{\text { Instruction count}}{\frac{\text { Instruction count} \times \mathrm{CPI}}{\text { Clock rate}} \times 10^{6}}=\frac{\text { Clock rate}}{\mathrm{CPI} \times 10^{6}}
\end{aligned}$ - CPI varies between programs on a given CPU
Concluding Remarks
Cost/performance is improving
- Due to underlying technology development
Hierarchical layers of abstraction
- In both hardware and software
Instruction set architecture
- The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
- Use parallelism to improve performance