Hadoop 介绍
Hadoop- 大数据开源世界的亚当夏娃。
外围是 HDFS 数据存储系统,和 MapReduce 分布式计算框架。
HDFS
原理是把大块数据切碎,
每个碎块复制三份,离开放在三个便宜机上,始终放弃有三块可用的数据互为备份。应用的时候只从其中一个备份读出来,这个碎块数据就有了。
存数据的叫 datenode(格子间),治理 datenode 的叫 namenode(执伞人)。
MapReduce
原理是大工作先分堆解决 -Map,再汇总处理结果 -Reduce。分和汇是多台服务器并行进行,能力体现集群的威力。难度在于如何把工作拆解成合乎 MapReduce 模型的分和汇,以及两头过程的输入输出 <k,v> 都是什么。
单机版 Hadoop 介绍
对于学习 hadoop 原理和 hadoop 开发的人来说,搭建一套 hadoop 零碎是必须的。但
- 配置该零碎是十分头疼的,很多人配置过程就放弃了。
- 没有服务器供你应用
这里介绍一种 免配置的单机版 hadoop 装置应用办法,能够简略疾速的跑一跑 hadoop 例子辅助学习、开发和测试。
要求笔记本上装了 Linux 虚拟机,虚拟机上装了 docker。
装置
应用 docker 下载 sequenceiq/hadoop-docker:2.7.0 镜像并运行。
[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer
下载胜利输入
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0
启动
[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd: [OK]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out
启动胜利后命令行 shell 会主动进入 Hadoop 的容器环境,不须要执行 docker exec。在容器环境进入 /usr/local/hadoop/sbin,执行./start-all.sh 和./mr-jobhistory-daemon.sh start historyserver,如下
bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.
localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.
starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.
bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out
Hadoop 启动实现,如此简略。
要问分布式部署有多麻烦,数数光配置文件就有多少个吧!我亲眼见过一个 hadoop 老鸟,因为新换的服务器 hostname 主机名带横线“-”,配了一上午,环境硬是没起来。
运行自带的例子
回到 Hadoop 主目录, 运行示例程序
bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%
mapreduce 计算实现,有如下输入
20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=569
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5929
Total time spent by all reduces in occupied slots (ms)=8545
Total time spent by all map tasks (ms)=5929
Total time spent by all reduce tasks (ms)=8545
Total vcore-seconds taken by all map tasks=5929
Total vcore-seconds taken by all reduce tasks=8545
Total megabyte-seconds taken by all map tasks=6071296
Total megabyte-seconds taken by all reduce tasks=8750080
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=159
CPU time spent (ms)=1280
Physical memory (bytes) snapshot=303452160
Virtual memory (bytes) snapshot=1291390976
Total committed heap usage (bytes)=136450048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
hdfs 命令查看输入后果
bash-4.1# bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
例子解说
grep 是一个在输出中计算正则表达式匹配的 mapreduce 程序,筛选出合乎正则的字符串以及呈现次数。
shell 的 grep 后果会显示残缺的一行,这个命令只显示行中匹配的那个字符串
grep input output 'dfs[a-z.]+'
正则表达式 dfs[a-z.]+,示意字符串要以 dfs 结尾,前面是小写字母或者换行符 n 之外的任意单个字符都能够,数量一个或者多个。
输出是 input 里的所有文件,
bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root 690 May 16 2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16 2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml
-rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml
-rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml
-rw-r--r--. 1 root root 774 May 16 2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml
后果输入到 output。
计算流程如下
稍有不同的是这里有两次 reduce,第二次 reduce 就是把后果依照呈现次数排个序。map 和 reduce 流程开发者本人随便组合,只有各流程的输入输出能连接上就行。
管理系统介绍
Hadoop 提供了 web 界面的管理系统,
端口号 | 用处 |
---|---|
50070 | Hadoop Namenode UI 端口 |
50075 | Hadoop Datanode UI 端口 |
50090 | Hadoop SecondaryNamenode 端口 |
50030 | JobTracker 监控端口 |
50060 | TaskTrackers 端口 |
8088 | Yarn 工作监控端口 |
60010 | Hbase HMaster 监控 UI 端口 |
60030 | Hbase HRegionServer 端口 |
8080 | Spark 监控 UI 端口 |
4040 | Spark 工作 UI 端口 |
加命令参数
docker run 命令要退出参数,能力拜访 UI 治理页面
docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
执行这条命令后在宿主机浏览器就能够查看零碎了,当然如果 Linux 有浏览器也能够查看。我的 Linux 没有图形界面,所以在宿主机查看。
50070 Hadoop Namenode UI 端口
50075 Hadoop Datanode UI 端口
8088 Yarn 工作监控端口
已实现和正在运行的 mapreduce 工作都能够在 8088 里查看,上图有 gerp 和 wordcount 两个工作。
一些问题
一、./sbin/mr-jobhistory-daemon.sh start historyserver 必须执行,否则运行工作过程中会报
20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
二、./start-all.sh 必须执行否则报形如
Unknown Job job_1592960164748_0001 谬误
三、docker run 命令前面必须加 –privileged=true,否则运行工作过程中会报 java.io.IOException: Job status not available
四、留神,Hadoop 默认不会笼罩后果文件,因而再次运行下面实例会提醒出错,须要先将 ./output 删除。或者换成 output01 试试?
总结
本文办法能够低成本的实现 Hadoop 的装置配置,对于学习了解和开发测试都有帮忙的。如果开发本人的 Hadoop 程序,须要将程序打 jar 包上传到 share/hadoop/mapreduce/ 目录,执行
bin/hadoop jar share/hadoop/mapreduce/yourtest.jar
来运行程序察看成果。