摸索背景
随着大数据利用的一直倒退,数据仓库、数据湖的大数据实际层出不穷;无论是电信、金融、政府,各个行业的大数据热潮蓬勃发展。在过来的4-5年中,咱们一直看到企业用户大数据收缩问题日益加剧,大数据翻新下数据存储老本出现线性增长,使得企业对于大数据的利用开始变得审慎、变向放缓了企业外部数据化转型的速度。
外围的挑战:如何更加经济地构建数据湖存储体系。
大数据存储引擎从2006年公布以来,百花齐放,计算侧MapReduce、Spark、Hive、Impala、Presto、Storm、Flink的问世一直冲破应用领域,不过在大数据存储方面反而显得谨慎与沉稳。在过来十多年,在Apache Hadoop生态被宽泛提及的次要还是HDFS与Ozone。
HDFS
Hadoop HDFS 是一种分布式文件系统,旨在在商用硬件上运行以进步其普适性。它与现有的分布式文件系统有很多相似之处。然而,HDFS的特点也是显明的:具备高度容错性、旨在部署在低成本硬件、容许程度扩缩容。HDFS提供对应用程序数据拜访的高吞吐量,实用于须要解决海量数据集的应用服务。
Ozone
Apache Ozone 是一种高度可扩大的分布式存储,实用于剖析、大数据和云原生应用程序。Ozone 反对 S3 兼容对象 API 以及 Hadoop 兼容文件系统协定。它针对高效的对象存储和文件系统操作进行了优化。
经济化数据存储策略,次要体现在两个要害个性上,只有实现了,其余的加强都会锦上添花:
应用最合适的存储系统存储对应的数据分块;
数据存储策略对下层利用的侵入性越低越好;
比方HDFS典型场景下应用3正本的策略,一方面是确保数据块的高可用性,同时多个正本也能够更好地保障数据局部性的要求,进步数据拜访的吞吐量;为了更好地提供数据服务,硬件环境也会选用绝对更好的磁盘;对于晚期的大数据实际而言,规范对立的软硬件抉择能够进步对新技术栈的推动,然而随着数据的一直积攒,很多数据的拜访频率出现指数级降落,尤其是针对合规查看的冷数据,不仅仅占据了生产集群的大量空间,可能一年到头都没有被拜访过一次。这是对资源的极大节约。
大数据倒退的现阶段,精细化数据存储被提上了议程。须要一种分层的存储体系,在维系现有计算性能的同时,将温、冷数据实现对下层利用通明的自主迁徙,控制数据存储与保护的老本。
要害个性验证
通过这篇文章,咱们心愿能够对经济化数据存储策略做一个初步摸索,首先咱们将先前提到的两个要害个性具象化,而后通过几组试验对技术可行性进行一个探讨。
要害个性一:应用一套存储系统作为热数据系统;应用另一套存储系统作为冷数据系统;
要害个性二:对立命名空间同时兼容多套存储系统,通过对立命名空间对下层利用提供数据拜访服务;
技术抉择:
- 计算引擎: Hive (大部分企业用户应用SQL引擎作为数据开发工具)
- 存储引擎: HDFS/Ozone (业界罕用的Apache生态存储)
- 数据编排引擎: Alluxio (第三方开源组件,兼容大部分Apache生态组件)
Hive
Apache Hive ™ 数据仓库软件有助于应用 SQL 读取、写入和治理驻留在分布式存储中的大型数据集。构造能够投影到曾经存储的数据上。提供了一个命令行工具和 JDBC 驱动程序容许用户连贯到 Hive。
对于Alluxio
“Alluxio数据编排零碎”是寰球首个分布式超大规模数据编排零碎,孵化于加州大学伯克利分校AMP实验室。自我的项目开源以来,已有超过来自300多个组织机构的1200多位贡献者参加开发。Alluxio可能在跨集群、跨区域、跨国家的任何云中将数据更严密地编排到靠近数据分析和AI/ML应用程序的集群中,从而向下层利用提供内存级别的数据访问速度。
作为大数据生态系统中的存储与计算拆散技术标准,在阿里云、腾讯云、华为云、金山云等国内顶级云厂商服务中失去生产测验,是建设企业公有云的基石和核心技术。2021年成立后,先后荣登“中关村国内前沿科技翻新大赛大数据与云计算畛域TOP10”、“2021投资界数字科技VENTURE50”、“科创中国”开源翻新榜等多项榜单。
技术可行性研究,咱们分两个阶段进行:
阶段一:应用同一类型的存储系统HDFS,实现不同HDFS零碎之间的冷热分层【模仿场景:应用新的HDFS3.0 EC或者用磁盘密集型的机器专门搭建冷数据HDFS】
阶段二:应用不同类型的存储系统,应用HDFS作为热数据存储系统;应用Ozone作为冷数据存储系统 【模仿场景:HDFS负责热数据/Ozone负责冷数据】
验证步骤
部署架构
软件版本:
- 计算引擎:Hive 2.3.9
- 存储引擎:Hadoop 2.10.1,Ozone 1.2.1,Alluxio 2.8
- 所有组件均为单机模式部署
集群布局:
试验一:基于Alluxio实现跨HDFS的通明数据冷热分层
Step 1: 在Hive 中创立库、分区表,默认数据存储在 HDFS_1 上
create database test location "/user/hive/test.db";create external table test.test_part(value string) partitioned by (dt string);
创立库
hive> create database test location '/user/hive/test.db';OK Time taken: 1.697 secondshive>
创立表
hive> create external table test.test_part(value string) partitioned by (dt string); OKTime taken: 0.607 secondshive>
Step 2: Alluxio Union URI 实现跨HDFS集群对立命名空间集成
alluxio fs mount \--option alluxio-union.hdfs1.uri=hdfs://namenode_1:8020/user/hive/test.db/test_part \--option alluxio-union.hdfs2.uri=hdfs://namenode_2:8020/user/hive/test.db/test_part \--option alluxio-union.priority.read=hdfs1,hdfs2 \--option alluxio-union.collection.create=hdfs1 \/user/hive/test.db/test_part union://test_part/
以Alluxio Union URI 形式挂载测试目录
[root@ip-172-31-17-3 ~]# alluxio fs mkdir /user/hive/test.db Successfully created directory /user/hive/test.db[root@ip-172-31-17-3 conf]# alluxio fs mount \> --option alluxio-union.hdfs1.uri=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part \> --option alluxio-union.hdfs2.uri=hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part \> --option alluxio-union.priority.read=hdfs1,hdfs2 \> --option alluxio-union.collection.create=hdfs1 \> /user/hive/test.db/test_part union://test_part/Mounted union://test_part/ at /user/hive/test.db/test_part[root@ip-172-31-17-3 ~]#
Step 3: 批改 Hive 表门路为 Union URI 门路,屏蔽跨异构存储的技术细节
alter table test.test_part set location "alluxio://alluxio:19998/user/hive/test.db/test_part";
#批改Hive表格对应的门路
hive> alter table test.test_part set location "alluxio://ip-172-31-17-3.us-west-2.compute.internal:19998/user/hive/test.db/test_part";OKTime taken: 0.143 secondshive>
Step 4: 模仿数据
mkdir dt\=2022-06-0{1..6}echo 1abc > dt\=2022-06-01/000000_0echo 2def > dt\=2022-06-02/000000_0echo 3ghi > dt\=2022-06-03/000000_0echo 4jkl > dt\=2022-06-04/000000_0echo 5mno > dt\=2022-06-05/000000_0echo 6pqr > dt\=2022-06-06/000000_0hdfs dfs -put dt\=2022-06-0{1..3} hdfs://namenode_1:8020/user/hive/test.db/test_parthdfs dfs -put dt\=2022-06-0{4..6} hdfs://namenode_2:8020/user/hive/test.db/test_part
[root@ip-172-31-17-3 ~]# mkdir dt\=2022-06-0{1..6} [root@ip-172-31-17-3 ~]# echo 1abc > dt\=2022-06-01/000000_0 [root@ip-172-31-17-3 ~]# echo 2def > dt\=2022-06-02/000000_0 [root@ip-172-31-17-3 ~]# echo 3ghi > dt\=2022-06-03/000000_0 [root@ip-172-31-17-3 ~]# echo 4jkl > dt\=2022-06-04/000000_0 [root@ip-172-31-17-3 ~]# echo 5mno > dt\=2022-06-05/000000_0 [root@ip-172-31-17-3 ~]# echo 6pqr > dt\=2022-06-06/000000_0
将模仿数据别离存入hdfs1、hdfs2
[root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{1..3} hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part [root@ip-172-31-17-3 ~]# hdfs dfs -mkdir -p hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part [root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{4..6} hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part
查问hdfs1和hdfs2,确认数据存入实现
[root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part Found 3 items drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-01 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-02 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-03 [root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part Found 3 items drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-04 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-05 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-06
通过查问alluxio Union URI 再次确认数据存入hdfs1和hdfs2,以及Union URI跨存储关联失效
[root@ip-172-31-17-3 ~]# alluxio fs ls /user/hive/test.db/test_part drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:09:19:243 DIR /user/hive/test.db/test_part/dt=2022-06-02 drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:09:19:219 DIR /user/hive/test.db/test_part/dt=2022-06-01 drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:10:49:740 DIR /user/hive/test.db/test_part/dt=2022-06-06 drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:10:49:721 DIR /user/hive/test.db/test_part/dt=2022-06-05 drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:10:49:698 DIR /user/hive/test.db/test_part/dt=2022-06-04 drwxr-xr-x root hdfsadmingroup 1 PERSISTED 07-13-2022 08:09:19:263 DIR /user/hive/test.db/test_part/dt=2022-06-03 [root@ip-172-31-17-3 ~]#
Step 5: 刷新Hive表元数据
MSCK REPAIR TABLE test.test_part;hive> MSCK REPAIR TABLE test.test_part; OKPartitions not in metastore: test_part:dt=2022-06-01 test_part:dt=2022-06-02 test_part:dt=2022-06-03 test_part:dt=2022-06-04 test_part:dt=2022-06-05 test_part:dt=2022-06-06 Repair: Added partition to metastore test.test_part:dt=2022-06-01 Repair: Added partition to metastore test.test_part:dt=2022-06-02 Repair: Added partition to metastore test.test_part:dt=2022-06-03 Repair: Added partition to metastore test.test_part:dt=2022-06-04 Repair: Added partition to metastore test.test_part:dt=2022-06-05 Repair: Added partition to metastore test.test_part:dt=2022-06-06 Time taken: 1.677 seconds, Fetched: 7 row(s)
通过select形式察看到Hive元数据刷新后,alluxio union URI关联失效体现到Hive表中
hive> select * from test.test_part; OK1abc 2022-06-01 2def 2022-06-02 3ghi 2022-06-03 4jkl 2022-06-04 5mno 2022-06-05 6pqr 2022-06-06 Time taken: 1.624 seconds, Fetched: 6 row(s) hive>
Step 6: 配置冷热主动分层策略
alluxio fs policy add /user/hive/test.db/test_part "ufsMigrate(olderThan(2m), UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)"
设置策略:冷数据(本例中按生成超过2分钟的数据)主动从热存储(hdfs1)迁徙到冷存储(hdfs2)
[root@ip-172-31-17-3 ~]# alluxio fs policy add /user/hive/test.db/test_part "ufsMigrate(olderThan(2m), UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)" Policy ufsMigrate-/user/hive/test.db/test_part is added to /user/hive/test.db/test_part.
通过Alluxio命令行查看策略设置胜利与否
[root@ip-172-31-17-3 ~]# alluxio fs policy list id: 1657700423909 name: "ufsMigrate-/user/hive/test.db/test_part" path: "/user/hive/test.db/test_part" created_at: 1657700423914 scope: "RECURSIVE" condition: "olderThan(2m)" action: "DATA(UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)" [root@ip-172-31-17-3 ~]#
策略失效后别离查看hdfs1和hdfs2,能够察看到hdfs1外面超过2分钟的数据都迁徙到hdfs2中
[root@ip-172-31-17-3 logs]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part [root@ip-172-31-17-3 logs]# hdfs dfs -ls hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part Found 6 items drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-01 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-02 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-03 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-04 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-05 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-06 [root@ip-172-31-17-3 logs]#
策略失效,冷数据主动迁徙过程中和实现后查Hive都失去如下预期查问后果:
hive> select * from test.test_part; OK 1abc 2022-06-01 2def 2022-06-02 3ghi 2022-06-03 4jkl 2022-06-04 5mno 2022-06-05 6pqr 2022-06-06 Time taken: 0.172 seconds, Fetched: 6 row(s) hive>
最初,咱们将试验一的两个步骤:(1)通过Alluxio的Union URI实现跨两个HDFS存储系统的Hive表的数据联邦,和(2)通过Alluxio实现跨两个HDFS存储系统的通明数据冷热分层,在图1和图2中别离以简化示意图的形式展现,便于更好的了解试验指标、过程和后果。
下一组试验只是将上一组试验设定中的两个HDFS存储系统更改成了两个异构存储系统HDFS(热存储)和Ozone(冷存储),从通明冷热分层性能层面成果是雷同的。
试验二:基于Alluxio实现跨异构存储(HDFS和Ozone)的通明数据冷热分层
step 1 : hive 创立库、表
create database hdfsToOzone location '/user/hive/hdfsToOzone.db';create external table hdfsToOzone.test(value string) partitioned by (dt string);
创立库
hive> create database hdfsToOzone location '/user/hive/hdfsToOzone.db'; OK Time taken: 0.055 seconds hive>
创立表
hive> create external table hdfsToOzone.test(value string) partitioned by (dt string); OK Time taken: 0.1 seconds hive>
step 2: Alluxio Union URI实现跨HDFS/Ozone集群对立命名空间集成
alluxio fs mount \--option alluxio-union.hdfs.uri=hdfs://HDFS1:8020/user/hive/hdfsToOzone.db/test \--option alluxio-union.ozone.uri=o3fs://bucket.volume/hdfsToOzone.db/test \--option alluxio-union.priority.read=hdfs,ozone \--option alluxio-union.collection.create=hdfs \--option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \/user/hive/hdfsToOzone.db/test union://HDFS_TO_OZONE/
在Ozone中应用命令行工具创立volume、bucket
[root@ip-172-31-19-127 ~]# ozone sh volume create /v-alluxio [root@ip-172-31-19-127 ~]# ozone sh bucket create /v-alluxio/b-alluxio [root@ip-172-31-19-127 ~]# ozone fs -mkdir -p o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test [root@ip-172-31-19-127 ~]#
先在Alluxio中创立试验目录,而后以 Union URI 形式挂载目录
[root@ip-172-31-17-3 ~]# alluxio fs mkdir /user/hive/hdfsToOzone.dbSuccessfully created directory /user/hive/hdfsToOzone.db[root@ip-172-31-17-3 ~]# alluxio fs mount \ > --option alluxio-union.hdfs.uri=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test \ > --option alluxio-union.ozone.uri=o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test \ > --option alluxio-union.priority.read=hdfs,ozone \ > --option alluxio-union.collection.create=hdfs \ > --option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \ > /user/hive/hdfsToOzone.db/test union://HDFS_TO_OZONE/ Mounted union://HDFS_TO_OZONE/ at /user/hive/hdfsToOzone.db/test [root@ip-172-31-17-3 ~]#
step 3: 批改 Hive 表门路为 Union URI 门路,屏蔽跨异构存储的技术细节
alter table hdfsToOzone.test set location "alluxio://alluxio:19998/user/hive/hdfsToOzone.db/test";
批改Hive表格对应的门路
hive> alter table hdfsToOzone.test set location "alluxio://ip-172-31-17-3.us-west-2.compute.internal:19998/user/hive/hdfsToOzone.db/test"; OKTime taken: 1.651 seconds hive>
step 4: 模仿数据
ozone fs -put dt\=2022-06-0{1..3} o3fs://b-alluxio.v-alluxio.ozone:9862/hdfsToOzone.db/test hdfs dfs -put dt\=2022-06-0{4..6} hdfs://HDFS1:8020/user/hive/hdfsToOzone.db/test
将数据存入ozone
[root@ip-172-31-19-127 ~]# ozone fs -put dt\=2022-06-0{1..3} o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test 2022-07-13 10:00:38,920 [main] INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2022-07-13 10:00:38,981 [main] INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2022-07-13 10:00:38,981 [main] INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started 2022-07-13 10:00:39,198 [main] INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.ratis.metrics.impl.MetricRegistriesImpl
通过命令行查问ozone,确认数据存入实现
[root@ip-172-31-19-127 ~]# ozone fs -ls o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test Found 3 items drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-01 drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-02 drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-03 [root@ip-172-31-19-127 ~]#
将数据存入hdfs1,并通过命令行查问hdfs1,确认数据存入实现
[root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{4..6} hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test [root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test Found 3 items drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-04 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-05 drwxr-xr-x - root hdfsadmingroup 0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-06 [root@ip-172-31-17-3 ~]#
通过Alluxio命令行查问,再次确认数据存入hdfs1和ozone,以及Union URI跨存储关联失效
[root@ip-172-31-17-3 ~]# alluxio fs ls /user/hive/hdfsToOzone.db/test drwxrwxrwx root root 0 PERSISTED 07-13-2022 10:00:40:670 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-02 drwxrwxrwx root root 0 PERSISTED 07-13-2022 10:00:38:691 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-01 drwxr-xr-x root hdfsadmingroup 0 PERSISTED 07-13-2022 10:06:29:206 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-06 drwxr-xr-x root hdfsadmingroup 0 PERSISTED 07-13-2022 10:06:29:186 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-05 drwxr-xr-x root hdfsadmingroup 0 PERSISTED 07-13-2022 10:06:29:161 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-04 drwxrwxrwx root root 0 PERSISTED 07-13-2022 10:00:40:762 DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-03 [root@ip-172-31-17-3 ~]#
step 5: 刷新Hive表元数据
MSCK REPAIR TABLE hdfsToOzone.test;hive> MSCK REPAIR TABLE hdfsToOzone.test; OK Partitions not in metastore: test:dt=2022-06-01 test:dt=2022-06-02 test:dt=2022-06-03 test:dt=2022-06-04 test:dt=2022-06-05 test:dt=2022-06-06 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-01 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-02 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-03 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-04 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-05 Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-06 Time taken: 0.641 seconds, Fetched: 7 row(s) hive>
通过select形式察看到hive元数据刷新后,alluxio union URI关联失效体现到hive表中
hive> select * from hdfsToOzone.test ; OK 1abc 2022-06-01 2def 2022-06-02 3ghi 2022-06-03 4jkl 2022-06-04 5mno 2022-06-05 6pqr 2022-06-06 Time taken: 0.156 seconds, Fetched: 6 row(s) hive>
step 6: 配置策略
alluxio fs policy add /user/hive/hdfsToOzone.db/test" ufsMigrate(olderThan(2m), UFS[hdfs]:REMOVE, UFS[ozone]:STORE)"
设置策略:冷数据(本例中按生成超过2分钟的数据)主动从热存储(hdfs1)迁徙到冷存储(ozone)
[root@ip-172-31-17-3 ~]# alluxio fs policy add /user/hive/hdfsToOzone.db/test/ "ufsMigrate(olderThan(2m), UFS[hdfs]:REMOVE, UFS[ozone]:STORE)" Policy ufsMigrate-/user/hive/hdfsToOzone.db/test is added to /user/hive/hdfsToOzone.db/test.
通过Alluxio命令行查看策略设置胜利与否
[root@ip-172-31-17-3 ~]# alluxio fs policy list id: 1657707130843 name: "ufsMigrate-/user/hive/hdfsToOzone.db/test" path: "/user/hive/hdfsToOzone.db/test" created_at: 1657707130843 scope: "RECURSIVE" condition: "olderThan(2m)" action: "DATA(UFS[hdfs]:REMOVE, UFS[ozone]:STORE)" [root@ip-172-31-17-3 ~]#
策略失效后别离查看hdfs1和ozone,能够察看到hdfs1外面超过2分钟的数据都迁徙到ozone中
[root@ip-172-31-17-3 ~]# ozone fs -ls o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test Found 6 items drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-01 drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-02 drwxrwxrwx - root root 0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-03 drwxrwxrwx - root root 0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-04 drwxrwxrwx - root root 0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-05 drwxrwxrwx - root root 0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-06 [root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test [root@ip-172-31-17-3 ~]#
策略失效,冷数据主动迁徙过程中和实现后查hive都失去如下预期查问后果:
hive> select * from hdfsToOzone.test ; OK 1abc 2022-06-01 2def 2022-06-02 3ghi 2022-06-03 4jkl 2022-06-04 5mno 2022-06-05 6pqr 2022-06-06 Time taken: 0.144 seconds, Fetched: 6 row(s) hive>
4. 试验小结
能够看出,试验二的执行过程和成果展现和试验一简直是一模一样,除了冷数据存储系统从hdfs2切换成了一个异构存储系统Ozone。
通过试验,咱们充沛验证了Alluxio数据编排是如何胜利将下层利用 (比方基于Hive的数仓建设) 与底层数据长久化策略 (应用hdfs或者Ozone, 是否进行冷热分层等) 解耦合的。同时也体现了Alluxio对于异构存储系统的通用性和易用性。
最初心愿这篇文章对各位如何应用Alluxio经济化数据存储策略有所启迪。
附录
Alluxio集成Hive及HDFS的形式
Alluxio 配置
echo 'export ALLX_HOME=/mnt1/alluxio' >> ~/.bashrcecho 'export PATH=$PATH:$ALLX_HOME/bin' >> ~/.bashrc
alluxio.master.hostname=ip-172-31-17-3.us-west-2.compute.internal alluxio.underfs.address=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/alluxio alluxio.worker.tieredstore.level0.dirs.path=/alluxio/ramdisk alluxio.worker.memory.size=4Galluxio.worker.tieredstore.levels=1alluxio.worker.tieredstore.level0.alias=MEMalluxio.user.file.readtype.default=CACHEalluxio.user.file.writetype.default=ASYNC_THROUGHalluxio.security.login.impersonation.username=_HDFS_USER_alluxio.master.security.impersonation.yarn.groups=*alluxio.master.security.impersonation.hive.groups=*alluxio.user.metrics.collection.enabled=truealluxio.user.block.size.bytes.default=64MB ######## Explore ########alluxio.user.block.write.location.policy.class=alluxio.client.block.policy.DeterministicHashPolicyalluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicyalluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=1alluxio.user.file.persist.on.rename=truealluxio.master.persistence.blacklist=.staging,_temporary,.tmpalluxio.user.file.passive.cache.enabled=false
Hive 客户端core-site.xml
cp /hadoop_home/etc/hadoop/core-site.xml /hive_home/conf
拷贝 jar别离到hadoop和hive home下的lib子目录中
cp /<PATH_TO_ALLUXIO>/client/alluxio-enterprise-2.8.0-1.0-client.jar /hadoop_home/share/lib cp /<PATH_TO_ALLUXIO>/client/alluxio-enterprise-2.8.0-1.0-client.jar /hive_home/lib
配置alluxio文件系统
vim /hive_home/conf/core-site.xml <property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value></property><property> <name>alluxio.master.rpc.addresses</name> <value>ip-172-31-17-3.us-west-2.compute.internal:19998</value></property>
HDFS受权
查看hdfs 超级用户
vim /hadoop_home/etc/hadoop/hdfs-site.xml<property><name>dfs.permissions.superusergroup</name><value>hdfsadmingroup</value></property>
将用户 Alluxio 减少到supergroup
groupadd hdfsadmingroupusermod -a -G hdfsadmingroup root
同步零碎的权限信息到 HDFS
su - hdfs -s /bin/bash -c "hdfs dfsadmin -refreshUserToGroupsMappings"
开启hdfs acl
vim /hadoop_home/etc/hadoop/hdfs-site.xml<property><name>dfs.permissions.enabled</name><value>true</value></property><property><name>dfs.namenode.acls.enabled</name><value>true</value></property> su - hdfs -s /bin/bash -c "hdfs dfs -setfacl -R -m user:root:rwx /"
Ozone 部署
配置文件
wget https://dlcdn.apache.org/ozone/1.2.1/ozone-1.2.1.tar.gz echo 'export OZONE_HOME=/mnt1/ozone-1.2.1' >> ~/.bashrcecho 'export PATH=$PATH:$OZONE_HOME/bin:$OZONE_HOME/sbin' >> ~/.bashrc
在ozone-site.xml中退出必要配置信息
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><configuration><property><name>ozone.om.address</name><value>ip-172-31-19-127.us-west-2.compute.internal:9862</value></property><property><name>ozone.metadata.dirs</name><value>/mnt/ozone-1.2.1/metadata/ozone</value></property><property><name>ozone.scm.client.address</name><value>ip-172-31-19-127.us-west-2.compute.internal:9860</value></property><property><name>ozone.scm.names</name><value>ip-172-31-19-127.us-west-2.compute.internal</value></property><property><name>ozone.scm.datanode.id.dir</name><value>/mnt/ozone-1.2.1/metadata/ozone/node</value></property><property><name>ozone.om.db.dirs</name><value>/mnt/ozone-1.2.1/metadata/ozone/omdb</value></property><property><name>ozone.scm.db.dirs</name><value>/mnt/ozone-1.2.1/metadata/ozone/scmdb</value></property><property><name>hdds.datanode.dir</name><value>/mnt/ozone-1.2.1/datanode/data</value></property><property><name>ozone.om.ratis.enable</name><value>false</value></property><property><name>ozone.om.http-address</name><value>ip-172-31-19-127.us-west-2.compute.internal:9874</value></property><property><name>ozone.s3g.domain.name</name><value>s3g.internal</value></property><property><name>ozone.replication</name><value>1</value></property></configuration>
初始化与启动(依照程序)
ozone scm --initozone --daemon start scmozone om --initozone --daemon start om ozone --daemon start datanodeozone --daemon start s3g ozone应用操作
创立名称为v-alluxio的volume
[root@ip-172-31-19-127 ~]# ozone sh volume create /v-alluxio [root@ip-172-31-19-127 ~]#
在v-alluxio下创立名为b-alluxio的bucket
[root@ip-172-31-19-127 ~]# ozone sh bucket create /v-alluxio/b-alluxio [root@ip-172-31-19-127 ~]#
查看bucket的相干信息
[root@ip-172-31-19-127 ~]# ozone sh bucket info /v-alluxio/b-alluxio { "metadata" : { }, "volumeName" : "v-alluxio", "name" : "b-alluxio", "storageType" : "DISK", "versioning" : false, "usedBytes" : 30, "usedNamespace" : 6, "creationTime" : "2022-07-13T09:11:37.403Z", "modificationTime" : "2022-07-13T09:11:37.403Z", "quotaInBytes" : -1, "quotaInNamespace" : -1, "bucketLayout" : "LEGACY" } [root@ip-172-31-19-127 ~]#
创立key,并放入相应的内容
[root@ip-172-31-19-127 ~]# touch Dockerfile [root@ip-172-31-19-127 ~]# ozone sh key put /v-alluxio/b-alluxio/Dockerfile Dockerfile [root@ip-172-31-19-127 ~]#
列出bucket下所有的key
[root@ip-172-31-19-127 ~]# ozone sh key list /v-alluxio/b-alluxio/{ "volumeName" : "v-alluxio", "bucketName" : "b-alluxio", "name" : "Dockerfile", "dataSize" : 0, "creationTime" : "2022-07-13T14:37:09.761Z", "modificationTime" : "2022-07-13T14:37:09.801Z", "replicationConfig" : { "replicationFactor" : "ONE", "requiredNodes" : 1, "replicationType" : "RATIS" }, "replicationFactor" : 1, "replicationType" : "RATIS" }[root@ip-172-31-19-127 ~]#
查看key的相干信息
[root@ip-172-31-19-127 ~]# ozone sh key info /v-alluxio/b-alluxio/Dockerfile { "volumeName" : "v-alluxio", "bucketName" : "b-alluxio", "name" : "Dockerfile", "dataSize" : 0, "creationTime" : "2022-07-13T14:37:09.761Z", "modificationTime" : "2022-07-13T14:37:09.801Z", "replicationConfig" : { "replicationFactor" : "ONE", "requiredNodes" : 1, "replicationType" : "RATIS" }, "ozoneKeyLocations" : [ ], "metadata" : { }, "replicationFactor" : 1, "replicationType" : "RATIS" } [root@ip-172-31-19-127 ~]#
Alluxio 挂载 ozone
形式一
[root@ip-172-31-17-3 ~]# alluxio fs mount /ozone o3fs://b-alluxio.v-alluxio.ip-172-31-19-127.us-west-2.compute.internal:9862/ Mounted o3fs://b-alluxio.v-alluxio.ip-172-31-19-127.us-west-2.compute.internal:9862/ at /ozone [root@ip-172-31-17-3 ~]#
形式二(带option的mount)
[root@ip-172-31-17-3 ~]# alluxio fs mount \ > --option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \ > /ozone1 o3fs://b-alluxio.v-alluxio/ Mounted o3fs://b-alluxio.v-alluxio/ at /ozone1 [root@ip-172-31-17-3 ~]#
验证Ozone挂载是否胜利
[root@ip-172-31-17-3 ~]# alluxio fs ls / drwxrwxrwx root root 0 PERSISTED 01-01-1970 00:00:00:000 DIR /ozone1 drwxrwxrwx root root 0 PERSISTED 01-01-1970 00:00:00:000 DIR /ozone [root@ip-172-31-17-3 ~]#
想要获取更多乏味有料的【流动信息】【技术文章】【大咖观点】,请关注[Alluxio智库]: