前言
- dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和保护,公司的前身是Fishtown Analytics。
- dbt 次要解决
ETL
中的T
(transform) dbt 次要分为两局部
dbt Core: cli 命令行工具dbt Cloud: 云服务
- dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
- qbit 集体感觉 dbt Core 的官网入门教程不敌对,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等。
- 本文次要参照:【大数据架构之旅】2 从零起步学 dbt
技术栈
Windows 10Python 3.8.10poetry 1.3.1git 2.35.1.windows.2PostgreSQL 15.1
poetry 我的项目配置
.../test_dbt/pyproject.toml
[tool.poetry]name = "test-dbt"version = "0.1.0"description = ""authors = ["qbit <q@bit.cn>"]readme = "README.md"packages = [{include = "test_dbt"}][[tool.poetry.source]]name = "aliyun"url = "https://mirrors.aliyun.com/pypi/simple/"default = true[tool.poetry.dependencies]python = "^3.8"dbt-core = "~1.3.1"dbt-postgres = "~1.3.1"[build-system]requires = ["poetry-core"]build-backend = "poetry.core.masonry.api"
- 本文中的
poetry 我的项目文件夹
指.../test_dbt/
- 本文中的
dbt 我的项目文件夹
指.../test_dbt/dbt_demo/
创立数据库与初始化我的项目
- 在 PostgreSQL 外面新建数据库
dbt-demo
初始化 dbt 我的项目,在
.../test_dbt/
目录下执行以下命令poetry run dbt init dbt_demo
按提醒执行完后,
.../test_dbt/
的目录构造如下$ tree.├── dbt_demo│ ├── analyses│ ├── dbt_project.yml│ ├── macros│ ├── models│ │ └── example│ │ ├── my_first_dbt_model.sql│ │ ├── my_second_dbt_model.sql│ │ └── schema.yml│ ├── README.md│ ├── seeds│ ├── snapshots│ └── tests├── logs│ └── dbt.log├── poetry.lock└── pyproject.toml
在个人用户目录下找到文件
~/.dbt/profiles.yml
,qbit
的绝对路径是C:\Users\qbit\.dbt\profiles.yml
,将以下内容复制粘贴到文件,中括号外面的内容按本人的理论状况填写dbt_demo:outputs: dev: type: postgres threads: 1 host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: dbt_demo schema: dev_schema prod: type: postgres threads: 1 host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: dbt_demo schema: prod_schematarget: dev
- 下文命令若无非凡阐明,都在
.../test_dbt/dbt_demo/
目录下运行
自带模型
进入
.../test_dbt/dbt_demo/
目录后运行以下命令查看环境信息$ poetry run dbt debug06:41:09 Running with dbt=1.3.1dbt version: 1.3.1python version: 3.8.10python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exeos info: Windows-10-10.0.19045-SP0Using profiles.yml file at C:\Users\qbit\.dbt\profiles.ymlUsing dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.ymlConfiguration:profiles.yml file [OK found and valid]dbt_project.yml file [OK found and valid]Required dependencies: - git [OK found]Connection:host: 192.168.1.52port: 5432user: postgresdatabase: dbt_demoschema: dev_schemasearch_path: Nonekeepalives_idle: 0sslmode: NoneConnection test: [OK connection ok]All checks passed!
生成自带模型
$ poetry run dbt run06:47:04 Running with dbt=1.3.106:47:04 Partial parse save file not found. Starting full parse.06:47:04 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics06:47:0406:47:05 Concurrency: 1 threads (target='dev')06:47:0506:47:05 1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]06:47:05 1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s]06:47:05 2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]06:47:05 2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s]06:47:0506:47:05 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s).06:47:0506:47:05 Completed successfully06:47:0506:47:05 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 查看 PostgreSQL 数据库,应该能够看到一张名为
my_first_dbt_model
的表,和一个名为my_second_dbt_model
的视图
内部数据
创立底表
- 从加州大学欧文分校下载共享单车的数据集,将
hour.csv
和day.csv
放到.../test_dbt/dbt_demo/seeds
文件夹下。 在
.../test_dbt/dbt_demo/seeds
目录下创立文件bike_share.yml
,内容如下:version: 2seeds:- name: hour config: column_types: dteday: date- name: day config: column_types: dteday: date
- 这里指定
date
类型是为了防止dbt
主动推断类型出错 - 对于数据列的类型,可参考官网文档 column_types
执行以下命令导入根底数据
$ poetry run dbt seed07:19:26 Running with dbt=1.3.107:19:26 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:19:2607:19:26 Concurrency: 1 threads (target='dev')07:19:2607:19:26 1 of 2 START seed file dev_schema.day .......................................... [RUN]07:19:31 1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s]07:19:31 2 of 2 START seed file dev_schema.hour ......................................... [RUN]07:21:24 2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s]07:21:2407:21:24 Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s).07:21:2407:21:24 Completed successfully07:21:2407:21:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 查看 PostgreSQL 数据库,应该能够看到新建了 2 张表:
hour
和day
自定义模型
- 创立模型目录
.../test_dbt/dbt_demo/models/bike_share
,上面 sql 文件中ref('day')
是援用的下面导入的day
数据库表 在
.../models/bike_share
目录内创立weather_trend.sql
,内容如下/* 天气趋势表 */{{ config(materialized='table') }}with weather_trend as ( select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weathersit)
在
.../models/bike_share
目录内创立season_trend.sql
,内容如下/* 节令趋势表 */{{ config(materialized='table') }}with season_trend as ( select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by season)
在
.../models/bike_share
目录内创立month_trend.sql
,内容如下/* 月度趋势表 */{{ config(materialized='table') }}with month_trend as ( select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by mnth)
在
.../models/bike_share
目录内创立weekday_trend.sql
,内容如下/* 星期几趋势表 */{{ config(materialized='table') }}with weekday_trend as ( select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weekday)
执行以下命令生成模型
$ poetry run dbt run07:43:59 Running with dbt=1.3.107:43:59 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:43:5907:43:59 Concurrency: 1 threads (target='dev')07:43:5907:43:59 1 of 6 START sql table model dev_schema.month_trend ............................ [RUN]07:44:00 1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s]07:44:00 2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]07:44:00 2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s]07:44:00 3 of 6 START sql table model dev_schema.season_trend ........................... [RUN]07:44:00 3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s]07:44:00 4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN]07:44:00 4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s]07:44:00 5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN]07:44:00 5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s]07:44:00 6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]07:44:00 6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s]07:44:0007:44:00 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s).07:44:0007:44:00 Completed successfully07:44:0007:44:00 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
文档生成和查看
运行以下命令生成文档
$ poetry run dbt docs generate07:49:09 Running with dbt=1.3.107:49:09 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:49:0907:49:09 Concurrency: 1 threads (target='dev')07:49:0907:49:09 Done.07:49:10 Building catalog07:49:10 Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
运行以下命令,会启动 http 服务,并关上默认浏览器查看文档
$ poetry run dbt docs serve07:49:53 Running with dbt=1.3.107:49:53 Serving docs at 0.0.0.0:808007:49:53 To access from your browser, navigate to: http://localhost:808007:49:5307:49:5307:49:53 Press Ctrl+C to exit.127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 -127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 -127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
- 网页界面
- 点击右下角按钮查看血缘关系图(Lineage Graph)
本文出自 qbit snap