前言
- dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和保护,公司的前身是 Fishtown Analytics。
- dbt 次要解决
ETL
中的T
(transform) -
dbt 次要分为两局部
dbt Core: cli 命令行工具 dbt Cloud: 云服务
- dbt Core 的 github 地址是:https://github.com/dbt-labs/d…
- qbit 集体感觉 dbt Core 的官网入门教程不敌对,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等。
- 本文次要参照:【大数据架构之旅】2 从零起步学 dbt
技术栈
Windows 10
Python 3.8.10
poetry 1.3.1
git 2.35.1.windows.2
PostgreSQL 15.1
poetry 我的项目配置
-
.../test_dbt/pyproject.toml
[tool.poetry] name = "test-dbt" version = "0.1.0" description = ""authors = ["qbit <q@bit.cn>"] readme = "README.md" packages = [{include = "test_dbt"}] [[tool.poetry.source]] name = "aliyun" url = "https://mirrors.aliyun.com/pypi/simple/" default = true [tool.poetry.dependencies] python = "^3.8" dbt-core = "~1.3.1" dbt-postgres = "~1.3.1" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
- 本文中的
poetry 我的项目文件夹
指.../test_dbt/
- 本文中的
dbt 我的项目文件夹
指.../test_dbt/dbt_demo/
创立数据库与初始化我的项目
- 在 PostgreSQL 外面新建数据库
dbt-demo
-
初始化 dbt 我的项目,在
.../test_dbt/
目录下执行以下命令poetry run dbt init dbt_demo
-
按提醒执行完后,
.../test_dbt/
的目录构造如下$ tree . ├── dbt_demo │ ├── analyses │ ├── dbt_project.yml │ ├── macros │ ├── models │ │ └── example │ │ ├── my_first_dbt_model.sql │ │ ├── my_second_dbt_model.sql │ │ └── schema.yml │ ├── README.md │ ├── seeds │ ├── snapshots │ └── tests ├── logs │ └── dbt.log ├── poetry.lock └── pyproject.toml
-
在个人用户目录下找到文件
~/.dbt/profiles.yml
,qbit
的绝对路径是C:\Users\qbit\.dbt\profiles.yml
,将以下内容复制粘贴到文件,中括号外面的内容按本人的理论状况填写dbt_demo: outputs: dev: type: postgres threads: 1 host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: dbt_demo schema: dev_schema prod: type: postgres threads: 1 host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: dbt_demo schema: prod_schema target: dev
- 下文命令若无非凡阐明,都在
.../test_dbt/dbt_demo/
目录下运行
自带模型
-
进入
.../test_dbt/dbt_demo/
目录后运行以下命令查看环境信息$ poetry run dbt debug 06:41:09 Running with dbt=1.3.1 dbt version: 1.3.1 python version: 3.8.10 python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe os info: Windows-10-10.0.19045-SP0 Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: 192.168.1.52 port: 5432 user: postgres database: dbt_demo schema: dev_schema search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
-
生成自带模型
$ poetry run dbt run 06:47:04 Running with dbt=1.3.1 06:47:04 Partial parse save file not found. Starting full parse. 06:47:04 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 06:47:04 06:47:05 Concurrency: 1 threads (target='dev') 06:47:05 06:47:05 1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 06:47:05 1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s] 06:47:05 2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 06:47:05 2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s] 06:47:05 06:47:05 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s). 06:47:05 06:47:05 Completed successfully 06:47:05 06:47:05 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 查看 PostgreSQL 数据库,应该能够看到一张名为
my_first_dbt_model
的表,和一个名为my_second_dbt_model
的视图
内部数据
创立底表
- 从加州大学欧文分校下载共享单车的数据集,将
hour.csv
和day.csv
放到.../test_dbt/dbt_demo/seeds
文件夹下。 -
在
.../test_dbt/dbt_demo/seeds
目录下创立文件bike_share.yml
,内容如下:version: 2 seeds: - name: hour config: column_types: dteday: date - name: day config: column_types: dteday: date
- 这里指定
date
类型是为了防止dbt
主动推断类型出错 - 对于数据列的类型,可参考官网文档 column_types
-
执行以下命令导入根底数据
$ poetry run dbt seed 07:19:26 Running with dbt=1.3.1 07:19:26 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:19:26 07:19:26 Concurrency: 1 threads (target='dev') 07:19:26 07:19:26 1 of 2 START seed file dev_schema.day .......................................... [RUN] 07:19:31 1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s] 07:19:31 2 of 2 START seed file dev_schema.hour ......................................... [RUN] 07:21:24 2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s] 07:21:24 07:21:24 Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s). 07:21:24 07:21:24 Completed successfully 07:21:24 07:21:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 查看 PostgreSQL 数据库,应该能够看到新建了 2 张表:
hour
和day
自定义模型
- 创立模型目录
.../test_dbt/dbt_demo/models/bike_share
,上面 sql 文件中ref('day')
是援用的下面导入的day
数据库表 -
在
.../models/bike_share
目录内创立weather_trend.sql
,内容如下/* 天气趋势表 */ {{config(materialized='table') }} with weather_trend as (select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ref('day') }} group by weathersit )
-
在
.../models/bike_share
目录内创立season_trend.sql
,内容如下/* 节令趋势表 */ {{config(materialized='table') }} with season_trend as (select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ref('day') }} group by season )
-
在
.../models/bike_share
目录内创立month_trend.sql
,内容如下/* 月度趋势表 */ {{config(materialized='table') }} with month_trend as (select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ref('day') }} group by mnth )
-
在
.../models/bike_share
目录内创立weekday_trend.sql
,内容如下/* 星期几趋势表 */ {{config(materialized='table') }} with weekday_trend as (select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ref('day') }} group by weekday )
-
执行以下命令生成模型
$ poetry run dbt run 07:43:59 Running with dbt=1.3.1 07:43:59 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:43:59 07:43:59 Concurrency: 1 threads (target='dev') 07:43:59 07:43:59 1 of 6 START sql table model dev_schema.month_trend ............................ [RUN] 07:44:00 1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s] 07:44:00 2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 07:44:00 2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s] 07:44:00 3 of 6 START sql table model dev_schema.season_trend ........................... [RUN] 07:44:00 3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s] 07:44:00 4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN] 07:44:00 4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s] 07:44:00 5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN] 07:44:00 5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s] 07:44:00 6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 07:44:00 6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s] 07:44:00 07:44:00 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s). 07:44:00 07:44:00 Completed successfully 07:44:00 07:44:00 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
文档生成和查看
-
运行以下命令生成文档
$ poetry run dbt docs generate 07:49:09 Running with dbt=1.3.1 07:49:09 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:49:09 07:49:09 Concurrency: 1 threads (target='dev') 07:49:09 07:49:09 Done. 07:49:10 Building catalog 07:49:10 Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
-
运行以下命令,会启动 http 服务,并关上默认浏览器查看文档
$ poetry run dbt docs serve 07:49:53 Running with dbt=1.3.1 07:49:53 Serving docs at 0.0.0.0:8080 07:49:53 To access from your browser, navigate to: http://localhost:8080 07:49:53 07:49:53 07:49:53 Press Ctrl+C to exit. 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
- 网页界面
- 点击右下角按钮查看血缘关系图(Lineage Graph)
本文出自 qbit snap