前言

  • dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和保护,公司的前身是Fishtown Analytics。
  • dbt 次要解决 ETL 中的 T(transform)
  • dbt 次要分为两局部

    dbt Core: cli 命令行工具dbt Cloud: 云服务
  • dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
  • qbit 集体感觉 dbt Core 的官网入门教程不敌对,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等。
  • 本文次要参照:【大数据架构之旅】2 从零起步学 dbt

技术栈

Windows    10Python     3.8.10poetry     1.3.1git        2.35.1.windows.2PostgreSQL 15.1

poetry 我的项目配置

  • .../test_dbt/pyproject.toml

    [tool.poetry]name = "test-dbt"version = "0.1.0"description = ""authors = ["qbit <q@bit.cn>"]readme = "README.md"packages = [{include = "test_dbt"}][[tool.poetry.source]]name = "aliyun"url = "https://mirrors.aliyun.com/pypi/simple/"default = true[tool.poetry.dependencies]python = "^3.8"dbt-core = "~1.3.1"dbt-postgres = "~1.3.1"[build-system]requires = ["poetry-core"]build-backend = "poetry.core.masonry.api"
  • 本文中的 poetry 我的项目文件夹.../test_dbt/
  • 本文中的 dbt 我的项目文件夹.../test_dbt/dbt_demo/

创立数据库与初始化我的项目

  • 在 PostgreSQL 外面新建数据库 dbt-demo
  • 初始化 dbt 我的项目,在 .../test_dbt/ 目录下执行以下命令

    poetry run dbt init dbt_demo
  • 按提醒执行完后,.../test_dbt/ 的目录构造如下

    $ tree.├── dbt_demo│   ├── analyses│   ├── dbt_project.yml│   ├── macros│   ├── models│   │   └── example│   │       ├── my_first_dbt_model.sql│   │       ├── my_second_dbt_model.sql│   │       └── schema.yml│   ├── README.md│   ├── seeds│   ├── snapshots│   └── tests├── logs│   └── dbt.log├── poetry.lock└── pyproject.toml
  • 在个人用户目录下找到文件 ~/.dbt/profiles.ymlqbit 的绝对路径是 C:\Users\qbit\.dbt\profiles.yml,将以下内容复制粘贴到文件,中括号外面的内容按本人的理论状况填写

    dbt_demo:outputs:  dev:    type: postgres    threads: 1    host: [host]    port: [port]    user: [dev_username]    pass: [dev_password]    dbname: dbt_demo    schema: dev_schema  prod:    type: postgres    threads: 1    host: [host]    port: [port]    user: [prod_username]    pass: [prod_password]    dbname: dbt_demo    schema: prod_schematarget: dev
  • 下文命令若无非凡阐明,都在 .../test_dbt/dbt_demo/ 目录下运行

自带模型

  • 进入 .../test_dbt/dbt_demo/ 目录后运行以下命令查看环境信息

    $ poetry run dbt debug06:41:09  Running with dbt=1.3.1dbt version: 1.3.1python version: 3.8.10python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exeos info: Windows-10-10.0.19045-SP0Using profiles.yml file at C:\Users\qbit\.dbt\profiles.ymlUsing dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.ymlConfiguration:profiles.yml file [OK found and valid]dbt_project.yml file [OK found and valid]Required dependencies: - git [OK found]Connection:host: 192.168.1.52port: 5432user: postgresdatabase: dbt_demoschema: dev_schemasearch_path: Nonekeepalives_idle: 0sslmode: NoneConnection test: [OK connection ok]All checks passed!
  • 生成自带模型

    $ poetry run dbt run06:47:04  Running with dbt=1.3.106:47:04  Partial parse save file not found. Starting full parse.06:47:04  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics06:47:0406:47:05  Concurrency: 1 threads (target='dev')06:47:0506:47:05  1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]06:47:05  1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s]06:47:05  2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]06:47:05  2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s]06:47:0506:47:05  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s).06:47:0506:47:05  Completed successfully06:47:0506:47:05  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 查看 PostgreSQL 数据库,应该能够看到一张名为 my_first_dbt_model 的表,和一个名为 my_second_dbt_model 的视图

内部数据

创立底表

  • 从加州大学欧文分校下载共享单车的数据集,将 hour.csvday.csv 放到 .../test_dbt/dbt_demo/seeds 文件夹下。
  • .../test_dbt/dbt_demo/seeds 目录下创立文件 bike_share.yml,内容如下:

    version: 2seeds:- name: hour  config:    column_types:       dteday: date- name: day  config:    column_types:      dteday: date
  • 这里指定 date 类型是为了防止 dbt 主动推断类型出错
  • 对于数据列的类型,可参考官网文档 column_types
  • 执行以下命令导入根底数据

    $ poetry run dbt seed07:19:26  Running with dbt=1.3.107:19:26  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:19:2607:19:26  Concurrency: 1 threads (target='dev')07:19:2607:19:26  1 of 2 START seed file dev_schema.day .......................................... [RUN]07:19:31  1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s]07:19:31  2 of 2 START seed file dev_schema.hour ......................................... [RUN]07:21:24  2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s]07:21:2407:21:24  Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s).07:21:2407:21:24  Completed successfully07:21:2407:21:24  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 查看 PostgreSQL 数据库,应该能够看到新建了 2 张表:hourday

自定义模型

  • 创立模型目录 .../test_dbt/dbt_demo/models/bike_share,上面 sql 文件中 ref('day') 是援用的下面导入的 day数据库表
  • .../models/bike_share 目录内创立 weather_trend.sql,内容如下

    /* 天气趋势表 */{{ config(materialized='table') }}with weather_trend as (  select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt  from {{ ref('day') }}  group by weathersit)
  • .../models/bike_share 目录内创立 season_trend.sql,内容如下

    /* 节令趋势表 */{{ config(materialized='table') }}with season_trend as (  select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt  from {{ ref('day') }}  group by season)
  • .../models/bike_share 目录内创立 month_trend.sql,内容如下

    /* 月度趋势表 */{{ config(materialized='table') }}with month_trend as (  select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt  from {{ ref('day') }}  group by mnth)
  • .../models/bike_share 目录内创立 weekday_trend.sql,内容如下

    /* 星期几趋势表 */{{ config(materialized='table') }}with weekday_trend as (  select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt  from {{ ref('day') }}  group by weekday)
  • 执行以下命令生成模型

    $ poetry run dbt run07:43:59  Running with dbt=1.3.107:43:59  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:43:5907:43:59  Concurrency: 1 threads (target='dev')07:43:5907:43:59  1 of 6 START sql table model dev_schema.month_trend ............................ [RUN]07:44:00  1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s]07:44:00  2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]07:44:00  2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s]07:44:00  3 of 6 START sql table model dev_schema.season_trend ........................... [RUN]07:44:00  3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s]07:44:00  4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN]07:44:00  4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s]07:44:00  5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN]07:44:00  5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s]07:44:00  6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]07:44:00  6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s]07:44:0007:44:00  Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s).07:44:0007:44:00  Completed successfully07:44:0007:44:00  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

文档生成和查看

  • 运行以下命令生成文档

    $ poetry run dbt docs generate07:49:09  Running with dbt=1.3.107:49:09  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics07:49:0907:49:09  Concurrency: 1 threads (target='dev')07:49:0907:49:09  Done.07:49:10  Building catalog07:49:10  Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
  • 运行以下命令,会启动 http 服务,并关上默认浏览器查看文档

    $ poetry run dbt docs serve07:49:53  Running with dbt=1.3.107:49:53  Serving docs at 0.0.0.0:808007:49:53  To access from your browser, navigate to:  http://localhost:808007:49:5307:49:5307:49:53  Press Ctrl+C to exit.127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 -127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 -127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
  • 网页界面
  • 点击右下角按钮查看血缘关系图(Lineage Graph)
本文出自 qbit snap