本文首发于 2020-07-26 21:55:10
《ClickHouse 和他的敌人们》系列文章转载自圈内好友 BohuTANG 的博客,原文链接:
https://bohutang.me/2020/07/2…
以下为注释。
现实生活中的物品一旦被标记为“纯手工打造”,给人的第一感觉就是“上乘之品”,一个字“贵”,比方北京老布鞋。
然而在计算机世界里,如果有人通知你 ClickHouse 的 SQL 解析器是纯手工打造的,是不是很诧异!
这个问题引起了不少网友的关注,所以本篇聊聊 ClickHouse 的纯手工解析器,看看它们的底层工作机制及优缺点。
干燥先从一个 SQL 开始:
EXPLAIN SELECT a,b FROM t1
token
首先对 SQL 里的字符一一做判断,而后依据其关联性做 token 宰割:
比方间断的 WordChar,那它就是 BareWord,解析函数在 Lexer::nextTokenImpl(),解析调用栈:
DB::Lexer::nextTokenImpl() Lexer.cpp:63
DB::Lexer::nextToken() Lexer.cpp:52
DB::Tokens::operator[](unsigned long) TokenIterator.h:36
DB::TokenIterator::get() TokenIterator.h:62
DB::TokenIterator::operator->() TokenIterator.h:64
DB::tryParseQuery(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:224
DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:314
DB::parseQuery(DB::IParser&, char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long) parseQuery.cpp:332
DB::executeQueryImpl(const char *, const char *, DB::Context &, bool, DB::QueryProcessingStage::Enum, bool, DB::ReadBuffer *) executeQuery.cpp:272
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, DB::Context&, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>) executeQuery.cpp:731
DB::MySQLHandler::comQuery(DB::ReadBuffer&) MySQLHandler.cpp:313
DB::MySQLHandler::run() MySQLHandler.cpp:150
ast
token 是最根底的元组,他们之间没有任何关联,只是一堆生冷的词组与符号,所以咱们还需对其进行 语法解析,让这些 token 之间建设肯定的关系,达到一个可形容的生机。
ClickHouse 在解每一个 token 的时候,会依据以后的 token 进行状态空间进行预判(parse 返回 true 则进入子状态空间持续),而后决定状态跳转,比方:
EXPLAIN -- TokenType::BareWord
逻辑首先会进入 Parsers/ParserQuery.cpp 的 ParserQuery::parseImpl 办法:
bool res = query_with_output_p.parse(pos, node, expected)
|| insert_p.parse(pos, node, expected)
|| use_p.parse(pos, node, expected)
|| set_role_p.parse(pos, node, expected)
|| set_p.parse(pos, node, expected)
|| system_p.parse(pos, node, expected)
|| create_user_p.parse(pos, node, expected)
|| create_role_p.parse(pos, node, expected)
|| create_quota_p.parse(pos, node, expected)
|| create_row_policy_p.parse(pos, node, expected)
|| create_settings_profile_p.parse(pos, node, expected)
|| drop_access_entity_p.parse(pos, node, expected)
|| grant_p.parse(pos, node, expected);
这里会对所有 query 类型进行 parse 办法的调用,直到有分支返回 true。
咱们来看 第一层 query_with_output_p.parse Parsers/ParserQueryWithOutput.cpp:
bool parsed =
explain_p.parse(pos, query, expected)
|| select_p.parse(pos, query, expected)
|| show_create_access_entity_p.parse(pos, query, expected)
|| show_tables_p.parse(pos, query, expected)
|| table_p.parse(pos, query, expected)
|| describe_table_p.parse(pos, query, expected)
|| show_processlist_p.parse(pos, query, expected)
|| create_p.parse(pos, query, expected)
|| alter_p.parse(pos, query, expected)
|| rename_p.parse(pos, query, expected)
|| drop_p.parse(pos, query, expected)
|| check_p.parse(pos, query, expected)
|| kill_query_p.parse(pos, query, expected)
|| optimize_p.parse(pos, query, expected)
|| watch_p.parse(pos, query, expected)
|| show_access_p.parse(pos, query, expected)
|| show_access_entities_p.parse(pos, query, expected)
|| show_grants_p.parse(pos, query, expected)
|| show_privileges_p.parse(pos, query, expected
跳进 第二层 explain_p.parse ParserExplainQuery::parseImpl 状态空间:
bool ParserExplainQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
ASTExplainQuery::ExplainKind kind;
bool old_syntax = false;
ParserKeyword s_ast("AST");
ParserKeyword s_analyze("ANALYZE");
ParserKeyword s_explain("EXPLAIN");
ParserKeyword s_syntax("SYNTAX");
ParserKeyword s_pipeline("PIPELINE");
ParserKeyword s_plan("PLAN");
... ...
else if (s_explain.ignore(pos, expected))
{... ...}
... ...
ParserSelectWithUnionQuery select_p;
ASTPtr query;
if (!select_p.parse(pos, query, expected))
return false;
... ...
s_explain.ignore 办法会进行一个 keyword 解析,解析出 ast node:
EXPLAIN -- keyword
跃进 第三层 select_p.parse ParserSelectWithUnionQuery::parseImpl 状态空间:
bool ParserSelectWithUnionQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
ASTPtr list_node;
ParserList parser(std::make_unique<ParserUnionQueryElement>(), std::make_unique<ParserKeyword>("UNION ALL"), false);
if (!parser.parse(pos, list_node, expected))
return false;
...
parser.parse 里又调用 第四层 ParserSelectQuery::parseImpl 状态空间:
bool ParserSelectQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{auto select_query = std::make_shared<ASTSelectQuery>();
node = select_query;
ParserKeyword s_select("SELECT");
ParserKeyword s_distinct("DISTINCT");
ParserKeyword s_from("FROM");
ParserKeyword s_prewhere("PREWHERE");
ParserKeyword s_where("WHERE");
ParserKeyword s_group_by("GROUP BY");
ParserKeyword s_with("WITH");
ParserKeyword s_totals("TOTALS");
ParserKeyword s_having("HAVING");
ParserKeyword s_order_by("ORDER BY");
ParserKeyword s_limit("LIMIT");
ParserKeyword s_settings("SETTINGS");
ParserKeyword s_by("BY");
ParserKeyword s_rollup("ROLLUP");
ParserKeyword s_cube("CUBE");
ParserKeyword s_top("TOP");
ParserKeyword s_with_ties("WITH TIES");
ParserKeyword s_offset("OFFSET");
ParserNotEmptyExpressionList exp_list(false);
ParserNotEmptyExpressionList exp_list_for_with_clause(false);
ParserNotEmptyExpressionList exp_list_for_select_clause(true);
...
if (!exp_list_for_select_clause.parse(pos, select_expression_list, expected))
return false;
第五层 exp_list_for_select_clause.parse ParserExpressionList::parseImpl 状态空间持续:
bool ParserExpressionList::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
return ParserList(std::make_unique<ParserExpressionWithOptionalAlias>(allow_alias_without_as_keyword),
std::make_unique<ParserToken>(TokenType::Comma))
.parse(pos, node, expected);
}
… … 写不上来个鸟!
能够发现,ast parser 的时候,事后结构好状态空间,比方 select 的状态空间:
- expression list
- from tables
- where
- group by
- with …
- order by
- limit
在一个状态空间內,还能够依据 parse 返回的 bool 判断是否持续进入子状态空间,始终递归解析出整个 ast。
总结
手工 parser 的益处是代码清晰简洁,每个细节可防可控,以及敌对的错误处理,改变起来不会一动员全身。
毛病是手工老本太高,须要大量的测试来保障其正确性,还须要一些 fuzz 来保障可靠性。
好在 ClickHouse 曾经实现的比拟全面,即便有新的需要,在现有根底上修修补补即可。
欢送关注我的微信公众号【数据库内核】:分享支流开源数据库和存储引擎相干技术。
题目 | 网址 |
---|---|
GitHub | https://dbkernel.github.io |
知乎 | https://www.zhihu.com/people/… |
思否(SegmentFault) | https://segmentfault.com/u/db… |
掘金 | https://juejin.im/user/5e9d3e… |
开源中国(oschina) | https://my.oschina.net/dbkernel |
博客园(cnblogs) | https://www.cnblogs.com/dbkernel |