老叶茶馆|ClickHouse和他的朋友们(8)纯手工打造的SQL解析器
_本文原题:ClickHouse和他的朋友们(8)纯手工打造的SQL解析器
原文出处:https://bohutang.me/2020/07/25/clickhouse-and-friends-parser/
现实生活中的物品一旦被标记为“纯手工打造” , 给人的第一感觉就是“上乘之品” , 一个字“贵” , 比如北京老布鞋 。
但是在计算机世界里 , 如果有人告诉你 ClickHouse 的 SQL 解析器是纯手工打造的 , 是不是很惊讶!这个问题引起了不少网友的关注 , 所以本篇聊聊 ClickHouse 的纯手工解析器 , 看看它们的底层工作机制及优缺点 。
枯燥先从一个 SQL 开始:
EXPLAIN SELECT a,b FROM t1
token
首先对 SQL 里的字符逐个做判断 , 然后根据其关联性做 token 分割:
本文插图
比如连续的 WordChar , 那它就是 BareWord , 解析函数在 Lexer::nextTokenImpl , 解析调用栈:
DB::Lexer::nextTokenImpl Lexer.cpp:63
DB::Lexer::nextToken Lexer.cpp:52
DB::Tokens::operator[](unsigned long) TokenIterator.h:36
DB::TokenIterator::get TokenIterator.h:62
DB::TokenIterator::operator-> TokenIterator.h:64
DB::tryParseQuery(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:224
DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:314
DB::parseQuery(DB::IParser&, char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long) parseQuery.cpp:332
DB::executeQueryImpl(const char *, const char *, DB::Context &, bool, DB::QueryProcessingStage::Enum, bool, DB::ReadBuffer *) executeQuery.cpp:272
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, DB::Context&, std::__1:: function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>) executeQuery.cpp:731
DB::MySQLHandler::comQuery(DB::ReadBuffer&) MySQLHandler.cpp:313
DB::MySQLHandler::run MySQLHandler.cpp:150
ast
token 是最基础的元组 , 他们之间没有任何关联 , 只是一堆生冷的词组与符号 , 所以我们还需对其进行 语法解析 , 让这些 token 之间建立一定的关系 , 达到一个可描述的活力 。
ClickHouse 在解每一个 token 的时候 , 会根据当前的 token 进行状态空间进行预判(parse 返回 true 则进入子状态空间继续) , 然后决定状态跳转 , 比如: 分页标题
EXPLAIN -- TokenType::BareWord
逻辑首先会进入Parsers/ParserQuery.cpp 的 ParserQuery::parseImpl 方法:
bool res = query_with_output_p.parse(pos, node, expected)
|| insert_p.parse(pos, node, expected)
|| use_p.parse(pos, node, expected)
|| set_role_p.parse(pos, node, expected)
|| set_p.parse(pos, node, expected)
|| system_p.parse(pos, node, expected)
|| create_user_p.parse(pos, node, expected)
|| create_role_p.parse(pos, node, expected)
|| create_quota_p.parse(pos, node, expected)
|| create_row_policy_p.parse(pos, node, expected)
|| create_settings_profile_p.parse(pos, node, expected)
|| drop_access_entity_p.parse(pos, node, expected)
|| grant_p.parse(pos, node, expected);
这里会对所有 query 类型进行 parse 方法的调用 , 直到有分支返回 true 。
我们来看 第一层query_with_output_p.parse Parsers/ParserQueryWithOutput.cpp:
bool parsed =
explain_p.parse(pos, query, expected)
|| select_p.parse(pos, query, expected)
|| show_create_access_entity_p.parse(pos, query, expected)
|| show_tables_p.parse(pos, query, expected)
|| table_p.parse(pos, query, expected)
|| describe_table_p.parse(pos, query, expected)
|| show_processlist_p.parse(pos, query, expected)
|| create_p.parse(pos, query, expected)
|| alter_p.parse(pos, query, expected)
|| rename_p.parse(pos, query, expected)
|| drop_p.parse(pos, query, expected)
|| check_p.parse(pos, query, expected)
|| kill_query_p.parse(pos, query, expected)
|| optimize_p.parse(pos, query, expected)
|| watch_p.parse(pos, query, expected)
|| show_access_p.parse(pos, query, expected)
|| show_access_entities_p.parse(pos, query, expected)
|| show_grants_p.parse(pos, query, expected)
|| show_privileges_p.parse(pos, query, expected
跳进 第二层explain_p.parse ParserExplainQuery::parseImpl状态空间:
bool ParserExplainQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
ASTExplainQuery::ExplainKind kind;
bool old_syntax = false;
ParserKeyword s_ast( "AST");
ParserKeyword s_analyze( "ANALYZE");
ParserKeyword s_explain( "EXPLAIN");
ParserKeyword s_syntax( "SYNTAX");
ParserKeyword s_pipeline( "PIPELINE");
ParserKeyword s_plan( "PLAN");
... ...
elseif(s_explain.ignore(pos, expected))
{
... ...
}
... ...
ParserSelectWithUnionQuery select_p;
ASTPtr query;
if(!select_p.parse(pos, query, expected))
returnfalse;
... ...
s_explain.ignore 方法会进行一个 keyword 解析 , 解析出 ast node:
EXPLAIN -- keyword
跃进 第三层select_p.parse ParserSelectWithUnionQuery::parseImpl状态空间:
bool ParserSelectWithUnionQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
ASTPtr list_node;
ParserList parser(std::make_unique<ParserUnionQueryElement>, std::make_unique<ParserKeyword>( "UNION ALL"), false);分页标题
if(!parser.parse(pos, list_node, expected))
returnfalse;
...
parser.parse 里又调用 第四层ParserSelectQuery::parseImpl 状态空间:
bool ParserSelectQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
auto select_query = std::make_shared<ASTSelectQuery>;
node = select_query;
ParserKeyword s_select( "SELECT");
ParserKeyword s_distinct( "DISTINCT");
ParserKeyword s_from( "FROM");
ParserKeyword s_prewhere( "PREWHERE");
ParserKeyword s_where( "WHERE");
ParserKeyword s_group_by( "GROUP BY");
ParserKeyword s_with( "WITH");
ParserKeyword s_totals( "TOTALS");
ParserKeyword s_having( "HAVING");
ParserKeyword s_order_by( "ORDER BY");
ParserKeyword s_limit( "LIMIT");
ParserKeyword s_settings( "SETTINGS");
ParserKeyword s_by( "BY");
ParserKeyword s_rollup( "ROLLUP");
ParserKeyword s_cube( "CUBE");
ParserKeyword s_top( "TOP");
ParserKeyword s_with_ties( "WITH TIES");
ParserKeyword s_offset( "OFFSET");
ParserNotEmptyExpressionList exp_list( false);
ParserNotEmptyExpressionList exp_list_for_with_clause( false);
ParserNotEmptyExpressionList exp_list_for_select_clause( true);
...
if(!exp_list_for_select_clause.parse(pos, select_expression_list, expected))
returnfalse;
第五层exp_list_for_select_clause.parse ParserExpressionList::parseImpl状态空间继续:
bool ParserExpressionList::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
returnParserList(
std::make_unique<ParserExpressionWithOptionalAlias>(allow_alias_without_as_keyword),
std::make_unique<ParserToken>(TokenType::Comma))
.parse(pos, node, expected);
}
... ... 写不下去个鸟!
可以发现 , ast parser 的时候 , 预先构造好状态空间 , 比如 select 的状态空间:
- expression list
- from tables
- where
- group by
- with ...
- order by
- limit
总结
手工 parser 的好处是代码清晰简洁 , 每个细节可防可控 , 以及友好的错误处理 , 改动起来不会一发动全身 。 缺点是手工成本太高 , 需要大量的测试来保证其正确性 , 还需要一些fuzz来保证可靠性 。 好在ClickHouse 已经实现的比较全面 , 即使有新的需求 , 在现有基础上修修补补即可 。
文内链接
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ParserQuery.cpp#L26
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ParserQueryWithOutput.cpp#L31
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ParserExplainQuery.cpp#L10
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ParserSelectWithUnionQuery.cpp#L26分页标题
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ParserSelectQuery.cpp#L24
- https://github.com/ClickHouse/ClickHouse/blob/558f9c76306ffc4e6add8fd34c2071b64e914103/src/Parsers/ExpressionListParsers.cpp#L520
- 戏说健康|江南·人文 | 小桥流水, 书场茶馆, 串起江南的人文生态
- 旅行红茶馆|周末随心第十二飞|温州到底好不好玩?大老板们的生活究竟是怎样的?
- 超美时尚屋|大隐隐于市, 成都这处茶馆悠闲巴适, 我私藏了很久, 少有游客知道
- 旅行红茶馆|南平|武夷山的人情味,都在这一杯清香四溢的茶里了
- 躺着看电影|【书香潜川】黄泥河老街上的茶馆
- |由器及道,明心见性,来七宝茶馆感受东方当代茶器之美吧!
- 穿搭日记|这家面积不到50平米的茶馆, 成泰顺最火的旅游点, 游客纷纷点赞