Impala 的词法解析使用语法分析生成器 JFlex 和 Java 语法解析器自动生成工具 CUP,具体原理和使用见官方文档,本文只分析一个简单的 SELECT 语句。

# JFlex

首先看官网介绍:

JFlex is a lexical analyzer generator (also known as scanner generator) for Java, written in Java.

A lexical analyzer generator takes as input a specification with a set of regular expressions and corresponding actions. It generates a program (a lexer) that reads input, matches the input against the regular expressions in the spec file, and runs the corresponding action if a regular expression matched. Lexers usually are the first front-end step in compilers, matching keywords, comments, operators, etc, and generating an input token stream for parsers. Lexers can also be used for many other purposes.

JFlex lexers are based on deterministic finite automata (DFAs). They are fast, without expensive backtracking.

JFlex is designed to work together with the LALR parser generator CUP by Scott Hudson, and the Java modification of Berkeley Yacc BYacc/J by Bob Jamison. It can also be used together with other parser generators like ANTLR or as a standalone tool.

总的来说,是一个 Java 语法分析生成器,它会识别到关键字,操作符等,与 Cup 一起配合使用。

# Impala JFlex

impala 中定义了一个 sql-scanner.flex,通过 maven-jflex-plugin 插件生成 SqlScanner.java 文件。
flex 分为三部分,每一部分通过 %% 符号分隔。
第一部分 usercode,定义了包名和要引入的类。
第二部分 options and declarations,定义了一个类名、类修饰符、编码等选项,通过正则表达式声明了一些变量。
第三部分 lexical rules,定义了匹配的解析器符号。

下面是 JFlex 文件中的内容:

fe/src/main/jflex/sql-scanner.flex
// usercode
// package statement
package org.apache.impala.analysis;
// import statements
import java_cup.runtime.Symbol;
// ...
// divided by %%
%%
// options and declarations
// give the generated class the name SqlScanner and to write the code to a file SqlScanner.java
%class SqlScanner
// switches to CUP compatibility mode to interface with a CUP generated parser
%cup
// makes the generated class public
%public
// makes the generated class final
%final
// be copied verbatim into the scanning method and will be executed each time the end of file is reached
%eofval{
  return newToken(SqlParserSymbols.EOF, null);
%eofval}
// defines the set of characters the scanner will work on
%unicode
// switches line counting on(the current line number can be accessed via the variable yyline)
%line
// switches column counting on (the current column is accessed via yycolumn)
%column
// is copied verbatim into the generated lexer class source
// 定义了关键字、保留字和字符定义
%{
  // ...
%}
// macros declarations
LineTerminator = \r|\n|\r\n
Whitespace = {LineTerminator} | [ \t\f]
// ...
// lexical state declaration
%state EOLHINT
// divided by %%
%%
// lexical rules
"..." { return newToken(SqlParserSymbols.DOTDOTDOT, null); }
// ...

下面是 pom 插件配置:

fe/pom.xml
<plugin>
    <groupId>de.jflex</groupId>
    <artifactId>maven-jflex-plugin</artifactId>
    <version>1.4.3</version>
    <executions>
        <execution>
        <id>jflex</id>
        <phase>generate-sources</phase>
        <goals>
            <goal>generate</goal>
        </goals>
        <configuration>
            <backup>false</backup>
        </configuration>
        </execution>
    </executions>
</plugin>

# CUP

官方说明:

CUP stands for Construction of Useful Parsers and is an LALR parser generator for Java. It was developed by C. Scott Ananian, Frank Flannery, Dan Wang, Andrew W. Appel and Michael Petter. It implements standard
LALR(1) parser generation. As a parser author, you specify the symbols of Your grammar (terminal T1,T2; non terminal N1, N2;), as well as the productions (LHS :== RHS1 | RHS2 😉. If you provide each production
alternative with action code ({: RESULT = myfunction(); :}), the parser will call this action code after performing a reduction with the particular production. You can use these callbacks to assemble an AST
(Abstract Syntax Tree) or for arbitrary purposes. You should also have a look at the scanner generator JFlex, which is suited particularly well for collaboration with CUP.

CUP 适合与 JFlex 配合,将 JFlex 识别到的字符组装成特定的语句。

# Impala CUP

impala 在 sql-parser.cup 中定义了所有的语句结构,然后通过 cup-maven-plugin 生成 Java 类 SqlParser.java 和 SqlParserSymbols.java。

CUP 包含四部分,但是它并没有像 JFlex 一样用分隔符。
第一部分 Package and Import Specifications,定义了包名,引入了 Java 类。
第二部分 User Code Components,定义了一些变量和方法,直接替换到生成的类中。
第三部分 Symbol Lists,定义了终结符和非终结符。简单地理解,终结符不能再下推别的表达式,非终结符可以下推到终结符。如 a = 1,b = a,这其中 a 就是终结符,b 就是非终结符。
第四部分 Precedence and Associativity declarations,定义了优先级和关联性声明。例如一些符号的左关联。
第五部分 The Grammar,定义了所有 SQL 语句的语法。

下面是 CUP 文件内容:

fe/src/main/cup/sql-parser.cup
// Package and Import Specifications
package org.apache.impala.analysis;
import com.google.common.collect.Lists;
// User Code Components, methods and variable to be placed directly within the generated parser class
parser code {:
  // ...
:};
// Symbol Lists, terminal and nonterminal
terminal
  ..., KW_SELECT, ...;
// ...
nonterminal SelectStmt select_stmt;
// ...
// Precedence and Associativity declarations
precedence left KW_AND;
// ...
// The Grammar
start with stmt;
stmt ::=
  query_stmt:query
  {: RESULT = query; :}
  | ...
  ;
// ...

CUP 类名在 pom 中定义:

fe/pom.xml
<plugin>
    <groupId>net.sourceforge.czt.dev</groupId>
    <artifactId>cup-maven-plugin</artifactId>
    <version>1.6-cdh</version>
    <executions>
        <execution>
        <id>cup</id>
        <phase>generate-sources</phase>
        <goals>
            <goal>generate</goal>
        </goals>
        </execution>
    </executions>
    <configuration>
        <cupDefinition>sql-parser.cup</cupDefinition>
        <className>SqlParser</className>
        <symbolsName>SqlParserSymbols</symbolsName>
        <outputDirectory>${project.build.directory}/generated-sources/cup</outputDirectory>
        <expectedConflicts>0</expectedConflicts>
    </configuration>
</plugin>

# 例子

对于一个简单 SQL:select a, b, c from tbl where d = 1,我们来看下是如何解析的。

首先,我们来看 impala 的 parse 方法。先是 SqlScanner 扫描,然后再经过 SqlParser 解析。

fe/src/main/java/org/apache/impala/analysis/Parser.java
public static StatementBase parse(String query, TQueryOptions options)
    throws AnalysisException {
  SqlScanner input = new SqlScanner(new StringReader(query));
  SqlParser parser = new SqlParser(input);
  parser.setQueryOptions(options);
  try {
    return (StatementBase) parser.parse().value;
  } catch (AnalysisException e) {
    throw e;
  } catch (Exception e) {
    throw new ParseException(parser.getErrorMsg(query), e);
  }
}

扫描时,select、from、where 分别对应关键字 SqlParserSymbols.KW_SELECTSqlParserSymbols.KW_FROMSqlParserSymbols.KW_WHERE ,a、b、c、d、tbl 对应正则匹配到 Identifier,Identifier 又关联到 SqlParserSymbols.IDENT, 对应到 SqlParserSymbols.COMMA= 对应到 SqlParserSymbols.EQUA ,1 匹配到 IntegerLiteral,对应 SqlParserSymbols.INTEGER_LITERAL

解析时, KW_SELECT IDENT, IDENT, IDENT KW_FROM IDENT KW_WHERE IDENT EQUA INTEGER_LITERAL 匹配到 select_stmt,其匹配路径如下: