Well… not exactly writing a parser but generating one.
For the task I’ve been working on recently I needed to have (inside my java code) a data model representing database tables, so I could perform data normalization according to it.
“CREATE TABLE” sql statement is a natural fit for representing the data model. The only thing I had to decide on is parser. How do I parse SQL? Can I use something off-the-shelf? I asked myself and it turns out yes I can!
I took a look at Javacc and then at Antlr. Antlr seemed more, I would say, mature, more documentation, tutorials, sample grammars, etc.
Antlr (ANother Tool for Language Recognition)
is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.
BTW, the data I was going to parse looked like this:
CREATE TABLE table_one (
customer_number integer NOT NULL,
address character varying(30)
);
CREATE TABLE table_two (
id integer NOT NULL,
city character varying(50)
);
I wrote following grammar for it:
grammar CreateTable;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
table_list :
(table_def)*;
table_def :
'create' 'table' table_name table_element_list SEMICOLON;
table_name :
(schema DOT)? table;
table_element_list :
LEFT_PAREN table_element (COMMA table_element)* RIGHT_PAREN;
table_element :
column_name data_type_def (column_constraint)?;
data_type_def :
data_type lenght_constraint?;
schema : ID;
table : ID;
column_name :ID;
data_type : ID:
length_constraint :
LEFT_PAREN NUMBER RIGHT_PAREN;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT : '.';
NUMBER : (DIGIT)+;
ID : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' ;
This is all fine but I want my parser to return data right away like:
List<Table> tables = parser.table_list();
List<Column> cols = table.getColumns();
Column col = table.getColumn("id");
col.getName();
col.getType();
col.getConstriant();
col.getLength();
So we have to made a few changes to our parser rules in the grammar.
grammar CreateTable;
@header {
import java.util.ArrayList;
import java.util.List;
}
table_list returns [List tables]
: {$tables = new ArrayList();} (table_def { $tables.add($table_def.tbl); } )* ;
table_def returns [Table tbl]
: 'create' 'table' table_name
table_element_list {
$tbl = new Table(String.valueOf($table_name.text), $table_element_list.cols);
}
SEMICOLON ;
table_name
: (schema DOT)? table ;
table_element_list returns [List cols]
: {$cols = new ArrayList();} LEFT_PAREN te=table_element {$cols.add($te.col);} (COMMA te=table_element {$cols.add($te.col);})* RIGHT_PAREN ;
table_element returns [Column col]
: column_name data_type_def (column_constraint)? {$col = new Column($column_name.cn, $data_type_def.tpe, $column_constraint.text, $data_type_def.len);};
column_name returns [String cn]
: ID {$cn = String.valueOf($ID.text); };
column_constraint
: 'not' 'null';
data_type_def returns [String tpe, Integer len]
: data_type {$tpe = $data_type.tp;} length_constraint? {$len = $length_constraint.len;}
;
data_type returns [String tp]
:
((('character' 'varying') | 'varchar') {$tp = "varchar";}
| ('bit' 'varying' | 'varbit') {$tp = "varbit";}
| ('double' 'precision' | 'float8') {$tp = "float8"; }
| ('character' | 'char') {$tp = "char";}
| ('integer' | 'int' | 'int4') {$tp = "integer";}
| ID {$tp = $ID.text;}
);
length_constraint returns [Integer len]
: LEFT_PAREN NUMBER RIGHT_PAREN {$len = Integer.valueOf($NUMBER.text);} ;
schema : ID;
table : ID;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT : '.';
NUMBER : (DIGIT)+;
ID : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' ;
That’s basically it. I have a parser that I wanted and I think it took me less than I would have spent writing my own.
There are many sample grammars. Have a look maybe there is existing one for the language you’re about to parse. :)
P.S. The exact “CREATE TABLE” SQL statement is slightly more complex that I assume in the grammar.




10 Comments
I’m puzzled with your ‘ID’ specification. Shouldn’t it be like the one below?
‘a’ .. ‘z’ ( ‘a’ .. ‘z’ | ‘0′ .. ‘9′ | ‘_’ )*
Um… ID can start with “_” symbol so it probably should look like
ID : (‘a’.. ‘z’ | ‘_’) (‘a’..’z’ | ‘0′..’9′ | ‘_’ )*;
“The exact “CREATE TABLE” SQL statement is slightly more complex that I assume in the grammar”– slightly? ;-)
I think much more complex. I was looking on ANTLR grammars list for some SQL’92 grammar but unforunately there is nothing like this. I wonder why?
I’ve started to manually create it translating from BNF definitions but it looks like a tedious task. Do you now any complete ANTLR grammar for SQL?
Best,
Maciej
The “create table” only statement is not too complex, but the entire ANSI standard SQL grammar is. Anyway, have a look at the following links:
http://www.antlr.org/grammar/1057936474293/index.html
http://www.antlr.org/grammar/1057863397080/index.html
Hope this helps.
Danil,
Thanks for grammars links. However, they are in ANTLR v2 which is quite different from the current V3, supported by ANTLRWorks IDE. Conversion it is also not so straightforward for me (this is compositional grammar combining parser and lexer together and need to be separater first). Do you have v3 version of it?
I don’t have it. :( So you’ll probably have to translate it by yourself. Also I’m sure you’ve seen it but just in case http://savage.net.au/SQL/sql-92.bnf.html
When I generate your grammer I get code that fails because Table and Column are not defined classes.
Is ANTLR supposed to be generating those, or am I supposed to have coded them myself? What should they look like?
Thanks!
Yes, you’re supposed to create them. They are very simple though. Just java beans.
public class Column implements Comparable { private String name; private String dataType; private String constraint; private Integer length; public Column(final String name, final String dataType, final String constraint, final Integer len) { this.name = name; this.dataType = dataType; this.constraint = constraint; this.length = len; } // setters, getters go here public class Table { private String name; private List<Column> columns; public Table(final String name, List<Column> cols) { this.name = name; Collections.sort(cols); this.columns = cols; } // getters, setters go here.Using the Derby Parser could be an alternative to folks who want SQL parsing abilities in their applications.
http://issues.apache.org/jira/browse/DERBY-3946
Thanks for the link. I didn’t consider such alternatives.