Well… not exactly writing a parser but generating one.
For the task I’ve been working on recently I needed to have (inside my java code) a data model representing database tables, so I could perform data normalization according to it.

“CREATE TABLE” sql statement is a natural fit for representing the data model. The only thing I had to decide on is parser. How do I parse SQL? Can I use something off-the-shelf? I asked myself and it turns out yes I can!
I took a look at Javacc and then at Antlr. Antlr seemed more, I would say, mature, more documentation, tutorials, sample grammars, etc.

Antlr (ANother Tool for Language Recognition)

is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.

BTW, the data I was going to parse looked like this:

CREATE TABLE table_one (
    customer_number integer NOT NULL,
    address character varying(30)
);

CREATE TABLE table_two (
    id integer NOT NULL,
    city character varying(50)
);

I wrote following grammar for it:

grammar CreateTable;

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

table_list :
    (table_def)*;

table_def :
    'create' 'table' table_name table_element_list SEMICOLON;

table_name :
    (schema DOT)? table;

table_element_list :
    LEFT_PAREN table_element (COMMA table_element)* RIGHT_PAREN;

table_element :
    column_name data_type_def (column_constraint)?;

data_type_def :
    data_type lenght_constraint?;

schema : ID;

table : ID;

column_name :ID;

data_type : ID:

length_constraint :
    LEFT_PAREN NUMBER RIGHT_PAREN;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT	:	 '.';
NUMBER 	:	(DIGIT)+;	
ID  : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ 	{ $channel = HIDDEN; } ;
fragment DIGIT :   '0'..'9' ;

This is all fine but I want my parser to return data right away like:

List<Table> tables = parser.table_list();
List<Column> cols = table.getColumns();
Column col = table.getColumn("id");
col.getName();
col.getType();
col.getConstriant();
col.getLength();

So we have to made a few changes to our parser rules in the grammar.

grammar CreateTable;

@header {
import java.util.ArrayList;
import java.util.List;

}

table_list returns [List tables] 
	: {$tables = new ArrayList();}	(table_def { $tables.add($table_def.tbl); } )*  ;
	

table_def returns [Table tbl]
	: 'create' 'table' table_name 
		table_element_list {
			$tbl = new Table(String.valueOf($table_name.text), $table_element_list.cols);
		}
		SEMICOLON ;

table_name 
	:	(schema DOT)? table  ;

table_element_list returns [List cols]
	: {$cols = new ArrayList();} LEFT_PAREN te=table_element {$cols.add($te.col);} (COMMA te=table_element {$cols.add($te.col);})*   RIGHT_PAREN ;

table_element returns [Column col]
	:  column_name data_type_def (column_constraint)? {$col = new Column($column_name.cn, $data_type_def.tpe, $column_constraint.text, $data_type_def.len);};

column_name returns [String cn]
	:	ID {$cn = String.valueOf($ID.text); };

column_constraint
	:	'not' 'null';

data_type_def returns [String tpe, Integer len]
	: data_type {$tpe = $data_type.tp;} length_constraint? {$len = $length_constraint.len;} 	 
	;
	
data_type returns [String tp]
	: 
	((('character' 'varying')  | 'varchar') {$tp = "varchar";}
		| ('bit' 'varying' | 'varbit') {$tp = "varbit";}
		| ('double' 'precision' | 'float8') {$tp = "float8"; }
		| ('character' | 'char') {$tp = "char";}
		| ('integer' | 'int' | 'int4') {$tp = "integer";}
		| ID {$tp = $ID.text;}
		);
	

length_constraint returns [Integer len] 
	: LEFT_PAREN NUMBER RIGHT_PAREN {$len = Integer.valueOf($NUMBER.text);} 	;
	
schema : ID;
table :	ID;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT	:	 '.';
NUMBER 	:	(DIGIT)+;	
ID  : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ 	{ $channel = HIDDEN; } ;
fragment DIGIT :   '0'..'9' ;

That’s basically it. I have a parser that I wanted and I think it took me less than I would have spent writing my own.
There are many sample grammars. Have a look maybe there is existing one for the language you’re about to parse. :)

P.S. The exact “CREATE TABLE” SQL statement is slightly more complex that I assume in the grammar.

Advertisements