首页 > 学院 > 开发设计 > 正文

yacc&lex-Chapter1

2019-11-11 02:18:44
字体:
来源:转载
供稿:网友

参考资料

-《lex & yacc 2nd》:下载地址参考 http://blog.csdn.net/a_flying_bird/article/details/52486815

本文即此书的学习笔记。

Lex

要点

扩展名

lex文件通常使用的后缀名: .l, .ll, .lex。——实际上,可以是任意的名称。

文件结构

文件内容分为三部分,各个部分之间以 %% 分隔:

%{ /* part 1: Definition Section. e.g.: Global declaration of C. */%}%%/* part 2: Rules section. Rule = Pattern + Action. */%%part 3: C codes.

注意,%} 不要写成 }% 了,否则 PRemature EOF。

%{ 和 %} 之间的内容会原封不动地拷贝到最后生成的c文件中,所以这里可以是任何合法的C代码。通常而言,这里放lex文件后面C代码要用到的一些东西。

lex文件生成c文件

使用lex命令,把lex文件转换成c文件(lex.yy.c);在生成可执行文件的时候,要链接库文件l。示例:

lex simplest.l gcc lex.yy.c -ll -o test

示例

最简单的例子

对应En Page 2.

代码(simplest.l):

%%.|/n ECHO;%%

编译(把lex文件转换成c文件)&链接&运行:

$ lssimplest.l$ lex simplest.l $ lslex.yy.c simplest.l$ gcc lex.yy.c -ll -o test$ ./test The simplest lex program. ------ 键盘输入内容The simplest lex program. ------ 程序回显结果^C$

识别单词 Recognizing Words

这个例子可以识别指定的这些单词,其他的不认识的直接回显。- 对应原书 ch1-02.l

代码:

%{/* * this sample demonstrates (very) simple recognition: * a verb, or not a verb. */%}%%[/t ]+ /* ignore whitespace */ ;is |am |are |were |was |be |being |been |do |does |did |will |would |should |can |could |has |have |had |go {printf("%s: is a verb/n", yytext);}[a-zA-Z]+ {printf("%s: is not verb/n", yytext);}.|/n {ECHO; /* normal default anyway */ }%%int main(){ yylex(); return 0;}

运行:

$ lex recoginzing_word.l $ gcc lex.yy.c -ll -o test$ ./test I am a student. You are a teacher. ------ 键盘输入内容I: is not verbam: is a verba: is not verbstudent: is not verb.You: is not verbare: is a verba: is not verbteacher: is not verb.^C$

要点

lex文件的三部分:definition section, rules section, user subroutines section.

definition section可以有一段”%{“和”%}”,这中间用来放C代码,比如#include,函数原型,全局变量等等。在由lex生成lex.yy.c的时候,这部分原封不动拷贝到C文件中。

rules section: 每个规则由两部分组成,即 pattern + action. 两者由空格分开。其中pattern是正则表达式语法。lexer在识别到某个pattern后,就会执行其对应的action。——action: { C codes. }

user subroutines section: 拷贝到.c文件的最后。

特殊的action:

“;”: 同C语言的空余句,即什么也不做。——直接忽略这些输入“ECHO;”: 缺省行为,将匹配的字符串打印到输出文件中(stdout,回显)。“|”: 使用下一个pattern的action。——注意 | action的语法,会在pattern后面有一个空格。而作为正则表达式的|则不会有空格。

注意1: ;和ECHO;的区别:前者是忽略输入,后者是打印到输出。可以将示例中的ECHO;改成;后观察输出的变化情况。

注意2: | action不能像下面这种方法写到同一行:

had | go {printf("%s: is a verb/n", yytext);}

变量:

yytext: 存储的是匹配到的字符串,其类型可以在生成的.c中看到,即 extern char *yytext;

无歧义规则:每个输入仅匹配一次 + 最长匹配。英文描述如下:

Lex patterns only match a given input characer or string once.Lex executes th action for the longest possible match for the current input.

缺省main:

这里的例子中,定义的main()调用了yylex()。yylex()是lex定义的函数,缺省情况下,如果lex文件中没有定义main()函数,lex系统有一个缺省的main,也会去调用yylex()。

程序退出:

缺省情况下,yylex()只有处理了所有的输入的时候,才会退出。对于控制台输入,则要等到Ctrl+C。当然,用户也可以主动return,即在action中增加return语句。为此,可以增加如下一个规则作验证:

quit {printf("Program will exit normally./n"); return 0;}

注意:这句话写到a-zA-Z]+的前面,否则 warning, rule cannot be matched。

拓展

可以修改下面两点,做对比分析:

[/t ]+ {printf("%s: white space/n", yytext);}.|/n {printf("%s: Invalid word/n", yytext);}

示例:——注意观察最后有一个换行符。

I am a student. You are a teacher. !@#$%^&*I: is not verb : white spaceam: is a verb : white spacea: is not verb : white spacestudent: is not verb.: Invalid word : white spaceYou: is not verb : white spaceare: is a verb : white spacea: is not verb : white spaceteacher: is not verb.: Invalid word : white space!: Invalid word@: Invalid word#: Invalid word$: Invalid word%: Invalid word^: Invalid word&: Invalid word*: Invalid word: Invalid word

识别更多的单词

对应 ch1-03.l

可以识别出动词、副词、介词等等。——只需要增加对应的rules即可。

代码:

%{/* * this sample demonstrates (very) simple recognition: * a verb, or not a verb. */%}%%[/t ]+ {printf("%s: white space/n", yytext);}is |am |are |were |was |be |being |been |do |does |did |will |would |should |can |could |has |have |had |go {printf("%s: is a verb/n", yytext);}very | simple | gently | quietly | calmly | angrily {printf("%s: is an adverb/n", yytext);}to |from |behind |above |below | between {printf("%s: is a preposition/n", yytext);}if | then | and | but | or {printf("%s: is a conjunction/n", yytext);}their | my | your | his | her | its {printf("%s: is a adjective/n", yytext);}I | you | he | she | we | they {printf("%s: is a pronoun/n", yytext);}QUIT { printf("Program will exit normally./n"); return 0; }[a-zA-Z]+ {printf("%s: don't recognize/n", yytext);}.|/n {printf("%s: Invalid word/n", yytext);}%%int main(){ yylex(); return 0;}

运行:

he is a student. and he is a teacher. QUIT (ENTER)he: is a pronoun : white spaceis: is a verb : white spacea: don't recognize : white spacestudent: don't recognize.: Invalid word : white spaceand: is a conjunction : white spacehe: is a pronoun : white spaceis: is a verb : white spacea: don't recognize : white spaceteacher: don't recognize.: Invalid word : white spaceProgram will exit normally.

动态定义单词表 lexer with symbol table

对应 ch1-03.l, 这个例子说明如何在lex中写更复杂的C代码。

前面的例子是把每个单词都定义在lex文件中,接下来对其优化。

比如,可以在文件中按照特定语法来定义单词的词性:

noun dog cat horse cowverb chew eat lick

即每行开头一个单词用来定义词性,接下来的每个单词都属于该词性。如此,可以在文件中作这种定义。当然,具体到这里的示例代码,暂时不处理文件输入,而仍然从控制台输入。这时,就有两种输入:

定义:即首字母表示词性,接下来是一系列属于该词性的单词;识别:同前一个例子,要求识别出每个单词的词性。

代码:

%{#include <stdbool.h>#include <string.h>#include <stdio.h>#include <stdlib.h>/* * Word recognizer with a symbol table. */enum { LOOKUP = 0, /* default - looking rather than defining. */ VERB, ADJ, ADV, NOUN, PREP, PRON, CONJ};int state; // global variable, default to 0(LOOKUP).bool add_word(int type, char *word);int lookup_word(char *word);%}%%[/t ]+ ; /* ignore whitespace *//n {state = LOOKUP;} // end of line, return to default state. /* Whenever a line starts with a reserved part of speech name */ /* start defining words of that type */^verb {state = VERB;}^adj {state = ADJ;}^adv {state = ADV;}^noun {state = NOUN;}^prep {state = PREP;}^pron {state = PRON;}^conj {state = CONJ;} /* a normal word, define it or look it up */[a-zA-Z]+ { if (state != LOOKUP) { /* define the current word */ add_word(state, yytext); } else { switch(lookup_word(yytext)) { case VERB: printf("%s: verb/n", yytext); break; case ADJ: printf("%s: adjective/n", yytext); break; case ADV: printf("%s: adverb/n", yytext); break; case NOUN: printf("%s: noun/n", yytext); break; case PREP: printf("%s: preposition/n", yytext); break; case PRON: printf("%s: pronoun/n", yytext); break; case CONJ: printf("%s: conjunction/n", yytext); break; default: printf("%s: don't recognize/n", yytext); break; } } }[,:.] {printf("%s: punctuation, ignored./n", yytext);}. {printf("%s: invalid char/n", yytext);}%% int main(){ yylex(); return 0;}/* define a linked list of words and types */struct word { char *word_name; int word_type; struct word *next;};struct word *word_list; /* first element in word list */bool add_word(int type, char *word){ struct word *wp; // wp: word pointer if (lookup_word(word) != LOOKUP) { printf("!!! warning: word %s already defined./n", word); return false; } /* word not there, allocate a new entry and link it on the list */ wp = (struct word*)malloc(sizeof(struct word)); wp->next = word_list; wp->word_name = (char*)malloc(strlen(word) + 1); strcpy(wp->word_name, word); wp->word_type = type; word_list = wp; return true;}int lookup_word(char *word){ struct word *wp = word_list; for (; wp; wp = wp->next) { if (strcmp(wp->word_name, word) == 0) { return wp->word_type; } } return LOOKUP;}

这里的枚举值有两个含义:

状态。缺省是LOOKUP状态,即对当前输入行的每个单词,在词库/链表中查找其词性(lookup_word),然后打印出来。但如果每一行的第一个单词是noun/verb等保留字,则说明要进入defining状态(细分为VERB等状态),保留字后续的各个单词将会添加到词库/链表中(add_word)。——在添加词库的时候,会先检查该单词是否已经入库。类型:词库中,每个单词每个单词对应的词性用VERB等表示。

运行:

noun pet dog cat cats [ENTER]verb is are [ENTER]adj my his their [ENTER]my pet is dog. their pets are cats. that's ok. [ENTER]my: adjectivepet: nounis: verbdog: noun.: punctuation, ignored.their: adjectivepets: don't recognizeare: verbcats: noun.: punctuation, ignored.that: don't recognize': invalid chars: don't recognizeok: don't recognize.: punctuation, ignored.^C

yacc

前面的例子把一串字符串识别成了一个个单词,接下来就是识别句子。

词法分析:从输入字符流中识别出一个个单词,就是所谓的词法分析,输出是token。其关键就是定义词法规则(正则表达式);语法分析:在得到一个个单词(包括词性)之后,就是做更高级的分析,比如某些词连在一起是否构成了一个正确的句子。——各个token如何组合或搭配在一起。对于不同的token 组合执行不同的action。

sentences

现在分析,如何由前面得到的noun&pronoun&verb等构造出句子(示例):

主语:(假定只能是)名词或代词,即 subject -> noun | pronoun宾语:(假定只能是)名词,即 object -> noun句子(主谓宾):谓语只支持动词形式,即 sentence -> subject verb object.

这里的subject&object&sentence就是基于词法分析得到的noun&pronoun&verb等token而构造出来的新的symbol。

parser和lexer之间的通信

在yacc&lex系统中,词法分析(lex/lexer)和语法分析(yacc/parser)是相对独立的两套子系统。词法分析对应的(库)函数是yylex(),这个函数对输入的字符流做词法分析,然后生成一个个token。语法分析对应的函数是parser(),其输入是yylex()产生的token。所以,要把lex/lexer/yylex()的输出作为yacc/parser/parser()的输入。

yylex()的原型:

int yylex (void);

这里的关键就在于yylex()的返回值,其表示了当前识别的token的类别。当parser()需要一个token的时候,就调用yylex(),根据其返回值,就知道这个token的类别,从而做进一步的处理。

需要注意的是,并非lexer要给parser返回所有的token。比如,注释部分或空白符号就不需要传给parser,或者说parser对此不感兴趣。这种情况下,lexer直接丢弃即可。

既然yacc和lex基于token通信,自然就需达成一致的规定。这就是所谓的token codes,即每一类token规定一个token code。在yacc&lex系统中,是由yacc来定义token codes,然后lex的代码include进来。具体地,

在yacc中用%token NOUN VERB语法定义token codesyacc -d test.y 会生成y.tab.c和y.tab.h两个文件,其中后者就包括了token codes的宏定义在lex中include这个y.tab.h文件。

注:取值为0的token code表示结束输入(a logical end of input)。

示例

test.l

%{/* * We now build a lexical analyzer to be used by a higher-level parser. */#include <stdbool.h>#include <string.h>#include <stdio.h>#include <stdlib.h>#include "y.tab.h"#define LOOKUP 0 /* default - looking rather than defining. */int state; // global variable, default to 0(LOOKUP).bool add_word(int type, char *word);int lookup_word(char *word);const char* get_word_type(int type);%}%%[/t ]+ ; /* ignore whitespace *//n {state = LOOKUP;} // end of line, return to default state././n { state = LOOKUP; return 0; // end of sentence. } /* Whenever a line starts with a reserved part of speech name */ /* start defining words of that type */^verb {state = VERB;}^adj {state = ADJECTIVE;}^adv {state = ADVERB;}^noun {state = NOUN;}^prep {state = PREPOSITION;}^pron {state = PRONOUN;}^conj {state = CONJUNCTION;} /* a normal word, define it or look it up */[a-zA-Z]+ { if (state != LOOKUP) { /* define the current word */ add_word(state, yytext); } else { int type = lookup_word(yytext); printf("%s: %s/n", yytext, get_word_type(type)); switch(type) { case VERB: case ADJECTIVE: case ADVERB: case NOUN: case PRONOUN: case PREPOSITION: case CONJUNCTION: return type; default: //printf("%s: don't recognize/n", yytext); break; // don't return, just ignore it. } } }. {printf("%s: ----/n", yytext);} // ignore it%% /* define a linked list of words and types */struct word { char *word_name; int word_type; struct word *next;};struct word *word_list; /* first element in word list */bool add_word(int type, char *word){ struct word *wp; // wp: word pointer if (lookup_word(word) != LOOKUP) { printf("!!! warning: word %s already defined./n", word); return false; } /* word not there, allocate a new entry and link it on the list */ wp = (struct word*)malloc(sizeof(struct word)); wp->next = word_list; wp->word_name = (char*)malloc(strlen(word) + 1); strcpy(wp->word_name, word); wp->word_type = type; word_list = wp; return true;}int lookup_word(char *word){ struct word *wp = word_list; for (; wp; wp = wp->next) { if (strcmp(wp->word_name, word) == 0) { return wp->word_type; } } return LOOKUP;}const char* get_word_type(int type){ switch(type) { case VERB: return "verb"; case ADJECTIVE: return "adjective"; case ADVERB: return "adverb"; case NOUN: return "noun"; case PREPOSITION: return "preposition"; case PRONOUN: return "pronoun"; case CONJUNCTION: return "conjunction"; default: return "unknown"; }}

test.y

%{/* * A lexer for the basic grammer to use for recognizing English sentence. */#include <stdio.h> extern int yylex (void);void yyerror(const char *s, ...);%}%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION%%sentence: subject VERB object {printf("Sentence is valid./n");} ;subject: NOUN | PRONOUN ;object: NOUN ;%%extern FILE *yyin;int main(){ //while(!feof(yyin)) { yyparse(); //}}void yyerror(const char *s, ...){ fprintf(stderr, "%s/n", s);}

y.tab.h

此文件自动生成,如下:

/* A Bison parser, made by GNU Bison 2.3. *//* Skeleton interface for Bison's Yacc-like parsers in C ... This special exception was added by the Free Software Foundation in version 2.2 of Bison. *//* Tokens. */#ifndef YYTOKENTYPE# define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NOUN = 258, PRONOUN = 259, VERB = 260, ADVERB = 261, ADJECTIVE = 262, PREPOSITION = 263, CONJUNCTION = 264 };#endif/* Tokens. */#define NOUN 258#define PRONOUN 259#define VERB 260#define ADVERB 261#define ADJECTIVE 262#define PREPOSITION 263#define CONJUNCTION 264#if ! defined YYSTYPE && ! defined YYSTYPE_IS_DECLAREDtypedef int YYSTYPE;# define yystype YYSTYPE /* obsolescent; will be withdrawn */# define YYSTYPE_IS_DECLARED 1# define YYSTYPE_IS_TRIVIAL 1#endifextern YYSTYPE yylval;

运行

noun dogsnoun dogverb is arepron they itit is dog.it: pronounis: verbdog: nounSentence is valid.it is dog.it: pronounsyntax error

其他尝试

增加一些打印

sentence: subject verb object {printf("Sentence is valid./n");} ;subject: NOUN {printf("subject of a noun./n");} | PRONOUN {printf("subject of a pronoun./n");} ;verb: VERB {printf("verb./n");} ;object: NOUN {printf("object of a noun./n");} ;

运行:

noun dogverb ispron itit is dogit: pronounsubject of a pronoun.is: verbverb.dog: nounobject of a noun.Sentence is valid.

或者:

noun dog dogsverb is arepron it theyit is dog they are dogs.it: pronounsubject of a pronoun.is: verbverb.dog: nounobject of a noun.Sentence is valid.they: pronounsyntax error

识别多个句子

extern FILE *yyin;int main(){ while(!feof(stdin/*yyin*/)) { yyparse(); }}

运行:

$ ./test noun dogverb ispron itit is dog.it: pronounis: verbdog: nounSentence is valid.it is dog.it: pronounis: verbdog: nounSentence is valid.noun dogsverb arepron theythey are dogs.they: pronounare: verbdogs: nounSentence is valid.

改成如下的代码会运行错误:

int main(){ //while(!feof(stdin/*yyin*/)) { for (;;) { yyparse(); }}

运行:

$ ./test noun dogverb ispron itit is dogit: pronounis: verbdog: nounSentence is valid.it is dogit: pronounsyntax erroris: verbsyntax errordog: noun

从文件中读数据

要从文件中读取,需要使用全局变量yyin。如下这种方式无效:

//extern FILE *yyin;int main(){ FILE* f = NULL; f = fopen("test.txt", "rb"); if (NULL == f) { printf("Open file failed./n"); return 1; } printf("Open file successfully./n"); while(!feof(f)) { yyparse(); }}

注:在yy.lex.c中,使用的是yyin全局变量。该变量初始化为0(NULL)。如果用户没有更改yyin,会程序跑起来之后会自动设置为stdin。

正确代码:

extern FILE *yyin;int main(){ yyin = fopen("test.txt", "rb"); if (NULL == yyin) { printf("Open file failed./n"); return 1; } printf("Open file successfully./n"); while(!feof(yyin)) { yyparse(); }}

测试文件test.txt的内容:

noun dog dogsverb is arepron it theyit is dog.they are dogs.

运行:

$ ./test Open file successfully.it: pronounis: verbdog: nounSentence is valid.they: pronounsyntax error$

简单语句和复合语句

对应 ch1-06.y

代码

%{/* * A lexer for the basic grammer to use for recognizing English sentence. */#include <stdio.h> extern int yylex (void);void yyerror(const char *s, ...);%}%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION%%sentence: simple_sentence { printf("Parsed a simple sentence./n"); } | compound_sentence { printf("Parsed a compound sentence./n"); } ;simple_sentence: subject verb object {printf("simple sentence of type 1./n");} | subject verb object prep_phrase {printf("simple sentence of type 2./n");} ;compound_sentence: simple_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 1./n");} | compound_sentence CONJUNCTION simple_sentence {printf("compound sentence of type 2./n");} ;subject: NOUN | PRONOUN | ADJECTIVE subject ;verb: VERB | ADVERB VERB | verb VERB ;object: NOUN | ADJECTIVE object ;prep_phrase: PREPOSITION NOUN ;%%extern FILE *yyin;int main(){ yyin = fopen("test.txt", "rb"); if (NULL == yyin) { printf("Open file failed./n"); return 1; } printf("Open file successfully./n"); while(!feof(yyin)) { yyparse(); } fclose(yyin); yyin = NULL; return 0;}void yyerror(const char *s, ...){ fprintf(stderr, "%s/n", s);}

测试文件

noun dog dogs Chinaverb is arepron it theyadj prettyconj andprep init is a pretty dog and they are dogs in China and it is dog.

运行结果

Open file successfully.it: pronounis: verba: unknownpretty: adjectivedog: nounand: conjunctionsimple sentence of type 1.they: pronounare: verbdogs: nounin: prepositionChina: nounsimple sentence of type 2.compound sentence of type 1.and: conjunctionit: pronounis: verbdog: noun.: ----simple sentence of type 1.compound sentence of type 2.Parsed a compound sentence.
发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表