齐天编译器第一棒 原理&代码讲解之 词法分析:编译器的第一道关卡
词法分析:编译器的第一道关卡原理讲解 与 代码分析说明,由于c#代码也是c家族的,和c长得很像,对于学过任意一门面向对象语言的嵌入式工程师们,都应该比较容易理解本文代码,有些特别的语法,笔者用绿色文字进行了解释。
本文涉及的源码:【手搓编译器】齐天编译器 第一棒 梯云八纵(8颗流水灯) - SDCC, IAR C++ for 51, GCC, VSCode,Linux, MacOS 国芯人工智能技术交流网站 - AI32位8051交流社区
原理讲解:什么是词法分析?词法分析(Lexical Analysis)是编译器工作的第一个阶段,它就像一位专业的"阅读理解专家",负责将原始的源代码文本分解成一系列有意义具有“令牌”的"词汇单元",这些词汇单元在编译原理中被称为记号(Token)。
一个生动的比喻想象你在阅读一篇英文文章,你的大脑会自动:
[*]识别出单词(如"hello"、“world”)
[*]识别出标点符号(如逗号、句号)
[*]识别出数字(如"123")
[*]跳过空格和换行
词法分析器做的就是类似的工作,只不过它处理的是编程语言的源代码。
词法分析的核心任务1. 输入与输出
[*]输入:原始的源代码字符串(例如:sfr WTST = 0xE9;)
[*]输出:记号序列(例如:
Token { Kind = Sfr, Text = sfr, Line = 1 }Token { Kind = Ident, Text = WTST, Line = 1 }Token { Kind = Eq, Text = , Line = 1 }Token { Kind = Num, Text = 0xE9, Line = 1 }Token { Kind = Semi, Text = , Line = 1 })
2. 具体工作内容词法分析器主要完成以下任务:
a) 识别记号将连续的字符组合成有意义的记号,例如:
[*]上面那一坨token都是
b) 过滤无关内容跳过对编译没有意义的字符:
[*]空格、制表符、换行符
[*]注释(如// 这是注释)
c) 记录位置信息记下每个记号在源代码中的位置,便于后续错误报告:ERROR:第3行:缺少分号
d) 分类记号将识别出的记号分类,常见的记号类型包括:
[*]关键字:语言预定义的保留字(如if、while、int)
[*]标识符:变量名、函数名等
[*]字面量:具体的值(如51、"hello")
[*]运算符:+、-、*、/、=等
[*]分隔符:;、,、(、)、{、}等
[*](第一棒没实现这么多)
为什么需要词法分析?它扮演着至关重要的角色:
1. 简化后续处理如果没有词法分析,语法分析器需要直接处理原始字符:
[*]需要自己判断"i"、“n”、“t"三个字符组合在一起是关键字"int”
[*]需要自己跳过空格和换行
[*]需要自己区分"123"是数字还是三个独立的字符
有了词法分析器,语法分析器可以直接处理结构化的记号,大大简化了工作。
2. 提高编译效率词法分析器可以高效地过滤掉无关内容,并且可以采用专门的算法(如有限自动机)快速识别记号,(本例未使用该算法),比语法分析器直接处理字符更高效。
3. 改善错误处理记位置,报错
4. 模块化设计将词法分析与语法分析分离,符合软件工程的模块化原则,使编译器更易于开发、维护和扩展。
代码解读:我们来看齐天编译器第一棒的代码1. 记号定义(Token)enum Tok{ Sfr, Ident, Num, Eq, Semi, Eof, While, LBrace, RBrace, LParen, RParen, One, Main, Void, Delay// 新增}
record Token(Tok Kind, string Text = "", int Line = 0);这里定义了两种类型:
[*]Tok枚举:表示所有可能的记号类型,如关键字(Sfr, While等)、标识符(Ident)、数字(Num)、运算符(Eq)等
[*]Token记录:表示一个具体的记号,包含类型(Kind)、原始文本(Text)和行号(Line)
[*]注:给不熟悉c# 的朋友介绍一下 record , 这其实是一个class。只不过重写了一些方法,让它适合做数据对象。比如原本的class 你是无法用=来判断两个对象是否相等record可以。
2. 词法分析器主体internal class Lexer{ private readonly string _src; private int _i, _line = 1;
public Lexer(string s) => _src = s;
private char C => _i < _src.Length ? _src : '\0'; private void Skip() => _i++;
// 关键字表 private static readonly Dictionary<string, Tok> Keywords = new() { ["sfr"] = Tok.Sfr, ["while"] = Tok.While, ["main"] = Tok.Main, ["void"] = Tok.Void, ["delay"] = Tok.Delay }; // ...}词法分析器的主要组成部分:
[*]_src:存储源代码字符串
[*]_i:当前读取位置的索引
[*]_line:当前行号,用于错误报告
[*]C属性:获取当前字符的便捷方法
[*]Skip()方法:移动到下一个字符
[*]Keywords字典:将字符串映射到对应的关键字记号类型
注:看到 => 就相当于看到 { } ,后面跟着的是函数体 public Lexer(string s) => _src = s; 翻译过来就是public Lexer(string s){ _src = s;}
3. 核心扫描方法public List<Token> Scan(){ var t = new List<Token>(); while (true) { if (_i >= _src.Length) { t.Add(new Token(Tok.Eof, "", _line)); break; }
if (C == '\n') { _line++; Skip(); continue; } if (char.IsWhiteSpace(C)) { Skip(); continue; }
// 单行 // 注释 if (C == '/' && _i + 1 <_src.Length && _src == '/') { while (C != '\n' && C != '\0') Skip(); continue; }
// 其他识别逻辑... } return t;}Scan()方法是词法分析的核心,它逐个字符读取源代码,根据不同的字符模式识别记号。
3.1 处理空白字符和注释if (C =='\n') { _line++; Skip(); continue; }if (char.IsWhiteSpace(C)) { Skip(); continue; }
// 单行 // 注释if (C =='/' && _i + 1 <_src.Length && _src == '/'){ while (C != '\n' && C != '\0') Skip(); continue;}词法分析器首先跳过所有空白字符(空格、制表符等)和换行符(同时更新行号)。对于单行注释(以//开头),它会一直读取到行尾。
3.2 识别数字// 十六进制if (C =='0' && _i + 1 <_src.Length && char.ToLower(_src) == 'x'){ var sb = "0x"; Skip(); Skip(); while (Uri.IsHexDigit(C)) { sb += C; Skip(); } t.Add(new Token(Tok.Num, sb, _line)); continue;}
// 十进制if (char.IsDigit(C)){ var sb = ""; while (char.IsDigit(C)) { sb += C; Skip(); } t.Add(new Token(Tok.Num, sb, _line)); continue;}数字识别分为两种情况:
[*]十六进制数:以"0x"或"0X"开头,后跟十六进制数字(0-9, a-f, A-F)
[*]十进制数:由数字0-9组成
3.3 识别单字符记号switch (C){ case '=': t.Add(new Token(Tok.Eq, "", _line)); Skip(); continue; case ';': t.Add(new Token(Tok.Semi, "", _line)); Skip(); continue; case '{': t.Add(new Token(Tok.LBrace, "", _line)); Skip(); continue; case '}': t.Add(new Token(Tok.RBrace, "", _line)); Skip(); continue; case '(': t.Add(new Token(Tok.LParen, "", _line)); Skip(); continue; case ')': t.Add(new Token(Tok.RParen, "", _line)); Skip(); continue;}这些是简单的单字符记号,如赋值运算符(=)、分号(;)、大括号({})和圆括号(())等。
3.4 识别标识符和关键字if (char.IsLetter(C) || C == '_'){ var sb = ""; while (char.IsLetterOrDigit(C) || C == '_') { sb += C; Skip(); }
var kind = Keywords.TryGetValue(sb, out var k) ? k : Tok.Ident; t.Add(new Token(kind, sb, _line));
Console.WriteLine($"identifier'{sb}' -> {kind}");
continue;}标识符和关键字的识别规则:
[*]以字母或下划线开头
[*]后跟字母、数字或下划线
[*]检查是否是预定义的关键字,如果不是则归类为标识符(Ident)
[*]注: $后面的字符串内 的{ } 里可以放变量名,打印时会打印值
3.5 错误处理throw new Exception($"Unexpected char '{C}' at line {_line}");如果遇到不符合任何已知模式的字符,词法分析器会抛出异常,报告错误位置(行号)。词法分析的工作流程总结
[*]初始化:创建Lexer实例,传入源代码字符串
[*]逐字符扫描:从第一个字符开始,依次检查每个字符
[*]模式匹配:根据当前字符和后续字符,尝试匹配已知的模式(数字、标识符、运算符等)
[*]生成记号:当匹配成功时,创建相应的Token并添加到结果列表
[*]跳过无关内容:忽略空白字符和注释
[*]错误处理:遇到无法识别的字符时报告错误
[*]结束标记:扫描完成后添加EOF(End of File)标记
示例输入和输出测试案例
sfr WTST = 0xE9;sfr EAXFR = 0xBA;sfr CKCON = 0xEA;
sfr P0 = 0x80;sfr P0M0 = 0x94; sfr P0M1 = 0x93;
sfr P4 = 0xC0;sfr P4M0 = 0xB3;sfr P4M1 = 0xB4;
void main(){ WTST = 0; EAXFR = 1; CKCON = 0;
P0M0 = 0xFF; P0M1 = 0x00;
P4M0 = 0x00; P4M1 = 0x00;
while (1) { P4 = 0x00;
//P0 =0xFF; //delay(15); //P0 =0xFE; //delay(15);
P0 = 0xFE; delay(15);
P0 = 0xFD; delay(15);
P0 = 0xFB; delay(15);
P0 = 0xF7; delay(15);
P0 = 0xEF; delay(15);
P0 = 0xDF; delay(15);
P0 = 0xBF; delay(15);
P0 = 0x7F; delay(15); }}
词法分析过程详解1. 特殊功能寄存器声明部分sfr WTST = 0xE9;词法分析器会将这行代码分解为以下Token序列:Token { Kind = Sfr, Text = sfr, Line = 1 }Token { Kind = Ident, Text = WTST, Line = 1 }Token { Kind = Eq, Text = , Line = 1 }Token { Kind = Num, Text = 0xE9, Line = 1 }Token { Kind = Semi, Text = , Line = 1 }解释:
[*]sfr 是一个关键字,识别为 Tok.Sfr
[*]WTST 是一个标识符,识别为 Tok.Ident
[*]= 是赋值运算符,识别为 Tok.Eq
[*]0xE9 是一个十六进制数字,识别为 Tok.Num
[*]; 是语句结束符,识别为 Tok.Semi
类似地,接下来的几行:sfr EAXFR = 0xBA;sfr CKCON = 0xEA;
2. 端口配置声明sfr P0 = 0x80;这条语句会被分解为:Token { Kind = Sfr, Text = sfr, Line = 5 }Token { Kind = Ident, Text = P0, Line = 5 }Token { Kind = Eq, Text = , Line = 5 }Token { Kind = Num, Text = 0x80, Line = 5 }Token { Kind = Semi, Text = , Line = 5 }
类似的sfr P0M0 = 0x94; sfr P0M1 = 0x93;
sfr P4 = 0xC0;sfr P4M0 = 0xB3;sfr P4M1 = 0xB4;
3. 主函数定义void main(){这部分会被分解为:Token { Kind = Void, Text = void, Line = 13 }Token { Kind = Main, Text = main, Line = 13 }Token { Kind = LParen, Text = , Line = 13 }Token { Kind = RParen, Text = , Line = 13 }Token { Kind = LBrace, Text = , Line = 14 }解释:
[*]void 是关键字,识别为 Tok.Void
[*]main 是函数名,识别为 Tok.Ident
[*]() 是函数参数列表,识别为 Tok.LParen 和 Tok.RParen
[*]{ 是函数体开始,识别为 Tok.LBrace
4. 初始化代码 WTST = 0; 这条赋值语句会被分解为:Token { Kind = Ident, Text = WTST, Line = 15 }Token { Kind = Eq, Text = , Line = 15 }Token { Kind = Num, Text = 0, Line = 15 }Token { Kind = Semi, Text = , Line = 15 }
类似的 EAXFR = 1; CKCON = 0;
P0M0 = 0xFF; P0M1 = 0x00;
P4M0 = 0x00; P4M1 = 0x00;
注意:词法分析器会忽略空行和多余的空格。
5. 无限循环 while (1) {会被分解为:Token { Kind = While, Text = while, Line = 27 }Token { Kind = LParen, Text = , Line = 27 }Token { Kind = Num, Text = 1, Line = 27 }Token { Kind = RParen, Text = , Line = 27 }Token { Kind = LBrace, Text = , Line = 27 }
6. 循环体中的代码 P4 = 0x00;
//P0 =0xFF; //delay(15); //P0 =0xFE; //delay(15);
P0 = 0xFE; delay(15);
P0 = 0xFD; delay(15);
P0 = 0xFB; delay(15);
P0 = 0xF7; delay(15);
P0 = 0xEF; delay(15);
P0 = 0xDF; delay(15);
P0 = 0xBF; delay(15);
P0 = 0x7F; delay(15); }
这部分代码的前几行词法分析结果如下:Token { Kind = Ident, Text= P4, Line = 28 }Token { Kind = Eq, Text =, Line = 28 }Token { Kind = Num, Text =0x00, Line = 28 }Token { Kind = Semi, Text= , Line = 28 } // 注释被词法分析器忽略
Token { Kind = Ident, Text= P0, Line = 35 }Token { Kind = Eq, Text =, Line = 35 }Token { Kind = Num, Text =0xFE, Line = 35 }Token { Kind = Semi, Text= , Line = 35 }
Token { Kind = Delay, Text= delay, Line = 36 }Token { Kind = LParen,Text = , Line = 36 }Token { Kind = Num, Text =15, Line = 36 }Token { Kind = RParen,Text = , Line = 36 }Token { Kind = Semi, Text= , Line = 36 }
后面都类似,不贴出了注意:
[*]以 // 开头的行是注释,词法分析器会忽略它们
[*]最后的 } 被识别为 Tok.RBrace
完整的Token序列将上述所有部分组合起来,整个程序的Token序列如下:Token { Kind = Sfr, Text =sfr, Line = 1 }Token { Kind = Ident, Text= WTST, Line = 1 }Token { Kind = Eq, Text =, Line = 1 }Token { Kind = Num, Text =0xE9, Line = 1 }Token { Kind = Semi, Text= , Line = 1 }Token { Kind = Sfr, Text =sfr, Line = 2 }Token { Kind = Ident, Text= EAXFR, Line = 2 }Token { Kind = Eq, Text =, Line = 2 }Token { Kind = Num, Text =0xBA, Line = 2 }Token { Kind = Semi, Text= , Line = 2 }Token { Kind = Sfr, Text =sfr, Line = 3 }Token { Kind = Ident, Text= CKCON, Line = 3 }Token { Kind = Eq, Text =, Line = 3 }Token { Kind = Num, Text =0xEA, Line = 3 }Token { Kind = Semi, Text= , Line = 3 }Token { Kind = Sfr, Text =sfr, Line = 5 }Token { Kind = Ident, Text= P0, Line = 5 }Token { Kind = Eq, Text =, Line = 5 }Token { Kind = Num, Text =0x80, Line = 5 }Token { Kind = Semi, Text= , Line = 5 }Token { Kind = Sfr, Text =sfr, Line = 6 }Token { Kind = Ident, Text= P0M0, Line = 6 }Token { Kind = Eq, Text =, Line = 6 }Token { Kind = Num, Text =0x94, Line = 6 }Token { Kind = Semi, Text= , Line = 6 }Token { Kind = Sfr, Text =sfr, Line = 7 }Token { Kind = Ident, Text= P0M1, Line = 7 }Token { Kind = Eq, Text =, Line = 7 }Token { Kind = Num, Text =0x93, Line = 7 }Token { Kind = Semi, Text= , Line = 7 }Token { Kind = Sfr, Text =sfr, Line = 9 }Token { Kind = Ident, Text= P4, Line = 9 }Token { Kind = Eq, Text =, Line = 9 }Token { Kind = Num, Text =0xC0, Line = 9 }Token { Kind = Semi, Text= , Line = 9 }Token { Kind = Sfr, Text =sfr, Line = 10 }Token { Kind = Ident, Text= P4M0, Line = 10 }Token { Kind = Eq, Text =, Line = 10 }Token { Kind = Num, Text =0xB3, Line = 10 }Token { Kind = Semi, Text= , Line = 10 }Token { Kind = Sfr, Text =sfr, Line = 11 }Token { Kind = Ident, Text= P4M1, Line = 11 }Token { Kind = Eq, Text =, Line = 11 }Token { Kind = Num, Text =0xB4, Line = 11 }Token { Kind = Semi, Text= , Line = 11 }Token { Kind = Void, Text= void, Line = 13 }Token { Kind = Main, Text= main, Line = 13 }Token { Kind = LParen,Text = , Line = 13 }Token { Kind = RParen,Text = , Line = 13 }Token { Kind = LBrace,Text = , Line = 14 }Token { Kind = Ident, Text= WTST, Line = 15 }Token { Kind = Eq, Text =, Line = 15 }Token { Kind = Num, Text =0, Line = 15 }Token { Kind = Semi, Text= , Line = 15 }Token { Kind = Ident, Text= EAXFR, Line = 16 }Token { Kind = Eq, Text =, Line = 16 }Token { Kind = Num, Text =1, Line = 16 }Token { Kind = Semi, Text= , Line = 16 }Token { Kind = Ident, Text= CKCON, Line = 17 }Token { Kind = Eq, Text =, Line = 17 }Token { Kind = Num, Text =0, Line = 17 }Token { Kind = Semi, Text= , Line = 17 }Token { Kind = Ident, Text= P0M0, Line = 19 }Token { Kind = Eq, Text =, Line = 19 }Token { Kind = Num, Text =0xFF, Line = 19 }Token { Kind = Semi, Text= , Line = 19 }Token { Kind = Ident, Text= P0M1, Line = 20 }Token { Kind = Eq, Text =, Line = 20 }Token { Kind = Num, Text =0x00, Line = 20 }Token { Kind = Semi, Text= , Line = 20 }Token { Kind = Ident, Text= P4M0, Line = 23 }Token { Kind = Eq, Text =, Line = 23 }Token { Kind = Num, Text =0x00, Line = 23 }Token { Kind = Semi, Text= , Line = 23 }Token { Kind = Ident, Text= P4M1, Line = 24 }Token { Kind = Eq, Text =, Line = 24 }Token { Kind = Num, Text =0x00, Line = 24 }Token { Kind = Semi, Text= , Line = 24 }Token { Kind = While, Text= while, Line = 27 }Token { Kind = LParen,Text = , Line = 27 }Token { Kind = Num, Text =1, Line = 27 }Token { Kind = RParen,Text = , Line = 27 }Token { Kind = LBrace,Text = , Line = 27 }Token { Kind = Ident, Text= P4, Line = 28 }Token { Kind = Eq, Text =, Line = 28 }Token { Kind = Num, Text =0x00, Line = 28 }Token { Kind = Semi, Text= , Line = 28 }Token { Kind = Ident, Text= P0, Line = 35 }Token { Kind = Eq, Text =, Line = 35 }Token { Kind = Num, Text =0xFE, Line = 35 }Token { Kind = Semi, Text= , Line = 35 }Token { Kind = Delay, Text= delay, Line = 36 }Token { Kind = LParen,Text = , Line = 36 }Token { Kind = Num, Text =15, Line = 36 }Token { Kind = RParen,Text = , Line = 36 }Token { Kind = Semi, Text= , Line = 36 }Token { Kind = Ident, Text= P0, Line = 38 }Token { Kind = Eq, Text =, Line = 38 }Token { Kind = Num, Text =0xFD, Line = 38 }Token { Kind = Semi, Text= , Line = 38 }Token { Kind = Delay, Text= delay, Line = 39 }Token { Kind = LParen,Text = , Line = 39 }Token { Kind = Num, Text =15, Line = 39 }Token { Kind = RParen,Text = , Line = 39 }Token { Kind = Semi, Text= , Line = 39 }Token { Kind = Ident, Text= P0, Line = 41 }Token { Kind = Eq, Text =, Line = 41 }Token { Kind = Num, Text =0xFB, Line = 41 }Token { Kind = Semi, Text= , Line = 41 }Token { Kind = Delay, Text= delay, Line = 42 }Token { Kind = LParen,Text = , Line = 42 }Token { Kind = Num, Text =15, Line = 42 }Token { Kind = RParen,Text = , Line = 42 }Token { Kind = Semi, Text= , Line = 42 }Token { Kind = Ident, Text= P0, Line = 44 }Token { Kind = Eq, Text =, Line = 44 }Token { Kind = Num, Text =0xF7, Line = 44 }Token { Kind = Semi, Text= , Line = 44 }Token { Kind = Delay, Text= delay, Line = 45 }Token { Kind = LParen,Text = , Line = 45 }Token { Kind = Num, Text =15, Line = 45 }Token { Kind = RParen,Text = , Line = 45 }Token { Kind = Semi, Text= , Line = 45 }Token { Kind = Ident, Text= P0, Line = 47 }Token { Kind = Eq, Text =, Line = 47 }Token { Kind = Num, Text =0xEF, Line = 47 }Token { Kind = Semi, Text= , Line = 47 }Token { Kind = Delay, Text= delay, Line = 48 }Token { Kind = LParen,Text = , Line = 48 }Token { Kind = Num, Text =15, Line = 48 }Token { Kind = RParen,Text = , Line = 48 }Token { Kind = Semi, Text= , Line = 48 }Token { Kind = Ident, Text= P0, Line = 50 }Token { Kind = Eq, Text =, Line = 50 }Token { Kind = Num, Text =0xDF, Line = 50 }Token { Kind = Semi, Text= , Line = 50 }Token { Kind = Delay, Text= delay, Line = 51 }Token { Kind = LParen,Text = , Line = 51 }Token { Kind = Num, Text =15, Line = 51 }Token { Kind = RParen,Text = , Line = 51 }Token { Kind = Semi, Text= , Line = 51 }Token { Kind = Ident, Text= P0, Line = 53 }Token { Kind = Eq, Text =, Line = 53 }Token { Kind = Num, Text =0xBF, Line = 53 }Token { Kind = Semi, Text= , Line = 53 }Token { Kind = Delay, Text= delay, Line = 54 }Token { Kind = LParen,Text = , Line = 54 }Token { Kind = Num, Text =15, Line = 54 }Token { Kind = RParen,Text = , Line = 54 }Token { Kind = Semi, Text= , Line = 54 }Token { Kind = Ident, Text= P0, Line = 56 }Token { Kind = Eq, Text =, Line = 56 }Token { Kind = Num, Text =0x7F, Line = 56 }Token { Kind = Semi, Text= , Line = 56 }Token { Kind = Delay, Text= delay, Line = 57 }Token { Kind = LParen,Text = , Line = 57 }Token { Kind = Num, Text =15, Line = 57 }Token { Kind = RParen,Text = , Line = 57 }Token { Kind = Semi, Text= , Line = 57 }Token { Kind = RBrace,Text = , Line = 58 }Token { Kind = RBrace,Text = , Line = 59 }Token { Kind = Eof, Text =, Line = 59 }词法分析的关键点
通过这个实例,我们可以观察到词法分析的几个关键特点:
[*]记号分类:词法分析器将源代码中的字符序列分类为不同的记号类型,如关键字、标识符、数字、运算符等。
[*]忽略无关信息:词法分析器会忽略空格、换行符和注释,这些信息对编译器的后续阶段没有意义。
[*]保留值信息:对于标识符和数字等记号,词法分析器不仅识别它们的类型,还保留它们的实际值(如变量名、具体数值)。
[*]错误检测:虽然在这个简单示例中没有体现,但词法分析器也会检测非法字符或格式错误(如未闭合的大括号)。
[*]位置信息:在实际实现中,词法分析器通常会记录每个记号在源代码中的位置(行号、列号),以便在后续阶段生成更有意义的错误信息。
参考文献
Aho, A. V., Lam, M. S., Sethi, R., &Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nded.). Addison-Wesley. (通常被称为"龙书")
词法分析:编译器的第一道关卡
原理讲解 与 代码分析
作为国芯人工智能团队的一员,我们致力于推动国产自主可控的编译器技术发展。在本篇中,我们将深入解析“词法分析”这一编译过程中的核心环节,并结合齐天编译器的实现,进行原理与代码层面的剖析。
一、什么是词法分析?
词法分析(Lexical Analysis)是编译器工作的第一个阶段,其主要任务是将源代码文本逐字符读取,并将其转换为一系列具有语义意义的“记号”(Token)。这些记号是后续语法分析、语义分析等阶段的基础。
在编译过程中,词法分析器(也称扫描器,Scanner)负责识别和提取源代码中的基本元素,如关键字、标识符、运算符、分隔符、常量等。它相当于一个“语言理解者”,将原始文本转化为计算机可处理的结构化数据。
二、词法分析的原理详解
1. 输入:原始源代码文本(字符串形式)
2. 处理:逐字符扫描,根据预定义的规则识别出不同的词法单元
3. 输出:生成一个由记号组成的序列,供后续阶段使用
三、词法分析的核心任务
识别空白字符:如空格、换行、制表符等,通常被忽略或用于分隔其他记号
识别关键字:如 if、else、while 等,具有特定语义的保留字
识别标识符:用户自定义的变量名、函数名等
识别运算符:如 +、-、`、/、==、!= 等
识别分隔符:如 ;、{、}、,` 等
识别常量:如整数、浮点数、字符串等
四、词法分析的实现方式
常见的词法分析实现方式包括:
有限状态自动机(FSA):通过状态转移图匹配不同类型的记号
正则表达式:利用正则表达式匹配模式,识别各类词法单元
手写扫描器:根据语法规则手动编写逻辑,适用于特定语言或简化场景
五、齐天编译器中的词法分析实现
在齐天编译器中,词法分析模块采用状态机机制,结合正则表达式匹配,实现高效的记号识别。以下是部分关键代码片段及说明:
csharp
public class Lexer
{
private string source;
private int position;
public Lexer(string source)
{
this.source = source;
this.position = 0;
}
public Token NextToken()
{
while (position < source.Length)
{
char c = source;
// 跳过空白字符
if (char.IsWhiteSpace(c))
{
position++;
continue;
}
// 处理数字
if (char.IsDigit(c))
{
return ReadNumber();
}
// 处理字母或下划线
if (char.IsLetter(c) || c == '')
{
return ReadIdentifier();
}
// 处理运算符
if (IsOperator(c))
{
return new Token(TokenType.Operator, c.ToString());
}
// 处理分隔符
if (IsDelimiter(c))
{
return new Token(TokenType.Delimiter, c.ToString());
}
// 其他情况视为错误
throw new Exception($"未知字符: {c}");
}
return new Token(TokenType.EOF, "");
}
private Token ReadNumber()
{
int start = position;
while (position < source.Length && char.IsDigit(source))
{
position++;
}
return new Token(TokenType.Number, source.Substring(start, position - start));
}
private Token ReadIdentifier()
{
int start = position;
while (position < source.Length && (char.IsLetterOrDigit(source) || source == ''))
{
position++;
}
string value = source.Substring(start, position - start);
return new Token(TokenType.Identifier, value);
}
private bool IsOperator(char c)
{
return "+-/%=!&|".Contains(c);
}
private bool IsDelimiter(char c)
{
return "();{}[],.".Contains(c);
}
}
六、代码说明与设计思路
Lexer类:封装了词法分析器的所有功能,包含源码字符串、当前位置等信息
NextToken方法:每次调用返回下一个记号,内部通过循环处理每个字符
ReadNumber和ReadIdentifier方法:分别用于识别数字和标识符,支持连续字符的读取
IsOperator和IsDelimiter方法:用于判断当前字符是否属于运算符或分隔符
Token类:用于表示一个记号,包含类型和值两个属性
七、词法分析的意义与挑战
词法分析虽然看似简单,但在实际应用中面临诸多挑战:
多语言支持:不同编程语言的词法规则差异较大,需灵活适配
性能优化:对于大规模源码,需保证扫描效率
错误处理:对非法字符或不规范的输入需具备良好的容错能力
八、结语
词法分析是编译器工作的起点,也是整个编译流程中不可或缺的一环。通过对齐天编译器的词法分析模块进行深入剖析,我们不仅能够理解其工作原理,还能为后续的语法分析、语义分析等模块打下坚实基础。
国芯人工智能将持续推进国产编译器技术的发展,助力构建自主可控的软件生态体系。
[本文内容由国芯人工智能辅助生成,仅供参考]
页:
[1]