OILS / opy / _regtest / src / osh / lex.py View on Github | oilshell.org

524 lines, 225 significant
1"""
2lex.py -- Shell lexer.
3
4It consists of a series of lexer modes, each with a regex -> Id mapping.
5
6NOTE: If this changes, the lexer may need to be recompiled with
7build/codegen.sh lexer.
8
9Input Handling
10--------------
11
12Note that our style of input Handling affects the regular expressions in the
13lexer.
14
15We pass one line at a time to the Lexer, via LineLexer. We must be able to
16parse one line at a time because of interactive parsing (e.g. using the output
17of GNU readline.)
18
19There are two ways we could handle input:
20
21 1. Every line is NUL terminated:
22 'one\n\0' 'last line\0'
23 2. Every line is terminated by NUL, except the last:
24 'one\n' 'last line\0'
25
26The advantage of #2 is that in the common case of reading files, we don't have
27to do it one line at a time. We could slurp the whole file in, or mmap() it,
28etc.
29
30The second option makes the regular expressions more complicated, so I'm
31punting on it for now. We assume the first.
32
33That means:
34
35 - No regexes below should match \0. They are added by
36 core/lexer_gen.py for re2c.
37
38For example, [^']+ is not valid. [^'\0]+ is correct. Otherwise we would read
39uninitialized memory past the sentinel.
40
41Python's regex engine knows where the end of the input string is, so it
42doesn't require need a sentinel like \0.
43
44Note that re2c is not able to work in a mode with a strict length limit. It
45would cause too many extra checks? The language is then no longer regular!
46
47http://re2c.org/examples/example_03.html
48
49UPDATE: Two More Options
50------------------------
51
523. Change the \n at the end of every line to \0. \0 becomes Id.Op_Newline, at
53least in lex_mode.OUTER.
54
55Advantage: This makes the regular expressions easier to generate, but allows
56you to read in the whole file at once instead of allocating lines.
57
58Disadvantages:
59- You can't mmap() the file because the data is mutated. Or it will have to be
60 copy-on-write.
61- You can't get rid of comment lines if you read the whole file.
62
634. Read a line at a time. Throw away the lines, unless you're parsing a
64function, which should be obvious.
65
66After you parse the function, you can COPY all the tokens to another location.
67Very few tokens need their actual text data. Most of them can just be
68identified by ID.
69
70Contents are relevant:
71
72- Lit_Chars, Lit_Other, Lit_EscapedChar, Lit_Digits
73- Id.Lit_VarLike -- for the name, and for = vs +=
74- Id.Lit_ArithVarLike
75- VSub_Name, VSub_Number
76- Id.Redir_* for the LHS file descriptor. Although this is one or two bytes
77 that could be copied.
78
79You can also take this opportunity to enter the strings in an intern table.
80How much memory would that save?
81
82Remaining constructs
83--------------------
84
85Case terminators:
86 ;;& Op_DSemiAmp for case
87 ;& Op_Semi
88
89Left Index:
90
91 _VAR_NAME_RE + '\[' Lit_LeftIndexLikeOpen
92 ]= Lit_LeftIndexLikeClose
93
94Indexed array and Associative array literals:
95 declare -A a=([key]=value [key2]=value2)
96 declare -a a=([1 + 2]=value [3 + 4]=value2) # parsed!
97
98 Lit_LBracket Lit_RBracketEqual
99 Left_Bracket, Right_BracketEqual?
100 Op_LBracket Op_RBracketEqual
101"""
102
103import re
104
105from osh.meta import Id, Kind, ID_SPEC
106from core.lexer import C, R
107
108from osh.meta import types
109
110lex_mode_e = types.lex_mode_e
111
112
113# In oil, I hope to have these lexer modes:
114# COMMAND
115# EXPRESSION (takes place of ARITH, VS_UNQ_ARG, VS_DQ_ARG)
116# SQ RAW_SQ DQ RAW_DQ
117# VS -- a single state here? Or switches into expression state, because }
118# is an operator
119# Problem: DICT_KEY might be a different state, to accept either a bare word
120# foo, or an expression (X=a+2), which is allowed in shell. Python doesn't
121# allowed unquoted words, but we want to.
122
123# TODO: There are 4 shared groups here. I think you should test if that
124# structure should be preserved through re2c. Do a benchmark.
125#
126# If a group has no matches, then return Id.Unknown_Tok? And then you can
127# chain the groups in order. It might make sense to experiment with the order
128# too.
129
130_BACKSLASH = [
131 R(r'\\[^\n\0]', Id.Lit_EscapedChar),
132 C('\\\n', Id.Ignored_LineCont),
133]
134
135_VAR_NAME_RE = r'[a-zA-Z_][a-zA-Z0-9_]*'
136
137# Used by osh/cmd_parse.py to validate for loop name. Note it must be
138# anchored on the right.
139VAR_NAME_RE = re.compile(_VAR_NAME_RE + '$')
140
141# All Kind.VSub
142_VARS = [
143 # Unbraced variables
144 R(r'\$' + _VAR_NAME_RE, Id.VSub_Name),
145 R(r'\$[0-9]', Id.VSub_Number),
146 C(r'$!', Id.VSub_Bang),
147 C(r'$@', Id.VSub_At),
148 C(r'$#', Id.VSub_Pound),
149 C(r'$$', Id.VSub_Dollar),
150 C(r'$*', Id.VSub_Star),
151 C(r'$-', Id.VSub_Hyphen),
152 C(r'$?', Id.VSub_QMark),
153]
154
155# Kind.Left that are valid in double-quoted modes.
156_LEFT_SUBS = [
157 C('`', Id.Left_Backtick),
158 C('$(', Id.Left_CommandSub),
159 C('${', Id.Left_VarSub),
160 C('$((', Id.Left_ArithSub),
161 C('$[', Id.Left_ArithSub2),
162]
163
164# Additional Kind.Left that are valid in unquoted modes.
165_LEFT_UNQUOTED = [
166 C('"', Id.Left_DoubleQuote),
167 C("'", Id.Left_SingleQuote),
168 C('$"', Id.Left_DollarDoubleQuote),
169 C("$'", Id.Left_DollarSingleQuote),
170
171 C('<(', Id.Left_ProcSubIn),
172 C('>(', Id.Left_ProcSubOut),
173]
174
175# Constructs used:
176# Character classes [] with simple ranges and negation, +, *, \n, \0
177# It would be nice to express this as CRE ... ? And then compile to re2c
178# syntax. And Python syntax.
179
180# NOTE: Should remain compatible with re2c syntax, for code gen.
181# http://re2c.org/manual/syntax/syntax.html
182
183# PROBLEM: \0 in Python re vs \000 in re2? Can this be unified?
184# Yes, Python allows \000 octal escapes.
185#
186# https://docs.python.org/2/library/re.html
187
188LEXER_DEF = {} # TODO: Should be a list so we enforce order.
189
190# Anything until the end of the line is a comment. Does not match the newline
191# itself. We want to switch modes and possibly process Op_Newline for here
192# docs, etc.
193LEXER_DEF[lex_mode_e.COMMENT] = [
194 R(r'[^\n\0]*', Id.Ignored_Comment)
195]
196
197_UNQUOTED = _BACKSLASH + _LEFT_SUBS + _LEFT_UNQUOTED + _VARS + [
198 # NOTE: We could add anything 128 and above to this character class? So
199 # utf-8 characters don't get split?
200 R(r'[a-zA-Z0-9_/.-]+', Id.Lit_Chars),
201 # e.g. beginning of NAME=val, which will always be longer than the above
202 # Id.Lit_Chars.
203 R(r'[a-zA-Z_][a-zA-Z0-9_]*\+?=', Id.Lit_VarLike),
204
205 C('#', Id.Lit_Pound), # For comments
206
207 # Needs to be LONGER than any other
208 #(_VAR_NAME_RE + r'\[', Id.Lit_Maybe_LHS_ARRAY),
209 # Id.Lit_Maybe_LHS_ARRAY2
210 #(r'\]\+?=', Id.Lit_Maybe_ARRAY_ASSIGN_RIGHT),
211
212 # For brace expansion {a,b}
213 C('{', Id.Lit_LBrace),
214 C('}', Id.Lit_RBrace), # Also for var sub ${a}
215 C(',', Id.Lit_Comma),
216 C('~', Id.Lit_Tilde), # For tilde expansion
217
218 R(r'[ \t\r]+', Id.WS_Space),
219
220 C('\n', Id.Op_Newline),
221
222 C('&', Id.Op_Amp),
223 C('|', Id.Op_Pipe),
224 C('|&', Id.Op_PipeAmp),
225 C('&&', Id.Op_DAmp),
226 C('||', Id.Op_DPipe),
227 C(';', Id.Op_Semi),
228 C(';;', Id.Op_DSemi),
229
230 C('(', Id.Op_LParen),
231 C(')', Id.Op_RParen),
232
233 R(r'[0-9]*<', Id.Redir_Less),
234 R(r'[0-9]*>', Id.Redir_Great),
235 R(r'[0-9]*<<', Id.Redir_DLess),
236 R(r'[0-9]*<<<', Id.Redir_TLess),
237 R(r'[0-9]*>>', Id.Redir_DGreat),
238 R(r'[0-9]*<<-', Id.Redir_DLessDash),
239 R(r'[0-9]*>&', Id.Redir_GreatAnd),
240 R(r'[0-9]*<&', Id.Redir_LessAnd),
241 R(r'[0-9]*<>', Id.Redir_LessGreat),
242 R(r'[0-9]*>\|', Id.Redir_Clobber),
243
244 R(r'[^\0]', Id.Lit_Other), # any other single char is a literal
245]
246
247# In OUTER and DBRACKET states.
248_EXTGLOB_BEGIN = [
249 C('@(', Id.ExtGlob_At),
250 C('*(', Id.ExtGlob_Star),
251 C('+(', Id.ExtGlob_Plus),
252 C('?(', Id.ExtGlob_QMark),
253 C('!(', Id.ExtGlob_Bang),
254]
255
256_KEYWORDS = [
257 # NOTE: { is matched elsewhere
258 C('[[', Id.KW_DLeftBracket),
259 C('!', Id.KW_Bang),
260 C('for', Id.KW_For),
261 C('while', Id.KW_While),
262 C('until', Id.KW_Until),
263 C('do', Id.KW_Do),
264 C('done', Id.KW_Done),
265 C('in', Id.KW_In),
266 C('case', Id.KW_Case),
267 C('esac', Id.KW_Esac),
268 C('if', Id.KW_If),
269 C('fi', Id.KW_Fi),
270 C('then', Id.KW_Then),
271 C('else', Id.KW_Else),
272 C('elif', Id.KW_Elif),
273 C('function', Id.KW_Function),
274 C('time', Id.KW_Time),
275]
276
277# These are treated like builtins in bash, but keywords in OSH. However, we
278# main compatibility with bash for the 'type' builtin.
279_MORE_KEYWORDS = [
280 C('declare', Id.Assign_Declare),
281 C('typeset', Id.Assign_Typeset),
282 C('local', Id.Assign_Local),
283 C('readonly', Id.Assign_Readonly),
284
285 C('break', Id.ControlFlow_Break),
286 C('continue', Id.ControlFlow_Continue),
287 C('return', Id.ControlFlow_Return),
288 C('exit', Id.ControlFlow_Exit),
289]
290
291
292_TYPE_KEYWORDS = set(name for _, name, _ in _KEYWORDS)
293_TYPE_KEYWORDS.add('{') # not in our lexer list
294_TYPE_BUILTINS = set(name for _, name, _ in _MORE_KEYWORDS)
295
296
297def IsOtherBuiltin(name):
298 return name in _TYPE_BUILTINS
299
300
301def IsKeyword(name):
302 return name in _TYPE_KEYWORDS
303
304
305# These two can must be recognized in the OUTER state, but can't nested within
306# [[.
307# Keywords have to be checked before _UNQUOTED so we get <KW_If "if"> instead
308# of <Lit_Chars "if">.
309LEXER_DEF[lex_mode_e.OUTER] = [
310 C('((', Id.Op_DLeftParen), # not allowed within [[
311] + _KEYWORDS + _MORE_KEYWORDS + _UNQUOTED + _EXTGLOB_BEGIN
312
313# DBRACKET: can be like OUTER, except:
314# - Don't really need redirects either... Redir_Less could be Op_Less
315# - Id.Op_DLeftParen can't be nested inside.
316LEXER_DEF[lex_mode_e.DBRACKET] = [
317 C(']]', Id.Lit_DRightBracket),
318 C('!', Id.KW_Bang),
319] + ID_SPEC.LexerPairs(Kind.BoolUnary) + \
320 ID_SPEC.LexerPairs(Kind.BoolBinary) + \
321 _UNQUOTED + _EXTGLOB_BEGIN
322
323# Inside an extended glob, most characters are literals, including spaces and
324# punctuation. We also accept \, $var, ${var}, "", etc. They can also be
325# nested, so _EXTGLOB_BEGIN appears here.
326#
327# Example: echo @(<> <>|&&|'foo'|$bar)
328LEXER_DEF[lex_mode_e.EXTGLOB] = \
329 _BACKSLASH + _LEFT_SUBS + _VARS + _EXTGLOB_BEGIN + [
330 R(r'[^\\$`"\'|)@*+!?\0]+', Id.Lit_Chars),
331 C('|', Id.Op_Pipe),
332 C(')', Id.Op_RParen), # maybe be translated to Id.ExtGlob_RParen
333 R(r'[^\0]', Id.Lit_Other), # everything else is literal
334]
335
336
337LEXER_DEF[lex_mode_e.BASH_REGEX] = [
338 # Match these literals first, and then the rest of the OUTER state I guess.
339 # That's how bash works.
340 #
341 # At a minimum, you do need $ and ~ expansions to happen. <>;& could have
342 # been allowed unescaped too, but that's not what bash does. The criteria
343 # was whether they were "special" in both languages, which seems dubious.
344 C('(', Id.Lit_Chars),
345 C(')', Id.Lit_Chars),
346 C('|', Id.Lit_Chars),
347] + [
348 # Avoid "unreachable rule error"
349 (is_regex, pat, re_list) for
350 (is_regex, pat, re_list) in _UNQUOTED
351 if not (is_regex == False and pat in ('(', ')', '|'))
352]
353
354
355LEXER_DEF[lex_mode_e.DQ] = [
356 # Only 4 characters are backslash escaped inside "".
357 # https://www.gnu.org/software/bash/manual/bash.html#Double-Quotes
358 R(r'\\[$`"\\]', Id.Lit_EscapedChar),
359 C('\\\n', Id.Ignored_LineCont),
360] + _LEFT_SUBS + _VARS + [
361 R(r'[^$`"\0\\]+', Id.Lit_Chars), # matches a line at most
362 # NOTE: When parsing here doc line, this token doesn't end it.
363 C('"', Id.Right_DoubleQuote),
364 R(r'[^\0]', Id.Lit_Other), # e.g. "$"
365]
366
367_VS_ARG_COMMON = _BACKSLASH + [
368 C('/', Id.Lit_Slash), # for patsub (not Id.VOp2_Slash)
369 C('#', Id.Lit_Pound), # for patsub prefix (not Id.VOp1_Pound)
370 C('%', Id.Lit_Percent), # for patsdub suffix (not Id.VOp1_Percent)
371 C('}', Id.Right_VarSub), # For var sub "${a}"
372]
373
374# Kind.{LIT,IGNORED,VS,LEFT,RIGHT,Eof}
375LEXER_DEF[lex_mode_e.VS_ARG_UNQ] = \
376 _VS_ARG_COMMON + _LEFT_SUBS + _LEFT_UNQUOTED + _VARS + [
377 # NOTE: added < and > so it doesn't eat <()
378 R(r'[^$`/}"\'\0\\#%<>]+', Id.Lit_Chars),
379 R(r'[^\0]', Id.Lit_Other), # e.g. "$", must be last
380]
381
382# Kind.{LIT,IGNORED,VS,LEFT,RIGHT,Eof}
383LEXER_DEF[lex_mode_e.VS_ARG_DQ] = _VS_ARG_COMMON + _LEFT_SUBS + _VARS + [
384 R(r'[^$`/}"\0\\#%]+', Id.Lit_Chars), # matches a line at most
385 # Weird wart: even in double quoted state, double quotes are allowed
386 C('"', Id.Left_DoubleQuote),
387 R(r'[^\0]', Id.Lit_Other), # e.g. "$", must be last
388]
389
390# NOTE: Id.Ignored_LineCont is NOT supported in SQ state, as opposed to DQ
391# state.
392LEXER_DEF[lex_mode_e.SQ] = [
393 R(r"[^'\0]+", Id.Lit_Chars), # matches a line at most
394 C("'", Id.Right_SingleQuote),
395]
396
397# Shared between echo -e and $''.
398_C_STRING_COMMON = [
399
400 # \x6 is valid in bash
401 R(r'\\x[0-9a-fA-F]{1,2}', Id.Char_Hex),
402 R(r'\\u[0-9]{1,4}', Id.Char_Unicode4),
403 R(r'\\U[0-9]{1,8}', Id.Char_Unicode8),
404
405 R(r'\\[0abeEfrtnv\\]', Id.Char_OneChar),
406
407 # Backslash that ends a line. Note '.' doesn't match a newline character.
408 C('\\\n', Id.Char_Literals),
409
410 # e.g. \A is not an escape, and \x doesn't match a hex escape. We allow it,
411 # but a lint tool could warn about it.
412 C('\\', Id.Char_BadBackslash),
413]
414
415# Used by ECHO_LEXER in core/builtin.py.
416ECHO_E_DEF = _C_STRING_COMMON + [
417 # Note: tokens above \0377 can either be truncated or be flagged a syntax
418 # error in strict mode.
419 R(r'\\0[0-7]{1,3}', Id.Char_Octal4),
420
421 C(r'\c', Id.Char_Stop),
422
423 # Bad Backslash should not end the string. We allow it, but a lint tool
424 # should warn about it.
425 R(r'\\$', Id.Char_BadBackslash),
426
427 # e.g. 'foo', anything that's not a backslash escape
428 R(r'[^\\]+', Id.Char_Literals),
429]
430
431# NOTE: Id.Ignored_LineCont is also not supported here, even though the whole
432# point of it is that supports other backslash escapes like \n! It just
433# becomes a regular backslash.
434LEXER_DEF[lex_mode_e.DOLLAR_SQ] = _C_STRING_COMMON + [
435 # Silly difference! In echo -e, the syntax is \0377, but here it's $'\377',
436 # with no leading 0.
437 R(r'\\[0-7]{1,3}', Id.Char_Octal3),
438
439 # ' is escaped in $'' mode, but not echo -e. Ditto fr ", not sure why.
440 C(r"\'", Id.Char_OneChar),
441 C(r'\"', Id.Char_OneChar),
442
443 # e.g. 'foo', anything that's not a backslash escape. Need to exclude ' as
444 # well.
445 R(r"[^\\'\0]+", Id.Char_Literals),
446
447 C("'", Id.Right_SingleQuote),
448
449 # Backslash that ends the file! Caught by re2c exhaustiveness check. Parser
450 # will assert; should give a better syntax error.
451 C('\\\0', Id.Unknown_Tok),
452]
453
454LEXER_DEF[lex_mode_e.VS_1] = [
455 R(_VAR_NAME_RE, Id.VSub_Name),
456 # ${11} is valid, compared to $11 which is $1 and then literal 1.
457 R(r'[0-9]+', Id.VSub_Number),
458 C('!', Id.VSub_Bang),
459 C('@', Id.VSub_At),
460 C('#', Id.VSub_Pound),
461 C('$', Id.VSub_Dollar),
462 C('*', Id.VSub_Star),
463 C('-', Id.VSub_Hyphen),
464 C('?', Id.VSub_QMark),
465
466 C('}', Id.Right_VarSub),
467
468 C('\\\n', Id.Ignored_LineCont),
469
470 C('\n', Id.Unknown_Tok), # newline not allowed inside ${}
471 R(r'[^\0]', Id.Unknown_Tok), # any char except newline
472]
473
474LEXER_DEF[lex_mode_e.VS_2] = \
475 ID_SPEC.LexerPairs(Kind.VTest) + \
476 ID_SPEC.LexerPairs(Kind.VOp1) + \
477 ID_SPEC.LexerPairs(Kind.VOp2) + [
478 C('}', Id.Right_VarSub),
479
480 C('\\\n', Id.Ignored_LineCont),
481 C('\n', Id.Unknown_Tok), # newline not allowed inside ${}
482 R(r'[^\0]', Id.Unknown_Tok), # any char except newline
483]
484
485# https://www.gnu.org/software/bash/manual/html_node/Shell-Arithmetic.html#Shell-Arithmetic
486LEXER_DEF[lex_mode_e.ARITH] = \
487 _LEFT_SUBS + _VARS + _LEFT_UNQUOTED + [
488 # newline is ignored space, unlike in OUTER
489 R(r'[ \t\r\n]+', Id.Ignored_Space),
490
491 # Examples of arith constants:
492 # 64#azAZ
493 # 0xabc 0xABC
494 # 0123
495 # A separate digits token makes this easier to parse STATICALLY. But this
496 # doesn't help with DYNAMIC parsing.
497 R(_VAR_NAME_RE, Id.Lit_ArithVarLike), # for variable names or 64#_
498 R(r'[0-9]+', Id.Lit_Digits),
499 C('@', Id.Lit_At), # for 64#@ or ${a[@]}
500 C('#', Id.Lit_Pound), # for 64#a
501
502# TODO: 64#@ interferes with VS_AT. Hm.
503] + ID_SPEC.LexerPairs(Kind.Arith) + [
504 C('\\\n', Id.Ignored_LineCont),
505 R(r'[^\0]', Id.Unknown_Tok) # any char. This should be a syntax error.
506]
507
508# Notes on BASH_REGEX states
509#
510# From bash manual:
511#
512# - Any part of the pattern may be quoted to force the quoted portion to be
513# matched as a string.
514# - Bracket expressions in regular expressions must be treated carefully, since
515# normal quoting characters lose their meanings between brackets.
516# - If the pattern is stored in a shell variable, quoting the variable
517# expansion forces the entire pattern to be matched as a string.
518#
519# Is there a re.escape function? It's just like EscapeGlob and UnescapeGlob.
520#
521# TODO: For testing, write a script to extract and save regexes... and compile
522# them with regcomp. I've only seen constant regexes.
523#
524# From code: ( | ) are treated special.