doc/eggex.md

OILS / doc / eggex.md View on Github | oilshell.org

623 lines, 396 significant

1	---
2	default_highlighter: oils-sh
3	---
4
5	Egg Expressions (YSH Regexes)
6	=============================
7
8	YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
9
10	if (mystr ~ /d+ '.' d+/) {
11	echo 'mystr looks like a number N.M'
12	}
13
14	These patterns are intended to be familiar, but they differ from POSIX or Perl
15	expressions in important ways. So we call them eggexes rather than
16	regexes!
17
18	<!-- cmark.py expands this -->
19	<div id="toc">
20	</div>
21
22	## Why Invent a New Language?
23
24	- Eggexes let you name subpatterns and compose them, which makes them more
25	readable and testable.
26	- Their syntax is vastly simpler because literal characters are quoted,
27	and operators are not. For example, `^` no longer means three totally
28	different things. See the critique at the end of this doc.
29	- bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
30	more expressive and (in some cases) Perl-like.
31	- They're designed to be translated to any regex dialect. Right now, the
32	YSH shell translates them to ERE so you can use them with common Unix tools:
33	- `egrep` (`grep -E`)
34	- `awk`
35	- GNU `sed --regexp-extended`
36	- PCRE syntax is the second most important target.
37	- They're statically parsed in YSH, so:
38	- You can get syntax errors at parse time. In contrast, if you embed a
39	regex in a string, you don't get syntax errors until runtime.
40	- The eggex is part of the [lossless syntax tree][], which means you can do
41	linting, formatting, and refactoring on eggexes, just like any other type
42	of code.
43	- Eggexes support regular languages in the mathematical sense, whereas
44	regexes are confused about the issue. All nonregular eggex extensions
45	are prefixed with `!!`, so you can visually audit them for [catastrophic
46	backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
47	written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
48	- Eggexes are more fun than regexes!
49
50	[backtracking]: https://blog.codinghorror.com/regex-performance/
51
52	[lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
53
54	### Example of Pattern Reuse
55
56	Here's a longer example:
57
58	# Define a subpattern. 'digit' and 'd' are the same.
59	$ var D = / digit{1,3} /
60
61	# Use the subpattern
62	$ var ip_pat = / D '.' D '.' D '.' D /
63
64	# This eggex compiles to an ERE
65	$ echo $ip_pat
66	[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
67
68	This means you can use it in a very simple way:
69
70	$ egrep $ip_pat foo.txt
71
72	TODO: You should also be able to inline patterns like this:
73
74	egrep $/d+/ foo.txt
75
76	### Design Philosophy
77
78	- Eggexes can express a superset of POSIX and Perl syntax.
79	- The language is designed for "dumb", one-to-one, syntactic translations.
80	That is, translation doesn't rely on understanding the semantics of
81	regexes. This is because regex implementations have many corner cases and
82	incompatibilities, with regard to Unicode, `NUL` bytes, etc.
83
84	### The Expression Language Is Consistent
85
86	Eggexes have a consistent syntax:
87
88	- Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
89	- A sequence of multiple characters looks like `'lit'`, `$var`, etc.
90	- Constructs that match zero characters look like `%start`, `%word_end`, etc.
91	- Entire subpatterns (which may contain alternation, repetition, etc.) are in
92	uppercase like `HexDigit`. Important: these are spliced as syntax trees,
93	not strings, so you don't need to think about quoting.
94
95	For example, it's easy to see that these patterns all match three characters:
96
97	/ d d d /
98	/ digit digit digit /
99	/ dot dot dot /
100	/ word space word /
101	/ 'ab' space /
102	/ 'abc' /
103
104	And that these patterns match two:
105
106	/ %start w w /
107	/ %start 'if' /
108	/ d d %end /
109
110	And that you have to look up the definition of `HexDigit` to know how many
111	characters this matches:
112
113	/ %start HexDigit %end /
114
115	Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
116
117	## Expression Primitives
118
119	### `.` Is Now `dot`
120
121	But `.` is still accepted. It usually matches any character except a newline,
122	although this changes based on flags (e.g. `dotall`, `unicode`).
123
124	### Classes Are Unadorned: `word`, `w`, `alnum`
125
126	We accept both Perl and POSIX classes.
127
128	- Perl:
129	- `d` or `digit`
130	- `s` or `space`
131	- `w` or `word`
132	- POSIX
133	- `alpha`, `alnum`, ...
134
135	### Zero-width Assertions Look Like `%this`
136
137	- POSIX
138	- `%start` is `^`
139	- `%end` is `$`
140	- PCRE:
141	- `%input_start` is `\A`
142	- `%input_end` is `\z`
143	- `%last_line_end` is `\Z`
144	- GNU ERE extensions:
145	- `%word_start` is `\<`
146	- `%word_end` is `\>`
147
148	### Single-Quoted Strings
149
150	- `'hello world'` becomes a regex-escaped string
151
152	Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
153	a strings into an eggex:
154
155	/ 'xyz ' @var /
156
157	## Compound Expressions
158
159	### Sequence and Alternation Are Unchanged
160
161	- `x y` matches `x` and `y` in sequence
162	- `x \| y` matches `x` or `y`
163
164	You can also write a more Pythonic alternative: `x or y`.
165
166	### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
167
168	Repetition is just like POSIX ERE or Perl:
169
170	- `x?`, `x+`, `x*`
171	- `x{3}`, `x{1,3}`
172
173	We've reserved syntactic space for PCRE and Python variants:
174
175	- lazy/non-greedy: `x{L +}`, `x{L 3,4}`
176	- possessive: `x{P +}`, `x{P 3,4}`
177
178	### Negation Consistently Uses !
179
180	You can negate named char classes:
181
182	/ !digit /
183
184	and char class literals:
185
186	/ ![ a-z A-Z ] /
187
188	Sometimes you can do both:
189
190	/ ![ !digit ] / # translates to /[^\D]/ in PCRE
191	# error in ERE because it can't be expressed
192
193
194	You can also negate "regex modifiers" / compilation flags:
195
196	/ word ; ignorecase / # flag on
197	/ word ; !ignorecase / # flag off
198	/ word ; !i / # abbreviated
199
200	In contrast, regexes have many confusing syntaxes for negation:
201
202	[^abc] vs. [abc]
203	[[^:digit:]] vs. [[:digit:]]
204
205	\D vs. \d
206
207	/\w/-i vs /\w/i
208
209	### Splice Other Patterns `@var_name` or `UpperCaseVarName`
210
211	This allows you to reuse patterns. Using uppercase variables:
212
213	var D = / digit{3} /
214
215	var ip_addr = / D '.' D '.' D '.' D /
216
217	Using normal variables:
218
219	var part = / digit{3} /
220
221	var ip_addr = / @part '.' @part '.' @part '.' @part /
222
223	This is similar to how `lex` and `re2c` work.
224
225	### Group With `()`
226
227	Parentheses are used for precdence:
228
229	('foo' \| 'bar')+
230
231	See note below: When translating to POSIX ERE, grouping becomes a capturing
232	group. POSIX ERE has no non-capturing groups.
233
234
235	### Capture with `<capture ...>`
236
237	Here's a positional capture:
238
239	<capture d+> # Becomes _group(1)
240
241	Add a variable after `as` for named capture:
242
243	<capture d+ as month> # Becomes _group('month')
244
245	You can also add type conversion functions:
246
247	<capture d+ : int> # _group(1) returns an Int, not Str
248	<capture d+ as month: int> # _group('month') returns an Int, not Str
249
250	### Character Class Literals Use `[]`
251
252	Example:
253
254	[ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]
255
256	Terms:
257
258	- Ranges: `a-f` or `'A' - 'F'`
259	- Literals: `\n`, `\x01`, `\u{3bc}`, etc.
260	- Sets specified as strings: `'abc'`
261
262	Only letters, numbers, and the underscore may be unquoted:
263
264	/['a'-'f' 'A'-'F' '0'-'9']/
265	/[a-f A-F 0-9]/ # Equivalent to the above
266
267	/['!' - ')']/ # Correct range
268	/[!-)]/ # Syntax Error
269
270	Ranges must be separated by spaces:
271
272	No:
273
274	/[a-fA-F0-9]/
275
276	Yes:
277
278	/[a-f A-f 0-9]/
279
280	### Backtracking Constructs Use `!!` (Discouraged)
281
282	If you want to translate to PCRE, you can use these.
283
284	!!REF 1
285	!!REF name
286
287	!!AHEAD( d+ )
288	!!NOT_AHEAD( d+ )
289	!!BEHIND( d+ )
290	!!NOT_BEHIND( d+ )
291
292	!!ATOMIC( d+ )
293
294	Since they all begin with `!!`, You can visually audit your code for potential
295	performance problems.
296
297	## Outside the Expression language
298
299	### Flags and Translation Preferences (`;`)
300
301	Flags or "regex modifiers" appear after a semicolon:
302
303	/ digit+ ; i / # ignore case
304
305	A translation preference is specified after a second semi-colon:
306
307	/ digit+ ; ; ERE / # translates to [[:digit:]]+
308	/ digit+ ; ; python / # could translate to \d+
309
310	Flags and translation preferences together:
311
312	/ digit+ ; ignorecase ; python / # could translate to (?i)\d+
313
314	In Oils, the following flags are currently supported:
315
316	#### `reg_icase` / `i` (Ignore Case)
317
318	Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
319	'FOO', but `/'foo'/` doesn't.
320
321	#### `reg_newline` (Multiline)
322
323	With this flag, `%end` will match before a newline and `%start` will match
324	after a newline.
325
326	= u'abc123\n' ~ / digit %end ; reg_newline / # true
327	= u'abc\n123' ~ / %start digit ; reg_newline / # true
328
329	Without the flag, `%start` and `%end` only match from the start or end of the
330	string, respectively.
331
332	= u'abc123\n' ~ / digit %end / # false
333	= u'abc\n123' ~ / %start digit / # false
334
335	Newlines are also ignored in `dot` and `![abc]` patterns.
336
337	= u'\n' ~ / . / # true
338	= u'\n' ~ / !digit / # true
339
340	Without this flag, the newline `\n` is treated as an ordinary character.
341
342	= u'\n' ~ / . ; reg_newline / # false
343	= u'\n' ~ / !digit ; reg_newline / # false
344
345	### Multiline Syntax
346
347	You can spread regexes over multiple lines and add comments:
348
349	var x = ///
350	digit{4} # year e.g. 2001
351	'-'
352	digit{2} # month e.g. 06
353	'-'
354	digit{2} # day e.g. 31
355	///
356
357	(Not yet implemented in YSH.)
358
359	### The YSH API
360
361	See the [YSH regex API](ysh-regex-api.html) for details.
362
363	In summary, YSH has Perl-like conveniences with an `~` operator:
364
365	var s = 'on 04-01, 10-31'
366	var pat = /<capture d+ as month> '-' <capture d+ as day>/
367
368	if (s ~ pat) { # search for the pattern
369	echo $[_group('month')] # => 04
370	}
371
372	It also has an explicit and powerful Python-like API with the `search()` and
373	leftMatch()` methods on strings.
374
375	var m = s => search(pat, pos=8) # start searching at a position
376	if (m) {
377	echo $[m => group('month')] # => 10
378	}
379
380	### Language Reference
381
382	- See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
383	the concrete syntax.
384	- See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
385	the abstract syntax.
386
387	## Usage Notes
388
389	### Use character literals rather than C-Escaped strings
390
391	No:
392
393	/ $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
394	/ r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
395
396	Yes:
397
398	# Instead, Take advantage of char literals and implicit regex concatenation
399	/ 'foo' \t 'bar' /
400	/ 'foo' \\ 'tbar' /
401
402
403	## POSIX ERE Limitations
404
405	### Repetition of Strings Requires Grouping
406
407	Repetitions like `* + ?` apply only to the last character, so literal strings
408	need extra grouping:
409
410
411	No:
412
413	'foo'+
414
415	Yes:
416
417	<capture 'foo'>+
418
419	Also OK:
420
421	('foo')+ # this is a CAPTURING group in ERE
422
423	This is necessary because ERE doesn't have non-capturing groups like Perl's
424	`(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
425	constructs that change the meaning of the pattern.
426
427	### Unicode char literals are limited in range
428
429	ERE can't represent this set of 1 character reliably:
430
431	/ [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
432
433	These sets are accepted:
434
435	/ [ \u{1} \u{2} ] / # set of 2 chars
436	/ [ \x01 \x02 ] ] / # set of 2 bytes
437
438	They happen to be identical when translated to ERE, but may not be when
439	translated to PCRE.
440
441	### Don't put non-ASCII bytes in string sets in char classes
442
443	This is a sequence of characters:
444
445	/ $'\xfe\xff' /
446
447	This is a set of characters that is illegal:
448
449	/ [ $'\xfe\xff' ] / # set or sequence? It's confusing
450
451	This is a better way to write it:
452
453	/ [ \xfe \xff ] / # set of 2 chars
454
455	### Char class literals: `^ - ] \`
456
457	The literal characters `^ - ] \` are problematic because they can be confused
458	with operators.
459
460	- `^` means negation
461	- `-` means range
462	- `]` closes the character class
463	- `\` is usually literal, but GNU gawk has an extension to make it an escaping
464	operator
465
466	The Eggex-to-ERE translator is smart enough to handle cases like this:
467
468	var pat = / ['^' 'x'] /
469	# translated to [x^], not [^x] for correctness
470
471	However, cases like this are a fatal runtime error:
472
473	var pat1 = / ['a'-'^'] /
474	var pat2 = / ['a'-'-'] /
475
476	## Critiques
477
478	### Regexes Are Hard To Read
479
480	... because the same symbol can mean many things.
481
482	`^` could mean:
483
484	- Start of the string/line
485	- Negated character class like `[^abc]`
486	- Literal character `^` like `[abc^]`
487
488	`\` is used in:
489
490	- Character classes like `\w` or `\d`
491	- Zero-width assertions like `\b`
492	- Escaped characters like `\n`
493	- Quoted characters like `\+`
494
495	`?` could mean:
496
497	- optional: `a?`
498	- lazy match: `a+?`
499	- some other kind of grouping:
500	- `(?P<named>\d+)`
501	- `(?:noncapturing)`
502
503	With egg expressions, each construct has a distinct syntax.
504
505	### YSH is Shorter Than Bash
506
507	Bash:
508
509	if [[ $x =~ '[[:digit:]]+' ]]; then
510	echo 'x looks like a number
511	fi
512
513	Compare with YSH:
514
515	if (x ~ /digit+/) {
516	echo 'x looks like a number'
517	}
518
519	### ... and Perl
520
521	Perl:
522
523	$x =~ /\d+/
524
525	YSH:
526
527	x ~ /d+/
528
529
530	The Perl expression has three more punctuation characters:
531
532	- YSH doesn't require sigils in expression mode
533	- The match operator is `~`, not `=~`
534	- Named character classes are unadorned like `d`. If that's too short, you can
535	also write `digit`.
536
537	## Design Notes
538
539	### Eggexes In Other Languages
540
541	The eggex syntax can be incorporated into other tools and shells. It's
542	designed to be separate from YSH -- hence the separate name.
543
544	Notes:
545
546	- Single quoted string literals should disallow internal backslashes, and
547	treat all other characters literally.. Instead, users can write `/ 'foo' \t
548	'sq' \' bar \n /` — i.e. implicit concatenation of strings and
549	characters, described above.
550	- To make eggexes portable between languages, Don't use the host language's
551	syntax for string literals (at least for single-quoted strings).
552
553	### Backward Compatibility
554
555	Eggexes aren't backward compatible in general, but they retain some legacy
556	operators like `^ . $` to ease the transition. These expressions are valid
557	eggexes and valid POSIX EREs:
558
559	.*
560	^[0-9]+$
561	^.{1,3}\|[0-9][0-9]?$
562
563	## FAQ
564
565	### The Name Sounds Funny.
566
567	If "eggex" sounds too much like "regex" to you, simply say "egg expression".
568	It won't be confused with "regular expression" or "regex".
569
570	### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
571
572	All three languages support pattern composition and have quoted literals. And
573	they have the goal of improving upon Perl 5 regex syntax, which has made its
574	way into every major programming language (Python, Java, C++, etc.)
575
576	The main difference is that Eggexes are meant to be used with existing
577	regex engines. For example, you translate them to a POSIX ERE, which is
578	executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
579	use them in Python, JavaScript, Java, or C++ programs.
580
581	Perl 6 and Rosie have their own engines that are more powerful than PCRE,
582	Python, etc. That means they cannot be used this way.
583
584	[rosie]: https://rosie-lang.org/
585
586	[raku-regex]: https://docs.raku.org/language/regexes
587
588	### What About Eggex versus Parsing Expression Grammars? (PEGs)
589
590	The short answer is that they can be complementary: PEGs are closer to
591	parsing, while eggex and [regular languages]($xref:regular-language) are
592	closer to lexing. Related:
593
594	- [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
595	- [Why Lexing and Parsing Should Be
596	Separate](https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
597
598	The PEG model is more resource intensive, but it can recognize more languages,
599	and it can recognize recursive structure (trees).
600
601	### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
602
603	Because the meanings of `.` `^` and `$` are usually affected by regex engine
604	flags, like `dotall`, `multiline`, and `unicode`.
605
606	As a result, the names mean nothing more than "however your regex engine
607	interprets `.` `^` and `$`".
608
609	As mentioned in the "Philosophy" section above, eggex only does a superficial,
610	one-to-one translation. It doesn't understand the details of which characters
611	will be matched under which engine.
612
613	### Where Do I Send Feedback?
614
615	Eggexes are implemented in YSH, but not yet set in stone.
616
617	Please try them, as described in [this
618	post](http://www.oilshell.org/blog/2019/08/22.html) and the
619	[README]($oils-src:README.md), and send us feedback!
620
621	You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
622	or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
623	with Github, etc.)