OILS / doc / eggex.md View on Github | oilshell.org

623 lines, 396 significant
1---
2default_highlighter: oils-sh
3---
4
5Egg Expressions (YSH Regexes)
6=============================
7
8YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
9
10 if (mystr ~ /d+ '.' d+/) {
11 echo 'mystr looks like a number N.M'
12 }
13
14These patterns are intended to be familiar, but they differ from POSIX or Perl
15expressions in important ways. So we call them *eggexes* rather than
16*regexes*!
17
18<!-- cmark.py expands this -->
19<div id="toc">
20</div>
21
22## Why Invent a New Language?
23
24- Eggexes let you name **subpatterns** and compose them, which makes them more
25 readable and testable.
26- Their **syntax** is vastly simpler because literal characters are **quoted**,
27 and operators are not. For example, `^` no longer means three totally
28 different things. See the critique at the end of this doc.
29- bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
30 more expressive and (in some cases) Perl-like.
31- They're designed to be **translated to any regex dialect**. Right now, the
32 YSH shell translates them to ERE so you can use them with common Unix tools:
33 - `egrep` (`grep -E`)
34 - `awk`
35 - GNU `sed --regexp-extended`
36 - PCRE syntax is the second most important target.
37- They're **statically parsed** in YSH, so:
38 - You can get **syntax errors** at parse time. In contrast, if you embed a
39 regex in a string, you don't get syntax errors until runtime.
40 - The eggex is part of the [lossless syntax tree][], which means you can do
41 linting, formatting, and refactoring on eggexes, just like any other type
42 of code.
43- Eggexes support **regular languages** in the mathematical sense, whereas
44 regexes are **confused** about the issue. All nonregular eggex extensions
45 are prefixed with `!!`, so you can visually audit them for [catastrophic
46 backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
47 written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
48- Eggexes are more fun than regexes!
49
50[backtracking]: https://blog.codinghorror.com/regex-performance/
51
52[lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
53
54### Example of Pattern Reuse
55
56Here's a longer example:
57
58 # Define a subpattern. 'digit' and 'd' are the same.
59 $ var D = / digit{1,3} /
60
61 # Use the subpattern
62 $ var ip_pat = / D '.' D '.' D '.' D /
63
64 # This eggex compiles to an ERE
65 $ echo $ip_pat
66 [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
67
68This means you can use it in a very simple way:
69
70 $ egrep $ip_pat foo.txt
71
72TODO: You should also be able to inline patterns like this:
73
74 egrep $/d+/ foo.txt
75
76### Design Philosophy
77
78- Eggexes can express a **superset** of POSIX and Perl syntax.
79- The language is designed for "dumb", one-to-one, **syntactic** translations.
80 That is, translation doesn't rely on understanding the **semantics** of
81 regexes. This is because regex implementations have many corner cases and
82 incompatibilities, with regard to Unicode, `NUL` bytes, etc.
83
84### The Expression Language Is Consistent
85
86Eggexes have a consistent syntax:
87
88- Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
89- A sequence of multiple characters looks like `'lit'`, `$var`, etc.
90- Constructs that match **zero** characters look like `%start`, `%word_end`, etc.
91- Entire subpatterns (which may contain alternation, repetition, etc.) are in
92 uppercase like `HexDigit`. Important: these are **spliced** as syntax trees,
93 not strings, so you **don't** need to think about quoting.
94
95For example, it's easy to see that these patterns all match **three** characters:
96
97 / d d d /
98 / digit digit digit /
99 / dot dot dot /
100 / word space word /
101 / 'ab' space /
102 / 'abc' /
103
104And that these patterns match **two**:
105
106 / %start w w /
107 / %start 'if' /
108 / d d %end /
109
110And that you have to look up the definition of `HexDigit` to know how many
111characters this matches:
112
113 / %start HexDigit %end /
114
115Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
116
117## Expression Primitives
118
119### `.` Is Now `dot`
120
121But `.` is still accepted. It usually matches any character except a newline,
122although this changes based on flags (e.g. `dotall`, `unicode`).
123
124### Classes Are Unadorned: `word`, `w`, `alnum`
125
126We accept both Perl and POSIX classes.
127
128- Perl:
129 - `d` or `digit`
130 - `s` or `space`
131 - `w` or `word`
132- POSIX
133 - `alpha`, `alnum`, ...
134
135### Zero-width Assertions Look Like `%this`
136
137- POSIX
138 - `%start` is `^`
139 - `%end` is `$`
140- PCRE:
141 - `%input_start` is `\A`
142 - `%input_end` is `\z`
143 - `%last_line_end` is `\Z`
144- GNU ERE extensions:
145 - `%word_start` is `\<`
146 - `%word_end` is `\>`
147
148### Single-Quoted Strings
149
150- `'hello *world*'` becomes a regex-escaped string
151
152Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
153a strings into an eggex:
154
155 / 'xyz ' @var /
156
157## Compound Expressions
158
159### Sequence and Alternation Are Unchanged
160
161- `x y` matches `x` and `y` in sequence
162- `x | y` matches `x` or `y`
163
164You can also write a more Pythonic alternative: `x or y`.
165
166### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
167
168Repetition is just like POSIX ERE or Perl:
169
170- `x?`, `x+`, `x*`
171- `x{3}`, `x{1,3}`
172
173We've reserved syntactic space for PCRE and Python variants:
174
175- lazy/non-greedy: `x{L +}`, `x{L 3,4}`
176- possessive: `x{P +}`, `x{P 3,4}`
177
178### Negation Consistently Uses !
179
180You can negate named char classes:
181
182 / !digit /
183
184and char class literals:
185
186 / ![ a-z A-Z ] /
187
188Sometimes you can do both:
189
190 / ![ !digit ] / # translates to /[^\D]/ in PCRE
191 # error in ERE because it can't be expressed
192
193
194You can also negate "regex modifiers" / compilation flags:
195
196 / word ; ignorecase / # flag on
197 / word ; !ignorecase / # flag off
198 / word ; !i / # abbreviated
199
200In contrast, regexes have many confusing syntaxes for negation:
201
202 [^abc] vs. [abc]
203 [[^:digit:]] vs. [[:digit:]]
204
205 \D vs. \d
206
207 /\w/-i vs /\w/i
208
209### Splice Other Patterns `@var_name` or `UpperCaseVarName`
210
211This allows you to reuse patterns. Using uppercase variables:
212
213 var D = / digit{3} /
214
215 var ip_addr = / D '.' D '.' D '.' D /
216
217Using normal variables:
218
219 var part = / digit{3} /
220
221 var ip_addr = / @part '.' @part '.' @part '.' @part /
222
223This is similar to how `lex` and `re2c` work.
224
225### Group With `()`
226
227Parentheses are used for precdence:
228
229 ('foo' | 'bar')+
230
231See note below: When translating to POSIX ERE, grouping becomes a capturing
232group. POSIX ERE has no non-capturing groups.
233
234
235### Capture with `<capture ...>`
236
237Here's a positional capture:
238
239 <capture d+> # Becomes _group(1)
240
241Add a variable after `as` for named capture:
242
243 <capture d+ as month> # Becomes _group('month')
244
245You can also add type conversion functions:
246
247 <capture d+ : int> # _group(1) returns an Int, not Str
248 <capture d+ as month: int> # _group('month') returns an Int, not Str
249
250### Character Class Literals Use `[]`
251
252Example:
253
254 [ a-f 'A'-'F' \xFF \u0100 \n \\ \' \" \0 ]
255
256Terms:
257
258- Ranges: `a-f` or `'A' - 'F'`
259- Literals: `\n`, `\x01`, `\u0100`, etc.
260- Sets specified as strings: `'abc'`
261
262Only letters, numbers, and the underscore may be unquoted:
263
264 /['a'-'f' 'A'-'F' '0'-'9']/
265 /[a-f A-F 0-9]/ # Equivalent to the above
266
267 /['!' - ')']/ # Correct range
268 /[!-)]/ # Syntax Error
269
270Ranges must be separated by spaces:
271
272No:
273
274 /[a-fA-F0-9]/
275
276Yes:
277
278 /[a-f A-f 0-9]/
279
280### Backtracking Constructs Use `!!` (Discouraged)
281
282If you want to translate to PCRE, you can use these.
283
284 !!REF 1
285 !!REF name
286
287 !!AHEAD( d+ )
288 !!NOT_AHEAD( d+ )
289 !!BEHIND( d+ )
290 !!NOT_BEHIND( d+ )
291
292 !!ATOMIC( d+ )
293
294Since they all begin with `!!`, You can visually audit your code for potential
295performance problems.
296
297## Outside the Expression language
298
299### Flags and Translation Preferences (`;`)
300
301Flags or "regex modifiers" appear after a semicolon:
302
303 / digit+ ; i / # ignore case
304
305A translation preference is specified after a second semi-colon:
306
307 / digit+ ; ; ERE / # translates to [[:digit:]]+
308 / digit+ ; ; python / # could translate to \d+
309
310Flags and translation preferences together:
311
312 / digit+ ; ignorecase ; python / # could translate to (?i)\d+
313
314In Oils, the following flags are currently supported:
315
316#### `reg_icase` / `i` (Ignore Case)
317
318Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
319'FOO', but `/'foo'/` doesn't.
320
321#### `reg_newline` (Multiline)
322
323With this flag, `%end` will match before a newline and `%start` will match
324after a newline.
325
326 = u'abc123\n' ~ / digit %end ; reg_newline / # true
327 = u'abc\n123' ~ / %start digit ; reg_newline / # true
328
329Without the flag, `%start` and `%end` only match from the start or end of the
330string, respectively.
331
332 = u'abc123\n' ~ / digit %end / # false
333 = u'abc\n123' ~ / %start digit / # false
334
335Newlines are also ignored in `dot` and `![abc]` patterns.
336
337 = u'\n' ~ / . / # true
338 = u'\n' ~ / !digit / # true
339
340Without this flag, the newline `\n` is treated as an ordinary character.
341
342 = u'\n' ~ / . ; reg_newline / # false
343 = u'\n' ~ / !digit ; reg_newline / # false
344
345### Multiline Syntax
346
347You can spread regexes over multiple lines and add comments:
348
349 var x = ///
350 digit{4} # year e.g. 2001
351 '-'
352 digit{2} # month e.g. 06
353 '-'
354 digit{2} # day e.g. 31
355 ///
356
357(Not yet implemented in YSH.)
358
359### The YSH API
360
361See the [YSH regex API](ysh-regex-api.html) for details.
362
363In summary, YSH has Perl-like conveniences with an `~` operator:
364
365 var s = 'on 04-01, 10-31'
366 var pat = /<capture d+ as month> '-' <capture d+ as day>/
367
368 if (s ~ pat) { # search for the pattern
369 echo $[_group('month')] # => 04
370 }
371
372It also has an explicit and powerful Python-like API with the `search()` and
373leftMatch()` methods on strings.
374
375 var m = s => search(pat, pos=8) # start searching at a position
376 if (m) {
377 echo $[m => group('month')] # => 10
378 }
379
380### Language Reference
381
382- See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
383 the concrete syntax.
384- See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
385 the abstract syntax.
386
387## Usage Notes
388
389### Use character literals rather than C-Escaped strings
390
391No:
392
393 / $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
394 / r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
395
396Yes:
397
398 # Instead, Take advantage of char literals and implicit regex concatenation
399 / 'foo' \t 'bar' /
400 / 'foo' \\ 'tbar' /
401
402
403## POSIX ERE Limitations
404
405### Repetition of Strings Requires Grouping
406
407Repetitions like `* + ?` apply only to the last character, so literal strings
408need extra grouping:
409
410
411No:
412
413 'foo'+
414
415Yes:
416
417 <capture 'foo'>+
418
419Also OK:
420
421 ('foo')+ # this is a CAPTURING group in ERE
422
423This is necessary because ERE doesn't have non-capturing groups like Perl's
424`(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
425constructs that change the meaning of the pattern.
426
427### Unicode char literals are limited in range
428
429ERE can't represent this set of 1 character reliably:
430
431 / [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
432
433These sets are accepted:
434
435 / [ \u{1} \u{2} ] / # set of 2 chars
436 / [ \x01 \x02 ] ] / # set of 2 bytes
437
438They happen to be identical when translated to ERE, but may not be when
439translated to PCRE.
440
441### Don't put non-ASCII bytes in string sets in char classes
442
443This is a sequence of characters:
444
445 / $'\xfe\xff' /
446
447This is a **set** of characters that is illegal:
448
449 / [ $'\xfe\xff' ] / # set or sequence? It's confusing
450
451This is a better way to write it:
452
453 / [ \xfe \xff ] / # set of 2 chars
454
455### Char class literals: `^ - ] \`
456
457The literal characters `^ - ] \` are problematic because they can be confused
458with operators.
459
460- `^` means negation
461- `-` means range
462- `]` closes the character class
463- `\` is usually literal, but GNU gawk has an extension to make it an escaping
464 operator
465
466The Eggex-to-ERE translator is smart enough to handle cases like this:
467
468 var pat = / ['^' 'x'] /
469 # translated to [x^], not [^x] for correctness
470
471However, cases like this are a fatal runtime error:
472
473 var pat1 = / ['a'-'^'] /
474 var pat2 = / ['a'-'-'] /
475
476## Critiques
477
478### Regexes Are Hard To Read
479
480... because the **same symbol can mean many things**.
481
482`^` could mean:
483
484- Start of the string/line
485- Negated character class like `[^abc]`
486- Literal character `^` like `[abc^]`
487
488`\` is used in:
489
490- Character classes like `\w` or `\d`
491- Zero-width assertions like `\b`
492- Escaped characters like `\n`
493- Quoted characters like `\+`
494
495`?` could mean:
496
497- optional: `a?`
498- lazy match: `a+?`
499- some other kind of grouping:
500 - `(?P<named>\d+)`
501 - `(?:noncapturing)`
502
503With egg expressions, each construct has a **distinct syntax**.
504
505### YSH is Shorter Than Bash
506
507Bash:
508
509 if [[ $x =~ '[[:digit:]]+' ]]; then
510 echo 'x looks like a number
511 fi
512
513Compare with YSH:
514
515 if (x ~ /digit+/) {
516 echo 'x looks like a number'
517 }
518
519### ... and Perl
520
521Perl:
522
523 $x =~ /\d+/
524
525YSH:
526
527 x ~ /d+/
528
529
530The Perl expression has three more punctuation characters:
531
532- YSH doesn't require sigils in expression mode
533- The match operator is `~`, not `=~`
534- Named character classes are unadorned like `d`. If that's too short, you can
535 also write `digit`.
536
537## Design Notes
538
539### Eggexes In Other Languages
540
541The eggex syntax can be incorporated into other tools and shells. It's
542designed to be separate from YSH -- hence the separate name.
543
544Notes:
545
546- Single quoted string literals should **disallow** internal backslashes, and
547 treat all other characters literally.. Instead, users can write `/ 'foo' \t
548 'sq' \' bar \n /` &mdash; i.e. implicit concatenation of strings and
549 characters, described above.
550- To make eggexes portable between languages, Don't use the host language's
551 syntax for string literals (at least for single-quoted strings).
552
553### Backward Compatibility
554
555Eggexes aren't backward compatible in general, but they retain some legacy
556operators like `^ . $` to ease the transition. These expressions are valid
557eggexes **and** valid POSIX EREs:
558
559 .*
560 ^[0-9]+$
561 ^.{1,3}|[0-9][0-9]?$
562
563## FAQ
564
565### The Name Sounds Funny.
566
567If "eggex" sounds too much like "regex" to you, simply say "egg expression".
568It won't be confused with "regular expression" or "regex".
569
570### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
571
572All three languages support pattern composition and have quoted literals. And
573they have the goal of improving upon Perl 5 regex syntax, which has made its
574way into every major programming language (Python, Java, C++, etc.)
575
576The main difference is that Eggexes are meant to be used with **existing**
577regex engines. For example, you translate them to a POSIX ERE, which is
578executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
579use them in Python, JavaScript, Java, or C++ programs.
580
581Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
582Python, etc. That means they **cannot** be used this way.
583
584[rosie]: https://rosie-lang.org/
585
586[raku-regex]: https://docs.raku.org/language/regexes
587
588### What About Eggex versus Parsing Expression Grammars? (PEGs)
589
590The short answer is that they can be complementary: PEGs are closer to
591**parsing**, while eggex and [regular languages]($xref:regular-language) are
592closer to **lexing**. Related:
593
594- [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
595- [Why Lexing and Parsing Should Be
596 Separate](https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
597
598The PEG model is more resource intensive, but it can recognize more languages,
599and it can recognize recursive structure (trees).
600
601### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
602
603Because the meanings of `.` `^` and `$` are usually affected by regex engine
604flags, like `dotall`, `multiline`, and `unicode`.
605
606As a result, the names mean nothing more than "however your regex engine
607interprets `.` `^` and `$`".
608
609As mentioned in the "Philosophy" section above, eggex only does a superficial,
610one-to-one translation. It doesn't understand the details of which characters
611will be matched under which engine.
612
613### Where Do I Send Feedback?
614
615Eggexes are implemented in YSH, but not yet set in stone.
616
617Please try them, as described in [this
618post](http://www.oilshell.org/blog/2019/08/22.html) and the
619[README]($oils-src:README.md), and send us feedback!
620
621You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
622or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
623with Github, etc.)