1 | ---
|
2 | default_highlighter: oils-sh
|
3 | ---
|
4 |
|
5 | Egg Expressions (YSH Regexes)
|
6 | =============================
|
7 |
|
8 | YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
|
9 |
|
10 | if (mystr ~ /d+ '.' d+/) {
|
11 | echo 'mystr looks like a number N.M'
|
12 | }
|
13 |
|
14 | These patterns are intended to be familiar, but they differ from POSIX or Perl
|
15 | expressions in important ways. So we call them *eggexes* rather than
|
16 | *regexes*!
|
17 |
|
18 | <!-- cmark.py expands this -->
|
19 | <div id="toc">
|
20 | </div>
|
21 |
|
22 | ## Why Invent a New Language?
|
23 |
|
24 | - Eggexes let you name **subpatterns** and compose them, which makes them more
|
25 | readable and testable.
|
26 | - Their **syntax** is vastly simpler because literal characters are **quoted**,
|
27 | and operators are not. For example, `^` no longer means three totally
|
28 | different things. See the critique at the end of this doc.
|
29 | - bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
|
30 | more expressive and (in some cases) Perl-like.
|
31 | - They're designed to be **translated to any regex dialect**. Right now, the
|
32 | YSH shell translates them to ERE so you can use them with common Unix tools:
|
33 | - `egrep` (`grep -E`)
|
34 | - `awk`
|
35 | - GNU `sed --regexp-extended`
|
36 | - PCRE syntax is the second most important target.
|
37 | - They're **statically parsed** in YSH, so:
|
38 | - You can get **syntax errors** at parse time. In contrast, if you embed a
|
39 | regex in a string, you don't get syntax errors until runtime.
|
40 | - The eggex is part of the [lossless syntax tree][], which means you can do
|
41 | linting, formatting, and refactoring on eggexes, just like any other type
|
42 | of code.
|
43 | - Eggexes support **regular languages** in the mathematical sense, whereas
|
44 | regexes are **confused** about the issue. All nonregular eggex extensions
|
45 | are prefixed with `!!`, so you can visually audit them for [catastrophic
|
46 | backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
|
47 | written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
|
48 | - Eggexes are more fun than regexes!
|
49 |
|
50 | [backtracking]: https://blog.codinghorror.com/regex-performance/
|
51 |
|
52 | [lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
|
53 |
|
54 | ### Example of Pattern Reuse
|
55 |
|
56 | Here's a longer example:
|
57 |
|
58 | # Define a subpattern. 'digit' and 'd' are the same.
|
59 | $ var D = / digit{1,3} /
|
60 |
|
61 | # Use the subpattern
|
62 | $ var ip_pat = / D '.' D '.' D '.' D /
|
63 |
|
64 | # This eggex compiles to an ERE
|
65 | $ echo $ip_pat
|
66 | [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
|
67 |
|
68 | This means you can use it in a very simple way:
|
69 |
|
70 | $ egrep $ip_pat foo.txt
|
71 |
|
72 | TODO: You should also be able to inline patterns like this:
|
73 |
|
74 | egrep $/d+/ foo.txt
|
75 |
|
76 | ### Design Philosophy
|
77 |
|
78 | - Eggexes can express a **superset** of POSIX and Perl syntax.
|
79 | - The language is designed for "dumb", one-to-one, **syntactic** translations.
|
80 | That is, translation doesn't rely on understanding the **semantics** of
|
81 | regexes. This is because regex implementations have many corner cases and
|
82 | incompatibilities, with regard to Unicode, `NUL` bytes, etc.
|
83 |
|
84 | ### The Expression Language Is Consistent
|
85 |
|
86 | Eggexes have a consistent syntax:
|
87 |
|
88 | - Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
|
89 | - A sequence of multiple characters looks like `'lit'`, `$var`, etc.
|
90 | - Constructs that match **zero** characters look like `%start`, `%word_end`, etc.
|
91 | - Entire subpatterns (which may contain alternation, repetition, etc.) are in
|
92 | uppercase like `HexDigit`. Important: these are **spliced** as syntax trees,
|
93 | not strings, so you **don't** need to think about quoting.
|
94 |
|
95 | For example, it's easy to see that these patterns all match **three** characters:
|
96 |
|
97 | / d d d /
|
98 | / digit digit digit /
|
99 | / dot dot dot /
|
100 | / word space word /
|
101 | / 'ab' space /
|
102 | / 'abc' /
|
103 |
|
104 | And that these patterns match **two**:
|
105 |
|
106 | / %start w w /
|
107 | / %start 'if' /
|
108 | / d d %end /
|
109 |
|
110 | And that you have to look up the definition of `HexDigit` to know how many
|
111 | characters this matches:
|
112 |
|
113 | / %start HexDigit %end /
|
114 |
|
115 | Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
|
116 |
|
117 | ## Expression Primitives
|
118 |
|
119 | ### `.` Is Now `dot`
|
120 |
|
121 | But `.` is still accepted. It usually matches any character except a newline,
|
122 | although this changes based on flags (e.g. `dotall`, `unicode`).
|
123 |
|
124 | ### Classes Are Unadorned: `word`, `w`, `alnum`
|
125 |
|
126 | We accept both Perl and POSIX classes.
|
127 |
|
128 | - Perl:
|
129 | - `d` or `digit`
|
130 | - `s` or `space`
|
131 | - `w` or `word`
|
132 | - POSIX
|
133 | - `alpha`, `alnum`, ...
|
134 |
|
135 | ### Zero-width Assertions Look Like `%this`
|
136 |
|
137 | - POSIX
|
138 | - `%start` is `^`
|
139 | - `%end` is `$`
|
140 | - PCRE:
|
141 | - `%input_start` is `\A`
|
142 | - `%input_end` is `\z`
|
143 | - `%last_line_end` is `\Z`
|
144 | - GNU ERE extensions:
|
145 | - `%word_start` is `\<`
|
146 | - `%word_end` is `\>`
|
147 |
|
148 | ### Single-Quoted Strings
|
149 |
|
150 | - `'hello *world*'` becomes a regex-escaped string
|
151 |
|
152 | Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
|
153 | a strings into an eggex:
|
154 |
|
155 | / 'xyz ' @var /
|
156 |
|
157 | ## Compound Expressions
|
158 |
|
159 | ### Sequence and Alternation Are Unchanged
|
160 |
|
161 | - `x y` matches `x` and `y` in sequence
|
162 | - `x | y` matches `x` or `y`
|
163 |
|
164 | You can also write a more Pythonic alternative: `x or y`.
|
165 |
|
166 | ### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
|
167 |
|
168 | Repetition is just like POSIX ERE or Perl:
|
169 |
|
170 | - `x?`, `x+`, `x*`
|
171 | - `x{3}`, `x{1,3}`
|
172 |
|
173 | We've reserved syntactic space for PCRE and Python variants:
|
174 |
|
175 | - lazy/non-greedy: `x{L +}`, `x{L 3,4}`
|
176 | - possessive: `x{P +}`, `x{P 3,4}`
|
177 |
|
178 | ### Negation Consistently Uses !
|
179 |
|
180 | You can negate named char classes:
|
181 |
|
182 | / !digit /
|
183 |
|
184 | and char class literals:
|
185 |
|
186 | / ![ a-z A-Z ] /
|
187 |
|
188 | Sometimes you can do both:
|
189 |
|
190 | / ![ !digit ] / # translates to /[^\D]/ in PCRE
|
191 | # error in ERE because it can't be expressed
|
192 |
|
193 |
|
194 | You can also negate "regex modifiers" / compilation flags:
|
195 |
|
196 | / word ; ignorecase / # flag on
|
197 | / word ; !ignorecase / # flag off
|
198 | / word ; !i / # abbreviated
|
199 |
|
200 | In contrast, regexes have many confusing syntaxes for negation:
|
201 |
|
202 | [^abc] vs. [abc]
|
203 | [[^:digit:]] vs. [[:digit:]]
|
204 |
|
205 | \D vs. \d
|
206 |
|
207 | /\w/-i vs /\w/i
|
208 |
|
209 | ### Splice Other Patterns `@var_name` or `UpperCaseVarName`
|
210 |
|
211 | This allows you to reuse patterns. Using uppercase variables:
|
212 |
|
213 | var D = / digit{3} /
|
214 |
|
215 | var ip_addr = / D '.' D '.' D '.' D /
|
216 |
|
217 | Using normal variables:
|
218 |
|
219 | var part = / digit{3} /
|
220 |
|
221 | var ip_addr = / @part '.' @part '.' @part '.' @part /
|
222 |
|
223 | This is similar to how `lex` and `re2c` work.
|
224 |
|
225 | ### Group With `()`
|
226 |
|
227 | Parentheses are used for precdence:
|
228 |
|
229 | ('foo' | 'bar')+
|
230 |
|
231 | See note below: When translating to POSIX ERE, grouping becomes a capturing
|
232 | group. POSIX ERE has no non-capturing groups.
|
233 |
|
234 |
|
235 | ### Capture with `<capture ...>`
|
236 |
|
237 | Here's a positional capture:
|
238 |
|
239 | <capture d+> # Becomes _group(1)
|
240 |
|
241 | Add a variable after `as` for named capture:
|
242 |
|
243 | <capture d+ as month> # Becomes _group('month')
|
244 |
|
245 | You can also add type conversion functions:
|
246 |
|
247 | <capture d+ : int> # _group(1) returns an Int, not Str
|
248 | <capture d+ as month: int> # _group('month') returns an Int, not Str
|
249 |
|
250 | ### Character Class Literals Use `[]`
|
251 |
|
252 | Example:
|
253 |
|
254 | [ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]
|
255 |
|
256 | Terms:
|
257 |
|
258 | - Ranges: `a-f` or `'A' - 'F'`
|
259 | - Literals: `\n`, `\x01`, `\u{3bc}`, etc.
|
260 | - Sets specified as strings: `'abc'`
|
261 |
|
262 | Only letters, numbers, and the underscore may be unquoted:
|
263 |
|
264 | /['a'-'f' 'A'-'F' '0'-'9']/
|
265 | /[a-f A-F 0-9]/ # Equivalent to the above
|
266 |
|
267 | /['!' - ')']/ # Correct range
|
268 | /[!-)]/ # Syntax Error
|
269 |
|
270 | Ranges must be separated by spaces:
|
271 |
|
272 | No:
|
273 |
|
274 | /[a-fA-F0-9]/
|
275 |
|
276 | Yes:
|
277 |
|
278 | /[a-f A-f 0-9]/
|
279 |
|
280 | ### Backtracking Constructs Use `!!` (Discouraged)
|
281 |
|
282 | If you want to translate to PCRE, you can use these.
|
283 |
|
284 | !!REF 1
|
285 | !!REF name
|
286 |
|
287 | !!AHEAD( d+ )
|
288 | !!NOT_AHEAD( d+ )
|
289 | !!BEHIND( d+ )
|
290 | !!NOT_BEHIND( d+ )
|
291 |
|
292 | !!ATOMIC( d+ )
|
293 |
|
294 | Since they all begin with `!!`, You can visually audit your code for potential
|
295 | performance problems.
|
296 |
|
297 | ## Outside the Expression language
|
298 |
|
299 | ### Flags and Translation Preferences (`;`)
|
300 |
|
301 | Flags or "regex modifiers" appear after a semicolon:
|
302 |
|
303 | / digit+ ; i / # ignore case
|
304 |
|
305 | A translation preference is specified after a second semi-colon:
|
306 |
|
307 | / digit+ ; ; ERE / # translates to [[:digit:]]+
|
308 | / digit+ ; ; python / # could translate to \d+
|
309 |
|
310 | Flags and translation preferences together:
|
311 |
|
312 | / digit+ ; ignorecase ; python / # could translate to (?i)\d+
|
313 |
|
314 | In Oils, the following flags are currently supported:
|
315 |
|
316 | #### `reg_icase` / `i` (Ignore Case)
|
317 |
|
318 | Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
|
319 | 'FOO', but `/'foo'/` doesn't.
|
320 |
|
321 | #### `reg_newline` (Multiline)
|
322 |
|
323 | With this flag, `%end` will match before a newline and `%start` will match
|
324 | after a newline.
|
325 |
|
326 | = u'abc123\n' ~ / digit %end ; reg_newline / # true
|
327 | = u'abc\n123' ~ / %start digit ; reg_newline / # true
|
328 |
|
329 | Without the flag, `%start` and `%end` only match from the start or end of the
|
330 | string, respectively.
|
331 |
|
332 | = u'abc123\n' ~ / digit %end / # false
|
333 | = u'abc\n123' ~ / %start digit / # false
|
334 |
|
335 | Newlines are also ignored in `dot` and `![abc]` patterns.
|
336 |
|
337 | = u'\n' ~ / . / # true
|
338 | = u'\n' ~ / !digit / # true
|
339 |
|
340 | Without this flag, the newline `\n` is treated as an ordinary character.
|
341 |
|
342 | = u'\n' ~ / . ; reg_newline / # false
|
343 | = u'\n' ~ / !digit ; reg_newline / # false
|
344 |
|
345 | ### Multiline Syntax
|
346 |
|
347 | You can spread regexes over multiple lines and add comments:
|
348 |
|
349 | var x = ///
|
350 | digit{4} # year e.g. 2001
|
351 | '-'
|
352 | digit{2} # month e.g. 06
|
353 | '-'
|
354 | digit{2} # day e.g. 31
|
355 | ///
|
356 |
|
357 | (Not yet implemented in YSH.)
|
358 |
|
359 | ### The YSH API
|
360 |
|
361 | See the [YSH regex API](ysh-regex-api.html) for details.
|
362 |
|
363 | In summary, YSH has Perl-like conveniences with an `~` operator:
|
364 |
|
365 | var s = 'on 04-01, 10-31'
|
366 | var pat = /<capture d+ as month> '-' <capture d+ as day>/
|
367 |
|
368 | if (s ~ pat) { # search for the pattern
|
369 | echo $[_group('month')] # => 04
|
370 | }
|
371 |
|
372 | It also has an explicit and powerful Python-like API with the `search()` and
|
373 | leftMatch()` methods on strings.
|
374 |
|
375 | var m = s => search(pat, pos=8) # start searching at a position
|
376 | if (m) {
|
377 | echo $[m => group('month')] # => 10
|
378 | }
|
379 |
|
380 | ### Language Reference
|
381 |
|
382 | - See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
|
383 | the concrete syntax.
|
384 | - See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
|
385 | the abstract syntax.
|
386 |
|
387 | ## Usage Notes
|
388 |
|
389 | ### Use character literals rather than C-Escaped strings
|
390 |
|
391 | No:
|
392 |
|
393 | / $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
|
394 | / r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
|
395 |
|
396 | Yes:
|
397 |
|
398 | # Instead, Take advantage of char literals and implicit regex concatenation
|
399 | / 'foo' \t 'bar' /
|
400 | / 'foo' \\ 'tbar' /
|
401 |
|
402 |
|
403 | ## POSIX ERE Limitations
|
404 |
|
405 | ### Repetition of Strings Requires Grouping
|
406 |
|
407 | Repetitions like `* + ?` apply only to the last character, so literal strings
|
408 | need extra grouping:
|
409 |
|
410 |
|
411 | No:
|
412 |
|
413 | 'foo'+
|
414 |
|
415 | Yes:
|
416 |
|
417 | <capture 'foo'>+
|
418 |
|
419 | Also OK:
|
420 |
|
421 | ('foo')+ # this is a CAPTURING group in ERE
|
422 |
|
423 | This is necessary because ERE doesn't have non-capturing groups like Perl's
|
424 | `(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
|
425 | constructs that change the meaning of the pattern.
|
426 |
|
427 | ### Unicode char literals are limited in range
|
428 |
|
429 | ERE can't represent this set of 1 character reliably:
|
430 |
|
431 | / [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
|
432 |
|
433 | These sets are accepted:
|
434 |
|
435 | / [ \u{1} \u{2} ] / # set of 2 chars
|
436 | / [ \x01 \x02 ] ] / # set of 2 bytes
|
437 |
|
438 | They happen to be identical when translated to ERE, but may not be when
|
439 | translated to PCRE.
|
440 |
|
441 | ### Don't put non-ASCII bytes in string sets in char classes
|
442 |
|
443 | This is a sequence of characters:
|
444 |
|
445 | / $'\xfe\xff' /
|
446 |
|
447 | This is a **set** of characters that is illegal:
|
448 |
|
449 | / [ $'\xfe\xff' ] / # set or sequence? It's confusing
|
450 |
|
451 | This is a better way to write it:
|
452 |
|
453 | / [ \xfe \xff ] / # set of 2 chars
|
454 |
|
455 | ### Char class literals: `^ - ] \`
|
456 |
|
457 | The literal characters `^ - ] \` are problematic because they can be confused
|
458 | with operators.
|
459 |
|
460 | - `^` means negation
|
461 | - `-` means range
|
462 | - `]` closes the character class
|
463 | - `\` is usually literal, but GNU gawk has an extension to make it an escaping
|
464 | operator
|
465 |
|
466 | The Eggex-to-ERE translator is smart enough to handle cases like this:
|
467 |
|
468 | var pat = / ['^' 'x'] /
|
469 | # translated to [x^], not [^x] for correctness
|
470 |
|
471 | However, cases like this are a fatal runtime error:
|
472 |
|
473 | var pat1 = / ['a'-'^'] /
|
474 | var pat2 = / ['a'-'-'] /
|
475 |
|
476 | ## Critiques
|
477 |
|
478 | ### Regexes Are Hard To Read
|
479 |
|
480 | ... because the **same symbol can mean many things**.
|
481 |
|
482 | `^` could mean:
|
483 |
|
484 | - Start of the string/line
|
485 | - Negated character class like `[^abc]`
|
486 | - Literal character `^` like `[abc^]`
|
487 |
|
488 | `\` is used in:
|
489 |
|
490 | - Character classes like `\w` or `\d`
|
491 | - Zero-width assertions like `\b`
|
492 | - Escaped characters like `\n`
|
493 | - Quoted characters like `\+`
|
494 |
|
495 | `?` could mean:
|
496 |
|
497 | - optional: `a?`
|
498 | - lazy match: `a+?`
|
499 | - some other kind of grouping:
|
500 | - `(?P<named>\d+)`
|
501 | - `(?:noncapturing)`
|
502 |
|
503 | With egg expressions, each construct has a **distinct syntax**.
|
504 |
|
505 | ### YSH is Shorter Than Bash
|
506 |
|
507 | Bash:
|
508 |
|
509 | if [[ $x =~ '[[:digit:]]+' ]]; then
|
510 | echo 'x looks like a number
|
511 | fi
|
512 |
|
513 | Compare with YSH:
|
514 |
|
515 | if (x ~ /digit+/) {
|
516 | echo 'x looks like a number'
|
517 | }
|
518 |
|
519 | ### ... and Perl
|
520 |
|
521 | Perl:
|
522 |
|
523 | $x =~ /\d+/
|
524 |
|
525 | YSH:
|
526 |
|
527 | x ~ /d+/
|
528 |
|
529 |
|
530 | The Perl expression has three more punctuation characters:
|
531 |
|
532 | - YSH doesn't require sigils in expression mode
|
533 | - The match operator is `~`, not `=~`
|
534 | - Named character classes are unadorned like `d`. If that's too short, you can
|
535 | also write `digit`.
|
536 |
|
537 | ## Design Notes
|
538 |
|
539 | ### Eggexes In Other Languages
|
540 |
|
541 | The eggex syntax can be incorporated into other tools and shells. It's
|
542 | designed to be separate from YSH -- hence the separate name.
|
543 |
|
544 | Notes:
|
545 |
|
546 | - Single quoted string literals should **disallow** internal backslashes, and
|
547 | treat all other characters literally.. Instead, users can write `/ 'foo' \t
|
548 | 'sq' \' bar \n /` — i.e. implicit concatenation of strings and
|
549 | characters, described above.
|
550 | - To make eggexes portable between languages, Don't use the host language's
|
551 | syntax for string literals (at least for single-quoted strings).
|
552 |
|
553 | ### Backward Compatibility
|
554 |
|
555 | Eggexes aren't backward compatible in general, but they retain some legacy
|
556 | operators like `^ . $` to ease the transition. These expressions are valid
|
557 | eggexes **and** valid POSIX EREs:
|
558 |
|
559 | .*
|
560 | ^[0-9]+$
|
561 | ^.{1,3}|[0-9][0-9]?$
|
562 |
|
563 | ## FAQ
|
564 |
|
565 | ### The Name Sounds Funny.
|
566 |
|
567 | If "eggex" sounds too much like "regex" to you, simply say "egg expression".
|
568 | It won't be confused with "regular expression" or "regex".
|
569 |
|
570 | ### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
|
571 |
|
572 | All three languages support pattern composition and have quoted literals. And
|
573 | they have the goal of improving upon Perl 5 regex syntax, which has made its
|
574 | way into every major programming language (Python, Java, C++, etc.)
|
575 |
|
576 | The main difference is that Eggexes are meant to be used with **existing**
|
577 | regex engines. For example, you translate them to a POSIX ERE, which is
|
578 | executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
|
579 | use them in Python, JavaScript, Java, or C++ programs.
|
580 |
|
581 | Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
|
582 | Python, etc. That means they **cannot** be used this way.
|
583 |
|
584 | [rosie]: https://rosie-lang.org/
|
585 |
|
586 | [raku-regex]: https://docs.raku.org/language/regexes
|
587 |
|
588 | ### What About Eggex versus Parsing Expression Grammars? (PEGs)
|
589 |
|
590 | The short answer is that they can be complementary: PEGs are closer to
|
591 | **parsing**, while eggex and [regular languages]($xref:regular-language) are
|
592 | closer to **lexing**. Related:
|
593 |
|
594 | - [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
|
595 | - [Why Lexing and Parsing Should Be
|
596 | Separate](https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
|
597 |
|
598 | The PEG model is more resource intensive, but it can recognize more languages,
|
599 | and it can recognize recursive structure (trees).
|
600 |
|
601 | ### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
|
602 |
|
603 | Because the meanings of `.` `^` and `$` are usually affected by regex engine
|
604 | flags, like `dotall`, `multiline`, and `unicode`.
|
605 |
|
606 | As a result, the names mean nothing more than "however your regex engine
|
607 | interprets `.` `^` and `$`".
|
608 |
|
609 | As mentioned in the "Philosophy" section above, eggex only does a superficial,
|
610 | one-to-one translation. It doesn't understand the details of which characters
|
611 | will be matched under which engine.
|
612 |
|
613 | ### Where Do I Send Feedback?
|
614 |
|
615 | Eggexes are implemented in YSH, but not yet set in stone.
|
616 |
|
617 | Please try them, as described in [this
|
618 | post](http://www.oilshell.org/blog/2019/08/22.html) and the
|
619 | [README]($oils-src:README.md), and send us feedback!
|
620 |
|
621 | You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
|
622 | or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
|
623 | with Github, etc.)
|