| 1 | ---
 | 
| 2 | default_highlighter: oils-sh
 | 
| 3 | ---
 | 
| 4 | 
 | 
| 5 | Egg Expressions (YSH Regexes)
 | 
| 6 | =============================
 | 
| 7 | 
 | 
| 8 | YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
 | 
| 9 | 
 | 
| 10 |     if (mystr ~ /d+ '.' d+/) {   
 | 
| 11 |       echo 'mystr looks like a number N.M'
 | 
| 12 |     }
 | 
| 13 | 
 | 
| 14 | These patterns are intended to be familiar, but they differ from POSIX or Perl
 | 
| 15 | expressions in important ways.  So we call them *eggexes* rather than
 | 
| 16 | *regexes*!
 | 
| 17 | 
 | 
| 18 | <!-- cmark.py expands this -->
 | 
| 19 | <div id="toc">
 | 
| 20 | </div>
 | 
| 21 | 
 | 
| 22 | ## Why Invent a New Language?
 | 
| 23 | 
 | 
| 24 | - Eggexes let you name **subpatterns** and compose them, which makes them more
 | 
| 25 |   readable and testable.
 | 
| 26 | - Their **syntax** is vastly simpler because literal characters are **quoted**,
 | 
| 27 |   and operators are not.  For example, `^` no longer means three totally
 | 
| 28 |   different things.  See the critique at the end of this doc.
 | 
| 29 | - bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
 | 
| 30 |   more expressive and (in some cases) Perl-like.
 | 
| 31 | - They're designed to be **translated to any regex dialect**.  Right now, the
 | 
| 32 |   YSH shell translates them to ERE so you can use them with common Unix tools:
 | 
| 33 |   - `egrep` (`grep -E`)
 | 
| 34 |   - `awk`
 | 
| 35 |   - GNU `sed --regexp-extended`
 | 
| 36 |   - PCRE syntax is the second most important target.
 | 
| 37 | - They're **statically parsed** in YSH, so:
 | 
| 38 |   - You can get **syntax errors** at parse time.  In contrast, if you embed a
 | 
| 39 |     regex in a string, you don't get syntax errors until runtime.
 | 
| 40 |   - The eggex is part of the [lossless syntax tree][], which means you can do
 | 
| 41 |     linting, formatting, and refactoring on eggexes, just like any other type
 | 
| 42 |     of code.
 | 
| 43 | - Eggexes support **regular languages** in the mathematical sense, whereas
 | 
| 44 |   regexes are **confused** about the issue.  All nonregular eggex extensions
 | 
| 45 |   are prefixed with `!!`, so you can visually audit them for [catastrophic
 | 
| 46 |   backtracking][backtracking].  (Russ Cox, author of the RE2 engine, [has
 | 
| 47 |   written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
 | 
| 48 | - Eggexes are more fun than regexes!
 | 
| 49 | 
 | 
| 50 | [backtracking]: https://blog.codinghorror.com/regex-performance/
 | 
| 51 | 
 | 
| 52 | [lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
 | 
| 53 | 
 | 
| 54 | ### Example of Pattern Reuse
 | 
| 55 | 
 | 
| 56 | Here's a longer example:
 | 
| 57 | 
 | 
| 58 |     # Define a subpattern.  'digit' and 'd' are the same.
 | 
| 59 |     $ var D = / digit{1,3} /
 | 
| 60 | 
 | 
| 61 |     # Use the subpattern
 | 
| 62 |     $ var ip_pat = / D '.' D '.' D '.' D /
 | 
| 63 | 
 | 
| 64 |     # This eggex compiles to an ERE
 | 
| 65 |     $ echo $ip_pat
 | 
| 66 |     [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
 | 
| 67 | 
 | 
| 68 | This means you can use it in a very simple way:
 | 
| 69 | 
 | 
| 70 |     $ egrep $ip_pat foo.txt
 | 
| 71 | 
 | 
| 72 | TODO: You should also be able to inline patterns like this:
 | 
| 73 | 
 | 
| 74 |     egrep $/d+/ foo.txt
 | 
| 75 | 
 | 
| 76 | ### Design Philosophy
 | 
| 77 | 
 | 
| 78 | - Eggexes can express a **superset** of POSIX and Perl syntax.
 | 
| 79 | - The language is designed for "dumb", one-to-one, **syntactic** translations.
 | 
| 80 |   That is, translation doesn't rely on understanding the **semantics** of
 | 
| 81 |   regexes.  This is because regex implementations have many corner cases and
 | 
| 82 |   incompatibilities, with regard to Unicode, `NUL` bytes, etc.
 | 
| 83 | 
 | 
| 84 | ### The Expression Language Is Consistent
 | 
| 85 | 
 | 
| 86 | Eggexes have a consistent syntax:
 | 
| 87 | 
 | 
| 88 | - Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
 | 
| 89 | - A sequence of multiple characters looks like `'lit'`, `$var`, etc.
 | 
| 90 | - Constructs that match **zero** characters look like `%start`, `%word_end`, etc. 
 | 
| 91 | - Entire subpatterns (which may contain alternation, repetition, etc.) are in
 | 
| 92 |   uppercase like `HexDigit`.  Important: these are **spliced** as syntax trees,
 | 
| 93 |   not strings, so you **don't** need to think about quoting.
 | 
| 94 | 
 | 
| 95 | For example, it's easy to see that these patterns all match **three** characters:
 | 
| 96 | 
 | 
| 97 |     / d d d /
 | 
| 98 |     / digit digit digit /
 | 
| 99 |     / dot dot dot /
 | 
| 100 |     / word space word /
 | 
| 101 |     / 'ab' space /
 | 
| 102 |     / 'abc' /
 | 
| 103 | 
 | 
| 104 | And that these patterns match **two**:
 | 
| 105 | 
 | 
| 106 |     / %start w w /
 | 
| 107 |     / %start 'if' /
 | 
| 108 |     / d d %end /
 | 
| 109 | 
 | 
| 110 | And that you have to look up the definition of `HexDigit` to know how many
 | 
| 111 | characters this matches:
 | 
| 112 | 
 | 
| 113 |     / %start HexDigit %end /
 | 
| 114 | 
 | 
| 115 | Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
 | 
| 116 | 
 | 
| 117 | ## Expression Primitives
 | 
| 118 | 
 | 
| 119 | ### `.` Is Now `dot`
 | 
| 120 | 
 | 
| 121 | But `.` is still accepted.  It usually matches any character except a newline,
 | 
| 122 | although this changes based on flags (e.g. `dotall`, `unicode`).
 | 
| 123 | 
 | 
| 124 | ### Classes Are Unadorned: `word`, `w`, `alnum`
 | 
| 125 | 
 | 
| 126 | We accept both Perl and POSIX classes.
 | 
| 127 | 
 | 
| 128 | - Perl:
 | 
| 129 |   - `d` or `digit`
 | 
| 130 |   - `s` or `space`
 | 
| 131 |   - `w` or `word`
 | 
| 132 | - POSIX
 | 
| 133 |   - `alpha`, `alnum`, ...
 | 
| 134 | 
 | 
| 135 | ### Zero-width Assertions Look Like `%this`
 | 
| 136 | 
 | 
| 137 | - POSIX
 | 
| 138 |   - `%start` is `^`
 | 
| 139 |   - `%end` is `$`
 | 
| 140 | - PCRE:
 | 
| 141 |   - `%input_start` is `\A`
 | 
| 142 |   - `%input_end` is `\z`
 | 
| 143 |   - `%last_line_end` is `\Z`
 | 
| 144 | - GNU ERE extensions:
 | 
| 145 |   - `%word_start` is `\<`
 | 
| 146 |   - `%word_end` is `\>`
 | 
| 147 | 
 | 
| 148 | ### Single-Quoted Strings
 | 
| 149 | 
 | 
| 150 | - `'hello *world*'`  becomes a regex-escaped string
 | 
| 151 | 
 | 
| 152 | Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
 | 
| 153 | a strings into an eggex:
 | 
| 154 | 
 | 
| 155 |     / 'xyz ' @var /
 | 
| 156 | 
 | 
| 157 | ## Compound Expressions
 | 
| 158 | 
 | 
| 159 | ### Sequence and Alternation Are Unchanged
 | 
| 160 | 
 | 
| 161 | - `x y` matches `x` and `y` in sequence
 | 
| 162 | - `x | y` matches `x` or `y`
 | 
| 163 | 
 | 
| 164 | You can also write a more Pythonic alternative: `x or y`.
 | 
| 165 | 
 | 
| 166 | ### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
 | 
| 167 | 
 | 
| 168 | Repetition is just like POSIX ERE or Perl:
 | 
| 169 | 
 | 
| 170 | - `x?`, `x+`, `x*` 
 | 
| 171 | - `x{3}`, `x{1,3}`
 | 
| 172 | 
 | 
| 173 | We've reserved syntactic space for PCRE and Python variants:
 | 
| 174 | 
 | 
| 175 | - lazy/non-greedy: `x{L +}`, `x{L 3,4}`
 | 
| 176 | - possessive: `x{P +}`, `x{P 3,4}`
 | 
| 177 | 
 | 
| 178 | ### Negation Consistently Uses !
 | 
| 179 | 
 | 
| 180 | You can negate named char classes:
 | 
| 181 | 
 | 
| 182 |     / !digit /
 | 
| 183 | 
 | 
| 184 | and char class literals:
 | 
| 185 | 
 | 
| 186 |     / ![ a-z A-Z ] /
 | 
| 187 | 
 | 
| 188 | Sometimes you can do both:
 | 
| 189 | 
 | 
| 190 |     / ![ !digit ] /  # translates to /[^\D]/ in PCRE
 | 
| 191 |                      # error in ERE because it can't be expressed
 | 
| 192 | 
 | 
| 193 | 
 | 
| 194 | You can also negate "regex modifiers" / compilation flags:
 | 
| 195 | 
 | 
| 196 |     / word ; ignorecase /   # flag on
 | 
| 197 |     / word ; !ignorecase /  # flag off
 | 
| 198 |     / word ; !i /           # abbreviated
 | 
| 199 | 
 | 
| 200 | In contrast, regexes have many confusing syntaxes for negation:
 | 
| 201 | 
 | 
| 202 |     [^abc] vs. [abc]
 | 
| 203 |     [[^:digit:]] vs. [[:digit:]]
 | 
| 204 | 
 | 
| 205 |     \D vs. \d
 | 
| 206 | 
 | 
| 207 |     /\w/-i vs /\w/i
 | 
| 208 | 
 | 
| 209 | ### Splice Other Patterns `@var_name` or `UpperCaseVarName`
 | 
| 210 | 
 | 
| 211 | This allows you to reuse patterns.  Using uppercase variables:
 | 
| 212 | 
 | 
| 213 |     var D = / digit{3} /
 | 
| 214 | 
 | 
| 215 |     var ip_addr = / D '.' D '.' D '.' D /
 | 
| 216 | 
 | 
| 217 | Using normal variables:
 | 
| 218 | 
 | 
| 219 |     var part = / digit{3} /
 | 
| 220 | 
 | 
| 221 |     var ip_addr = / @part '.' @part '.' @part '.' @part /
 | 
| 222 | 
 | 
| 223 | This is similar to how `lex` and `re2c` work.
 | 
| 224 | 
 | 
| 225 | ### Group With `()` 
 | 
| 226 | 
 | 
| 227 | Parentheses are used for precdence:
 | 
| 228 | 
 | 
| 229 |     ('foo' | 'bar')+
 | 
| 230 | 
 | 
| 231 | See note below: When translating to POSIX ERE, grouping becomes a capturing
 | 
| 232 | group.  POSIX ERE has no non-capturing groups.
 | 
| 233 | 
 | 
| 234 | 
 | 
| 235 | ### Capture with `<capture ...>`
 | 
| 236 | 
 | 
| 237 | Here's a positional capture:
 | 
| 238 | 
 | 
| 239 |     <capture d+>           # Becomes _group(1)
 | 
| 240 | 
 | 
| 241 | Add a variable after `as` for named capture:
 | 
| 242 | 
 | 
| 243 |     <capture d+ as month>  # Becomes _group('month')
 | 
| 244 | 
 | 
| 245 | You can also add type conversion functions:
 | 
| 246 | 
 | 
| 247 |     <capture d+ : int>           # _group(1) returns an Int, not Str
 | 
| 248 |     <capture d+ as month: int>   # _group('month') returns an Int, not Str
 | 
| 249 | 
 | 
| 250 | ### Character Class Literals Use `[]`
 | 
| 251 | 
 | 
| 252 | Example:
 | 
| 253 | 
 | 
| 254 |     [ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]
 | 
| 255 | 
 | 
| 256 | Terms:
 | 
| 257 | 
 | 
| 258 | - Ranges: `a-f` or `'A' - 'F'`
 | 
| 259 | - Literals: `\n`, `\x01`, `\u{3bc}`, etc.
 | 
| 260 | - Sets specified as strings: `'abc'`
 | 
| 261 | 
 | 
| 262 | Only letters, numbers, and the underscore may be unquoted:
 | 
| 263 | 
 | 
| 264 |     /['a'-'f' 'A'-'F' '0'-'9']/
 | 
| 265 |     /[a-f A-F 0-9]/              # Equivalent to the above
 | 
| 266 | 
 | 
| 267 |     /['!' - ')']/                # Correct range
 | 
| 268 |     /[!-)]/                      # Syntax Error
 | 
| 269 | 
 | 
| 270 | Ranges must be separated by spaces:
 | 
| 271 | 
 | 
| 272 | No:
 | 
| 273 | 
 | 
| 274 |     /[a-fA-F0-9]/
 | 
| 275 | 
 | 
| 276 | Yes:
 | 
| 277 | 
 | 
| 278 |     /[a-f A-f 0-9]/
 | 
| 279 | 
 | 
| 280 | ### Backtracking Constructs Use `!!` (Discouraged)
 | 
| 281 | 
 | 
| 282 | If you want to translate to PCRE, you can use these.
 | 
| 283 | 
 | 
| 284 |     !!REF 1
 | 
| 285 |     !!REF name
 | 
| 286 | 
 | 
| 287 |     !!AHEAD( d+ )
 | 
| 288 |     !!NOT_AHEAD( d+ )
 | 
| 289 |     !!BEHIND( d+ )
 | 
| 290 |     !!NOT_BEHIND( d+ )
 | 
| 291 | 
 | 
| 292 |     !!ATOMIC( d+ )
 | 
| 293 | 
 | 
| 294 | Since they all begin with `!!`, You can visually audit your code for potential
 | 
| 295 | performance problems.
 | 
| 296 | 
 | 
| 297 | ## Outside the Expression language
 | 
| 298 | 
 | 
| 299 | ### Flags and Translation Preferences (`;`)
 | 
| 300 | 
 | 
| 301 | Flags or "regex modifiers" appear after a semicolon:
 | 
| 302 | 
 | 
| 303 |     / digit+ ; i /  # ignore case
 | 
| 304 | 
 | 
| 305 | A translation preference is specified after a second semi-colon:
 | 
| 306 | 
 | 
| 307 |     / digit+ ; ; ERE /                # translates to [[:digit:]]+
 | 
| 308 |     / digit+ ; ; python /             # could translate to \d+
 | 
| 309 | 
 | 
| 310 | Flags and translation preferences together:
 | 
| 311 | 
 | 
| 312 |     / digit+ ; ignorecase ; python /  # could translate to (?i)\d+
 | 
| 313 | 
 | 
| 314 | In Oils, the following flags are currently supported:
 | 
| 315 | 
 | 
| 316 | #### `reg_icase` / `i` (Ignore Case)
 | 
| 317 | 
 | 
| 318 | Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
 | 
| 319 | 'FOO', but `/'foo'/` doesn't.
 | 
| 320 | 
 | 
| 321 | #### `reg_newline` (Multiline)
 | 
| 322 | 
 | 
| 323 | With this flag, `%end` will match before a newline and `%start` will match
 | 
| 324 | after a newline.
 | 
| 325 | 
 | 
| 326 |     = u'abc123\n' ~ / digit %end ; reg_newline /    # true
 | 
| 327 |     = u'abc\n123' ~ / %start digit ; reg_newline /  # true
 | 
| 328 | 
 | 
| 329 | Without the flag, `%start` and `%end` only match from the start or end of the
 | 
| 330 | string, respectively.
 | 
| 331 | 
 | 
| 332 |     = u'abc123\n' ~ / digit %end /                  # false
 | 
| 333 |     = u'abc\n123' ~ / %start digit /                # false
 | 
| 334 | 
 | 
| 335 | Newlines are also ignored in `dot` and `![abc]` patterns.
 | 
| 336 | 
 | 
| 337 |     = u'\n' ~ / . /                                 # true
 | 
| 338 |     = u'\n' ~ / !digit /                            # true
 | 
| 339 | 
 | 
| 340 | Without this flag, the newline `\n` is treated as an ordinary character.
 | 
| 341 | 
 | 
| 342 |     = u'\n' ~ / . ; reg_newline /                   # false
 | 
| 343 |     = u'\n' ~ / !digit ; reg_newline /              # false
 | 
| 344 | 
 | 
| 345 | ### Multiline Syntax
 | 
| 346 | 
 | 
| 347 | You can spread regexes over multiple lines and add comments:
 | 
| 348 | 
 | 
| 349 |     var x = ///
 | 
| 350 |       digit{4}   # year e.g. 2001
 | 
| 351 |       '-'
 | 
| 352 |       digit{2}   # month e.g. 06
 | 
| 353 |       '-'
 | 
| 354 |       digit{2}   # day e.g. 31
 | 
| 355 |     ///
 | 
| 356 | 
 | 
| 357 | (Not yet implemented in YSH.)
 | 
| 358 | 
 | 
| 359 | ### The YSH API
 | 
| 360 | 
 | 
| 361 | See the [YSH regex API](ysh-regex-api.html) for details.
 | 
| 362 | 
 | 
| 363 | In summary, YSH has Perl-like conveniences with an `~` operator:
 | 
| 364 | 
 | 
| 365 |     var s = 'on 04-01, 10-31'
 | 
| 366 |     var pat = /<capture d+ as month> '-' <capture d+ as day>/
 | 
| 367 | 
 | 
| 368 |     if (s ~ pat) {       # search for the pattern
 | 
| 369 |       echo $[_group('month')]  # => 04
 | 
| 370 |     }
 | 
| 371 | 
 | 
| 372 | It also has an explicit and powerful Python-like API with the `search()` and
 | 
| 373 | leftMatch()` methods on strings.
 | 
| 374 | 
 | 
| 375 |     var m = s => search(pat, pos=8)  # start searching at a position
 | 
| 376 |     if (m) {
 | 
| 377 |       echo $[m => group('month')]  # => 10
 | 
| 378 |     }
 | 
| 379 | 
 | 
| 380 | ### Language Reference
 | 
| 381 | 
 | 
| 382 | - See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
 | 
| 383 |   the concrete syntax.
 | 
| 384 | - See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
 | 
| 385 |   the abstract syntax.
 | 
| 386 | 
 | 
| 387 | ## Usage Notes
 | 
| 388 | 
 | 
| 389 | ### Use character literals rather than C-Escaped strings
 | 
| 390 | 
 | 
| 391 | No:
 | 
| 392 | 
 | 
| 393 |     / $'foo\tbar' /   # Match 7 characters including a tab, but it's hard to read
 | 
| 394 |     / r'foo\tbar' /   # The string must contain 8 chars including '\' and 't'
 | 
| 395 | 
 | 
| 396 | Yes:
 | 
| 397 | 
 | 
| 398 |     # Instead, Take advantage of char literals and implicit regex concatenation
 | 
| 399 |     / 'foo' \t 'bar' /
 | 
| 400 |     / 'foo' \\ 'tbar' /
 | 
| 401 | 
 | 
| 402 | 
 | 
| 403 | ## POSIX ERE Limitations
 | 
| 404 | 
 | 
| 405 | ### Repetition of Strings Requires Grouping
 | 
| 406 | 
 | 
| 407 | Repetitions like `* + ?` apply only to the last character, so literal strings
 | 
| 408 | need extra grouping:
 | 
| 409 | 
 | 
| 410 | 
 | 
| 411 | No:
 | 
| 412 | 
 | 
| 413 |     'foo'+ 
 | 
| 414 | 
 | 
| 415 | Yes:
 | 
| 416 | 
 | 
| 417 |     <capture 'foo'>+
 | 
| 418 | 
 | 
| 419 | Also OK:
 | 
| 420 | 
 | 
| 421 |     ('foo')+  # this is a CAPTURING group in ERE
 | 
| 422 | 
 | 
| 423 | This is necessary because ERE doesn't have non-capturing groups like Perl's
 | 
| 424 | `(?:...)`, and Eggex only does "dumb" translations.  It doesn't silently insert
 | 
| 425 | constructs that change the meaning of the pattern.
 | 
| 426 | 
 | 
| 427 | ### Unicode char literals are limited in range
 | 
| 428 | 
 | 
| 429 | ERE can't represent this set of 1 character reliably:
 | 
| 430 | 
 | 
| 431 |     / [ \u{0100} ] /      # This char is 2 bytes encoded in UTF-8
 | 
| 432 | 
 | 
| 433 | These sets are accepted:
 | 
| 434 | 
 | 
| 435 |     / [ \u{1} \u{2} ] /   # set of 2 chars
 | 
| 436 |     / [ \x01 \x02 ] ] /   # set of 2 bytes
 | 
| 437 | 
 | 
| 438 | They happen to be identical when translated to ERE, but may not be when
 | 
| 439 | translated to PCRE.
 | 
| 440 | 
 | 
| 441 | ### Don't put non-ASCII bytes in string sets in char classes
 | 
| 442 | 
 | 
| 443 | This is a sequence of characters:
 | 
| 444 | 
 | 
| 445 |     / $'\xfe\xff' /
 | 
| 446 | 
 | 
| 447 | This is a **set** of characters that is illegal:
 | 
| 448 | 
 | 
| 449 |     / [ $'\xfe\xff' ] /  # set or sequence?  It's confusing
 | 
| 450 | 
 | 
| 451 | This is a better way to write it:
 | 
| 452 | 
 | 
| 453 |     / [ \xfe \xff ] /  # set of 2 chars
 | 
| 454 | 
 | 
| 455 | ### Char class literals: `^ - ] \`
 | 
| 456 | 
 | 
| 457 | The literal characters `^ - ] \` are problematic because they can be confused
 | 
| 458 | with operators.
 | 
| 459 | 
 | 
| 460 | - `^` means negation
 | 
| 461 | - `-` means range
 | 
| 462 | - `]` closes the character class
 | 
| 463 | - `\` is usually literal, but GNU gawk has an extension to make it an escaping
 | 
| 464 |   operator
 | 
| 465 | 
 | 
| 466 | The Eggex-to-ERE translator is smart enough to handle cases like this:
 | 
| 467 | 
 | 
| 468 |     var pat = / ['^' 'x'] / 
 | 
| 469 |     # translated to [x^], not [^x] for correctness
 | 
| 470 | 
 | 
| 471 | However, cases like this are a fatal runtime error:
 | 
| 472 | 
 | 
| 473 |     var pat1 = / ['a'-'^'] /
 | 
| 474 |     var pat2 = / ['a'-'-'] /
 | 
| 475 | 
 | 
| 476 | ## Critiques
 | 
| 477 | 
 | 
| 478 | ### Regexes Are Hard To Read
 | 
| 479 | 
 | 
| 480 | ... because the **same symbol can mean many things**.
 | 
| 481 | 
 | 
| 482 | `^` could mean:
 | 
| 483 | 
 | 
| 484 | - Start of the string/line
 | 
| 485 | - Negated character class like `[^abc]`
 | 
| 486 | - Literal character `^` like `[abc^]`
 | 
| 487 | 
 | 
| 488 | `\` is used in:
 | 
| 489 | 
 | 
| 490 | - Character classes like `\w` or `\d`
 | 
| 491 | - Zero-width assertions like `\b`
 | 
| 492 | - Escaped characters like `\n`
 | 
| 493 | - Quoted characters like `\+`
 | 
| 494 | 
 | 
| 495 | `?` could mean:
 | 
| 496 | 
 | 
| 497 | - optional: `a?`
 | 
| 498 | - lazy match: `a+?`
 | 
| 499 | - some other kind of grouping:
 | 
| 500 |   - `(?P<named>\d+)`
 | 
| 501 |   - `(?:noncapturing)`
 | 
| 502 | 
 | 
| 503 | With egg expressions, each construct has a **distinct syntax**.
 | 
| 504 | 
 | 
| 505 | ### YSH is Shorter Than Bash
 | 
| 506 | 
 | 
| 507 | Bash:
 | 
| 508 | 
 | 
| 509 |     if [[ $x =~ '[[:digit:]]+' ]]; then
 | 
| 510 |       echo 'x looks like a number
 | 
| 511 |     fi
 | 
| 512 | 
 | 
| 513 | Compare with YSH:
 | 
| 514 | 
 | 
| 515 |     if (x ~ /digit+/) {
 | 
| 516 |       echo 'x looks like a number'
 | 
| 517 |     }
 | 
| 518 | 
 | 
| 519 | ### ... and Perl
 | 
| 520 | 
 | 
| 521 | Perl:
 | 
| 522 | 
 | 
| 523 |     $x =~ /\d+/
 | 
| 524 | 
 | 
| 525 | YSH:
 | 
| 526 | 
 | 
| 527 |     x ~ /d+/
 | 
| 528 | 
 | 
| 529 | 
 | 
| 530 | The Perl expression has three more punctuation characters:
 | 
| 531 | 
 | 
| 532 | - YSH doesn't require sigils in expression mode
 | 
| 533 | - The match operator is `~`, not `=~`
 | 
| 534 | - Named character classes are unadorned like `d`.  If that's too short, you can
 | 
| 535 |   also write `digit`.
 | 
| 536 | 
 | 
| 537 | ## Design Notes
 | 
| 538 | 
 | 
| 539 | ### Eggexes In Other Languages
 | 
| 540 | 
 | 
| 541 | The eggex syntax can be incorporated into other tools and shells.  It's
 | 
| 542 | designed to be separate from YSH -- hence the separate name.
 | 
| 543 | 
 | 
| 544 | Notes:
 | 
| 545 | 
 | 
| 546 | - Single quoted string literals should **disallow** internal backslashes, and
 | 
| 547 |   treat all other characters literally..  Instead, users can write `/ 'foo' \t
 | 
| 548 |   'sq' \' bar \n /` — i.e. implicit concatenation of strings and
 | 
| 549 |   characters, described above.
 | 
| 550 | - To make eggexes portable between languages, Don't use the host language's
 | 
| 551 |   syntax for string literals (at least for single-quoted strings).
 | 
| 552 | 
 | 
| 553 | ### Backward Compatibility
 | 
| 554 | 
 | 
| 555 | Eggexes aren't backward compatible in general, but they retain some legacy
 | 
| 556 | operators like `^ . $` to ease the transition.  These expressions are valid
 | 
| 557 | eggexes **and** valid POSIX EREs:
 | 
| 558 | 
 | 
| 559 |     .*
 | 
| 560 |     ^[0-9]+$
 | 
| 561 |     ^.{1,3}|[0-9][0-9]?$
 | 
| 562 | 
 | 
| 563 | ## FAQ
 | 
| 564 | 
 | 
| 565 | ### The Name Sounds Funny.
 | 
| 566 | 
 | 
| 567 | If "eggex" sounds too much like "regex" to you, simply say "egg expression".
 | 
| 568 | It won't be confused with "regular expression" or "regex".
 | 
| 569 | 
 | 
| 570 | ### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
 | 
| 571 | 
 | 
| 572 | All three languages support pattern composition and have quoted literals.  And
 | 
| 573 | they have the goal of improving upon Perl 5 regex syntax, which has made its
 | 
| 574 | way into every major programming language (Python, Java, C++, etc.)
 | 
| 575 | 
 | 
| 576 | The main difference is that Eggexes are meant to be used with **existing**
 | 
| 577 | regex engines.  For example, you translate them to a POSIX ERE, which is
 | 
| 578 | executed by `egrep` or `awk`.  Or you translate them to a Perl-like syntax and
 | 
| 579 | use them in Python, JavaScript, Java, or C++ programs.
 | 
| 580 | 
 | 
| 581 | Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
 | 
| 582 | Python, etc.  That means they **cannot** be used this way.
 | 
| 583 | 
 | 
| 584 | [rosie]: https://rosie-lang.org/
 | 
| 585 | 
 | 
| 586 | [raku-regex]: https://docs.raku.org/language/regexes
 | 
| 587 | 
 | 
| 588 | ### What About Eggex versus Parsing Expression Grammars?  (PEGs)
 | 
| 589 | 
 | 
| 590 | The short answer is that they can be complementary: PEGs are closer to
 | 
| 591 | **parsing**, while eggex and [regular languages]($xref:regular-language) are
 | 
| 592 | closer to **lexing**.  Related:
 | 
| 593 | 
 | 
| 594 | - [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
 | 
| 595 | - [Why Lexing and Parsing Should Be
 | 
| 596 |   Separate](https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
 | 
| 597 | 
 | 
| 598 | The PEG model is more resource intensive, but it can recognize more languages,
 | 
| 599 | and it can recognize recursive structure (trees).
 | 
| 600 | 
 | 
| 601 | ### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
 | 
| 602 | 
 | 
| 603 | Because the meanings of `.` `^` and `$` are usually affected by regex engine
 | 
| 604 | flags, like `dotall`, `multiline`, and `unicode`.
 | 
| 605 | 
 | 
| 606 | As a result, the names mean nothing more than "however your regex engine
 | 
| 607 | interprets `.` `^` and `$`".
 | 
| 608 | 
 | 
| 609 | As mentioned in the "Philosophy" section above, eggex only does a superficial,
 | 
| 610 | one-to-one translation.  It doesn't understand the details of which characters
 | 
| 611 | will be matched under which engine.
 | 
| 612 | 
 | 
| 613 | ### Where Do I Send Feedback?
 | 
| 614 | 
 | 
| 615 | Eggexes are implemented in YSH, but not yet set in stone.
 | 
| 616 | 
 | 
| 617 | Please try them, as described in [this
 | 
| 618 | post](http://www.oilshell.org/blog/2019/08/22.html) and the
 | 
| 619 | [README]($oils-src:README.md), and send us feedback!
 | 
| 620 | 
 | 
| 621 | You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
 | 
| 622 | or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
 | 
| 623 | with Github, etc.)
 |