| 1 | ---
 | 
| 2 | default_highlighter: oils-sh
 | 
| 3 | in_progress: yes
 | 
| 4 | ---
 | 
| 5 | 
 | 
| 6 | Notes on Unicode in Shell
 | 
| 7 | =========================
 | 
| 8 | 
 | 
| 9 | <div id="toc">
 | 
| 10 | </div>
 | 
| 11 | 
 | 
| 12 | ## Philosophy
 | 
| 13 | 
 | 
| 14 | Oils is UTF-8 centric, unlike `bash` and other shells.
 | 
| 15 | 
 | 
| 16 | That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
 | 
| 17 | Python or JavaScript.  The former languages internally represent strings as
 | 
| 18 | UTF-8, while the latter use arrays of code points or UTF-16 code units.
 | 
| 19 | 
 | 
| 20 | ## A Mental Model
 | 
| 21 | 
 | 
| 22 | ### Program Encoding
 | 
| 23 | 
 | 
| 24 | Shell **programs** should be encoded in UTF-8 (or its ASCII subset).  Unicode
 | 
| 25 | characters can be encoded directly in the source:
 | 
| 26 | 
 | 
| 27 | <pre>
 | 
| 28 | echo 'μ'
 | 
| 29 | </pre>
 | 
| 30 | 
 | 
| 31 | or denoted in ASCII with C-escaped strings:
 | 
| 32 | 
 | 
| 33 |     echo $'\u03bc'   # bash style
 | 
| 34 | 
 | 
| 35 |     echo u'\u{3bc}'  # YSH style
 | 
| 36 | 
 | 
| 37 | (Such strings are preferred over `echo -e` because they're statically parsed.)
 | 
| 38 | 
 | 
| 39 | ### Data Encoding
 | 
| 40 | 
 | 
| 41 | Strings in OSH are arbitrary sequences of **bytes**, which may be valid UTF-8.
 | 
| 42 | Details:
 | 
| 43 | 
 | 
| 44 | - When passed to external programs, strings are truncated at the first `NUL`
 | 
| 45 |   (`'\0'`) byte.  This is a consequence of how Unix and C work.
 | 
| 46 | - Some operations like length `${#s}` and slicing `${s:1:3}` require the string
 | 
| 47 |   to be **valid UTF-8**.  Decoding errors are fatal if `shopt -s
 | 
| 48 |   strict_word_eval` is on.
 | 
| 49 | 
 | 
| 50 | ## List of Features That Respect Unicode
 | 
| 51 | 
 | 
| 52 | ### OSH / bash
 | 
| 53 | 
 | 
| 54 | These operations are currently implemented in Python, in `osh/string_ops.py`:
 | 
| 55 | 
 | 
| 56 | - `${#s}` -- length in code points (buggy in bash)
 | 
| 57 |   - Note: YSH `len(s)` returns a number of bytes, not code points.
 | 
| 58 | - `${s:1:2}` -- index and length are a number of code points
 | 
| 59 | - `${x#glob?}` and `${x##glob?}` (see below)
 | 
| 60 | 
 | 
| 61 | More:
 | 
| 62 | 
 | 
| 63 | - `${foo,}` and `${foo^}` for lowercase / uppercase
 | 
| 64 | - `[[ a < b ]]` and `[ a '<' b ]` for sorting
 | 
| 65 |   - these can use libc `strcoll()`?
 | 
| 66 | - `printf '%d' \'c` where `c` is an arbitrary character.  This is an obscure
 | 
| 67 |   syntax for `ord()`, i.e. getting an integer from an encoded character.
 | 
| 68 | 
 | 
| 69 | #### Globs
 | 
| 70 | 
 | 
| 71 | Globs have character classes `[^a]` and `?`.
 | 
| 72 | 
 | 
| 73 | This pattern results in a `glob()` call:
 | 
| 74 | 
 | 
| 75 |     echo my?glob
 | 
| 76 | 
 | 
| 77 | These patterns result in `fnmatch()` calls:
 | 
| 78 | 
 | 
| 79 |     case $x in ?) echo 'one char' ;; esac
 | 
| 80 | 
 | 
| 81 |     [[ $x == ? ]]
 | 
| 82 | 
 | 
| 83 |     ${s#?}  # remove one character suffix, quadratic loop for globs
 | 
| 84 | 
 | 
| 85 | This uses our glob to ERE translator for *position* info:
 | 
| 86 | 
 | 
| 87 |     echo ${s/?/x}
 | 
| 88 | 
 | 
| 89 | #### Regexes (ERE)
 | 
| 90 | 
 | 
| 91 | Regexes have character classes `[^a]` and `.`:
 | 
| 92 | 
 | 
| 93 |     pat='.'  # single "character"
 | 
| 94 |     [[ $x =~ $pat ]]
 | 
| 95 | 
 | 
| 96 | #### Locale-aware operations
 | 
| 97 | 
 | 
| 98 | - Prompt string has time, which is locale-specific.
 | 
| 99 | - In bash, `printf` also has time.
 | 
| 100 | 
 | 
| 101 | Other:
 | 
| 102 | 
 | 
| 103 | - The prompt width is calculated with `wcswidth()`, which doesn't just count
 | 
| 104 |   code points.  It calculates the **display width** of characters, which is
 | 
| 105 |   different in general.
 | 
| 106 | 
 | 
| 107 | ### YSH
 | 
| 108 | 
 | 
| 109 | - Eggex matching depends on ERE semantics.
 | 
| 110 |   - `mystr ~ / [ \xff ] /` 
 | 
| 111 |   - `case (x) { / dot / }`
 | 
| 112 | - `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
 | 
| 113 | - `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
 | 
| 114 | - `Str.{upper,lower}` also need unicode case folding
 | 
| 115 | - `split()` respects unicode space?
 | 
| 116 | 
 | 
| 117 | Not unicode aware:
 | 
| 118 | 
 | 
| 119 | - `strcmp()` does byte-wise and UTF-8 wise comparisons?
 | 
| 120 | 
 | 
| 121 | ### Data Languages
 | 
| 122 | 
 | 
| 123 | - Decoding JSON/J8 validates UTF-8
 | 
| 124 | - Encoding JSON/J8 decodes and validates UTF-8
 | 
| 125 |   - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
 | 
| 126 | 
 | 
| 127 | ## Implementation Notes
 | 
| 128 | 
 | 
| 129 | Unlike bash and CPython, Oils doesn't call `setlocale()`.  (Although GNU
 | 
| 130 | readline may call it.)
 | 
| 131 | 
 | 
| 132 | It's expected that your locale will respect UTF-8.  This is true on most
 | 
| 133 | distros.  If not, then some string operations will support UTF-8 and some
 | 
| 134 | won't.
 | 
| 135 | 
 | 
| 136 | For example:
 | 
| 137 | 
 | 
| 138 | - String length like `${#s}` is implemented in Oils code, not libc, so it will
 | 
| 139 |   always respect UTF-8.
 | 
| 140 | - `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
 | 
| 141 |   settings.  Same with Oils `(x ~ pat)`.
 | 
| 142 | 
 | 
| 143 | TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
 | 
| 144 | other `X`.
 | 
| 145 | 
 | 
| 146 | ### List of Low-Level UTF-8 Operations
 | 
| 147 | 
 | 
| 148 | libc:
 | 
| 149 | 
 | 
| 150 | - `glob()` and `fnmatch()`
 | 
| 151 | - `regexec()`
 | 
| 152 | - `strcoll()` respects `LC_COLLATE`, which bash probably does
 | 
| 153 | 
 | 
| 154 | Our own:
 | 
| 155 | 
 | 
| 156 | - Decode next rune from a position, or previous rune
 | 
| 157 |   - `trimLeft()` and `${s#prefix}` need this
 | 
| 158 | - Decode UTF-8
 | 
| 159 |   - J8 encoding and decoding need this
 | 
| 160 |   - `for r in (runes(x))` needs this
 | 
| 161 |   - respecting surrogate half
 | 
| 162 |     - JSON needs this
 | 
| 163 | - Encode integer rune to UTF-8 sequence
 | 
| 164 |   - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
 | 
| 165 | 
 | 
| 166 | Not sure:
 | 
| 167 | 
 | 
| 168 | - Case folding
 | 
| 169 |   - both OSH and YSH have uppercase and lowercase
 | 
| 170 | 
 | 
| 171 | ## Tips
 | 
| 172 | 
 | 
| 173 | - The GNU `iconv` program converts text from one encoding to another.
 | 
| 174 | 
 | 
| 175 | <!--
 | 
| 176 | ## Spec Tests
 | 
| 177 | 
 | 
| 178 | June 2024 notes:
 | 
| 179 | 
 | 
| 180 | - `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
 | 
| 181 |   - ${s//?/a}
 | 
| 182 | - glob() and fnmatch() seem to be OK?   As long as locale is UTF-8.
 | 
| 183 | 
 | 
| 184 | -->
 | 
| 185 | 
 | 
| 186 | 
 | 
| 187 | 
 | 
| 188 | <!--
 | 
| 189 | 
 | 
| 190 | What libraries are we using?
 | 
| 191 | 
 | 
| 192 | TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
 | 
| 193 | 
 | 
| 194 | Or maybe we punt on that, and say Oils is only valid in UTF-8 mode?  Need to
 | 
| 195 | investigate the API more.
 | 
| 196 | 
 | 
| 197 | - fnmatch()
 | 
| 198 | - glob()
 | 
| 199 | - regcomp/regexec()
 | 
| 200 | 
 | 
| 201 | - Are we using any re2c unicode?  For JSON?
 | 
| 202 | - upper() and lower()?  isupper() is lower()
 | 
| 203 |   - Need to sort these out
 | 
| 204 | 
 | 
| 205 | -->
 |