1 | ---
|
2 | default_highlighter: oils-sh
|
3 | in_progress: yes
|
4 | ---
|
5 |
|
6 | Notes on Unicode in Shell
|
7 | =========================
|
8 |
|
9 | <div id="toc">
|
10 | </div>
|
11 |
|
12 | ## Philosophy
|
13 |
|
14 | Oils is UTF-8 centric, unlike `bash` and other shells.
|
15 |
|
16 | That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
|
17 | Python or JavaScript. The former languages internally represent strings as
|
18 | UTF-8, while the latter use arrays of code points or UTF-16 code units.
|
19 |
|
20 | ## A Mental Model
|
21 |
|
22 | ### Program Encoding
|
23 |
|
24 | Shell **programs** should be encoded in UTF-8 (or its ASCII subset). Unicode
|
25 | characters can be encoded directly in the source:
|
26 |
|
27 | <pre>
|
28 | echo 'μ'
|
29 | </pre>
|
30 |
|
31 | or denoted in ASCII with C-escaped strings:
|
32 |
|
33 | echo $'\u03bc' # bash style
|
34 |
|
35 | echo u'\u{3bc}' # YSH style
|
36 |
|
37 | (Such strings are preferred over `echo -e` because they're statically parsed.)
|
38 |
|
39 | ### Data Encoding
|
40 |
|
41 | Strings in OSH are arbitrary sequences of **bytes**, which may be valid UTF-8.
|
42 | Details:
|
43 |
|
44 | - When passed to external programs, strings are truncated at the first `NUL`
|
45 | (`'\0'`) byte. This is a consequence of how Unix and C work.
|
46 | - Some operations like length `${#s}` and slicing `${s:1:3}` require the string
|
47 | to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
|
48 | strict_word_eval` is on.
|
49 |
|
50 | ## List of Features That Respect Unicode
|
51 |
|
52 | ### OSH / bash
|
53 |
|
54 | These operations are currently implemented in Python, in `osh/string_ops.py`:
|
55 |
|
56 | - `${#s}` -- length in code points (buggy in bash)
|
57 | - Note: YSH `len(s)` returns a number of bytes, not code points.
|
58 | - `${s:1:2}` -- index and length are a number of code points
|
59 | - `${x#glob?}` and `${x##glob?}` (see below)
|
60 |
|
61 | More:
|
62 |
|
63 | - `${foo,}` and `${foo^}` for lowercase / uppercase
|
64 | - `[[ a < b ]]` and `[ a '<' b ]` for sorting
|
65 | - these can use libc `strcoll()`?
|
66 | - `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
|
67 | syntax for `ord()`, i.e. getting an integer from an encoded character.
|
68 |
|
69 | #### Globs
|
70 |
|
71 | Globs have character classes `[^a]` and `?`.
|
72 |
|
73 | This pattern results in a `glob()` call:
|
74 |
|
75 | echo my?glob
|
76 |
|
77 | These patterns result in `fnmatch()` calls:
|
78 |
|
79 | case $x in ?) echo 'one char' ;; esac
|
80 |
|
81 | [[ $x == ? ]]
|
82 |
|
83 | ${s#?} # remove one character suffix, quadratic loop for globs
|
84 |
|
85 | This uses our glob to ERE translator for *position* info:
|
86 |
|
87 | echo ${s/?/x}
|
88 |
|
89 | #### Regexes (ERE)
|
90 |
|
91 | Regexes have character classes `[^a]` and `.`:
|
92 |
|
93 | pat='.' # single "character"
|
94 | [[ $x =~ $pat ]]
|
95 |
|
96 | #### Locale-aware operations
|
97 |
|
98 | - Prompt string has time, which is locale-specific.
|
99 | - In bash, `printf` also has time.
|
100 |
|
101 | Other:
|
102 |
|
103 | - The prompt width is calculated with `wcswidth()`, which doesn't just count
|
104 | code points. It calculates the **display width** of characters, which is
|
105 | different in general.
|
106 |
|
107 | ### YSH
|
108 |
|
109 | - Eggex matching depends on ERE semantics.
|
110 | - `mystr ~ / [ \xff ] /`
|
111 | - `case (x) { / dot / }`
|
112 | - `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
|
113 | - `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
|
114 | - `Str.{upper,lower}` also need unicode case folding
|
115 | - `split()` respects unicode space?
|
116 |
|
117 | Not unicode aware:
|
118 |
|
119 | - `strcmp()` does byte-wise and UTF-8 wise comparisons?
|
120 |
|
121 | ### Data Languages
|
122 |
|
123 | - Decoding JSON/J8 validates UTF-8
|
124 | - Encoding JSON/J8 decodes and validates UTF-8
|
125 | - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
|
126 |
|
127 | ## Implementation Notes
|
128 |
|
129 | Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
|
130 | readline may call it.)
|
131 |
|
132 | It's expected that your locale will respect UTF-8. This is true on most
|
133 | distros. If not, then some string operations will support UTF-8 and some
|
134 | won't.
|
135 |
|
136 | For example:
|
137 |
|
138 | - String length like `${#s}` is implemented in Oils code, not libc, so it will
|
139 | always respect UTF-8.
|
140 | - `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
|
141 | settings. Same with Oils `(x ~ pat)`.
|
142 |
|
143 | TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
|
144 | other `X`.
|
145 |
|
146 | ### List of Low-Level UTF-8 Operations
|
147 |
|
148 | libc:
|
149 |
|
150 | - `glob()` and `fnmatch()`
|
151 | - `regexec()`
|
152 | - `strcoll()` respects `LC_COLLATE`, which bash probably does
|
153 |
|
154 | Our own:
|
155 |
|
156 | - Decode next rune from a position, or previous rune
|
157 | - `trimLeft()` and `${s#prefix}` need this
|
158 | - Decode UTF-8
|
159 | - J8 encoding and decoding need this
|
160 | - `for r in (runes(x))` needs this
|
161 | - respecting surrogate half
|
162 | - JSON needs this
|
163 | - Encode integer rune to UTF-8 sequence
|
164 | - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
|
165 |
|
166 | Not sure:
|
167 |
|
168 | - Case folding
|
169 | - both OSH and YSH have uppercase and lowercase
|
170 |
|
171 | ## Tips
|
172 |
|
173 | - The GNU `iconv` program converts text from one encoding to another.
|
174 |
|
175 | <!--
|
176 | ## Spec Tests
|
177 |
|
178 | June 2024 notes:
|
179 |
|
180 | - `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
|
181 | - ${s//?/a}
|
182 | - glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
|
183 |
|
184 | -->
|
185 |
|
186 |
|
187 |
|
188 | <!--
|
189 |
|
190 | What libraries are we using?
|
191 |
|
192 | TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
|
193 |
|
194 | Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
|
195 | investigate the API more.
|
196 |
|
197 | - fnmatch()
|
198 | - glob()
|
199 | - regcomp/regexec()
|
200 |
|
201 | - Are we using any re2c unicode? For JSON?
|
202 | - upper() and lower()? isupper() is lower()
|
203 | - Need to sort these out
|
204 |
|
205 | -->
|