OILS / doc / unicode.md View on Github | oilshell.org

205 lines, 132 significant
1---
2default_highlighter: oils-sh
3in_progress: yes
4---
5
6Notes on Unicode in Shell
7=========================
8
9<div id="toc">
10</div>
11
12## Philosophy
13
14Oils is UTF-8 centric, unlike `bash` and other shells.
15
16That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
17Python or JavaScript. The former languages internally represent strings as
18UTF-8, while the latter use arrays of code points or UTF-16 code units.
19
20## A Mental Model
21
22### Program Encoding
23
24Shell **programs** should be encoded in UTF-8 (or its ASCII subset). Unicode
25characters can be encoded directly in the source:
26
27<pre>
28echo '&#x03bc;'
29</pre>
30
31or denoted in ASCII with C-escaped strings:
32
33 echo $'\u03bc' # bash style
34
35 echo u'\u{3bc}' # YSH style
36
37(Such strings are preferred over `echo -e` because they're statically parsed.)
38
39### Data Encoding
40
41Strings in OSH are arbitrary sequences of **bytes**, which may be valid UTF-8.
42Details:
43
44- When passed to external programs, strings are truncated at the first `NUL`
45 (`'\0'`) byte. This is a consequence of how Unix and C work.
46- Some operations like length `${#s}` and slicing `${s:1:3}` require the string
47 to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
48 strict_word_eval` is on.
49
50## List of Features That Respect Unicode
51
52### OSH / bash
53
54These operations are currently implemented in Python, in `osh/string_ops.py`:
55
56- `${#s}` -- length in code points (buggy in bash)
57 - Note: YSH `len(s)` returns a number of bytes, not code points.
58- `${s:1:2}` -- index and length are a number of code points
59- `${x#glob?}` and `${x##glob?}` (see below)
60
61More:
62
63- `${foo,}` and `${foo^}` for lowercase / uppercase
64- `[[ a < b ]]` and `[ a '<' b ]` for sorting
65 - these can use libc `strcoll()`?
66- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
67 syntax for `ord()`, i.e. getting an integer from an encoded character.
68
69#### Globs
70
71Globs have character classes `[^a]` and `?`.
72
73This pattern results in a `glob()` call:
74
75 echo my?glob
76
77These patterns result in `fnmatch()` calls:
78
79 case $x in ?) echo 'one char' ;; esac
80
81 [[ $x == ? ]]
82
83 ${s#?} # remove one character suffix, quadratic loop for globs
84
85This uses our glob to ERE translator for *position* info:
86
87 echo ${s/?/x}
88
89#### Regexes (ERE)
90
91Regexes have character classes `[^a]` and `.`:
92
93 pat='.' # single "character"
94 [[ $x =~ $pat ]]
95
96#### Locale-aware operations
97
98- Prompt string has time, which is locale-specific.
99- In bash, `printf` also has time.
100
101Other:
102
103- The prompt width is calculated with `wcswidth()`, which doesn't just count
104 code points. It calculates the **display width** of characters, which is
105 different in general.
106
107### YSH
108
109- Eggex matching depends on ERE semantics.
110 - `mystr ~ / [ \xff ] /`
111 - `case (x) { / dot / }`
112- `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
113- `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
114- `Str.{upper,lower}` also need unicode case folding
115- `split()` respects unicode space?
116
117Not unicode aware:
118
119- `strcmp()` does byte-wise and UTF-8 wise comparisons?
120
121### Data Languages
122
123- Decoding JSON/J8 validates UTF-8
124- Encoding JSON/J8 decodes and validates UTF-8
125 - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
126
127## Implementation Notes
128
129Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
130readline may call it.)
131
132It's expected that your locale will respect UTF-8. This is true on most
133distros. If not, then some string operations will support UTF-8 and some
134won't.
135
136For example:
137
138- String length like `${#s}` is implemented in Oils code, not libc, so it will
139 always respect UTF-8.
140- `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
141 settings. Same with Oils `(x ~ pat)`.
142
143TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
144other `X`.
145
146### List of Low-Level UTF-8 Operations
147
148libc:
149
150- `glob()` and `fnmatch()`
151- `regexec()`
152- `strcoll()` respects `LC_COLLATE`, which bash probably does
153
154Our own:
155
156- Decode next rune from a position, or previous rune
157 - `trimLeft()` and `${s#prefix}` need this
158- Decode UTF-8
159 - J8 encoding and decoding need this
160 - `for r in (runes(x))` needs this
161 - respecting surrogate half
162 - JSON needs this
163- Encode integer rune to UTF-8 sequence
164 - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
165
166Not sure:
167
168- Case folding
169 - both OSH and YSH have uppercase and lowercase
170
171## Tips
172
173- The GNU `iconv` program converts text from one encoding to another.
174
175<!--
176## Spec Tests
177
178June 2024 notes:
179
180- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
181 - ${s//?/a}
182- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
183
184-->
185
186
187
188<!--
189
190What libraries are we using?
191
192TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
193
194Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
195investigate the API more.
196
197- fnmatch()
198- glob()
199- regcomp/regexec()
200
201- Are we using any re2c unicode? For JSON?
202- upper() and lower()? isupper() is lower()
203 - Need to sort these out
204
205-->