OILS / doc / stream-table-process.md View on Github | oilshell.org

356 lines, 238 significant
1---
2default_highlighter: oils-sh
3---
4
5YSH Design: Streams, Tables and Processes (and Trees?)
6======================================================
7
8*(July 2024)*
9
10This is a long, "unified/orthogonal" design for:
11
12- Streams: [awk]($xref) delimited lines, regexes
13- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
14 TSV8 format
15- Processes: xargs -P in parallel
16
17There's also a relation to:
18
19- Trees: `jq`, which will be covered elsewhere.
20
21It's a layered design. That means we need some underlying mechanisms:
22
23- `eval` and positional args `$1 $2 $3`
24- `ctx` builtin
25- Data langauges: TSV8
26- Process pool / event loop primitive
27
28It will link to:
29
30- Oils blog posts (old)
31- Zulip threads (recent)
32- Other related projects (many of them)
33
34<div id="toc">
35</div>
36
37## Background / References
38
39- Shell, Awk, and Make Should be Combined (2016)
40 - this is the Awk part!
41
42- What is a Data Frame? (2018)
43
44- Sketches of YSH Features (June 2023) - can we express things in YSH?
45 - Zulip: Oils Layering / Self-hosting
46
47- Language Compositionality Test: J8 Lines
48 - This whole thing is a compositionality test
49
50- read --split
51 - more feedback from Aidan and Samuel
52
53- What is a Data Frame?
54
55- jq in jq thread
56
57Old wiki pages:
58
59- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
60 - uxy - closest I think - <https://github.com/sustrik/uxy>
61 - relies on to-json and jq for querying
62 - miller - I don't like their language - https://github.com/johnkerl/miller -
63 - jc - <https://github.com/kellyjonbrazil/jc>
64- nushell
65- extremely old thing -
66
67We're doing **all of these**.
68
69## Concrete Use Cases
70
71- benchmarks/* with dplyr
72- wedge report
73- oilshell.org analytics job uses dplyr and ggplot2
74
75## Intro
76
77### How much code is it?
78
79- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
80 - It should be small
81
82### Thanks
83
84- Samuel - two big hints
85 - do it in YSH
86 - `table` with the `ctx` builtin
87- Aidan
88 - `read --split` feedback
89
90
91## Tools
92
93- awk
94 - streams of records - row-wise
95- R
96 - column-wise operations on tables
97- `find . -printf '%s %P\n'` - size and path
98 - generate text that looks like a table
99- xargs
100 - operate on tabular text -- it has a bespoke splitting algorithm
101 - Opinionated guide to xargs
102 - table in, table out
103- jq - "awk for JSON"
104
105
106## Concepts
107
108- TSV8
109 - aligned format SSV8
110 - columns have types, and attributes
111- Lines
112 - raw lines like shell
113 - J8 lines (which can represent any filename, any unicode or byte string)
114- Tables - can be thought of as:
115 - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
116 - this is actually <https://jsonlines.org> , and it fits well with `jq`
117 - Columns - shape `{bytes: [], path: []}
118
119## Underlying Mechanisms in Oils / Primitives
120
121- blocks `value.Block` - `^()` and `{ }`
122- expressions `value.Expr` - `^[]` and 'compute [] where []'
123
124- eval (b, vars={}, positional=[])
125
126- Buffered for loop
127 - YSH is now roughly as fast as Awk!
128 - `for x in <>`
129
130- "magic awk loop"
131
132 with chop {
133 for <README.md *.py> {
134 echo _line_num _line _filename $1 $2
135 }
136 }
137
138- positional args $1 $2 $3
139 - currently mean "argv stack"
140 - or "the captures"
141 - this can probably be generalized
142
143- `ctx` builtin
144- `value.Place`
145
146TODO:
147
148- split() like Python, not like shell IFS algorithm
149
150- string formatting ${bytes %.2f}
151 - ${bytes %.2f M} Megabytes
152 - ${bytes %.2f Mi} Mebibytes
153
154 - ${timestamp +'%Y-m-%d'} and strfitime
155
156 - this is for
157
158 - floating point %e %f %g and printf and strftime
159
160### Process Pool or Event Loop Primitive?
161
162- if you want to display progress, then you might need an event loop
163- test framework might display progress
164
165## Matrices - Orthogonal design in these dimensions
166
167- input: lines vs. rows
168- output: string (Str, Template) vs. row vs. block execution (also a row)
169- execution: serial vs. parallel
170- representation: interior vs. exterior !!!
171 - Dicts and Lists are interior, but TSV8 is exterior
172 - and we have row-wise format, and column-wise format -- this always bugged me
173- exterior: human vs. machine readable
174 - TSV8 is both human and machine-readable
175 - "aligned" #.ssv8 format is also
176 - they are one format named TSV8, with different file extensions. This is
177 because it doesn't make too much sense to implement SSV8 without TSV8. The
178 latter becomes trivial. So we call the whole thing TSV8.
179
180This means we consider all these conversions
181
182- Line -> Line
183- Line -> Row
184- Row -> Line
185- Row -> Row
186
187## Concrete Decisions - Matrix cut off
188
189Design might seem very general, but we did make some hard choices.
190
191- push vs. pull
192 - everything is "push" style I think
193- buffered vs. unbuffered, everything
194
195- List vs iterators
196 - everything is either iterable pipelines, or a List
197
198
199[OSH]: $xref
200[YSH]: $xref
201
202
203## String World
204
205**THESE ARE ALL THE SAME ALGORITHM**. They just have different names.
206
207- each-line
208- each-row
209- split-by (/d+/, cols=:|a b c|)
210 - chop
211- if-match
212- must-match
213 - todo
214
215should we also have: if-split-by ? In case there aren't enough columns?
216
217They all take:
218
219- string arg ' '
220- template arg (^"") - `value.Expr`
221- block arg
222
223for the block arg, this applies:
224
225 -j 4
226 --max-jobs 4
227
228 --max-jobs $(cached-nproc)
229 --max-jobs $[_nproc - 1]
230
231### Awk Issues
232
233So we have this
234
235 echo begin
236 var d = {}
237 cat -- @files | split-by (ifs=IFS) {
238 echo $2 $1
239 call d->accum($1, $2)
240 }
241 echo end
242
243But then how do we have conditionals:
244
245 Filter foo { # does this define a proc? Or a data structure
246
247 split-by (ifs=IFS) # is this possible? We register the proc itself?
248
249 config split-by (ifs=IFS) # register it
250
251 BEGIN {
252 var d = {}
253 }
254 END {
255 echo d.sum
256 }
257
258 when [$1 ~ /d+/] {
259 setvar d.sum += $1
260 }
261
262 }
263
264## Table World
265
266### `table` to construct
267
268Actions:
269
270 table cat
271 table align / table tabify
272 table header (cols)
273 table slice (1, -1) or (-1, -2) etc.
274
275Subcommands
276
277 cols
278 types
279 attr units
280
281Partial Parsing / Lazy Parsing - TSV8 is designed for this
282
283 # we only decode the columns that are necessary
284 cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)
285
286## Will writing it in YSH be slow?
287
288- We concentrate on semantics first
289- We can rewrite in Python
290- Better: users can use **exterior** tools with the same interface
291 - in some cases
292 - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
293- Most data will be small at first
294
295
296## Applications
297
298- Shell is shared nothing
299- Scaling to infinity on the biggest clouds
300
301
302## Extra: Tree World?
303
304This is sort of "expanding the scope" of the project, when we want to reduce scope.
305
306But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them.
307
308Streams of Trees (jq)
309
310 empty
311 this
312 this[]
313 =>
314 select()
315 a & b # more than one
316
317
318## Pie in the Sky
319
320Four types of Data Languages:
321
322- flat strings
323- JSON8 - tree
324- TSV8 - table
325- NIL8 - Lisp Tree
326- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
327 - 8ml
328
329Four types of query languaegs:
330
331- regex
332- jq / jshape
333- tsv8
334
335
336## Appendix
337
338### Notes on Naming
339
340Considering columns and then rows:
341
342- SQL is "select ... where"
343- dplyr is "select ... filter"
344- YSH is "pick ... where"
345 - select is a legacy shell keyword, and pick is shorter
346 - or it could be elect in OSH, elect/select in YSH
347 - OSH wouldn't support mutate [average = bytes/total] anyway
348
349dplyr:
350
351- summarise vs. summarize vs. summary
352
353
354
355
356