OILS / doc / stream-table-process.md View on Github | oilshell.org

357 lines, 239 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6Streams, Tables and Processes - awk, R, xargs
7=============================================
8
9*(July 2024)*
10
11This is a long, "unified/orthogonal" design for:
12
13- Streams: [awk]($xref) delimited lines, regexes
14- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
15 TSV8 format
16- Processes: xargs -P in parallel
17
18There's also a relation to:
19
20- Trees: `jq`, which will be covered elsewhere.
21
22It's a layered design. That means we need some underlying mechanisms:
23
24- `eval` and positional args `$1 $2 $3`
25- `ctx` builtin
26- Data langauges: TSV8
27- Process pool / event loop primitive
28
29It will link to:
30
31- Oils blog posts (old)
32- Zulip threads (recent)
33- Other related projects (many of them)
34
35<div id="toc">
36</div>
37
38## Background / References
39
40- Shell, Awk, and Make Should be Combined (2016)
41 - this is the Awk part!
42
43- What is a Data Frame? (2018)
44
45- Sketches of YSH Features (June 2023) - can we express things in YSH?
46 - Zulip: Oils Layering / Self-hosting
47
48- Language Compositionality Test: J8 Lines
49 - This whole thing is a compositionality test
50
51- read --split
52 - more feedback from Aidan and Samuel
53
54- What is a Data Frame?
55
56- jq in jq thread
57
58Old wiki pages:
59
60- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
61 - uxy - closest I think - <https://github.com/sustrik/uxy>
62 - relies on to-json and jq for querying
63 - miller - I don't like their language - https://github.com/johnkerl/miller -
64 - jc - <https://github.com/kellyjonbrazil/jc>
65- nushell
66- extremely old thing -
67
68We're doing **all of these**.
69
70## Concrete Use Cases
71
72- benchmarks/* with dplyr
73- wedge report
74- oilshell.org analytics job uses dplyr and ggplot2
75
76## Intro
77
78### How much code is it?
79
80- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
81 - It should be small
82
83### Thanks
84
85- Samuel - two big hints
86 - do it in YSH
87 - `table` with the `ctx` builtin
88- Aidan
89 - `read --split` feedback
90
91
92## Tools
93
94- awk
95 - streams of records - row-wise
96- R
97 - column-wise operations on tables
98- `find . -printf '%s %P\n'` - size and path
99 - generate text that looks like a table
100- xargs
101 - operate on tabular text -- it has a bespoke splitting algorithm
102 - Opinionated guide to xargs
103 - table in, table out
104- jq - "awk for JSON"
105
106
107## Concepts
108
109- TSV8
110 - aligned format SSV8
111 - columns have types, and attributes
112- Lines
113 - raw lines like shell
114 - J8 lines (which can represent any filename, any unicode or byte string)
115- Tables - can be thought of as:
116 - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
117 - this is actually <https://jsonlines.org> , and it fits well with `jq`
118 - Columns - shape `{bytes: [], path: []}
119
120## Underlying Mechanisms in Oils / Primitives
121
122- blocks `value.Block` - `^()` and `{ }`
123- expressions `value.Expr` - `^[]` and 'compute [] where []'
124
125- eval (b, vars={}, positional=[])
126
127- Buffered for loop
128 - YSH is now roughly as fast as Awk!
129 - `for x in <>`
130
131- "magic awk loop"
132
133 with chop {
134 for <README.md *.py> {
135 echo _line_num _line _filename $1 $2
136 }
137 }
138
139- positional args $1 $2 $3
140 - currently mean "argv stack"
141 - or "the captures"
142 - this can probably be generalized
143
144- `ctx` builtin
145- `value.Place`
146
147TODO:
148
149- split() like Python, not like shell IFS algorithm
150
151- string formatting ${bytes %.2f}
152 - ${bytes %.2f M} Megabytes
153 - ${bytes %.2f Mi} Mebibytes
154
155 - ${timestamp +'%Y-m-%d'} and strfitime
156
157 - this is for
158
159 - floating point %e %f %g and printf and strftime
160
161### Process Pool or Event Loop Primitive?
162
163- if you want to display progress, then you might need an event loop
164- test framework might display progress
165
166## Matrices - Orthogonal design in these dimensions
167
168- input: lines vs. rows
169- output: string (Str, Template) vs. row vs. block execution (also a row)
170- execution: serial vs. parallel
171- representation: interior vs. exterior !!!
172 - Dicts and Lists are interior, but TSV8 is exterior
173 - and we have row-wise format, and column-wise format -- this always bugged me
174- exterior: human vs. machine readable
175 - TSV8 is both human and machine-readable
176 - "aligned" #.ssv8 format is also
177 - they are one format named TSV8, with different file extensions. This is
178 because it doesn't make too much sense to implement SSV8 without TSV8. The
179 latter becomes trivial. So we call the whole thing TSV8.
180
181This means we consider all these conversions
182
183- Line -> Line
184- Line -> Row
185- Row -> Line
186- Row -> Row
187
188## Concrete Decisions - Matrix cut off
189
190Design might seem very general, but we did make some hard choices.
191
192- push vs. pull
193 - everything is "push" style I think
194- buffered vs. unbuffered, everything
195
196- List vs iterators
197 - everything is either iterable pipelines, or a List
198
199
200[OSH]: $xref
201[YSH]: $xref
202
203
204## String World
205
206**THESE ARE ALL THE SAME ALGORITHM**. They just have different names.
207
208- each-line
209- each-row
210- split-by (/d+/, cols=:|a b c|)
211 - chop
212- if-match
213- must-match
214 - todo
215
216should we also have: if-split-by ? In case there aren't enough columns?
217
218They all take:
219
220- string arg ' '
221- template arg (^"") - `value.Expr`
222- block arg
223
224for the block arg, this applies:
225
226 -j 4
227 --max-jobs 4
228
229 --max-jobs $(cached-nproc)
230 --max-jobs $[_nproc - 1]
231
232### Awk Issues
233
234So we have this
235
236 echo begin
237 var d = {}
238 cat -- @files | split-by (ifs=IFS) {
239 echo $2 $1
240 call d->accum($1, $2)
241 }
242 echo end
243
244But then how do we have conditionals:
245
246 Filter foo { # does this define a proc? Or a data structure
247
248 split-by (ifs=IFS) # is this possible? We register the proc itself?
249
250 config split-by (ifs=IFS) # register it
251
252 BEGIN {
253 var d = {}
254 }
255 END {
256 echo d.sum
257 }
258
259 when [$1 ~ /d+/] {
260 setvar d.sum += $1
261 }
262
263 }
264
265## Table World
266
267### `table` to construct
268
269Actions:
270
271 table cat
272 table align / table tabify
273 table header (cols)
274 table slice (1, -1) or (-1, -2) etc.
275
276Subcommands
277
278 cols
279 types
280 attr units
281
282Partial Parsing / Lazy Parsing - TSV8 is designed for this
283
284 # we only decode the columns that are necessary
285 cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)
286
287## Will writing it in YSH be slow?
288
289- We concentrate on semantics first
290- We can rewrite in Python
291- Better: users can use **exterior** tools with the same interface
292 - in some cases
293 - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
294- Most data will be small at first
295
296
297## Applications
298
299- Shell is shared nothing
300- Scaling to infinity on the biggest clouds
301
302
303## Extra: Tree World?
304
305This is sort of "expanding the scope" of the project, when we want to reduce scope.
306
307But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them.
308
309Streams of Trees (jq)
310
311 empty
312 this
313 this[]
314 =>
315 select()
316 a & b # more than one
317
318
319## Pie in the Sky
320
321Four types of Data Languages:
322
323- flat strings
324- JSON8 - tree
325- TSV8 - table
326- NIL8 - Lisp Tree
327- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
328 - 8ml
329
330Four types of query languaegs:
331
332- regex
333- jq / jshape
334- tsv8
335
336
337## Appendix
338
339### Notes on Naming
340
341Considering columns and then rows:
342
343- SQL is "select ... where"
344- dplyr is "select ... filter"
345- YSH is "pick ... where"
346 - select is a legacy shell keyword, and pick is shorter
347 - or it could be elect in OSH, elect/select in YSH
348 - OSH wouldn't support mutate [average = bytes/total] anyway
349
350dplyr:
351
352- summarise vs. summarize vs. summary
353
354
355
356
357