doc/stream-table-process.md

OILS / doc / stream-table-process.md View on Github | oilshell.org

481 lines, 326 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	Streams, Tables and Processes - awk, R, xargs
7	=============================================
8
9	(July 2024)
10
11	This is a long, "unified/orthogonal" design for:
12
13	- Streams: [awk]($xref) delimited lines, regexes
14	- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
15	TSV8 format
16	- Processes: xargs -P in parallel
17
18	There's also a relation to:
19
20	- Trees: `jq`, which will be covered elsewhere.
21
22	It's a layered design. That means we need some underlying mechanisms:
23
24	- `eval` and positional args `$1 $2 $3`
25	- `ctx` builtin
26	- Data langauges: TSV8
27	- Process pool / event loop primitive
28
29	It will link to:
30
31	- Oils blog posts (old)
32	- Zulip threads (recent)
33	- Other related projects (many of them)
34
35	<div id="toc">
36	</div>
37
38	## Intro With Code Snippets
39
40	Let's introduce this with a text file
41
42	$ seq 4 \| xargs -n 2 \| tee test.txt
43	1 2
44	3 4
45
46	xargs does splitting:
47
48	$ echo 'alice bob' \| xargs -n 1 -- echo hi \| tee test2.txt
49	hi alice
50	hi bob
51
52	Oils:
53
54	# should we use $_ for _word _line _row? $[_.age] instead of $[_row.age]
55	$ echo 'alice bob' \| each-word { echo "hi $_" } \| tee test2.txt
56	hi alice
57	hi bob
58
59	Normally this should be balanced
60
61	### Streams - awk
62
63	Now let's use awk:
64
65	$ cat test.txt \| awk '{ print $2 " " $1 }'
66	2 1
67	4 3
68
69	In YSH:
70
71	$ cat test.txt \| chop '$2 $1'
72	2 1
73	4 3
74
75	It's shorter! `chop` is an alias for `split-by (space=true, template='$2 $1')`
76
77	With a template, for static parsing:
78
79	$ cat test.txt \| chop (^"$2 $1")
80	2 1
81	4 3
82
83	It's shorter! With a block:
84
85	$ cat test.txt \| chop { mkdir -v -p $2/$1 }
86	mkdir: created directory '2/1'
87	mkdir: created directory '4/3'
88
89	With no argument, it prints a table:
90
91	$ cat test.txt \| chop
92	#.tsv8 $1 $2
93	2 1
94	4 3
95
96	$ cat test.txt \| chop (names = :\|a b\|)
97	#.tsv8 a b
98	2 1
99	4 3
100
101	Longer examples with split-by:
102
103	$ cat test.txt \| split-by (space=true, template='$2 $1')
104	$ cat test.txt \| split-by (space=true, template=^"$2 $1")
105	$ cat test.txt \| split-by (space=true) { mkdir -v -p $2/$1 }
106	$ cat test.txt \| split-by (space=true)
107	$ cat test.txt \| split-by (space=true, names= :\|a b\|)
108	$ cat test.txt \| split-by (space=true, names= :\|a b\|) {
109	mkdir -v -p $a/$b
110	}
111
112	With must-match:
113
114	$ var p = /<capture d+> s+ </capture d+>/
115	$ cat test.txt \| must-match (p, template='$2 $1')
116	$ cat test.txt \| must-match (p, template=^"$2 $1")
117	$ cat test.txt \| must-match (p) { mkdir -v -p $2/$1 }
118	$ cat test.txt \| must-match (p)
119
120	With names:
121
122	$ var p = /<capture d+ as a> s+ </capture d+ as b>/
123	$ cat test.txt \| must-match (p, template='$b $a')
124	$ cat test.txt \| must-match (p)
125	#.tsv8 a b
126	2 1
127	4 3
128
129	$ cat test.txt \| must-match (p) {
130	mkdir -v -p $a/$b
131	}
132
133	Doing it in parallel:
134
135	$ cat test.txt \| must-match --max-jobs 4 (p) {
136	mkdir -v -p $a/$b
137	}
138
139	### Tables - Data frames with dplyr (R)
140
141	$ cat table.txt
142	size path
143	3 foo.txt
144	20 bar.jpg
145
146	$ R
147	> t=read.table('table.txt', header=T)
148	> t
149	size path
150	1 3 foo.txt
151	2 20 bar.jpg
152
153	### Processes - xargs
154
155	We already saw this! Because we "compressed" awk and xargs together
156
157	What's not in the streams / awk example above:
158
159	- `BEGIN END` - that can be separate
160	- `when [$1 ~ /d+/] { }`
161
162	## Background / References
163
164	- Shell, Awk, and Make Should be Combined (2016)
165	- this is the Awk part!
166
167	- What is a Data Frame? (2018)
168
169	- Sketches of YSH Features (June 2023) - can we express things in YSH?
170	- Zulip: Oils Layering / Self-hosting
171
172	- Language Compositionality Test: J8 Lines
173	- This whole thing is a compositionality test
174
175	- read --split
176	- more feedback from Aidan and Samuel
177
178	- What is a Data Frame?
179
180	- jq in jq thread
181
182	Old wiki pages:
183
184	- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
185	- uxy - closest I think - <https://github.com/sustrik/uxy>
186	- relies on to-json and jq for querying
187	- miller - I don't like their language - https://github.com/johnkerl/miller -
188	- jc - <https://github.com/kellyjonbrazil/jc>
189	- nushell
190	- extremely old thing -
191
192	We're doing all of these.
193
194	## Concrete Use Cases
195
196	- benchmarks/* with dplyr
197	- wedge report
198	- oilshell.org analytics job uses dplyr and ggplot2
199
200	## Intro
201
202	### How much code is it?
203
204	- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
205	- It should be small
206
207	### Thanks
208
209	- Samuel - two big hints
210	- do it in YSH
211	- `table` with the `ctx` builtin
212	- Aidan
213	- `read --split` feedback
214
215
216	## Tools
217
218	- awk
219	- streams of records - row-wise
220	- R
221	- column-wise operations on tables
222	- `find . -printf '%s %P\n'` - size and path
223	- generate text that looks like a table
224	- xargs
225	- operate on tabular text -- it has a bespoke splitting algorithm
226	- Opinionated guide to xargs
227	- table in, table out
228	- jq - "awk for JSON"
229
230
231	## Concepts
232
233	- TSV8
234	- aligned format SSV8
235	- columns have types, and attributes
236	- Lines
237	- raw lines like shell
238	- J8 lines (which can represent any filename, any unicode or byte string)
239	- Tables - can be thought of as:
240	- Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
241	- this is actually <https://jsonlines.org> , and it fits well with `jq`
242	- Columns - shape `{bytes: [], path: []}
243
244	## Underlying Mechanisms in Oils / Primitives
245
246	- blocks `value.Block` - `^()` and `{ }`
247	- expressions `value.Expr` - `^[]` and 'compute [] where []'
248
249	- eval (b, vars={}, positional=[])
250
251	- Buffered for loop
252	- YSH is now roughly as fast as Awk!
253	- `for x in (stdin)`
254
255	- "magic awk loop"
256
257	with chop {
258	for <README.md *.py> {
259	echo _line_num _line _filename $1 $2
260	}
261	}
262
263	- positional args $1 $2 $3
264	- currently mean "argv stack"
265	- or "the captures"
266	- this can probably be generalized
267
268	- `ctx` builtin
269	- `value.Place`
270
271	TODO:
272
273	- split() like Python, not like shell IFS algorithm
274
275	- string formatting ${bytes %.2f}
276	- ${bytes %.2f M} Megabytes
277	- ${bytes %.2f Mi} Mebibytes
278
279	- ${timestamp +'%Y-m-%d'} and strfitime
280
281	- this is for
282
283	- floating point %e %f %g and printf and strftime
284
285	### Process Pool or Event Loop Primitive?
286
287	- if you want to display progress, then you might need an event loop
288	- test framework might display progress
289
290	## Matrices - Orthogonal design in these dimensions
291
292	- input: lines vs. rows
293	- output: string (Str, Template) vs. row vs. block execution (also a row)
294	- execution: serial vs. parallel
295	- representation: interior vs. exterior !!!
296	- Dicts and Lists are interior, but TSV8 is exterior
297	- and we have row-wise format, and column-wise format -- this always bugged me
298	- exterior: human vs. machine readable
299	- TSV8 is both human and machine-readable
300	- "aligned" #.ssv8 format is also
301	- they are one format named TSV8, with different file extensions. This is
302	because it doesn't make too much sense to implement SSV8 without TSV8. The
303	latter becomes trivial. So we call the whole thing TSV8.
304
305	This means we consider all these conversions
306
307	- Line -> Line
308	- Line -> Row
309	- Row -> Line
310	- Row -> Row
311
312	## Concrete Decisions - Matrix cut off
313
314	Design might seem very general, but we did make some hard choices.
315
316	- push vs. pull
317	- everything is "push" style I think
318	- buffered vs. unbuffered, everything
319
320	- List vs iterators
321	- everything is either iterable pipelines, or a List
322
323
324	[OSH]: $xref
325	[YSH]: $xref
326
327
328	## String World
329
330	THESE ARE ALL THE SAME ALGORITHM. They just have different names.
331
332	- each-line
333	- each-row
334	- split-by (/d+/, cols=:\|a b c\|)
335	- chop
336	- if-match
337	- must-match
338	- todo
339
340	should we also have: if-split-by ? In case there aren't enough columns?
341
342	They all take:
343
344	- string arg ' '
345	- template arg (^"") - `value.Expr`
346	- block arg
347
348	for the block arg, this applies:
349
350	-j 4
351	--max-jobs 4
352
353	--max-jobs $(cached-nproc)
354	--max-jobs $[_nproc - 1]
355
356	### Awk Issues
357
358	So we have this
359
360	echo begin
361	var d = {}
362	cat -- @files \| split-by (ifs=IFS) {
363	echo $2 $1
364	call d->accum($1, $2)
365	}
366	echo end
367
368	But then how do we have conditionals:
369
370	Filter foo { # does this define a proc? Or a data structure
371
372	split-by (ifs=IFS) # is this possible? We register the proc itself?
373
374	config split-by (ifs=IFS) # register it
375
376	BEGIN {
377	var d = {}
378	}
379	END {
380	echo d.sum
381	}
382
383	when [$1 ~ /d+/] {
384	setvar d.sum += $1
385	}
386
387	}
388
389	## Table World
390
391	### `table` to construct
392
393	Actions:
394
395	table cat
396	table align / table tabify
397	table header (cols)
398	table slice (1, -1) or (-1, -2) etc.
399
400	Subcommands
401
402	cols
403	types
404	attr units
405
406	Partial Parsing / Lazy Parsing - TSV8 is designed for this
407
408	# we only decode the columns that are necessary
409	cat myfile.tsv8 \| table --by-col (&out, cols = :\|bytes path\|)
410
411	## Will writing it in YSH be slow?
412
413	- We concentrate on semantics first
414	- We can rewrite in Python
415	- Better: users can use exterior tools with the same interface
416	- in some cases
417	- they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
418	- Most data will be small at first
419
420
421	## Applications
422
423	- Shell is shared nothing
424	- Scaling to infinity on the biggest clouds
425
426
427	## Extra: Tree World?
428
429	This is sort of "expanding the scope" of the project, when we want to reduce scope.
430
431	But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice bridge between them.
432
433	Streams of Trees (jq)
434
435	empty
436	this
437	this[]
438	=>
439	select()
440	a & b # more than one
441
442
443	## Pie in the Sky
444
445	Four types of Data Languages:
446
447	- flat strings
448	- JSON8 - tree
449	- TSV8 - table
450	- NIL8 - Lisp Tree
451	- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
452	- 8ml
453
454	Four types of query languaegs:
455
456	- regex
457	- jq / jshape
458	- tsv8
459
460
461	## Appendix
462
463	### Notes on Naming
464
465	Considering columns and then rows:
466
467	- SQL is "select ... where"
468	- dplyr is "select ... filter"
469	- YSH is "pick ... where"
470	- select is a legacy shell keyword, and pick is shorter
471	- or it could be elect in OSH, elect/select in YSH
472	- OSH wouldn't support mutate [average = bytes/total] anyway
473
474	dplyr:
475
476	- summarise vs. summarize vs. summary
477
478
479
480
481