doc/stream-table-process.md

OILS / doc / stream-table-process.md View on Github | oilshell.org

357 lines, 239 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	Streams, Tables and Processes - awk, R, xargs
7	=============================================
8
9	(July 2024)
10
11	This is a long, "unified/orthogonal" design for:
12
13	- Streams: [awk]($xref) delimited lines, regexes
14	- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
15	TSV8 format
16	- Processes: xargs -P in parallel
17
18	There's also a relation to:
19
20	- Trees: `jq`, which will be covered elsewhere.
21
22	It's a layered design. That means we need some underlying mechanisms:
23
24	- `eval` and positional args `$1 $2 $3`
25	- `ctx` builtin
26	- Data langauges: TSV8
27	- Process pool / event loop primitive
28
29	It will link to:
30
31	- Oils blog posts (old)
32	- Zulip threads (recent)
33	- Other related projects (many of them)
34
35	<div id="toc">
36	</div>
37
38	## Background / References
39
40	- Shell, Awk, and Make Should be Combined (2016)
41	- this is the Awk part!
42
43	- What is a Data Frame? (2018)
44
45	- Sketches of YSH Features (June 2023) - can we express things in YSH?
46	- Zulip: Oils Layering / Self-hosting
47
48	- Language Compositionality Test: J8 Lines
49	- This whole thing is a compositionality test
50
51	- read --split
52	- more feedback from Aidan and Samuel
53
54	- What is a Data Frame?
55
56	- jq in jq thread
57
58	Old wiki pages:
59
60	- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
61	- uxy - closest I think - <https://github.com/sustrik/uxy>
62	- relies on to-json and jq for querying
63	- miller - I don't like their language - https://github.com/johnkerl/miller -
64	- jc - <https://github.com/kellyjonbrazil/jc>
65	- nushell
66	- extremely old thing -
67
68	We're doing all of these.
69
70	## Concrete Use Cases
71
72	- benchmarks/* with dplyr
73	- wedge report
74	- oilshell.org analytics job uses dplyr and ggplot2
75
76	## Intro
77
78	### How much code is it?
79
80	- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
81	- It should be small
82
83	### Thanks
84
85	- Samuel - two big hints
86	- do it in YSH
87	- `table` with the `ctx` builtin
88	- Aidan
89	- `read --split` feedback
90
91
92	## Tools
93
94	- awk
95	- streams of records - row-wise
96	- R
97	- column-wise operations on tables
98	- `find . -printf '%s %P\n'` - size and path
99	- generate text that looks like a table
100	- xargs
101	- operate on tabular text -- it has a bespoke splitting algorithm
102	- Opinionated guide to xargs
103	- table in, table out
104	- jq - "awk for JSON"
105
106
107	## Concepts
108
109	- TSV8
110	- aligned format SSV8
111	- columns have types, and attributes
112	- Lines
113	- raw lines like shell
114	- J8 lines (which can represent any filename, any unicode or byte string)
115	- Tables - can be thought of as:
116	- Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
117	- this is actually <https://jsonlines.org> , and it fits well with `jq`
118	- Columns - shape `{bytes: [], path: []}
119
120	## Underlying Mechanisms in Oils / Primitives
121
122	- blocks `value.Block` - `^()` and `{ }`
123	- expressions `value.Expr` - `^[]` and 'compute [] where []'
124
125	- eval (b, vars={}, positional=[])
126
127	- Buffered for loop
128	- YSH is now roughly as fast as Awk!
129	- `for x in <>`
130
131	- "magic awk loop"
132
133	with chop {
134	for <README.md *.py> {
135	echo _line_num _line _filename $1 $2
136	}
137	}
138
139	- positional args $1 $2 $3
140	- currently mean "argv stack"
141	- or "the captures"
142	- this can probably be generalized
143
144	- `ctx` builtin
145	- `value.Place`
146
147	TODO:
148
149	- split() like Python, not like shell IFS algorithm
150
151	- string formatting ${bytes %.2f}
152	- ${bytes %.2f M} Megabytes
153	- ${bytes %.2f Mi} Mebibytes
154
155	- ${timestamp +'%Y-m-%d'} and strfitime
156
157	- this is for
158
159	- floating point %e %f %g and printf and strftime
160
161	### Process Pool or Event Loop Primitive?
162
163	- if you want to display progress, then you might need an event loop
164	- test framework might display progress
165
166	## Matrices - Orthogonal design in these dimensions
167
168	- input: lines vs. rows
169	- output: string (Str, Template) vs. row vs. block execution (also a row)
170	- execution: serial vs. parallel
171	- representation: interior vs. exterior !!!
172	- Dicts and Lists are interior, but TSV8 is exterior
173	- and we have row-wise format, and column-wise format -- this always bugged me
174	- exterior: human vs. machine readable
175	- TSV8 is both human and machine-readable
176	- "aligned" #.ssv8 format is also
177	- they are one format named TSV8, with different file extensions. This is
178	because it doesn't make too much sense to implement SSV8 without TSV8. The
179	latter becomes trivial. So we call the whole thing TSV8.
180
181	This means we consider all these conversions
182
183	- Line -> Line
184	- Line -> Row
185	- Row -> Line
186	- Row -> Row
187
188	## Concrete Decisions - Matrix cut off
189
190	Design might seem very general, but we did make some hard choices.
191
192	- push vs. pull
193	- everything is "push" style I think
194	- buffered vs. unbuffered, everything
195
196	- List vs iterators
197	- everything is either iterable pipelines, or a List
198
199
200	[OSH]: $xref
201	[YSH]: $xref
202
203
204	## String World
205
206	THESE ARE ALL THE SAME ALGORITHM. They just have different names.
207
208	- each-line
209	- each-row
210	- split-by (/d+/, cols=:\|a b c\|)
211	- chop
212	- if-match
213	- must-match
214	- todo
215
216	should we also have: if-split-by ? In case there aren't enough columns?
217
218	They all take:
219
220	- string arg ' '
221	- template arg (^"") - `value.Expr`
222	- block arg
223
224	for the block arg, this applies:
225
226	-j 4
227	--max-jobs 4
228
229	--max-jobs $(cached-nproc)
230	--max-jobs $[_nproc - 1]
231
232	### Awk Issues
233
234	So we have this
235
236	echo begin
237	var d = {}
238	cat -- @files \| split-by (ifs=IFS) {
239	echo $2 $1
240	call d->accum($1, $2)
241	}
242	echo end
243
244	But then how do we have conditionals:
245
246	Filter foo { # does this define a proc? Or a data structure
247
248	split-by (ifs=IFS) # is this possible? We register the proc itself?
249
250	config split-by (ifs=IFS) # register it
251
252	BEGIN {
253	var d = {}
254	}
255	END {
256	echo d.sum
257	}
258
259	when [$1 ~ /d+/] {
260	setvar d.sum += $1
261	}
262
263	}
264
265	## Table World
266
267	### `table` to construct
268
269	Actions:
270
271	table cat
272	table align / table tabify
273	table header (cols)
274	table slice (1, -1) or (-1, -2) etc.
275
276	Subcommands
277
278	cols
279	types
280	attr units
281
282	Partial Parsing / Lazy Parsing - TSV8 is designed for this
283
284	# we only decode the columns that are necessary
285	cat myfile.tsv8 \| table --by-col (&out, cols = :\|bytes path\|)
286
287	## Will writing it in YSH be slow?
288
289	- We concentrate on semantics first
290	- We can rewrite in Python
291	- Better: users can use exterior tools with the same interface
292	- in some cases
293	- they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
294	- Most data will be small at first
295
296
297	## Applications
298
299	- Shell is shared nothing
300	- Scaling to infinity on the biggest clouds
301
302
303	## Extra: Tree World?
304
305	This is sort of "expanding the scope" of the project, when we want to reduce scope.
306
307	But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice bridge between them.
308
309	Streams of Trees (jq)
310
311	empty
312	this
313	this[]
314	=>
315	select()
316	a & b # more than one
317
318
319	## Pie in the Sky
320
321	Four types of Data Languages:
322
323	- flat strings
324	- JSON8 - tree
325	- TSV8 - table
326	- NIL8 - Lisp Tree
327	- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
328	- 8ml
329
330	Four types of query languaegs:
331
332	- regex
333	- jq / jshape
334	- tsv8
335
336
337	## Appendix
338
339	### Notes on Naming
340
341	Considering columns and then rows:
342
343	- SQL is "select ... where"
344	- dplyr is "select ... filter"
345	- YSH is "pick ... where"
346	- select is a legacy shell keyword, and pick is shorter
347	- or it could be elect in OSH, elect/select in YSH
348	- OSH wouldn't support mutate [average = bytes/total] anyway
349
350	dplyr:
351
352	- summarise vs. summarize vs. summary
353
354
355
356
357