doc/stream-table-process.md

OILS / doc / stream-table-process.md View on Github | oilshell.org

356 lines, 238 significant

1	---
2	default_highlighter: oils-sh
3	---
4
5	YSH Design: Streams, Tables and Processes (and Trees?)
6	======================================================
7
8	(July 2024)
9
10	This is a long, "unified/orthogonal" design for:
11
12	- Streams: [awk]($xref) delimited lines, regexes
13	- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
14	TSV8 format
15	- Processes: xargs -P in parallel
16
17	There's also a relation to:
18
19	- Trees: `jq`, which will be covered elsewhere.
20
21	It's a layered design. That means we need some underlying mechanisms:
22
23	- `eval` and positional args `$1 $2 $3`
24	- `ctx` builtin
25	- Data langauges: TSV8
26	- Process pool / event loop primitive
27
28	It will link to:
29
30	- Oils blog posts (old)
31	- Zulip threads (recent)
32	- Other related projects (many of them)
33
34	<div id="toc">
35	</div>
36
37	## Background / References
38
39	- Shell, Awk, and Make Should be Combined (2016)
40	- this is the Awk part!
41
42	- What is a Data Frame? (2018)
43
44	- Sketches of YSH Features (June 2023) - can we express things in YSH?
45	- Zulip: Oils Layering / Self-hosting
46
47	- Language Compositionality Test: J8 Lines
48	- This whole thing is a compositionality test
49
50	- read --split
51	- more feedback from Aidan and Samuel
52
53	- What is a Data Frame?
54
55	- jq in jq thread
56
57	Old wiki pages:
58
59	- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
60	- uxy - closest I think - <https://github.com/sustrik/uxy>
61	- relies on to-json and jq for querying
62	- miller - I don't like their language - https://github.com/johnkerl/miller -
63	- jc - <https://github.com/kellyjonbrazil/jc>
64	- nushell
65	- extremely old thing -
66
67	We're doing all of these.
68
69	## Concrete Use Cases
70
71	- benchmarks/* with dplyr
72	- wedge report
73	- oilshell.org analytics job uses dplyr and ggplot2
74
75	## Intro
76
77	### How much code is it?
78
79	- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
80	- It should be small
81
82	### Thanks
83
84	- Samuel - two big hints
85	- do it in YSH
86	- `table` with the `ctx` builtin
87	- Aidan
88	- `read --split` feedback
89
90
91	## Tools
92
93	- awk
94	- streams of records - row-wise
95	- R
96	- column-wise operations on tables
97	- `find . -printf '%s %P\n'` - size and path
98	- generate text that looks like a table
99	- xargs
100	- operate on tabular text -- it has a bespoke splitting algorithm
101	- Opinionated guide to xargs
102	- table in, table out
103	- jq - "awk for JSON"
104
105
106	## Concepts
107
108	- TSV8
109	- aligned format SSV8
110	- columns have types, and attributes
111	- Lines
112	- raw lines like shell
113	- J8 lines (which can represent any filename, any unicode or byte string)
114	- Tables - can be thought of as:
115	- Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
116	- this is actually <https://jsonlines.org> , and it fits well with `jq`
117	- Columns - shape `{bytes: [], path: []}
118
119	## Underlying Mechanisms in Oils / Primitives
120
121	- blocks `value.Block` - `^()` and `{ }`
122	- expressions `value.Expr` - `^[]` and 'compute [] where []'
123
124	- eval (b, vars={}, positional=[])
125
126	- Buffered for loop
127	- YSH is now roughly as fast as Awk!
128	- `for x in <>`
129
130	- "magic awk loop"
131
132	with chop {
133	for <README.md *.py> {
134	echo _line_num _line _filename $1 $2
135	}
136	}
137
138	- positional args $1 $2 $3
139	- currently mean "argv stack"
140	- or "the captures"
141	- this can probably be generalized
142
143	- `ctx` builtin
144	- `value.Place`
145
146	TODO:
147
148	- split() like Python, not like shell IFS algorithm
149
150	- string formatting ${bytes %.2f}
151	- ${bytes %.2f M} Megabytes
152	- ${bytes %.2f Mi} Mebibytes
153
154	- ${timestamp +'%Y-m-%d'} and strfitime
155
156	- this is for
157
158	- floating point %e %f %g and printf and strftime
159
160	### Process Pool or Event Loop Primitive?
161
162	- if you want to display progress, then you might need an event loop
163	- test framework might display progress
164
165	## Matrices - Orthogonal design in these dimensions
166
167	- input: lines vs. rows
168	- output: string (Str, Template) vs. row vs. block execution (also a row)
169	- execution: serial vs. parallel
170	- representation: interior vs. exterior !!!
171	- Dicts and Lists are interior, but TSV8 is exterior
172	- and we have row-wise format, and column-wise format -- this always bugged me
173	- exterior: human vs. machine readable
174	- TSV8 is both human and machine-readable
175	- "aligned" #.ssv8 format is also
176	- they are one format named TSV8, with different file extensions. This is
177	because it doesn't make too much sense to implement SSV8 without TSV8. The
178	latter becomes trivial. So we call the whole thing TSV8.
179
180	This means we consider all these conversions
181
182	- Line -> Line
183	- Line -> Row
184	- Row -> Line
185	- Row -> Row
186
187	## Concrete Decisions - Matrix cut off
188
189	Design might seem very general, but we did make some hard choices.
190
191	- push vs. pull
192	- everything is "push" style I think
193	- buffered vs. unbuffered, everything
194
195	- List vs iterators
196	- everything is either iterable pipelines, or a List
197
198
199	[OSH]: $xref
200	[YSH]: $xref
201
202
203	## String World
204
205	THESE ARE ALL THE SAME ALGORITHM. They just have different names.
206
207	- each-line
208	- each-row
209	- split-by (/d+/, cols=:\|a b c\|)
210	- chop
211	- if-match
212	- must-match
213	- todo
214
215	should we also have: if-split-by ? In case there aren't enough columns?
216
217	They all take:
218
219	- string arg ' '
220	- template arg (^"") - `value.Expr`
221	- block arg
222
223	for the block arg, this applies:
224
225	-j 4
226	--max-jobs 4
227
228	--max-jobs $(cached-nproc)
229	--max-jobs $[_nproc - 1]
230
231	### Awk Issues
232
233	So we have this
234
235	echo begin
236	var d = {}
237	cat -- @files \| split-by (ifs=IFS) {
238	echo $2 $1
239	call d->accum($1, $2)
240	}
241	echo end
242
243	But then how do we have conditionals:
244
245	Filter foo { # does this define a proc? Or a data structure
246
247	split-by (ifs=IFS) # is this possible? We register the proc itself?
248
249	config split-by (ifs=IFS) # register it
250
251	BEGIN {
252	var d = {}
253	}
254	END {
255	echo d.sum
256	}
257
258	when [$1 ~ /d+/] {
259	setvar d.sum += $1
260	}
261
262	}
263
264	## Table World
265
266	### `table` to construct
267
268	Actions:
269
270	table cat
271	table align / table tabify
272	table header (cols)
273	table slice (1, -1) or (-1, -2) etc.
274
275	Subcommands
276
277	cols
278	types
279	attr units
280
281	Partial Parsing / Lazy Parsing - TSV8 is designed for this
282
283	# we only decode the columns that are necessary
284	cat myfile.tsv8 \| table --by-col (&out, cols = :\|bytes path\|)
285
286	## Will writing it in YSH be slow?
287
288	- We concentrate on semantics first
289	- We can rewrite in Python
290	- Better: users can use exterior tools with the same interface
291	- in some cases
292	- they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
293	- Most data will be small at first
294
295
296	## Applications
297
298	- Shell is shared nothing
299	- Scaling to infinity on the biggest clouds
300
301
302	## Extra: Tree World?
303
304	This is sort of "expanding the scope" of the project, when we want to reduce scope.
305
306	But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice bridge between them.
307
308	Streams of Trees (jq)
309
310	empty
311	this
312	this[]
313	=>
314	select()
315	a & b # more than one
316
317
318	## Pie in the Sky
319
320	Four types of Data Languages:
321
322	- flat strings
323	- JSON8 - tree
324	- TSV8 - table
325	- NIL8 - Lisp Tree
326	- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
327	- 8ml
328
329	Four types of query languaegs:
330
331	- regex
332	- jq / jshape
333	- tsv8
334
335
336	## Appendix
337
338	### Notes on Naming
339
340	Considering columns and then rows:
341
342	- SQL is "select ... where"
343	- dplyr is "select ... filter"
344	- YSH is "pick ... where"
345	- select is a legacy shell keyword, and pick is shorter
346	- or it could be elect in OSH, elect/select in YSH
347	- OSH wouldn't support mutate [average = bytes/total] anyway
348
349	dplyr:
350
351	- summarise vs. summarize vs. summary
352
353
354
355
356