SteamPipes
Table of Contents generated with DocToc
Fast, simple data pipelines built from first principles. Basically, datomic
transducers.
SteamPipes is the successor to PipeStreams and
PipeDreams. PipeStreams was originally built on top of
NodeJS streams and through; from version X███████████████ on, I
switched to pull-streams.
Motivation
- Performance, X███████████████ insert benchmarks
- Simplicity of implementation, no recursion
- Observability, the data pipeline is an array of arrays that one may inspect
How to Construct Sources, Transforms, and Sinks
Sources
Valid SteamPipes sources include all JS values for which either
CS │ JS
──────────────────────────────┼─────────────────────────────────────
for d from source │ for ( d of source ) {
... │ ... }
──────────────────────────────┴─────────────────────────────────────
or
CS │ JS
──────────────────────────────┼─────────────────────────────────────
for await d from source │ for await ( d of source ) {
... │ ... }
──────────────────────────────┴─────────────────────────────────────
is valid.
In addition, synchronous and asynchronous functions that, when called without arguments, return a value for
which one of the iteration modes (sync or async) works correctly are allowed. Such a function will be called
as late as possible, that is, not at pipline definition time, but only when a pipeline with a source and a
drain has been constructed and is started with pull()
.
Transforms
-
Functions that take 2 arguments d
and send
(includes send.end()
);
-
must/should/may have a list (Array
) that acts as so-called 'local sink' (this is where data send with
send d
is stored before being passed to the next transform);
-
property to indicate whether transform is asynchronous.
-
transforms have a property sink
, which must be a list (at least have a shift()
method);
-
TF may add (ordinarily push()
) values to the sink at any time (but processing only guaranteed when this
happens, in TFs marked synchronous, before the main body of the function completed, and in TFs marked
asynchronous, before done()
has been called).
-
conceivable to use same TF, same sink
in two or more pipelines simultaneously; conceivable to accept
values from other sources than the TF which is directly upstream; hence possible to construct wyes (i.e.
data sources that appear in mid-stream).
-
Calling $ whatever..., ( d, send ) -> ...
is always equivalent to calling modify whatever..., $ ( d, send ) -> ...
; calling modify t
without any further arguments is equivalent to t
(the transform
itself).
Modifiers and $before_first()
, $after_last()
{ first, last, before, after, between, }
$before_first()
$after_last()
$async_before_first()
$async_after_last()
Sinks
Arbitrary objects can act as sinks provided they have a sink
property; this property must be either set to
true
for a generic sink or else be an object that has push()
method (such as a list). A sink may,
furthermore, also have an on_end()
method which, if set, must be a function that takes zero or one
argument.
If the sink
property is a list, then it will receive all data items that arrive through the pipeline (the
resultant data of the pipeline); if it is true
, then those data items will be discarded.
The on_end()
method will be called when streaming has terminated (since the source was exhausted or a
transform called send.end()
); if it takes one argument, then that will be the list of resultant data. If
both the sink
property has been set to a list and on_end()
takes an argument, then that value will be
the sink
property (you probably only want the one or the other in most cases).
{ sink: true, }
{ sink: true, on_end: ( -> do_something() ), }
{ sink: true, on_end: ( ( result ) -> do_something result ), }
{ sink: x, on_end: ( ( result ) -> do_something result ### NB result is x ### ), }
The only SteamPipes method that produces a sink is $drain()
(it should really be called sink()
but for
compatibility with PipeStreams the name has been kept as a holdover from pull-stream
). $drain()
takes
zero, one or two arguments:
$drain() is equiv. to { sink: true, }
$drain -> ... is equiv. to { sink: true, on_end: ( -> ... ), }
$drain ( x ) -> ... is equiv. to { sink: true, on_end: ( ( x ) -> ... ), }
$drain { sink: x, }, -> ... is equiv. to { sink: x, on_end: ( -> ... ), }
$drain { sink: x, }, ( x ) -> ... is equiv. to { sink: x, on_end: ( ( x ) -> ... ), }
Asynchronous Sources and Transforms
Asynchronous transforms can be constructed using the 'asynchronous remit' method, $async()
. The method
passed into $async()
must accept three arguments, namely d
(the data item coming down the pipeline),
send
(the method to send data down the pipeline), and, in addition to synchronous transforms, done
,
which is a callback function used to signal completion (it is analogous to the resulve
argument of
promises, new Promise ( resulve, reject ) ->
and indeed implemented as such). An example:
X███████████████
X███████████████
X███████████████
X███████████████
Ducts
Duct Configurations
I. Special Arities
There are two special duct arities, empty and single. An empty pipeline producers a duct marked with
is_empty: true
; it is always a no-op, hence discardable. The duct does not have a type
property.
A pipeline with a single element produces a duct with the property is_single: true
; it is always
equivalent to its sole transform, and its type
property is that of its sole element.
SHAPE OF PIPELINE SHAPE OF DUCT REMARKS
⋆ [] ⇨ { is_empty: true, } # equiv. to a no-op
⋆ [ x, ] ⇨ { is_single: true, } # equiv. to its single member
II. Open Ducts
Open ducts may always take the place of a non-composite element of the same type; this is what makes
pipelines composable. As one can always replace a sequence like ( x += a ); ( x += b );
by a
non-composed equivalent ( x += a + b )
, so can one replace a non-composite through (i.e. a single
function that transforms values) with a composite one (i.e. a list of throughs), and so on:
SHAPE OF PIPELINE SHAPE OF DUCT REMARKS
⋆ [ source, transforms..., ] ⇨ { type: 'source', } # equiv. to a non-composite source
⋆ [ transforms..., ] ⇨ { type: 'through', } # equiv. to a non-composite transform
⋆ [ transforms..., sink, ] ⇨ { type: 'sink', } # equiv. to a non-composite sink
III. Closed Ducts
Closed ducts are pipelines that have both a source and a sink (plus any number of throughs). They are like a
closed electric circuit and will start running when being passed to the pull()
method (but note that
actual data flow may be indefinitely postponed in case the source does not start delivering immediately).
SHAPE OF PIPELINE SHAPE OF DUCT REMARKS
⋆ [ source, transforms..., sink, ] ⇨ { type: 'circuit', } # ready to run
Behavior for Ending Streams
Two ways to end a stream from inside a transform: either
- call
send.end()
, or send SP.symbols.end
.
The two methods are 100% identical. In SteamPipes, 'ending a stream' means 'to break from the loop that
iterates over the data source'.
Note that when the pull
method receives an end
signal, it will not request any further data from the
source, but it will allow all data that is already in the pipeline to reach the sink just as in regular
operation, and it will also supply all transforms that have requested a last
value with such a terminal
value.
Any of these actions may cause any of the transforms to issue an unlimited number of further values, so
that, in the general case, end
ing a stream is not guaranteed to actually stop processing at any point in
time; this is only true for properly coöperating transforms.
Aborting Streams
There's no API to abort a stream—i.e. make the stream and all transforms stop processing immediately—but one
can always wrap the pull pipeline...
invocation into a try
/catch
clause and throw a custom symbolic
value:
pipeline = []
...
pipeline.push $ ( d, send ) ->
...
throw 'abort'
...
...
try
pull pipeline...
catch error
throw error if error isnt 'abort'
warn "the stream was aborted"
...
Updates
- If source has a method
start()
, it will be called when SP.pull pipeline...
is called; this enables
push sources to delay issuing data until the pipeline is ready to consume it
To Do
Future: JS Pipeline Operator
see Breaking Chains with Pipelines in Modern
JavaScript
const result3 = numbers
|> filter(#, v => v % 2 === 0)
|> map(#, v => v + 1)
|> slice(#, 0, 3)
|> Array.from;
- Lazy evaluation, no backpressure (?), built into the language.
- Already usable with Babel.
- Article discusses a number of alternatives with merits and demerits, must read.
To Do: Railway-Oriented Programming
Transform categorization: functions may
-
acc. to result arity
- give back exactly one value for each input that we do care about (->
$map()
) - give back exactly one value for each input that we do not care about (->
$watch()
) - give back any number of values (->
$
/remit()
) - never give back any value (->
$watch()
)
-
acc. to iterability
-
acc. to synchronicity
- be synchronous
- be asynchronous
-
acc. to happiness
- give back sad value on failure
- always give back happy failure, using
throw
for sad results - return a sentinel value / error code (like JS
[].indexOf()
)
-
pipeline definition may take on this form:
¶ = ( pipeline = [] ).push.bind pipeline
¶ tee other_pipeline, ( d ) -> 110 <= d <= 119 # optional filter, all `d`s stay in this pipeline, some also in other
¶ switch other_pipeline, ( d ) -> 110 <= d <= 119 # obligatory filter, each `d` in only one pipeline
¶ watch ( d ) -> ... # return value thrown away (does that respect async functions?)
¶ guard -1, $indexOf 'helo' # guard with filter value, saddens value when `true <- CND.equals(...)`
¶ guard ( ( d ) -> ... ), indexOf 'helo' # guard with filter function, saddens value when `true <- filter()`
¶ trycatch map ( d ) -> throw new Error "whaat" if d % 2 is 0; return d * 3 + 1
¶ trycatch $ ( d, send ) -> throw new Error "whaat" if d % 2 is 0; send d; send d * 3 + 1
¶ if_sad $show_warning()
¶ if_sad $ignore()
¶ drain()
pull ¶
-
pipe processing never calls any transform with sad value (except for those explicitly configured to accept
those)
-
but all sad values are still passed on, cause errors at pipeline end (near drain) when not being filtered
out
-
must not swallow exceptions implicitly as that would promote silent failures
-
benefit: simplify logic a great deal
-
benefit: may record errors and try to move on, then complain with summary of everything that went wrong