Server
The benefit of offloading work to a worker pool will vary significantly based on work load. In some cases, the worker pool may actually be a detriment to performance. Careful consideration must be given.
In this example, we create a simple Fastify http server that responds with a simple JSON object after introducing an artificial delay of 100 milliseconds.
There are four variants to the server:
-
async-sleep-unpooled - Uses an async delay using a Promise-wrapped
setTimeout
. The event loop is allowed to keep turning during the delay allowing the server to respond to additional requests. The delay occurs within the main thread. -
async-sleep-pooled - Moves the async delay to the worker threads for processing.
-
sync-sleep-unpooled - Uses a sync delay using
Atomics.wait
to block the main thread for 100 milliseconds before responding. Because the main thread is blocked, the event loop is not turning while we are waiting. This simulates a particularly expensive event loop blocking scenario. -
sync-sleep-pooled - Moves the sync delay into the worker threads for processing.
For both async-sleep-pooled
and sync-sleep-pooled
, the
main thread uses a Promise to wait on the worker to complete it's
task before responding to the request.
Setting up
The code files for this example can be found on github.
To get started, run npm i
to install dependencies. We use autocannon
for benchmarking the four variants:
$ npm i -g autocannon
Let's start with the two async sleep variants. First run:
$ node async-sleep-unpooled
And in a separate terminal, run autocannon:
$ autocannon localhost:3000
You should see results similar to:
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬────────┬────────┬────────┬────────┬──────────┬─────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼───────────┤
│ Latency │ 100 ms │ 101 ms │ 104 ms │ 125 ms │ 101.2 ms │ 3.09 ms │ 130.66 ms │
└─────────┴────────┴────────┴────────┴────────┴──────────┴─────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬ ───────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┤
│ Req/Sec │ 90 │ 90 │ 99 │ 100 │ 97.8 │ 3.32 │ 90 │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┤
│ Bytes/Sec │ 14.8 kB │ 14.8 kB │ 16.2 kB │ 16.4 kB │ 16 kB │ 545 B │ 14.8 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴───────┴───────┴─────────┘
Note that this server is not particularly fast at all due to the artificially
imposed 100 millisecond delay.
It's possible to achieve significantly higher results by tuning the
way autocannon is sending requests. For instance, running autocannon with
the -c 100 -p 2
options increases performance of the async-sleep-unpooled
example significantly.
Let's see how the Piscina version does in comparison:
Run:
$ node async-sleep-pooled
And autocannon again:
$ autocannon localhost:3000
Your result should be similar to:
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬─── ─────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 126 ms │ 169 ms │ 197 ms │ 339 ms │ 170.13 ms │ 31.41 ms │ 425.33 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼── ───────┼───────┼─────────┤
│ Req/Sec │ 43 │ 43 │ 60 │ 60 │ 58.2 │ 5.08 │ 43 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 7.05 kB │ 7.05 kB │ 9.85 kB │ 9.85 kB │ 9.55 kB │ 833 B │ 7.05 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘
Notice that the pooled version is slower! This is because in both async examples, the event loop on the main thread is still active, allowing it to service more requests. The worker pool here, however, adds additional performance overhead marshalling data back and forth between threads and allocating additional Promises. In this scenario, the worker pool does not add much benefit.
Let's look as the sync cases and see what happens with those.
Run:
$ node sync-sleep-unpooled
And run autocannon again:
$ autocannon localhost:3000
The results should be fairly awful in comparison to the first two cases:
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬────────┬─────────┬─────────┬─────────┬──────────┬───────────┬────────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼─────────┼─────────┼─────────┼──────────┼───────────┼────────────┤
│ Latency │ 338 ms │ 1007 ms │ 1811 ms │ 1919 ms │ 962.2 ms │ 255.54 ms │ 1919.77 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴──────────┴───────────┴────────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤
│ Req/Sec │ 9 │ 9 │ 10 │ 10 │ 9.81 │ 0.4 │ 9 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤
│ Bytes/Sec │ 1.48 kB │ 1.48 kB │ 1.64 kB │ 1.64 kB │ 1.61 kB │ 65.6 B │ 1.48 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘
The reason for the performance drop should be apparent. Synchronously sleeping blocks the event loop from turning while a request is being processed, which means the server is unable to do anything else while it waits.
To see how the pooled version fares, run:
$ node sync-sleep-pooled
And run autocannon again
$ autocannon localhost:3000
The results should be nearly identical to the async-sleep-pooled
version!
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼──────────┤
│ Latency │ 124 ms │ 174 ms │ 200 ms │ 335 ms │ 169.99 ms │ 36.22 ms │ 422.2 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴──────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Req/Sec │ 42 │ 42 │ 60 │ 60 │ 58.2 │ 5.4 │ 42 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 6.89 kB │ 6.89 kB │ 9.85 kB │ 9.85 kB │ 9.55 kB │ 886 B │ 6.89 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘
The reason sync-sleep-pooled
and async-sleep-pooled
yield identical results
is because, in both cases, the main thread is doing identical work -- that is,
receiving a request that is dispatched to a worker, waiting about 100 milliseconds,
the returning the response. The main thread here does not care whether the workers
are sleeping synchronously or asychronously.
Tuning pool performance
It is possible to tune the performance of the Piscina worker pool using a variety of options:
minThreads
- The minimum number of threads always runninmaxThreads
- The maximum number of threads allowedidleTimeout
- The number of millisecondsa thread is permitted to remain idlemaxQueue
-- The maximum number of pending work itemsconcurrentTasksPerWorker
-- The number of work items to dispatch concurrently to a single thread.
We'll use idleTimeout
and concurrentTasksPerWorker
in this example.
By default, as soon as a Piscina worker thread has nothing to do, it will terminate. This means that if the work queue is empty, the thread will terminate. If we're not keeping our queue filled, this will incur additional overhead as Node.js spins up new worker threads to handle incoming requests.
Also by default, Piscina will assume that jobs are synchronous in nature and
will dispatch only a single job per thread at any time. If the workload is
asynchronous, setting concurrentTasksPerWorker
to a higher number will
increase the number of jobs Piscina will send to a single worker, allowing
it to process multiple tasks.
The async-sleep-pooled
example has been written to accept two optional
parameters. The first is the concurrentTasksPerWorker
, and the second is
the idleTimeout
. To see the effect setting each has on the performance
of the example, run:
$ node async-sleep-pooled 10 1000
Then run autocannon again:
$ autocannon localhost:3000
You should see that the performance of the pooled example improves significantly, but is still slightly less than the async-sleep-unpooled version:
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 100 ms │ 101 ms │ 126 ms │ 135 ms │ 105.16 ms │ 23.23 ms │ 391.75 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec │ 76 │ 76 │ 97 │ 99 │ 94.2 │ 6.42 │ 76 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 12.5 kB │ 12.5 kB │ 15.9 kB │ 16.2 kB │ 15.5 kB │ 1.05 kB │ 12.5 kB │
└───────────┴─────────┴─── ──────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Let's use the same options for the sync-sleep-pooled.js
example:
$ node sync-sleep-pooled 10 1000
Then run autocannon again:
$ autocannon localhost:3000
With the results:
Running 10s test @ http://localhost:3000
10 connections
┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼──────────┤
│ Latency │ 100 ms │ 200 ms │ 201 ms │ 329 ms │ 169.05 ms │ 52.78 ms │ 456.5 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴──────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Req/Sec │ 46 │ 46 │ 60 │ 60 │ 58.4 │ 4.16 │ 46 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 7.55 kB │ 7.55 kB │ 9.85 kB │ 9.85 kB │ 9.58 kB │ 681 B │ 7.54 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘
The options have no noticeable impact on the sync sleep version. The reason is because the workload is fully synchronous and cannot be processed concurrently no matter what the concurrency settings are. The performance, in other words, is bound entirely to the synchronous delay and can only be improved by reducing the event loop block.
You can also check out this example on github.