Skip to main content

Server

The benefit of offloading work to a worker pool will vary significantly based on work load. In some cases, the worker pool may actually be a detriment to performance. Careful consideration must be given.

In this example, we create a simple Fastify http server that responds with a simple JSON object after introducing an artificial delay of 100 milliseconds.

There are four variants to the server:

  • async-sleep-unpooled - Uses an async delay using a Promise-wrapped setTimeout. The event loop is allowed to keep turning during the delay allowing the server to respond to additional requests. The delay occurs within the main thread.

  • async-sleep-pooled - Moves the async delay to the worker threads for processing.

  • sync-sleep-unpooled - Uses a sync delay using Atomics.wait to block the main thread for 100 milliseconds before responding. Because the main thread is blocked, the event loop is not turning while we are waiting. This simulates a particularly expensive event loop blocking scenario.

  • sync-sleep-pooled - Moves the sync delay into the worker threads for processing.

note

For both async-sleep-pooled and sync-sleep-pooled, the main thread uses a Promise to wait on the worker to complete it's task before responding to the request.

Setting up

The code files for this example can be found on github.

To get started, run npm i to install dependencies. We use autocannon for benchmarking the four variants:

$ npm i -g autocannon

Let's start with the two async sleep variants. First run:

$ node async-sleep-unpooled

And in a separate terminal, run autocannon:

$ autocannon localhost:3000

You should see results similar to:

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬────────┬────────┬────────┬──────────┬─────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼───────────┤
│ Latency │ 100 ms │ 101 ms │ 104 ms │ 125 ms │ 101.2 ms │ 3.09 ms │ 130.66 ms │
└─────────┴────────┴────────┴────────┴────────┴──────────┴─────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬───────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┤
│ Req/Sec │ 90 │ 90 │ 99 │ 100 │ 97.8 │ 3.32 │ 90 │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┤
│ Bytes/Sec │ 14.8 kB │ 14.8 kB │ 16.2 kB │ 16.4 kB │ 16 kB │ 545 B │ 14.8 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴───────┴───────┴─────────┘
info

Note that this server is not particularly fast at all due to the artificially imposed 100 millisecond delay. It's possible to achieve significantly higher results by tuning the way autocannon is sending requests. For instance, running autocannon with the -c 100 -p 2 options increases performance of the async-sleep-unpooled example significantly.

Let's see how the Piscina version does in comparison:

Run:

$ node async-sleep-pooled

And autocannon again:

$ autocannon localhost:3000

Your result should be similar to:

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 126 ms │ 169 ms │ 197 ms │ 339 ms │ 170.13 ms │ 31.41 ms │ 425.33 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Req/Sec │ 43 │ 43 │ 60 │ 60 │ 58.2 │ 5.08 │ 43 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 7.05 kB │ 7.05 kB │ 9.85 kB │ 9.85 kB │ 9.55 kB │ 833 B │ 7.05 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘

Notice that the pooled version is slower! This is because in both async examples, the event loop on the main thread is still active, allowing it to service more requests. The worker pool here, however, adds additional performance overhead marshalling data back and forth between threads and allocating additional Promises. In this scenario, the worker pool does not add much benefit.

Let's look as the sync cases and see what happens with those.

Run:

$ node sync-sleep-unpooled

And run autocannon again:

$ autocannon localhost:3000

The results should be fairly awful in comparison to the first two cases:

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬─────────┬─────────┬─────────┬──────────┬───────────┬────────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼─────────┼─────────┼─────────┼──────────┼───────────┼────────────┤
│ Latency │ 338 ms │ 1007 ms │ 1811 ms │ 1919 ms │ 962.2 ms │ 255.54 ms │ 1919.77 ms │
└─────────┴────────┴─────────┴─────────┴─────────┴──────────┴───────────┴────────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤
│ Req/Sec │ 9 │ 9 │ 10 │ 10 │ 9.81 │ 0.4 │ 9 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤
│ Bytes/Sec │ 1.48 kB │ 1.48 kB │ 1.64 kB │ 1.64 kB │ 1.61 kB │ 65.6 B │ 1.48 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘

The reason for the performance drop should be apparent. Synchronously sleeping blocks the event loop from turning while a request is being processed, which means the server is unable to do anything else while it waits.

To see how the pooled version fares, run:

$ node sync-sleep-pooled

And run autocannon again

$ autocannon localhost:3000

The results should be nearly identical to the async-sleep-pooled version!

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼──────────┤
│ Latency │ 124 ms │ 174 ms │ 200 ms │ 335 ms │ 169.99 ms │ 36.22 ms │ 422.2 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴──────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Req/Sec │ 42 │ 42 │ 60 │ 60 │ 58.2 │ 5.4 │ 42 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 6.89 kB │ 6.89 kB │ 9.85 kB │ 9.85 kB │ 9.55 kB │ 886 B │ 6.89 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘

The reason sync-sleep-pooled and async-sleep-pooled yield identical results is because, in both cases, the main thread is doing identical work -- that is, receiving a request that is dispatched to a worker, waiting about 100 milliseconds, the returning the response. The main thread here does not care whether the workers are sleeping synchronously or asychronously.

Tuning pool performance

It is possible to tune the performance of the Piscina worker pool using a variety of options:

  • minThreads - The minimum number of threads always runnin
  • maxThreads - The maximum number of threads allowed
  • idleTimeout - The number of millisecondsa thread is permitted to remain idle
  • maxQueue -- The maximum number of pending work items
  • concurrentTasksPerWorker -- The number of work items to dispatch concurrently to a single thread.

We'll use idleTimeout and concurrentTasksPerWorker in this example.

By default, as soon as a Piscina worker thread has nothing to do, it will terminate. This means that if the work queue is empty, the thread will terminate. If we're not keeping our queue filled, this will incur additional overhead as Node.js spins up new worker threads to handle incoming requests.

Also by default, Piscina will assume that jobs are synchronous in nature and will dispatch only a single job per thread at any time. If the workload is asynchronous, setting concurrentTasksPerWorker to a higher number will increase the number of jobs Piscina will send to a single worker, allowing it to process multiple tasks.

The async-sleep-pooled example has been written to accept two optional parameters. The first is the concurrentTasksPerWorker, and the second is the idleTimeout. To see the effect setting each has on the performance of the example, run:

$ node async-sleep-pooled 10 1000

Then run autocannon again:

$ autocannon localhost:3000

You should see that the performance of the pooled example improves significantly, but is still slightly less than the async-sleep-unpooled version:

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 100 ms │ 101 ms │ 126 ms │ 135 ms │ 105.16 ms │ 23.23 ms │ 391.75 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec │ 76 │ 76 │ 97 │ 99 │ 94.2 │ 6.42 │ 76 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 12.5 kB │ 12.5 kB │ 15.9 kB │ 16.2 kB │ 15.5 kB │ 1.05 kB │ 12.5 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Let's use the same options for the sync-sleep-pooled.js example:

$ node sync-sleep-pooled 10 1000

Then run autocannon again:

$ autocannon localhost:3000

With the results:

Running 10s test @ http://localhost:3000
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼──────────┤
│ Latency │ 100 ms │ 200 ms │ 201 ms │ 329 ms │ 169.05 ms │ 52.78 ms │ 456.5 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴──────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───────┬─────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Req/Sec │ 46 │ 46 │ 60 │ 60 │ 58.4 │ 4.16 │ 46 │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┼─────────┤
│ Bytes/Sec │ 7.55 kB │ 7.55 kB │ 9.85 kB │ 9.85 kB │ 9.58 kB │ 681 B │ 7.54 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───────┴─────────┘

The options have no noticeable impact on the sync sleep version. The reason is because the workload is fully synchronous and cannot be processed concurrently no matter what the concurrency settings are. The performance, in other words, is bound entirely to the synchronous delay and can only be improved by reducing the event loop block.

You can also check out this example on github.