What’s the best way to safely increase parallelism in a production Node service? What if your application is a bank integration service? That’s a question author’s team needed to answer a couple of months ago. By Evan Limanto.
We were running 4,000 Node containers (or “workers”) for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale. Most requests were network-bound, so we could improve our capacity and costs if we could just figure out how to increase parallelism safely.
This is in depth article, with all of these and more covered:
- Why they invested in parallelism
- How they rolled out changes reliably
- Deploy, investigate, repeat
- Results and learnings
Never underestimate the importance of having low-level metrics for a system. Being able to monitor GC and memory statistics during rollout was essential. Lengthy article but there is plenty of the value. Highly recommended!
[Read More]