Worker health in production is monitored using PM2's built-in monitoring dashboard, cluster lifecycle events (online, exit, disconnect), custom heartbeat mechanisms via IPC, and external APM tools like New Relic, Datadog, or Prometheus with Grafana.
Monitoring worker health is critical in production to detect crashes, memory leaks, and unresponsive processes. There are multiple layers of monitoring: built-in Node.js cluster events for basic health, PM2 for process management dashboards, and APM (Application Performance Monitoring) tools for deep observability.
PM2 — pm2 monit provides real-time CPU/memory per worker, auto-restart on memory threshold breach
Cluster lifecycle events — listen to online, exit, disconnect events on the cluster object
Custom heartbeat via IPC — workers send periodic pings; master kills unresponsive ones after a timeout
process.memoryUsage() — monitor heap usage per worker to detect memory leaks over time
Datadog / New Relic APM — instrument workers with agents for distributed tracing and alerting
Prometheus + Grafana — expose custom metrics per worker and aggregate via Prometheus scraping