What is the autoscaling logic? Do we have our own logic or we use Amazon’s? Can I add workers when autoscaling is on? if not, why?
I am not an expert, but as far as I know we look at the number of events in non-paused queues. If the backlog crosses a certain threshold (I believe it’s 500 by default), we spin up more capacity, up to
maxWorkerNodes configuration parameter.
You can manually add more workers when autoscaling is on (provided that you have permissions to do so). Those workers won’t auto-shutdown when the processing is complete, so there’s a risk of idling the workers.
The platform autoscaling will shutdown workers down to minWorkerNodes when the number of running and pending requests in the active queues gores down to 0.
Do you know the exact logic for determining the number of workers necessary to scale up? Once the elements in queues exceeds that threshold does it simple spin up workers until you’re at
maxWorkerNodes or is there some incremental/smart scaling going on?
WRT to the threshold, do you know if that is configurable? Not all elements in the queues process at the same speed…
No smart scaling, just an on/off switch.
Threshold is configurable I believe, it’s a cluster configuration option. Need cluster admin rights for that.
So you’re saying that the cluster will go from Min --> Max as soon as the “threshold” of work is surpassed? Theres no incremental scaling?
Correct. There’s a bit of delay between worker creation, but they are ramped up from min to max as soon as the threshold is met.
This is really good info. Thanks everyone.
How long does the threshold have to be exceeded before the next batch of workers gets automatically spun up?
I had over 20,000 entries waiting in the JmsDataLoadQueue and the system sat on 5 worker nodes for at least 30 - 45 minutes before it decided to spin up to 10. It then worked with 10 worker nodes for about 90 minutes before it went to maxnodes. However, I suspect it was not automatic, as the number of workers went from 10 to 50 almost immediately after a conversation with Ops and raising the ticket to P1. Would love to know the algorithms for determining when to spin up more and if this is configurable by someone like me (not Ops but part of Data Integration team).
On a side note,
ClusterConfig.getAutoScaleMaxWorkerNodes() is used as the autoscaling limit if the workers are below or equal to it value set.
Cluster.hosts() will return more machines then
ClusterConfig.getAutoScaleWorkersNodes because they may have been added manually.
@garnaiz can the backlog thresholds be changed? Right now, it only scales down when we have 0 pending messages. We have a customer who will always have 30-40 pending messages on the queues and the workers wont scale down. Can we change the scale down threshold to 100 instead of 0?
why will there “Always” be 30-40 pending messages?
we have streaming data coming in every 5-15 mins.
They can be changed, but as soon as streaming data exceeds high threshold then maxWorkers will be created again. And another thing to consider is that there is another parameter that affects how workers are shut down in autoscaling. Originally, it was too expensive to spin up workers, so we decided that once the low threshold was hit, the cluster manager would wait for 30 minutes before shutting down the workers. This is also controlled by an autoscale parameter called