We have several Cron Jobs set to start Map jobs every half hour in a production environment. These jobs worked as expected for several months, but lately, they have begun to skip. For example, in one job, it may run on the hour, skip the next two cycles, run again, miss one, and then run continuously. I have monitored the Cron Queue and can see the triggers and invalidations are populated, but sometimes they will inexplicably be removed, only to reappear later, often invalidated by a different node. Furthermore, all of these jobs are being invalidated by a master node, while every other existing job seems to be invalidated by a worker node. Looking at the history of the Cron Job shows that it has missed executions, too. I am just trying to understand what factors may be impacts whether or not these jobs fire. As far as we can tell, the Map jobs kicked off run as expected without fail. We are running on 220.127.116.112.
I had similar issue, you must have some errors in the
CronQueue that explain those silent failures, check them with
@bachr the only error I see in the queue is:
errorMsg: wrapped NullPointerException at c3.engine.database.async.CronQueueMethods.makeJobHistory (CronQueueMethods.java:265) null
This was due to the cron job cache being out of sync on one or more masters. As a result, some of the masters thought those jobs were inactive (e.g. the cache didn’t get cleared when you made them active). I emptied that specific cache on each of the masters and it appears to be working properly.
We’ve been suspecting the following root causes so far: Long running jobs in other tenants of the same environment at the same time.
We did the following: Reboot the masters and recover manually the failed Cron Jobs.
But we don’t know what exactly fixed the issue.
It’s possible, but I can’t say with 100% certainty. Rebooting the masters would definitely fix it though.