We are calling jobs to execute in our Data Services environment (14.0.3.273 - Windows 2008 R2) through the RUN_BATCH_JOB webservice interface.
For the most part this is working as expected.
However, occasionally we have a job or two that gets hung. There is no error thrown, and the job just shows as running in the admin console. If I look at the Monitor log I can see were records have been read in, and in the trace log there is nothing unusual. There is nothing in the error log.
I end up having to Abort these jobs from within the admin console. At this point I have also noticed that there remains a single AL_ENGINE.EXE in the task manager that I must manually kill. Prior to the Abort from within the admin console there were 6 to 7 AL_ENGINE.EXE tasks for this job. Aborting killed all of them except 1, which I find very odd.
Since nothing appears to be getting logged into the error, trace, or monitor logs I'm at a loss as to where I can look to try and diagnose what is going on.
After I abort the jobs, and kill the stray AL_ENGINE.EXE task, I can resubmit the except same job, and source data file, and execution is successful.
Therefore this would seem to be a problem independent of the source data and/or job.
Any advice of where to start looking to figure out what is happening? This is affecting a production set of jobs that we depend on running in a timely, automated manner. In must be able to trust that jobs will either run, or fail due to error. When the job hangs my application is left thinking that the job is just taking a long to to execute - which would be fine if it would eventually finish, but they aren't 100% of the time.
Let me know what additional information I can provide to help.
Thanks,
Richard