Failed job check missing job failures

mnelson · 12 October 2020 20:51

First off, thank you for all of your work putting this tool together! quite handy!
As I’ve put it on a couple servers, I’ve started noticing that the failed job check -1 misses job failures.
The check looks as msdb…sysjobhistory based off the check’s {LAST_CHECK_DATE}.
the function msdb.dbo.agent_datetime(run_date, run_time) is the time that the job starts executing.
if the failed job check runs once a minute, it will only alert for a job that has started after the {LAST_CHECK_DATE}, or in another words, it will only alert if a job ran for less than 1 minute and failed. If a job runs for an hour and then fails, the check will not catch it.

My suggestion is to update the check and report to add the run_duration to the job start time and then compare that calculated column to {LAST_CHECK_DATE} instead.

for example:

select @output=count(*) from msdb.dbo.sysjobhistory
where DATEADD(second, DATEDIFF(second, GETDATE(), GETUTCDATE()), --convert to UTC
dateadd
(ss, --add seconds for run_duration
(CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60
+ CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60
+ CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),6,2) AS INT)
)
,msdb.dbo.agent_datetime(run_date, run_time)) --add seconds to job start date/time
) >= dateadd(second,-10,’{LAST_CHECK_DATE}’) and run_status = 0 and step_id <> 0

Marcin · 13 October 2020 11:18

Hi, thanks for the kind words.

Your suggestion makes sense.

When I developed the “checks engine” I wrote few checks as examples for people to write their own checks for their own environments. It looks though that everyone ended up using my, not always correct, examples

Checks are changing a bit in the upcoming version 3.x so I will take that into account - or feel free to submit a PR to GitHub

mnelson · 13 October 2020 15:46

Thanks for the feedback Marcin!

Here is the updated query for report_id= -2 to also incorporate my modified logic for the the l last check date compared to the time the job failed.

`;WITH cte_failed_jobs AS (
SELECT [Job] = sj.NAME ,[Step] = sjs.step_name
,[Message] = sjh.message
,[Run Time] = msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)
,Duration=CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60 + CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60 + CAST(SUBSTRING((right(‘0000000’+convert(varchar(7), run_duration), 7)),6,2) AS INT)
FROM msdb.dbo.sysjobhistory sjh
INNER JOIN msdb.dbo.sysjobs sj ON sjh.job_id = sj.job_id
INNER JOIN msdb.dbo.sysjobsteps sjs ON sjs.job_id = sj.job_id AND sjh.step_id = sjs.step_id
WHERE sjh.step_id > 0
AND dateadd (ss, /*add seconds for run_duration */
(CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60 +CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60 + CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),6,2) AS INT) )
,msdb.dbo.agent_datetime(run_date, run_time)) >=
isnull(( /SELECT dateadd(second, - 1, last_check_date),/
SELECT dateadd(second, - 1, DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), last_check_date) ) /convert from UTC time to local time/
FROM [dbo].[sqlwatch_meta_check]
WHERE check_id = - 1 AND sql_instance = @@SERVERNAME ), getdate())

AND sjh.run_status = 0 )

SELECT @output = ( SELECT + '

JOB: ’ + c1.[Job] + ‘

’ + ( SELECT CHAR(10) + '

Step: ’ + c2.[Step]

CHAR(10)+‘executed on: ’ + convert(VARCHAR(23), c2.[Run Time], 121) + CHAR(10)+‘duration: ’ + cast(c2.duration as varchar) + ’ seconds’+ CHAR(10) + ‘
Message: ’ + c2.[Message] + ‘’
,c2.[Message]
FROM cte_failed_jobs c2
WHERE c1.[Job] = c2.[Job]
ORDER BY [Run Time]
FOR XML path(’’) ,type ).value(’.’, ‘nvarchar(MAX)’) t
FROM cte_failed_jobs c1 GROUP BY c1.[Job]
FOR XML path(’’) ,type ).value(’.’, ‘nvarchar(MAX)’)
`

mnelson · 13 October 2020 15:49

oh bugger. the html markup got boogered up inside of the editor here.
I also have a second improvement I’m working on for the report query to pull the error message out of ssisdb if an SSIS package fails.