Failed job check missing job failures

First off, thank you for all of your work putting this tool together! quite handy!
As I’ve put it on a couple servers, I’ve started noticing that the failed job check -1 misses job failures.
The check looks as msdb…sysjobhistory based off the check’s {LAST_CHECK_DATE}.
the function msdb.dbo.agent_datetime(run_date, run_time) is the time that the job starts executing.
if the failed job check runs once a minute, it will only alert for a job that has started after the {LAST_CHECK_DATE}, or in another words, it will only alert if a job ran for less than 1 minute and failed. If a job runs for an hour and then fails, the check will not catch it.

My suggestion is to update the check and report to add the run_duration to the job start time and then compare that calculated column to {LAST_CHECK_DATE} instead.

for example:

select @output=count(*) from msdb.dbo.sysjobhistory
where DATEADD(second, DATEDIFF(second, GETDATE(), GETUTCDATE()), --convert to UTC
dateadd
(ss, --add seconds for run_duration
(CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60
+ CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60
+ CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),6,2) AS INT)
)
,msdb.dbo.agent_datetime(run_date, run_time)) --add seconds to job start date/time
) >= dateadd(second,-10,’{LAST_CHECK_DATE}’) and run_status = 0 and step_id <> 0

Hi, thanks for the kind words.

Your suggestion makes sense.

When I developed the “checks engine” I wrote few checks as examples for people to write their own checks for their own environments. It looks though that everyone ended up using my, not always correct, examples :slight_smile:

Checks are changing a bit in the upcoming version 3.x so I will take that into account - or feel free to submit a PR to GitHub

Thanks for the feedback Marcin!

Here is the updated query for report_id= -2 to also incorporate my modified logic for the the l last check date compared to the time the job failed.

`;WITH cte_failed_jobs AS (
SELECT [Job] = sj.NAME ,[Step] = sjs.step_name
,[Message] = sjh.message
,[Run Time] = msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)
,Duration=CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60 + CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60 + CAST(SUBSTRING((right(‘0000000’+convert(varchar(7), run_duration), 7)),6,2) AS INT)
FROM msdb.dbo.sysjobhistory sjh
INNER JOIN msdb.dbo.sysjobs sj ON sjh.job_id = sj.job_id
INNER JOIN msdb.dbo.sysjobsteps sjs ON sjs.job_id = sj.job_id AND sjh.step_id = sjs.step_id
WHERE sjh.step_id > 0
AND dateadd (ss, /*add seconds for run_duration */
(CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60 +CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60 + CAST(SUBSTRING((right(‘0000000’ + convert(varchar(7), run_duration), 7)),6,2) AS INT) )
,msdb.dbo.agent_datetime(run_date, run_time)) >=
isnull(( /SELECT dateadd(second, - 1, last_check_date),/
SELECT dateadd(second, - 1, DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), last_check_date) ) /convert from UTC time to local time/
FROM [dbo].[sqlwatch_meta_check]
WHERE check_id = - 1 AND sql_instance = @@SERVERNAME ), getdate())

AND sjh.run_status = 0 )

SELECT @output = ( SELECT + '

JOB: ’ + c1.[Job] + ‘

’ + ( SELECT CHAR(10) + '

Step: ’ + c2.[Step]

  • CHAR(10)+‘executed on: ’ + convert(VARCHAR(23), c2.[Run Time], 121) + CHAR(10)+‘duration: ’ + cast(c2.duration as varchar) + ’ seconds’+ CHAR(10) + ‘
    Message: ’ + c2.[Message] + ‘
    ,c2.[Message]
    FROM cte_failed_jobs c2
    WHERE c1.[Job] = c2.[Job]
    ORDER BY [Run Time]
    FOR XML path(’’) ,type ).value(’.’, ‘nvarchar(MAX)’) t
    FROM cte_failed_jobs c1 GROUP BY c1.[Job]
    FOR XML path(’’) ,type ).value(’.’, ‘nvarchar(MAX)’)
    `

oh bugger. the html markup got boogered up inside of the editor here.
I also have a second improvement I’m working on for the report query to pull the error message out of ssisdb if an SSIS package fails.