Hello, I’m a relatively new DBA and I’m trying to setup sqlwatch in our environment.
I have the blocked process monitor setup to email me when blocking is detected and I thought I configured it to send me repeat alerts every 5 minutes with an hourly limit of 3, but this is not what I’m experiencing.
I created a row in sqlwatch_config_check_action as such:
I create blocking to test this
(I run two queries, one in a transaction with TABLOCKX, HOLDLOCK and the other a normal select against the same table)
and I’m receiving two emails as the initial alerts, the only difference is the alert time is 207ms apart. If I check sqlwatch_logger_check, I can see the check triggering as CRITICAL check_status only once and at every minute as I would expect.
These two, nearly identical emails, are the only email notification I’ve received so far about 30 minutes later. I appreciate any help understanding how sqlwatch_config_check_action works to configure repeat and timing of notifications. I’ve read your posts here and on stackexchange, but I’m not having luck figuring out what’s wrong or making work as I’d like it to.
For what it’s worth, action_recovery does seem to work fine, which I really like.
To help with debugging, you can enable verbose logging with the below:
set config_value = 1
where config_name = 'Logger Log Info messages'
This will start logging quite a lot of info in the
[dbo].[sqlwatch_app_log] table. You will have to dig out those relevant to this particular action/check and let me see it, please. Once done, disable the verbose logging:
set config_value = 0
where config_name = 'Logger Log Info messages'
Thanks. I enabled verbose logging.
The first three entries look like this:
||Errors when processing Action (Id: -8) for Check (Id: -2)
||Errors during action execution (usp_sqlwatch_internal_process_actions): Action (Id:-8) calling Report (Id: -3)
|2022-10-21 16:01:43.6850376|usp_sqlwatch_internal_process_actions|F7A4AA65-1BE9-4D0B-8B1F-054CA1E24A6E|Action (Id:-8) calling Report (Id: -3)|ERROR|92|NT SERVICE\SQLSERVERAGENT|dbo|50000|16|1|dbo.usp_sqlwatch_internal_process_reports|281|Errors during execution of (usp_sqlwatch_internal_process_reports). Please review action log.|
|2022-10-21 16:01:43.6730246|usp_sqlwatch_internal_process_reports|E3796F4B-3C89-450E-8FC7-09926979074F|Error when executing Template Report (usp_sqlwatch_internal_query_to_html_table), @report_batch_id: NULL, @report_id: -3|ERROR|92|NT SERVICE\SQLSERVERAGENT|dbo|207|16|1|NULL|2|Invalid column name ‘blocking_start_time’.|
Then the only other rows I see that seem related all look like this (with different spids recorded):
2022-10-24 15:53:43.0596663 usp_sqlwatch_internal_process_checks 55C51822-5204-42B0-97A6-039608B9ACB8 Determining @check_status for Blocked Process (id: -2). We will try to use baseline data. We have NOT got any baseline data. The @check_status is null. The check value is 1.00000, the warning threshold is (null), and critical threshold is >0. The variance is 1. The final result is CRITICAL. INFO 80 NT SERVICE\SQLSERVERAGENT dbo NULL NULL NULL NULL NULL NULL
I don’t see any other rows that seem like they might be related to the blocked processes report, check, or alert.
For what it’s worth, I did update sqlwatch_config_reportset report_definition to some new code you shared with someone else who was having trouble with blocked process reports (that’s something else I was ignoring for now), so I’m not sure if maybe that affected something? I have put that report_definition value back the way it was just now though and no change in what alerts I’m getting.
Is there anything with the status of INFO or WARNING that would explain why you’re not getting messages?
This may be why the report doesnt work though:
Invalid column name ‘blocking_start_time’.
Looking at the sqlwatch_meta_action_queue table, I’m able to match up the rows to the emails I received; so, it appears that the emails I am receiving are timed exactly as they’re being generated out of sqlwatch.
Since SQLWATCH_INTERNAL_ACTIONS job appears to be correctly processing whatever is in sqlwatch_meta_action_queue, next step is to verify what’s going on with SQLWATCH_INTERNAL_CHECKS, right?
For what it’s worth, the extended properties on the SQLWATCH DB show I’m on version
I forgot to stop my blocking test last night and as a result, sqlwatch sent emails all night. With the blocked process action configured like this:
This is what it sent me overnight and into today (and I also received in email) until I cleared the blocking at 10:53am:
Note the previous test was cleared yesterday at 15:42 and a new test started 4 minutes later.
I’m seeing nothing in the sqlwatch_app_log except tons of these that correlate to the failed blocking process checks:
Determining @check_status for Blocked Process (id: -2). We will try to use baseline data. We have NOT got any baseline data. The @check_status is null. The check value is 1.00000, the warning threshold is (null), and critical threshold is >0. The variance is 1. The final result is CRITICAL.
By the way, I did fix the issue with the report based on the code you provided here:
Can’t get SQL Watch 4.2 to Send Blocked Process Notifcations - General - SQLWATCH
Any thoughts on what I should look at next? I did spend the better part of yesterday and today reading and trying to understand the process in usp_sqlwatch_internal_process_alerts and usp_sqlwatch_internal_process_actions but nothing as of yet jumped out at me.
Is there any chance you could share a backup of your SQLWATCH database with me for investigation? It may be easier this way.
Unfortunately, I can’t. For now, I’m wiping everything out and reinstalling so I can confirm this is reproducible and restart troubleshooting with a clean environment. I’ll get back to you with further findings once I have something.
I think I might have found the issue as I could reproduce after reinstalling on a different server.
In usp_sqlwatch_internal_process_actions, getdate() is used in the 4th comparison when trying to set @action_type:
when @check_status <> ‘OK’ and last_check_status = @check_status and (@action_repeat_period_minutes is not null
and datediff(minute,isnull(@last_action_time,‘1970-01-01’),getdate()) > @action_repeat_period_minutes) then ‘REPEAT’
I created a table to dump the pertinent variable values from the procedure. My last_action_time is exposed as:
(time_added is a column I created to record when the record was added to my debugging table)
Testing using the last_action_time with getdate():
I think the difference in the UTC value of last_action_time vs EST value of getdate() is throwing off the case statement so that it almost always returns NONE instead of REPEAT when we want it to. This does kind of explain the erratic behavior of the notifications I was seeing overnight, I think.
I will try updating usp_sqlwatch_internal_process_actions tomorrow, so we’re comparing times in the same time zone and see if that resolves.
ah! good spot! it should all be UTC dates so sounds like a bug to me!
Changing getdate() to getutcdate() in usp_sqlwatch_internal_process_actions, line 179, seems to have resolved the major issue. Repeating alerts are usable now , though I still see behavior I don’t understand.
Does action_id -8 (Run Blocked Process Report) also generate its own email notification?
If -8 does, then I’d expect it roughly every minute, while the other should email roughly every 5 min.
What I got was 2 emails roughly every 8 minutes (1 from each check action, I believe).
In the below evidence, 8:01 is when I started my blocking. 8:50 is the recovery (only 1 as expected) after cancelling the blocking.
Am I misunderstanding how the timing of action_repeat_period_minutes is supposed to work?
Think of actions as triggers that run other things. (maybe a trigger would have been a better name, actually)
An action can either run an arbitrary command defined in the
[action_exec] field in
[dbo].[sqlwatch_config_action], or it can run a report defined in
[action_report_id] but never both.
When it runs a report, what happens next is down to the report.
By default, the
Run Blocked Process Report action runs report -3:
To add to the confusion, reports can also run actions (trigger other things), and in this case, once the report -3 runs, it will call action -1:
And action -1 is just a standard email notification.
So, action_id -8 should not send an email directly but should only call the report. However, from your screenshot, I gather the check_id -2 has two actions, one the standard report (via -8) and -1, which is a direct email so you will end up with two emails.
It’s a bit of a circle that used to be explained in the old docs:
There’s also a process flow explaining the decision-making:
I realise this may be confusing, but the idea behind it was that we could build all sorts of automation on top of it, things like saving reports to disk, calling APIs, interfacing with other tools etc.
Great, thanks! That’s what I was expecting and what I’m seeing, so it makes sense as far as the email count goes.
What doesn’t make sense to me yet is why both actions are being triggered at the same time interval of 8-ish minutes (give or take a few ms), apparently ignoring the action_repeat_period_minutes value in the sqlwatch_config_check_action table.
action_every_failure=1 would make that happen as it is going to trigger every time it sees the blocking.
How often they trigger depends on how often the Job runs.
action_repeat_period_minutes was designed for stale checks. For example, if you have low storage and ignore the first alert, it will remind you in x minutes (days ideally).
You may try setting the
action_every_failure=0. This should email you the first time the blocking happens, and then the reminder for as long as the blocking is happening.
Btw, I am open for suggestions how to improve this. I wrote the code for the actions “engine” in an evening or so with the idea I had in my head at that time. I think it works well but its not very intuitive.
If I think of anything, I will let you know.
I have the jobs at the “out of box” defaults currently:
SQLWATCH_INTERNAL_ACTIONS job runs 15 seconds.
SQLWATCH_INTERNAL_CHECKS job runs every minute.
I adjusted the sqlwatch_config_check_action table to this:
Now they’re queueing up about 10 minutes apart.
I’ll have to dig into it more next week.
Do you still have the INFO messages enabled? What does it spit out there?
Unfortunately, I turned those off. Sorry I didn’t get back sooner. This has taken a back burner to other work. I will get back soon as I have anything new to share though.