I had a tricky problem recently at a client where I had configured a Service Level Objective but the availability was reporting healthy while the Distributed Application it was based on was in a failure condition. I thought that the process of trying to diagnose the problem was interesting as it touched a number of areas and so I have written this blog post to detail those steps.
Setting the scene
So, I had a System Center Operations Manager 2012 SP1 environment with a distributed application and a service level objective against that distributed application. The distributed application was in a critical state, but the service level report was reporting that the application was completely available.
After checking all the obvious things like ensuring that the health that was making the distributed application critical was also the health state the SLO was monitoring, I decided that the best bet was to look at the database and work out why the availability was reporting incorrectly. The first step to do this was to work out what the report was executing.
Step 1 – What is the report doing?
SCOM reports are run from a reporting services instance, and if you look in the console it will show you the link to the Reporting Service Instance.
However, this is the interface designed for accessing by a computer – the human readable version can normally be found by replacing ReportServer_SCOM with Reports_SCOM in the URL. Once you access this you will see a list of report folders.
It is easiest at this point to switch to the details view and then navigate into the service level reports folder. You then need to look through the reports and find the report that will contain the information you are after. A little trial and error and you can find out that this is the ServiceLevelTrackingSummary.Detail report. Choose to edit this in the Report builder
This opens the SQL Server report builder. From here you can see the datasets that are used to calculate the availability and it is fairly clear that this will be calculated in the SLASummaryData dataset. Looking at the properties shows you that the stored procedure used to get the information is Microsoft_SystemCenter_DataWarehouse_ServiceLevel_Report_Library_ServiceLevelAgreementSummaryReportDataGet
This is all very well, but looking on the parameters page shows that this stored procedure requires some XML to be passed in containing the SLO that you want the data for. Because I am lazy I did not want to work out what the XML was myself, so I decided to use SQL Server Profiler. This will allow me to trace the call used to obtain the data and so find out the full parameters.
Step 2 – What SQL procedure does the report execute?
SQL Server Profiler is used to watch what calls are being made to a SQL Server. Now, if you are a DBA then there are lots of things you can adjust and use to filter the trace, but I am not a DBA so I usually just use the defaults, which works for most of the things I do.
To open SQL Profiler, I start SQL Server Management Studio and make a connection to the relevant database – in this case, the Operations Manager Data Warehouse. I then went to the Tools menu and select SQL Server Profiler. This asks you to connect to the SQL database and then gives you the trace screen.
On the events tab you can control what you are tracing, but I was happy with the defaults so I didn’t need to go into them. Now, this can generate a lot of data, so I got the report ready to run, then clicked start on the trace, ran the report and then clicked stop again before too much data was captured. Once I captured the trace, I opened the search box by selecting Edit/Find and then searching for the name of the stored procedure. This gave me the call that the report was using to generate the data.
I could then take this procedure call and run it in a query in SQL Server Management Studio to look at exactly what data is being returned to the report. This should tell me where the calculation is going wrong
Step 3 – What do the SQL Server stored procedures do?
Executing the Stored Procedure in SQL Management studio was easy enough, and it showed that the downtime is summarised hourly and returned to the report.
It clearly shows that the downtime calculation earlier in the chain was incorrect. This would mean moving on to the next level down and this would mean opening up the stored procedure to see what it was doing. Now, some of the SQL in here was getting beyond me, but I could see there was a chain:
This last stored procedure took its data from a view – vstatehourlyfull. I then needed to look at the data in vstatehourlyfull to see what this contained, I selected the first 100 rows to see what I needed to find, and it was clear that I needed to find the ManagedEntityRowId and the MonitorRowID
Now, I had the ManagedEntityRowID from the original stored procedure and a quick query against vMonitor shows that the MonitorRowID of 1 is the availability rollup monitor (which is the one we are trying to monitor). Looking at the data shows that the time was all being allocated to the “InWhiteState” column rather than the “InRedState” column
It is clear therefore that when the hourly data is being calculated, the time was being allocated to the wrong state – but why?
Step 4 – Why is the hourly data wrong?
Unfortunately I was at the end of the chain I could follow from the report data. However, I could still progress by some more educated guesswork. The hourly state table had to be calculated from the original data, so a quick look down at the stored procedure list shows one called “StateAggregate”. Reading this stored procedure, it seems to calculate the availability from a stored procedure called “StateChangeEventList”. This looks at the raw state data and using the ManagedEntityMonitorRowID from the vstatehourlyfull table I could see that the state changes were correct. Clearly something else was overriding the state.
Looking further down the stored procedure – there was a reference to factoring in an outage to the Health Service. This is needed because an object will not change state when the health service is not reporting – so the state would need to be overridden to indicate that monitoring was not possible. Once I spotted this, it seemed clear what was happening – something was indicating that the monitor needed to be overridden. Modifying the query listed in the stored procedure, I run the following query with the following results:
The Entity row ID has come from the initial report query – and it shows that the parent monitor still had an “active” outage as the EndDateTime was still set to null. As you may be able to spot, it would appear that two state messages were written indicating the start of an outage, but only one of them had been updated to indicate that the outage was finished. A quick query of the managed entity view showed that this was the entity representing the “All Servers Management Pool”. As I knew this was healthy again, I could update the health service outage table to set the end date as follows
Once I did this, the state change event list did not show the override to unmonitored and the vstatehourly tables started to display the correct information, and the report stated to show the correct availability statistics
While this problem is an unusual one, I hope that the steps I have listed here can help other people investigate issues that they might be experiencing with their SCOM environments and show that a familiarity with SQL and the data structures contained in the Operations Manager databases can help an administrator sort out problems with the environment.