Unsound and Incomplete: Locking, catalogue tables and index rebuilds

This isn't meant to be a blog devoted to SQL Server, but since I manage a relatively complex SQL Server application I come across a lot of interesting things relating to the product and building applications with it. This particular example came up in production but as usual I have created a simpler example based on AdventureWorks 2012.

To give some context, the real application from which this is example derived needs to load quite a lot of data as quickly as possible every night. The strategy for that is to partition switch in the data, and to speed up the load into the stage table we disable all the indexes prior to inserting the rows. (Then we re-enable the indexes before partition switching - when partition switching the source table must match the destination exactly). Not only does this ensure that the data is loaded very quickly, but it also leaves no fragmentation.

This approach was used in a number of place therefore a generic stored proc was written which queries sys.indexes to determine the set of indexes either to be rebuilt or disabled:

CREATE PROC DYNAMIC_REBUILD
(
@TABLE_NAME varchar(255),
@ACTION varchar(50)
)
AS
BEGIN
DECLARE @cmd varchar(8000)
SELECT @cmd = COALESCE(@cmd, '') +
' ALTER INDEX [' + i.name +
'] ON ' + @TABLE_NAME + ' ' +
@ACTION + ';' + CHAR(13) + Char(10)
FROM
sys.indexes i
INNER JOIN
sys.objects o
ON i.object_id = o.object_id
WHERE
OBJECT_SCHEMA_NAME(o.object_id) + '.' + OBJECT_NAME(o.object_id) =
@TABLE_NAME
AND o.type_desc = 'USER_TABLE'
AND i.index_id > 1
exec(@cmd)
END

The issue I am going to describe in detail is contained in that proc, but it wasn't obvious to me or the developer who had written the proc.

During the course of the overnight batch process, the proc was called by multiple times, often in parallel (but targeting different tables of course) because of parallel data loads. The first issue was that we experienced deadlocks nearly every day. The deadlock was on sysschobjs, which was initially surprising, although the proc was called within a transaction. That is easily resolved by adding a nolock hint, but in hindsight this should have given a big indication of the underlying problem with the proc.

The more interesting issue is that we found that executions blocked each other. To see this for yourself, create the proc above on AdventureWorks (making sure you add the nolock hint against both sys.indexes and sys.objects). Then in two separate sessions execute:

BEGIN TRAN
exec DYNAMIC_REBUILD @TABLE_NAME='Sales.SalesOrderDetail',
@ACTION='REBUILD'

and

BEGIN TRAN

exec DYNAMIC_REBUILD @TABLE_NAME='Sales.Customer',

@ACTION='REBUILD'

You should see the second session blocks. Let's look at what it is blocking on by executing this query:

SELECT blockee.wait_type, blockee.wait_resource, *

FROM sys.dm_exec_sessions blocker

INNER JOIN sys.dm_exec_requests blockee

ON blocker.session_id = blockee.blocking_session_id

This should return something like this:

wait_type wait_resource

LCK_M_SCH_S OBJECT: 7:1154103152:0

So, the second session is waiting for a schema stability lock (Sch-S) but on which table?

SELECT OBJECT_NAME(1154103152)

returns: SalesOrderDetail

How odd - the proc passed the argument to rebuild the Sales.Customer table is waiting for a Sch-S lock on the table whose indexes were rebuilt by the first proc. Just for completeness, let's check the locks held by the first session:

SELECT request_mode, request_status

FROM sys.dm_tran_locks

WHERE request_session_id=51

AND resource_associated_entity_id = 1154103152

returns:

request_mode request_status

Sch-M GRANT

No surprise really since you would expect a schema modification lock to be held by an index rebuild, and Sch-S is incompatible with Sch-M so that causes blocking. The question is why the query attempts to acquire a Sch-S lock? I did not know when I faced this, and reading the MSDN documentation on locks does not give any clues (at least not to me).

In order to figure out what was going I used extended events. (There are several good online resources to help if you aren't familiar with using them, but I would particularly recommend John Huang's site and Jonathan Keyahias's blog from which I learned a lot). My objective was to understand which locks and in what order were being acquired and released by the query in the proc. I created a session as follows:

CREATE EVENT SESSION understand_locking

ON SERVER

ADD EVENT sqlserver.lock_acquired

(

ACTION (sqlserver.session_id)

ADD EVENT sqlserver.lock_released

(

ACTION (sqlserver.session_id)

)

ADD TARGET package0.asynchronous_file_target

(

SET filename = 'c:\dev\xelog.xel', metadatafile = 'c:\dev\xelog.xem'

)

This just adds the SPID to the set of standard fields that are recorded. You can see the standard fields by executing this query:

SELECT *

FROM sys.dm_xe_object_columns

WHERE object_name = 'lock_acquired'

AND column_type = 'data'

After starting the session (you can do this interactively in SSMS), and executing the procedures you will have a file of event data that you need to process. To do that, you have to do some XML processing. You can use SSMS to examine the results in order to get a feel for the structure (I could not find an xsd) - the following query is a reasonable template for how to process the data:

DECLARE @Temp table (ID int identity(1,1), EventData xml);

INSERT INTO @Temp(EventData)

SELECT cast(event_data as xml) event_data

FROM

sys.fn_xe_file_target_read_file ('c:\dev\xelog*', 'c:\dev\xelog.xem', null, null) a;

SELECT OBJECT_NAME(ObjectID), * FROM

(

select ID,

cast(EventName as varchar(20)) as EventName,

cast(SessionID as varchar(20)) as SessionID,

cast(resource_type as varchar(10)) as ResourceType,

cast(object_id as varchar(16)) as ObjectID,

cast(mode as varchar(10)) Mode,

cast(owner_type as varchar(12)) OwnerType,

cast(database_id as int) DBID,

cast(resource_0 as varchar(16)) Res0,

cast(resource_1 as varchar(16)) Res1,

cast(resource_2 as varchar(16)) Res2,

cast(lockspace_workspace_id as varchar(20)) lockspace_workspace_id,

cast(lockspace_sub_id as varchar(10)) lockspace_sub_id,

cast(lockspace_nest_id as varchar(10)) lockspace_nest_id,

cast(transaction_id as int) TransID,

EventTime

FROM

(

SELECT

a.ID,

xml.value('../@name', 'varchar(128)') EventName,

xml.value('../@timestamp', 'datetime') EventTime,

xml.value('(../action/value)[1]', 'varchar(20)') SessionID,

xml.value('@name','varchar(128)') [Col],

CASE

WHEN isnull(rtrim(xml.value('(./text)[1]', 'varchar(128)')), '') = '' then xml.value('(.)[1]', 'varchar(128)')

ELSE xml.value('(./text)[1]', 'varchar(128)')

END [Value]

FROM

@Temp a

CROSS APPLY

a.EventData.nodes('//data') ed(xml)

) x

PIVOT

(

MAX([Value]) FOR Col IN

( resource_type, object_id, mode, owner_type, transaction_id, database_id,

lockspace_workspace_id, lockspace_sub_id, lockspace_nest_id, resource_0, resource_1,

resource_2)

) pvt

) y

WHERE

(Mode='SCH_M' OR Mode='SCH_S')

This might look a bit daunting but all it does it extract the values from the XML, then pivot it so that you get the data in a single row per event rather than multiple rows. The result looks like this (I have removed columns to make it readable on this blog):

ObjectName EventName SessionID

CountryRegionCurrency lock_acquired 55

CountryRegionCurrency lock_released 55

FK_Product_UnitMeasure_WeightUnitMeasureCode lock_acquired 55

Product lock_acquired 55

FK_Product_UnitMeasure_WeightUnitMeasureCode lock_released 55

Product lock_released 55

From the events, it is clear that it is acquiring and releasing Sch-S locks for a large number of tables in the database. At this point it is time to look at the query plan for the query in the proc that builds the SQL.

The highlighted section shows that there is an index scan on sysschobjs. I haven't read it in any official documentation, but I can only assume that when reading from the catalogue tables that, in addition to shared locks, schema stability locks are acquired too. Reviewing the query, the following bit provokes the scan:

WHERE OBJECT_SCHEMA_NAME(o.object_id) + '.' + OBJECT_NAME(o.object_id) = @TABLE_NAME

And it can easily be rewritten as:

WHERE o.object_id = OBJECT_ID(@TABLE_NAME)

Reviewing the modified query plan, we now see index seeks:

If we rerun the procs in two sessions they don't block each other, and analysis of the extended events shows that Sch-S locks are not acquired on all the tables.

In the end, this was a fairly sraightforward solution - optimising a poor query. However the thing I learned was that you need to pay attention to queries that access the catalogue tables because they can acquire Sch-S locks and lead to contention in some circumstances.

Unsound and Incomplete

Pages

Sunday, 30 June 2013

Locking, catalogue tables and index rebuilds

No comments:

Post a Comment