unfilled January 26, 2012 at 16:31

Help, my database is corrupt. Now what?

Transfer

A corrupted database is probably one of the worst nightmares of most database administrators. The result of damage is downtime, screaming managers and all sorts of other unpleasant things.
In this article I will explain what you cannot do with a damaged database and describe some of what needs to be done, some types of damage and how to fix them.

How to detect that a database is corrupted

Typically, damage is excellently detected when trying to access a damaged page. Requests, backups, or reindexing procedures result in errors with high severity levels.
Here are a couple of examples of system messages when a database corruption is detected:

SQL Server detected a logical consistency-based I / O error: incorrect checksum (expected: 0xfdff74c9; actual: 0xfdff74cb). It occurred during a read of page (1: 69965) in database ID 13 at offset 0x0000002229a000 in file 'D: \ Develop \ Databases \ Broken1.mdf'.

Attempt to fetch logical page 1: 69965 in database 13 failed. It belongs to allocation unit 72057594049069056 not to 281474980642816.

The main problem is that if the database integrity checks are not performed on an ongoing basis, then the damage can be detected after hours, days, and even months, after it has formed, at a time when it will be difficult to fix something .
I will not describe the situation when the database went into the “suspect” state ( “suspicious” in the Russian edition of SQL Server - approx. Translator ). A description of all sorts of reasons why the database can go into “suspect” and many options for fixing this is the topic of a separate article, if not a book.

What to do if the database is still damaged

Do not panic
Do not detach it
Do not restart SQL Server
Do not start recovery right away
Run Integrity Check
Find a reason

Do not panic

The most important thing when detecting database corruption is not to panic. Any decisions made must be carefully weighed, all possible factors must be taken into account. It’s damn easy to worsen the situation by making an undecided decision.

Do not detach database

In most cases, when SQL Server detects database corruption, it means that the database actually has corrupted pages. Attempting to convince SQL Server that this is not the case by detaching and attaching the database, backup and then restoring it, restarting the SQL Server service, or rebooting the server will not cause the error to disappear.
If the database is corrupted and SQL Server detects this when it joins, it will not be able to join it. There are several ways to make it see this database, but it’s much better not to disconnect it.

Do not restart SQL Server

In the same way as when disconnecting and joining, restarting the SQL Server service will not be able to fix the detected errors (if any).
Restarting the service can make the situation worse. If SQL Server detects errors during the database recovery phase after the restart, it will mark it as “suspect”, which will greatly complicate the database recovery process.

Do not start recovery right away

You may be tempted to just run DBCC CHECKDB with one of the “recovery” options (usually data loss) and hope that things get better ( in my experience, the first thing that is recommended in non-core SQL Server forums is to run DBCC CHECKDB REPAIR_ALLOW_DATA_LOSS - comment of a translator ). In many cases, starting such a recovery is not recommended. It does not guarantee the correction of all errors and may lead to unacceptable data loss.
This recovery is the last step in fixing errors. It should only be started if you already have no other choice, but not in the first place.

Run Integrity Check

In order to decide how to fix the database, we definitely need to know what exactly is damaged. The only way we can find out is to run DBCC CHECKDB with the All_ErrorMsgs parameter (in SQL Server 2005 SP3, SQL Server 2008 SP1 and in older versions, this parameter is enabled by default, it is not necessary to specify it). Remember that if you run DBCC CHECKDB without the No_InfoMsgs parameter, the output of this procedure will contain information about the number of rows and pages in each table, which is unlikely to interest you in analyzing errors.
DBCC CHECKDB can take a long time to run on large databases, but you must wait until this procedure is complete. A competent recovery strategy can only be built if there is information about all the problems in the database.

Find a reason

Once the errors are fixed, the work cannot be considered finished. If the cause of these errors is not identified, they may occur again. Usually, the main cause of errors is problems with the I / O subsystem, but they can also be caused by incorrect operation of "low-level software" (like antivirus), human actions, or SQL Server bugs itself.

What's next

Further steps to correct errors completely and completely depend on the results of CheckDB. A little further I will show some of the most frequently occurring errors (note that this article does not claim to be a full description of all kinds of errors).
The described errors are arranged in increasing order of severity - from the least serious to the most serious. In general, for the most serious errors found by CheckDB, there is a description of the available methods for resolving them.
If you suddenly find an error that is not described in the article, pay attention to the last section - “Help Search”.

Invalid page space information

Msg 2508, Level 16, State 3, Line 1
The In-row data RSVD page count for object "Broken1", index ID 0, partition ID 76911687695381, alloc unit ID 76911687695381 (type In-row data) is incorrect. Run DBCC UPDATEUSAGE.

In SQL Server 2000, the number of rows and pages in a table or index stored in metadata might not be true (and even be negative) and DBCC CHECKDB did not see anything wrong with that. In SQL Server 2005, this number must be correct and CheckDB will issue a warning if it suddenly finds a mismatch.
This is a minor problem and is very easy to resolve. As stated in the message, you just need to run DBCC UPDATEUSAGE in the context of the desired database and the warning disappears. This error is common in databases upgraded from SQL Server 2000 and should not appear in databases created in SQL Server 2005/2008.

Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1: 26839) in object ID 181575685, index ID 1, partition ID 293374720802816, alloc unit ID 76911687695381 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL.

This error appears when a PFS page (Page Free Space), which takes into account how full the pages are in the database, contains incorrect values. This error, as mentioned earlier, is not serious. The algorithm used to determine how full the pages were in SQL Server 2000 did not always work out correctly. To solve this problem, you need to run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS parameter and, if this is the only error in the database, no data will actually be affected.

Damage to non-clustered indexes only

If all the errors found by CheckDB relate to indexes with ID = 2 or more, this means that only nonclustered indexes were damaged. Since the information contained in non-clustered indexes is "redundant" ( the same data is stored on the heap, or in the cluster index - approx. Translator ), these damages can be repaired without losing any data.
If all errors found by CheckDB relate to non-clustered indexes, the recommended “recovery level” for DBCC CHECKDB is REPAIR_REBUILD. Examples of such errors (in fact, there are much more errors of this type):

Msg 8941, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 4, page (3: 224866). Test (sorted [i] .offset> = PAGEHEADSIZE) failed. Slot 159, offset 0x1 is invalid.

Msg 8942, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 4, page (3: 224866). Test (sorted [i] .offset> = max) failed. Slot 0, offset 0x9f overlaps with the prior row.

In this case, the damage can be completely repaired by deleting the damaged non-clustered indexes and re-creating them. Rebuilding an index (ALTER INDEX REBUILD) on-line (and sometimes off-line) reads the pages of the old index to create a new one and, therefore, will fail. Therefore, you must delete the old indexes and recreate them.
This is exactly what DBCC CHECKDB will do with the REPAIR_REBUILD parameter, but the database must be in single-user mode. This is why it is usually best to manually perform these operations so that the database can continue to work while the indexes are recreated.
If you don’t have enough time to recreate the necessary indexes and there is a “clean” (error-free) full backup and backups of the transaction log with an unbroken chain of logs, you can restore damaged pages from them.

LOB page corruption

Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 1, partition ID 72057594145669120, alloc unit ID 72057594087800832 (type LOB data). The off-row data node at page (1: 2444050), slot 0, text ID 901891555328 is not referenced.

The error indicates that there are LOB pages (Large OBjects) that no data pages link to. This can happen if the cluster index (or heap) was previously damaged and its damaged pages were deleted.
If CheckDB only talks about such errors, then you can run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS parameter - these pages will be destroyed. Since you still do not have data pages that link to these pages, there will be no more data loss.

Out of Range Errors

Msg 2570, Sev 16, State 3, Line 17
Page (1: 1103587), slot 24 in object ID 34, index ID 1, partition ID 281474978938880, alloc unit ID 281474978938880 (type “In-row data”). Column "modified" value is out of range for data type "datetime". Update column to a legal value.

These errors indicate that the column contains values that are out of range. This can be a datetime value that assumes that more than 1440 minutes have passed since midnight, a Unicode string in which the number of bytes is not divisible by 2, or float / real with an incorrect precision value.
Checking for these errors is not performed by default, for databases upgraded from SQL Server 2000 or earlier, if the DBCC CHECKDB command with the DATA_PURITY parameter enabled has never been run before.
CheckDB will not be able to fix these errors, since it is not known what values to replace incorrect ones. Correction of such errors does not require much effort, but is done manually. Incorrect values should be replaced with something acceptable. The main problem is finding incorrect values.This knowledge base article has step-by-step instructions.

Cluster index or heap corruption

If it is discovered that the pages of the heap or leaf pages of the cluster index are damaged, this means that the data on them is lost. The pages of the leaf level of the clustered index contain directly the data pages and for them redundancy is not provided.
If CheckDB reports damage to the pages of the leaf level of the clustered index, the required “recovery level” for DBCC CHECKDB is REPAIR_ALLOW_DATA_LOSS.
Examples of such errors:

Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 181575685, index ID 1, partition ID 76911687695381, alloc unit ID 76911687695381 (type In-row data). Page (1: 22417) was not seen in the scan although its parent (1: 479) and previous (1: 715544) refer to it.

Server: Msg 8939, Level 16, State 1, Line 2
Table error: Object ID 181575685, index ID 0, page (1: 168576). Test (m_freeData> = PAGEHEADSIZE && m_freeData <= (UINT) PAGESIZE - m_slotCnt * sizeof (Slot)) failed. Values are 44 and 8028.

It should be remembered that if the errors returned by CheckDB relate to index id = 0 or 1, this means that the data is directly damaged.
This type of error is fixed, but the correction is to destroy lines or entire pages. When CheckDB deletes the error correction data, the restrictions imposed by foreign keys are not checked and no triggers fire. Lines or pages are simply deleted. As a result, the data may be inconsistent, or logical integrity may be violated (not a single line may link to LOB pages anymore, or non-clustered index lines may indicate "nowhere"). Due to such consequences, such recovery is not recommended.
If you have a “clean” backup, recovery from it is usually preferable to correct such errors. If the database is in the full recovery model and you have backups of the transaction log with an unbroken chain of logs (starting from the last “clean” full backup), you can backup the active part of the log and restore the entire database (or only damaged pages), in As a result, data will not be lost at all.
If there is no backup with undamaged data, you have only one option - launch DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS parameter. This will require putting the database in single-user mode for the entire duration of this procedure.
And although you have no way to avoid data loss, you can see what data will be deleted from the clustered index. For this, check out this post by Paul Renadal.

Metadata corruption

Msg 3853, Level 16, State 1, Line 1
Attribute (object_id = 181575685) of row (object_id = 181575685, column_id = 1) in sys.columns does not have a matching row (object_id = 181575685) in sys.objects.

Similar errors usually occur in databases upgraded from SQL Server 2000 when someone tinkered directly into system tables.
In system tables of any version of SQL Server, foreign keys are not used, so in SQL Server 2000 it was possible to delete a row from sysobjects (for example, a table) and leave in the syscolumns and sysindexes tables rows that refer to the deleted row.
In SQL Server 2000, CheckDB did not check the integrity of the system catalog, and such problems often hung unnoticed. In SQL Server 2005, CheckDB checks the integrity of the system catalog and such errors may occur.
Correction of these errors is not the easiest thing. CheckDB cannot fix them, because the only thing that can be done is to delete entries from the system tables, which, in turn, can cause the loss of a large amount of data. If you have a backup of this database made before the upgrade to SQL Server 2005 and the update was very recent, you can deploy it to SQL Server 2000, manually correct the system tables and transfer the database to SQL Server 2005 again.
If you do not have a backup The database on SQL Server 2000 or the update has passed too long and data loss is unacceptable, there are two ways. The first is to edit the system tables in SQL Server 2005, but keep in mind that this is a rather complicated and risky process, since system tables are not documented and are much more complex than in earlier versions. INYou can find more information in this post .
The second way is to script all database objects and export all the data, after which create a new database, restore the objects and fill in the data. This option is more preferable.

Irreparable damage

CheckDB cannot fix everything. Any errors like the ones below are incorrigible and the only option is to restore the database from a backup in which there is no such damage. If you have a full backup and the log chain is not broken until the current time, you can backup the final fragment of the transaction log and the database can be restored without losing any data.
If there are no such backups, the only thing you can do is to script those objects and upload the data that is still available. It is likely that due to corruption not all data will be accessible, and most likely not all objects will be able to be scripted without errors.

Damage to system tables

Msg 7985, Level 16, State 2, Line 1
System table pre-checks: Object ID 4. Could not read and latch page (1: 358) with latch type SH.
Check statement terminated due to unrepairable error.

Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent.

CheckDB depends on several critical system tables in order to get an idea of what should be in the database. If these tables themselves are damaged, then CheckDB cannot even guess what should be in the database and with what to compare the current state of affairs, not to mention to fix something.

Damage to distribution cards

Msg 8946, Level 16, State 12, Line 1
Table error: Allocation page (1: 2264640) has invalid PFS_PAGE page header values. Type is 0. Check type, alloc unit ID and page ID on the page.

Msg 8998, Level 16, State 2, Line 1
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 13 pages from (1: 2264640) to (1: 2272727)

In this case, one or several pages determining the placement of data in the database ( distribution cards - approx. Translator ) are damaged. These pages are used to determine which pages and extents in the database are used and which are free. CheckDB cannot fix such errors, since it is almost impossible to determine (without these pages) which extents are used to place data and which are not. A simple deletion of such a “distribution map” is not possible, since deletion of any of them will entail the deletion of 4 GB of data.

Help Search

If you are not sure what you need to do, ask for help. If you suddenly receive a message about database corruption that is incomprehensible to you and which is not described above, ask for help. If you are not sure that you have chosen the best recovery method, ask for help.
If you have a Senior DBA, contact him. If you have a “mentor” - ask him. Ask for advice in the forums, but remember that not all of the advice received in the forums is helpful. In fact, it is from there that absolutely wrong and even dangerous decisions are published from time to time.
Contact Microsoft Support, finally. It will not be free, but they really know what can be done with a damaged database and it is likely that if your database is critical for the enterprise, then the cost of downtime during an independent search for a solution will be much higher than the cost of contacting support.

Conclusion

In this article, I gave several examples of what can be done when a damaged database is detected and, more importantly, what should not be done. I hope that now you better understand what methods can be used to solve the problems described and how important it is to have good backups ( and choose the correct recovery model - approx. Translator ).

Note: this is my first translation, which, moreover, was done not at once, but in several approaches, in the evenings, when free time appeared, so the whole text, perhaps, will seem to some to be somewhat inconsistent. If somewhere I was too tongue-tied and some part of the text suddenly turned out to be difficult to understand, I will gladly listen to all the comments.
Regards, unfilled.
PS When I was about to click on the "Publish" button, the mailing list from SQL Server Central with such a comic fell out on my mail .

Tags: