Rajendra Gupta
Automatic Page Repair in SQL Server Always On Availability Groups

Automatic Page Repair in SQL Server Always On Availability Groups

August 5, 2020 by

This is the 9th article in the series of SQL Server Always On Availability Groups.

Introduction

SQL Server Always On is a disaster and high availability feature. In the previous articles, we explored the following features.

  • AG failover in case of primary replica goes down
  • Manual failover
  • Database level health detection that initiates a failover in case any AG database changes its state from online and SQL Server not able to write in the transaction log

In the above points, we safeguard SQL Server for the instance and DB failover. Suppose your AG replica is healthy, but one of your AG databases becomes corrupt. You get a logical consistency error if you execute the DBCC CHECKDB command. You might have an excellent copy of that database page in another participating replica. Let’s find the answer to a few questions in this article.

  • How does SQL Server Always On help us to recover the corrupt page in this scenario?
  • What happens if a page corrupts in the AG database on the primary replica?

Prerequisites

You can follow the series articles and prepare a two-node SQL Server Always On Availability Group, as shown below.

  • Current Primary replica: SQLNode1\INST1
  • Current Secondary replica: SQLNode2\INST1
  • AG Mode: Synchronous commit

My Demo AG setup

Automatic Page Repair in SQL Server Always On Availability Groups

SQL Server supports automatic page repair for the databases participating in AG configuration. If SQL detects specific corruption errors (Error id 823,824 & 829) for an AG database, SQL Server tries to restore the page from the corresponding replica(primary or secondary).

It works for the following consistency errors.

  • Error 823: We get this error if the operating system performs a cyclic redundancy check that failed on the data
  • Error 824: Logical consistency errors such as bad page checksum, torn page detection
  • Error 829: SQL Server raises error 829 if a page is marked as restore pending

Resolve a page corruption issues on the Primary replica in SQL Server Always On Availability Groups

Let us consider that my AG database is in the synchronized state and communicating with the secondary replica. Now, you get the IO error in an AG database on the primary replica. The below diagram shows the high-level steps for the automatic page repair if any page corrupts at the primary replica.

Automatic Page Repair in SQL Server Always On Availability Groups

  • It inserts a row in the suspect_pages table in the MSDB system database. This table logs the corruption information such as the database id, file id, page id, error id
  • It broadcasts a request to all secondary replica
  • It might get a response from the multiple secondary replicas. It gets the page from the replica that responds first. This request also specifies the page ID and LSN that is currently at the end of the log
  • SQL Server marks the specified page as restore pending. If any user tries to access the page status restore pending, he gets error 829
  • Once the secondary replica receives a page request, it waits until it has redone the log up to the LSN primary replica specified in the request
  • Secondary replica access the required page and sends it to the primary replica. If the secondary replica could not access the page, it returns an error to the primary replica. It causes the failure of automatic page repair
  • If the automatic page repair is successful, it marks the status of the page in the suspect_pages table as restored (event_type 5)
  • SQL Server tries to resolve the deferred transactions after the successful page repair

Let’s simulate the automatic page repair in the SQL Server Always On Availability Group.

  • Note: For this article, we corrupt a page in the primary replica intentionally. DO NOT EVER perform these steps in a production database. It is for educational purposes only.

Create a [Employee] table in the [MyNewDB] database and insert sample records in it.

Use the DBCC IND command to check the database pages in the [MyNewDB] database.

DBCC IND command

Verify if the AG dashboard is healthy.

Verify AG dashboard

Stop SQL Services on both Primary and Secondary AG replicas using SQL Server Configuration Manager. You should stop SQL Services on secondary replica before the primary replica.

Stop SQL Services

Stop SQL Services on secondary

We use a Hex Editor in this article to open the database file. You can download it from the URL.

Launch Hex Editor and open the MyNewDB.MDF file. You might get the permission error. Make sure you permit your account on the database file.

Hex editor

It opens the [MyNewDB.MDF] content, along with the decoded text.

Open MDF file with Hex Editor

Now, search for an employee name that we entered initially. Let’s search for it for the employee name Roy.

Modify MDF file content

Modify the numbers on the left-hand side and save the MDF file changes.

Save changes

Start the SQL Services for the primary replica and secondary replica.

If we try to select the records from the [Employee] table, it gives you logical consistency error, as shown below. It reports an incorrect checksum for the page 1:336 in the [MyNewDB] database:

logical consistency error

As highlighted earlier, SQL Server puts an entry in the suspect_pages table in the MSDB database. You can see an entry for the page_id 336. Here, we see event_type 5 that shows the page is restored from the secondary replica:

Suspect pages

We can also validate it using the DMV sys.dm_hadr_auto_page_repair:

DMV output

In the DMV output, we get the following useful columns.

  • Page_id: It is the page id that was restored from the secondary replica
  • Error_type:
    • 1 = 824 error
    • 2 = Bad checksum
    • 3 = Torn Page
    • -1 = All hardware 823 errors
  • Page_status: you might get the following page status in this column
    • 2 = It is queued for a request from the secondary replica
    • 3 = request is sent to the partner
    • 4 = Page is successfully repaired
    • 5 = Page could not be repaired. Automatic page repair will try to repair the page again

In our case, we get event_id 5 and page_status 4 that shows page corruption was due to torn page, but automatic database repair restored it successfully. You can query the [employee] table again. You can notice here that we still have employee Roy details because SQL Server replaced the modified (corrupted) page in the primary replica from the secondary replica.

Verify table records

Resolve a page corruption issues on the secondary replica in SQL Server Always On Availability Groups

Previously, we intentionally corrupted a page in the primary replica. Let’s perform the same demonstration in the secondary replica.

Page corruption simulation

Start the SQL Services and try to access the [Employee] table in the secondary replica database. You might get the following error message.

Corruption message

By default, the secondary replica is not allowed for a connection. You can open the availability group properties from the primary replica and modify the value for Readable Secondary to Yes.

Modify replica property

Now, go back to the secondary replica and access the employee table. It reports a logical consistency I/O error. It has an issue to read the page 1:336.

Logical errors

Wait for some time and check the entry in the suspect_pages table of the MSDB database in the secondary replica. It has an entry for page 1:336 with event_type 5.

You can query the DMV sys.dm_hadr_auto_page_repair and verify that the page is repaired successfully (page_status 4).

check repair status

You can read the page in the secondary replica successfully.

Table records

Your AG dashboard is healthy as well.

AG dashboard

The overall process of automatic page repair if page corrupts in a secondary replica are as below.

  • SQL Server Always On Availability group enters into the suspended state for the secondary replica
  • It enters the corrupt page information in the suspect_page table of the secondary replica
  • Secondary replica puts a request to get a page copy from the primary replica
  • If the primary replica can access the page, it sends the page to the secondary replica
  • Secondary replica repairs the page, and AG is in the synchronized state again

At a high-level, we can summarize the automatic page repair if page corrupts at the secondary replica as shown below:

Overall automatic page repair process

Restrictions of automatic page repair in SQL Server Always On Availability Groups

SQL Server cannot recover or restore the following pages as part of automatic page repair functionality.

  • A file header page
  • A database boot page
  • Global Allocation Map (GAM) Pages
  • Shared Allocation Map (SGAM) pages
  • Page Free Space (PFS) pages

Conclusion

In this article, we explored that SQL Server Always On Availability Groups give us additional flexibility of automatic page repair in case of any corruption issues. It is an asynchronous process and automatically fixes the corruption for you.

Table of contents

A comprehensive guide to SQL Server Always On Availability Groups on Windows Server 2016
Configure Domain Controller and Active Directory for SQL Server Always On Availability Groups
Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups
Install SQL Server 2019 on Windows Server 2016 with SQL Server Always On Availability Groups
Add a new node into existing SQL Server Always On Availability Groups
Configure Managed Service Accounts for SQL Server Always On Availability Groups
Add or remove a node from SQL Server Always On Availability Groups using T-SQL scripts
Database-level health detection in SQL Server Always On Availability Groups
Automatic Page Repair in SQL Server Always On Availability Groups
Deploy a domain-independent Windows Failover Cluster for SQL Server Always On Availability Groups
Configure a SQL Server Always On Availability Group on the domain-independent Failover Cluster
An overview of distributed SQL Server Always On Availability Groups
Deploy a distributed SQL Server Always On Availability Group
Monitor and failover a Distributed SQL Server Always On Availability Group
Transparent Data Encryption for SQL Server Always On Availability Groups
Configure SQL Server replication for a database in SQL Server Always On Availability Groups
Configuring SQL Server replication for distribution databases in SQL Server Always On Availability Groups
Explore Cross-database MSDTC for distributed transactions in SQL Server Always On Availability Groups
Deploy MSDTC for distributed transactions in SQL Server Always On Availability Groups
Restore an existing availability group database participating in SQL Server Always On Availability Groups
Exploring AG dashboards for monitoring SQL Server Always On Availability Groups
Backup compression in TDE enabled databases in SQL Server Always On Availability Groups
Impact of dropping a login in the active directory tied to SQL Server Always On Availability Groups
Refresh SQL Server Always On Availability Group databases using DBATools PowerShell
Configure SQL Server Reporting Services databases in SQL Server Always On Availability Groups
Suspend and Resume Data Movement in SQL Server Always On Availability Groups
Explore failover types in SQL Server Always On Availability Groups
Explore SQL Server PolyBase External Tables in SQL Server Always On Availability Groups
SQL Server Always On Availability Groups for SQL Server Linux instances
Column-level SQL Server encryption with SQL Server Always On Availability Groups
Make the most of secondary replicas in SQL Server Always On Availability Groups
Apply SQL Server patches or cumulative updates in SQL Server Always On Availability Groups
Monitor SQL Server Always On Availability groups using extended events
The Hub and Spoke model of policy-based management for SQL Server Always On Availability Groups
Custom policies for AG dashboards of SQL Server Always On Availability Groups
Explore dynamic management views for monitoring SQL Server Always On Availability Groups
Dynamic management views for monitoring availability replicas and databases for SQL Server Always On Availability
Configure SQL Server Always On Availability Groups using Windows PowerShell scripts
Configure Integration Services Catalog Database SSISDB in SQL Server Always On Availability Groups
Synchronize logins between Availability replicas in SQL Server Always On Availability Group
Session timeouts in SQL Server Always On Availability Groups
Lease Timeouts and Health Checks in SQL Server Always On Availability Groups
Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
168 Views