Understanding Oracle Instance Recovery

To understand instance recovery, you need to understand the functioning of certain background processes.


Checkpoint (CKPT) Process

Every three seconds (or more frequently), the CKPT process stores data in the control file to document the modified data blocks that DBWhas written from the SGA to disk. This is called a “checkpoint.” The purpose of a checkpoint is to identify that place in the online redo log file where instance recovery is to begin (which is called the “checkpoint position”).
In the event of a log switch, the CKPT process also writes this checkpoint information to the headers of data files.

Checkpoints exist for the following reasons:
•To ensure that modified data blocks in memory are written to the disk regularly so that data is not lost in case of a system or database failure
•To reduce the time required for instance recovery (Only the online redo log file entries following the last checkpoint need to be processed for recovery.)
•To ensure that all committed data has been written to data files during shutdown

The checkpoint information written by the CKPT process includes checkpoint position, system change number (SCN), location in the online redo log file to begin recovery, information about logs, and so on.
Note: The CKPT process does not write data blocks to the disk or redo blocks to the online redo log files.

Background Processes and Recovery: Redo Log Files and LogWriter
Redo log files record changes to the database as a result of transactions and internal Oracle server actions. (A transaction is a logical unit of work consisting of one or more SQL statements run by a user.) Redo log files protect the database from the loss of integrity because of system failures caused by power outages, disk failures, and so on. Redo log files must be multiplexed to ensure that the information stored in them is not lost in the event of a disk failure.

The redo log consists of groups of redo log files. A group consists of a redo log file and its multiplexed copies. Each identical copy is said to be a member of that group, and each group is identified by a number. The LogWriter (LGWR) process writes redo records from the redo log buffer to all members of a redo log group until the files are filled or a log switch 
operation is requested.

It then switches and writes to the files in the next group. Redo log groups are used in a circular fashion.
Best practice: If possible, multiplexed redo log files should reside on different disks.

The Archiver (ARCn) Process
ARCn is an optional background process. However, it is crucial to the recovery of a database after the loss of a disk. When an online redo log group gets filled, the Oracle instance begins writing to the next online redo log group. The process of switching from one online redo log group to another is called a log switch. The ARCn process initiates archiving of the filled log group at every log switch. It automatically archives the online redo log group before the log group can be reused so that all the changes made to the database are preserved. This enables recovery of the database to the point of failure even if a disk drive is damaged.

One of the important decisions that a DBA must make is whether to configure the database to operate in ARCHIVELOG mode or in NOARCHIVELOG mode.
          In NOARCHIVELOG mode, the online redo log files are overwritten each time a log switch occurs.
          In ARCHIVELOG mode, inactive groups of filled online redo log files must be archived before they can be used again.
Note: ARCHIVELOG mode is essential for most backup strategies (and is very easy to configure).

Instance Recovery
The Oracle database automatically recovers from instance failure. All that the DBA needs to do is start the instance normally. The instance mounts the control files and then attempts to open the data files. When it discovers that the data files have not been synchronized during shutdown, the instance uses information contained in the redo log groups to roll the data files forward to the time of shutdown. Then the database is opened and (because the undo tablespace is also rolled forward) any uncommitted transactions are rolled back.

Phases of Instance Recovery
For an instance to open a data file, the system change number (SCN) contained in the data file’s header must match the current SCN that is stored in the database’s control files.

If the numbers do not match, the instance applies redo data from the online redo logs, sequentially “redoing” transactions until the data files are up-to-date. After all data files have been synchronized with the control files, the database is opened and users can log in.

When redo logs are applied, all transactions are applied to bring the database up to the state as of the time of failure. This usually includes transactions that are in progress but have not yet been committed. After the database has been opened, those uncommitted transactions are rolled back.
At the end of the rollback phase of instance recovery, the data files contain only committed data.