RAID controller got frozen

Why it's important to buy a good RAID controller

Josep

I write this post to explain a problem I found with a RAID controller. One morning I was notified that PostgreSQL in one development server was stopped. I connected to it and I found the service stopped. It was a physical server running Windows Server 2008 R2 and PostgreSQL 9.0. The server had a RAID 1. It’s important to notice that Windows Server Backup started at 22:00 and it usually takes 2 hours to complete, so during the time of the failure it was working and there was a lot of I/O.

Analyzing the logs

PostgreSQL logs

Then I checked the last pg_log:

2012-05-17 23:46:12 PANIC: could not fsync log file 120, segment 174: Bad file descriptor
2012-05-17 23:46:12 STATEMENT: COMMIT
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application’s support team for more information.
2012-05-17 23:46:12 LOG: server process (PID 4488) exited with exit code 3
2012-05-17 23:46:12 LOG: terminating any other active server processes
These 3 lines are repeated several times:
2012-05-17 23:46:12 WARNING: terminating connection because of crash of another server process
2012-05-17 23:46:12 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-05-17 23:46:12 HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-05-17 23:46:12 WARNING: terminating connection because of crash of another server process
2012-05-17 23:46:12 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-05-17 23:46:12 HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-05-17 23:46:12 LOG: all server processes terminated; reinitializing
2012-05-17 23:46:22 FATAL: pre-existing shared memory block is still in use
2012-05-17 23:46:22 HINT: Check if there are any old server processes still running, and terminate them.

I had no idea why this crash happened. I started the server successfully. It was in a consistent state. Log after the start:

2012-05-18 09:31:43 LOG: database system was interrupted; last known up at 2012-05-17 23:45:43 CEST
2012-05-18 09:31:43 LOG: database system was not properly shut down; automatic recovery in progress
2012-05-18 09:31:43 LOG: consistent recovery state reached at 78/AE000078
2012-05-18 09:31:43 LOG: redo starts at 78/AD007298
2012-05-18 09:31:43 LOG: record with zero length at 78/AE003C10
2012-05-18 09:31:43 LOG: redo done at 78/AE003BD0
2012-05-18 09:31:43 LOG: last completed transaction was at log time 2012-05-17 23:46:11.808+02
2012-05-18 09:31:44 LOG: database system is ready to accept connections
2012-05-18 09:31:44 FATAL: the database system is starting up
2012-05-18 09:31:44 LOG: autovacuum launcher started

Event Viewer logs

Then I went directly to Event Viewer / Windows Logs / System and found several errors. The first was:

17/05/2012 23:46:09 LSI_SAS (eventID 11): The driver detected a controller error on \Device\RaidPort0.

It was repeated 22 times in the elapse of 2 seconds (23:46:09 – 23:46:10).

Then the backup aborted due to IO failure. The error was shown for every drive unit:

17/05/2012 23:46:11 volsnap (eventID 14): The shadow copies of volume E: were aborted because of an IO failure on volume E:.

And then from 23:46:12 to 23:47:42 there were 33 errors like this:

17/05/2012 23:46:12 ntfs (eventID 57): The system failed to flush data to the transaction log. Corruption may occur.

And between these errors, it was one that I was expecting for:

17/05/2012 23:46:23 Service Control Manager (eventide: 7036): The postgresql-x64-9.0 – PostgreSQL Server 9.0 service entered the stopped state.

Raid software

Finally I opened the RAID software administration and one of the two disks of the RAID was KO.

What happened?

Just putting together PostgreSQL error and Event Viewer error, that happened at the very same time:

PostgreSQL error:

2012-05-17 23:46:12 CEST [unknown] 4fb350d7.1188 PANIC: could not fsync log file 120, segment 174: Bad file descriptor

Event Viewer error:
17/05/2012 23:46:12 ntfs (eventID 57): The system failed to flush data to the transaction log. Corruption may occur.

This gives us the explanation: the system couldn’t flush PostgreSQL data to the log file and PostgreSQL stopped immediate. It was the only way to maintain the integrity of the database. I suppose that Windows Backup I/O was also helping to cause these problems. During 1 minute and 30 seconds the RAID controller was frozen and couldn’t flush to disk. After it, everything was running fine because the RAID 1 had one disk to work with. PostgreSQL and Windows Backup were affected by this problem, all the other software continued working without any problem. But we’ve to keep in mind that databases are very sensitive to this.

I goggled to find some information about the RAID controller and in one post where someone was complaining it.

I could have a wrong idea of what a RAID is. So I checked Wikipedia RAID:

RAID (redundant array of independent disks, originally redundant array of inexpensive disks) is a storage technology that combines multiple disk drive components into a logical unit. Data is distributed across the drives in one of several ways called “RAID levels”, depending on what level of redundancy and performance (via parallel communication) is required.
[…]
In RAID 1 (mirroring without parity or striping), data is written identically to two drives, thereby producing a “mirrored set”; at least two drives are required to constitute such an array. While more constituent drives may be employed, many implementations deal with a maximum of only two; of course, it might be possible to use such a limited level 1 RAID itself as a constituent of a level 1 RAID, effectively masking the limitation. The array continues to operate as long as at least one drive is functioning. With appropriate operating system support, there can be increased read performance, and only a minimal write performance reduction; implementing RAID 1 with a separate controller for each drive in order to perform simultaneous reads (and writes) is sometimes called multiplexing (or duplexing when there are only two drives).

Conclusion

Reading twice the Wikipedia quote "The array continues to operate as long as at least one drive is functioning" I think that it could be a little ambiguous after what happened to me. I thought that a RAID idea was that the system could work seamless when a disk fails. But it looks like that some implementations of RAID are not exactly this or at least some have an interruption before the array continues to operate…

As a DBA I’m in charge of databases, but I don’t control the hardware where they’re running on. And things can always go worst: there are full of stories where RAIDs weren’t monitored and disks failures weren’t detected until all information was lost. So, that’s a reason why a good backup policy and a good recovery plan are the most important things.