Posts Tagged ‘i/o’
A Sub-Prime I/O Primer
My fans on oracle-l already know that I’ve had a bit of a battle with Oracle I/O recently, most of the damage being self-inflicted. I’d like to give as forensic a review as possible so that those poor souls who come after me will have some hope and inspiration to carry on.
First of all, some definitions, as they relate to Oracle:
Direct I/O: This is I/O done without the use of the OS (or filesystem) buffer cache. Oracle already has its own buffer cache, so the filesystem cache is (usually) makes I/O take much longer than just bypassing it and relying solely on Oracle to handle the data buffer cache, which it should do much better (for it’s own needs) than the filesystem.
Asynchronous I/O: This type of I/O means that I/O commands are sent but other processing can continue before the I/O has finished.
NOTE: These two kinds of I/O, direct and asynchronous, are completely independent of each other. For some reason that I can’t explain, I had it stuck in my Coke-addled mind that direct was the opposite of asynch I/O. As many times as I read over guides and glossaries, it wouldn’t shake loose.
And now, for our feature presentation.
Act I
In which our hero becomes bewildered with the possibilities
I had been idling in #oracle when hali mentioned something to someone about direct I/O. Having not dealt with I/O very much in my own travels, I inquired more. It was then that I learned of filesystemio_options and the benefits of setting them to “directio.”
Needless to say, I was sold.
I set about with my trusty system/storage administator on mounting our vxfs partitions with the proper parameters (mincache=direct,convosync=direct) and setting filesystemio_options=directio, which I’m led to believe is redundant. Once the filesystem is mounted with the directio options, Oracle (and anything else) will always use direct I/O.
Act II
In which direct meets asynchronous
Fast forward three or four weeks and I’m grasping at straws trying to figure out a problem that causes the production instance to hang. (I’m not going to address these hangs in this post, as I’m not really sure that they have to do with direct I/O.) There seemed to be a perfect storm of changes in the weeks prior to these problems happening
- Migrated to a 64-bit server (via datapump exp/imp)
- Much larger SGA (from 1.5 Gb to 16 Gb)
- Dramatically different datafile layout
- Using direct I/O
And that’s just to name the ones I can think of at this moment. I decided (after weeks of tinkering elsewhere) to look at direct I/O. It was during this second look that I realized that one probably should also have asynchronous I/O enabled.
Act III
Paradise Lost
And so I set filesystemio_options=setall. There were no problems upon instance restart. However this showed up in that evenings RMAN backup:
Starting Control File and SPFILE Autobackup at 29-NOV-07 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: ===========================================================RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 11/29/2007 02:00:46 ORA-19502: write error on file "/rman/c-3171457975-20071129-00", blockno 321 (blocksize=16384) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 14: Bad address Additional information: -1 Additional information: 1048576
oracle:~/sr $ strace -c -p 12605 Process 12605 attached - interrupt to quit Process 12605 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 80.66 1.142811 298 3831 io_getevents 7.26 0.102820 51 2004 io_submit
Jumping to Conclusions
As I mentioned before, I’m not convinced at all that the hangings had anything to do with the I/O issues. However I must note that since we enabled asynchronous I/O on top of the direct I/O, we haven’t had any instance hanging. Where we used to get them once every other day, it has been three weeks since the last incident.
To their end, Oracle support had analyzed our system state dumps and noted that our session cached cursors was at 100% utilization, and a lot of the wait events during the hanging had to do with cursors (cursor: pin S wait on X). Per their recommendation, we have raised the session_cached_cursors parameter, but haven’t had a window yet in which to bounce the instance so the parameter can take effect.
If there ever is a point where we figure out exactly what was causing the problem and what the solution was, I’ll be sure to write about it in this space. For now I just wanted to show you what an impulsive, ignorant fool I am.
~ Fin ~
Christo on Aligning ASM Disks on Linux
As I sit here on a Sunday morning waiting for some disk issues to be resolved in my datacenter, I took the time to catch up on some pages I had marked for later reading. One was a post by Christo Kutrovsky from the Pythian Group, about aligning ASM disks on linux, to avoid misalignments due to odd default offsets in linux that might cause I/O for relatively small amounts of data to be inefficiently split across disks.
Also now I can’t help but look at “Christo Kutrovsky” and think of Krusty the Clown, a.k.a. Herschel Krustofski.
Monitor RMAN I/O
Yesterday in #oracle, hali pointed out the existance of the v$rman_backup_job_details view, and has since written a nice example post about using it to monitor the I/O done by every RMAN job you’ve done.



