Posts Tagged ‘tuning’
A Sub-Prime I/O Primer
My fans on oracle-l already know that I’ve had a bit of a battle with Oracle I/O recently, most of the damage being self-inflicted. I’d like to give as forensic a review as possible so that those poor souls who come after me will have some hope and inspiration to carry on.
First of all, some definitions, as they relate to Oracle:
Direct I/O: This is I/O done without the use of the OS (or filesystem) buffer cache. Oracle already has its own buffer cache, so the filesystem cache is (usually) makes I/O take much longer than just bypassing it and relying solely on Oracle to handle the data buffer cache, which it should do much better (for it’s own needs) than the filesystem.
Asynchronous I/O: This type of I/O means that I/O commands are sent but other processing can continue before the I/O has finished.
NOTE: These two kinds of I/O, direct and asynchronous, are completely independent of each other. For some reason that I can’t explain, I had it stuck in my Coke-addled mind that direct was the opposite of asynch I/O. As many times as I read over guides and glossaries, it wouldn’t shake loose.
And now, for our feature presentation.
Act I
In which our hero becomes bewildered with the possibilities
I had been idling in #oracle when hali mentioned something to someone about direct I/O. Having not dealt with I/O very much in my own travels, I inquired more. It was then that I learned of filesystemio_options and the benefits of setting them to “directio.”
Needless to say, I was sold.
I set about with my trusty system/storage administator on mounting our vxfs partitions with the proper parameters (mincache=direct,convosync=direct) and setting filesystemio_options=directio, which I’m led to believe is redundant. Once the filesystem is mounted with the directio options, Oracle (and anything else) will always use direct I/O.
Act II
In which direct meets asynchronous
Fast forward three or four weeks and I’m grasping at straws trying to figure out a problem that causes the production instance to hang. (I’m not going to address these hangs in this post, as I’m not really sure that they have to do with direct I/O.) There seemed to be a perfect storm of changes in the weeks prior to these problems happening
- Migrated to a 64-bit server (via datapump exp/imp)
- Much larger SGA (from 1.5 Gb to 16 Gb)
- Dramatically different datafile layout
- Using direct I/O
And that’s just to name the ones I can think of at this moment. I decided (after weeks of tinkering elsewhere) to look at direct I/O. It was during this second look that I realized that one probably should also have asynchronous I/O enabled.
Act III
Paradise Lost
And so I set filesystemio_options=setall. There were no problems upon instance restart. However this showed up in that evenings RMAN backup:
Starting Control File and SPFILE Autobackup at 29-NOV-07 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: ===========================================================RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 11/29/2007 02:00:46 ORA-19502: write error on file "/rman/c-3171457975-20071129-00", blockno 321 (blocksize=16384) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 14: Bad address Additional information: -1 Additional information: 1048576
oracle:~/sr $ strace -c -p 12605 Process 12605 attached - interrupt to quit Process 12605 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 80.66 1.142811 298 3831 io_getevents 7.26 0.102820 51 2004 io_submit
Jumping to Conclusions
As I mentioned before, I’m not convinced at all that the hangings had anything to do with the I/O issues. However I must note that since we enabled asynchronous I/O on top of the direct I/O, we haven’t had any instance hanging. Where we used to get them once every other day, it has been three weeks since the last incident.
To their end, Oracle support had analyzed our system state dumps and noted that our session cached cursors was at 100% utilization, and a lot of the wait events during the hanging had to do with cursors (cursor: pin S wait on X). Per their recommendation, we have raised the session_cached_cursors parameter, but haven’t had a window yet in which to bounce the instance so the parameter can take effect.
If there ever is a point where we figure out exactly what was causing the problem and what the solution was, I’ll be sure to write about it in this space. For now I just wanted to show you what an impulsive, ignorant fool I am.
~ Fin ~
DBWR and the Devil
I was recently re-reading parts I, II, III and IV of Kevin Closson’s series on the perils of overconfiguring your db_writer_processes (which I mentioned earlier this year). My instance appeared to be a perfect candidate. It was configured with db_writer_processes=4, using the one-DBWR-for-each-CPU approach that Kevin rails against for most configurations.
After some other reading and investigation, I decided to make the change to only use one DBWR processes. Changes to the db_writer_processes parameter require an instance bounce to take effect, and we had some downtime scheduled anyway.
I’m happy to say that there are no ill effects. I don’t the see DBWn process anywhere on the top(1) radar. I can’t say I haven’t seen any drastic performance improvements, but if I can do the same work at the same level with 1/4 of the resources, I’d call that a win. It’s probably a bigger win for our development/test server that had 3 production clones, each with 4 DBWR processes, for a total of 12 DBWRs. This is now 1 each for 3 total, and (again) no reported performance hit.
Greg Rahn Explains Plans
Greg over at Structured Data has a fantastic piece titled “Troubleshooting Bad Execution Plans” which is a must-read for all. Learning about the GATHER_PLAN_STATISTICS hint alone was worth the price of admission.
Also time for me to re-evaluate the changes I made to OPTIMIZER_INDEX_CACHING and OPTIMIZER_INDEX_COST_ADJ a few versions back:
SQL> show parameters optimizer_index NAME TYPE VALUE ------------------------------------ ----------- ----- optimizer_index_caching integer 90 optimizer_index_cost_adj integer 10
Tuning Backup & Recovery
A level 0 database backup of my production instance takes over 8 hours, and duplication is just as long, if not longer. I’m going to do some reading and see if we can shore things up. I’ll report back. The database is around 1.1 terabytes. Some have told me that their databases of similar size are backed up in only a couple of hours.
My main concern is the duplication, or restore & recovery. An 8 hour restore/recovery basically means the loss of a full business day (depending on when it is needed).
Debunking the Buffer Cache Hit Ratio
DBWR 101
Tuning the PGA: The Follow-Up
I know that all 3 of you have been shivering with anticiPATION at how my PGA tuning experiment went. Today is your lucky day.
I did indeed raise the value of pga_aggregate_target to 512MB. The cache hit ratio rose to near 80%, but the estimated over-allocations was quickly into the tens of thousands again. Looking at V$PGA_TARGET_ADVICE, giving the PGA 1024MB (or 1GB) seemed the next logical choice. That change was made last weekend, and after one week, I’m getting nothing but love:
SQL> SELECT round(PGA_TARGET_FOR_ESTIMATE/1024/1024) target_mb, ESTD_PGA_CACHE_HIT_PERCENTAGE cache_hit_perc, ESTD_OVERALLOC_COUNT
FROM V$PGA_TARGET_ADVICE;
TARGET_MB CACHE_HIT_PERC ESTD_OVERALLOC_COUNT
---------- -------------- --------------------
128 60 13700
256 64 12475
512 82 5614
768 91 143
1024 95 0
1229 96 0
1434 96 0
1638 96 0
1843 96 0
2048 96 0
3072 96 0
4096 96 0
6144 96 0
8192 96 0
I’ve put my current level in blue above. I’m satisfied with these numbers at 95% and 0.
PGA: Not just for Caddyshack 2 fans
Ahh … Jackie Mason. The poor man’s Rodney Dangerfield. The pride of Sheboygan, Wisconsin (other than meat products).
But, no, today I’m not talking about golf (nor will I ever, if I can help it). I’m talking about Oracle’s PGA, the Program Global Area. The PGA is a private chunk of memory dedicated to each server process (i.e. dedicated user processes). It is used by the server processes when doing sorting or hash-joins, among others, and it is definitely something you want to pay attention to and properly tune for your database needs.
After doing this reading, I calcuated that my PGA should probably be at least 256MB for sure, probably closer to 512MB. Note that I hadn’t touched or directly set it prior to this. As soon as this output appeared on the screen, I doubled over in self-disgust.
SQL> show parameter pga_aggregate_target NAME TYPE VALUE------------------------------------ ----------- ------------------------------pga_aggregate_target big integer 24M
I then went to see what the advisor table had to say. Bear in mind that you want CACHE_HIT_PERCENTAGE as near to 100 as possible, and you definitely want ESTD_OVERALLOC_COUNT to be 0.
SQL> SELECT round(PGA_TARGET_FOR_ESTIMATE/1024/1024) target_mb, ESTD_PGA_CACHE_HIT_PERCENTAGE cache_hit_perc, ESTD_OVERALLOC_COUNT FROM V$PGA_TARGET_ADVICE; TARGET_MB CACHE_HIT_PERC ESTD_OVERALLOC_COUNT---------- -------------- -------------------- 12 52 30444 18 52 30444 24 75 30444 29 75 30444 34 75 30444 38 75 30444 43 75 30443 48 75 30423 72 75 30377 96 75 30309 144 75 29990 192 76 29335
Yeah. Neither goal is even on the charts. So my plan now is to bump it up pga_aggregate_target to 512MB and see how those numbers settle out in a week. I might even want to go higher in the end. Google seems to imply that 24M is indeed the default pga_aggregate_target value, at least it was when I set this up as a new 9iR2 instance and imp’d the data over.
To be continued …
selecting count(1) vs count(*)
Count me among the many who naively assumed that using count(1) instead of count(*) was somehow faster or less resource-intensive when counting the number records in a select query. Herod over at Yet Another Oracle DBA provides some more evidence on top of Tom Kyte’s previous smackdown.
Bind Your Variables or Die
NOTE: I’m re-pasting this from an internal wiki page I wrote up for our new developers who may not be aware of such a thing as bind variables and how they can greatly improve performance and scalability.
Most know that using bind variables help greatly with performance when dealing with an often-used query, such as a customer lookup. Read Mark Rittman’s article, which does a great job of explaining all there is to be explained in that regard.
A quick summary that Oracle has to re-parse literal queries everytime, generating new query plans every time and storing that processed query in memory. With bind variables, only one such process, parse and memory storage takes place and that information gets re-used when that bind query is called again with different bound values. This saves on the CPU having to re-parse and re-plan the query, and saves on memory by not having to store yet another query plan.
To get an idea of how important DBAs consider the use of bind variables, read Tom Kyte’s thrashing of a user who was unlucky enough to incur his wrath.
Read about Using Bind Variables to Guard against SQL injection. SQL injection exploits are one of the most common attacks on database-driven websites, as attackers would exploit the fact that a SQL query such as user validation is built with literals, and they could turn this query:
SELECT * FROM usersWHERE username='dts'and password='foo'
into
SELECT * FROM usersWHERE username='dts'and password='foo' or 1=1
Using bind variables prevents such exploits from happening due to how the values are substituted.



