die Seilerwerks

Chronicling Life, Love, Linux and Oracle database administration.

Posts Tagged ‘tuning

A Sub-Prime I/O Primer

without comments

My fans on oracle-l already know that I’ve had a bit of a battle with Oracle I/O recently, most of the damage being self-inflicted. I’d like to give as forensic a review as possible so that those poor souls who come after me will have some hope and inspiration to carry on.

First of all, some definitions, as they relate to Oracle:

Direct I/O: This is I/O done without the use of the OS (or filesystem) buffer cache. Oracle already has its own buffer cache, so the filesystem cache is (usually) makes I/O take much longer than just bypassing it and relying solely on Oracle to handle the data buffer cache, which it should do much better (for it’s own needs) than the filesystem.

Asynchronous I/O: This type of I/O means that I/O commands are sent but other processing can continue before the I/O has finished.

NOTE: These two kinds of I/O, direct and asynchronous, are completely independent of each other. For some reason that I can’t explain, I had it stuck in my Coke-addled mind that direct was the opposite of asynch I/O. As many times as I read over guides and glossaries, it wouldn’t shake loose.

And now, for our feature presentation.

 

Act I

In which our hero becomes bewildered with the possibilities

I had been idling in #oracle when hali mentioned something to someone about direct I/O. Having not dealt with I/O very much in my own travels, I inquired more. It was then that I learned of filesystemio_options and the benefits of setting them to “directio.”

Needless to say, I was sold.

I set about with my trusty system/storage administator on mounting our vxfs partitions with the proper parameters (mincache=direct,convosync=direct) and setting filesystemio_options=directio, which I’m led to believe is redundant. Once the filesystem is mounted with the directio options, Oracle (and anything else) will always use direct I/O.

 

Act II

In which direct meets asynchronous

Fast forward three or four weeks and I’m grasping at straws trying to figure out a problem that causes the production instance to hang. (I’m not going to address these hangs in this post, as I’m not really sure that they have to do with direct I/O.) There seemed to be a perfect storm of changes in the weeks prior to these problems happening

  1. Migrated to a 64-bit server (via datapump exp/imp)
  2. Much larger SGA (from 1.5 Gb to 16 Gb)
  3. Dramatically different datafile layout
  4. Using direct I/O

And that’s just to name the ones I can think of at this moment. I decided (after weeks of tinkering elsewhere) to look at direct I/O. It was during this second look that I realized that one probably should also have asynchronous I/O enabled.

 

Act III

Paradise Lost

And so I set filesystemio_options=setall. There were no problems upon instance restart. However this showed up in that evenings RMAN backup:

Starting Control File and SPFILE Autobackup at 29-NOV-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================

RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 11/29/2007 02:00:46 ORA-19502: write error on file "/rman/c-3171457975-20071129-00", blockno 321 (blocksize=16384) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 14: Bad address Additional information: -1 Additional information: 1048576
I was seeing similar ORA errors about async I/O in my alert log intermittently when redo logs were being archived. Obviously this wasn’t good. Google turned up one reference, which proved to be gold. The VxFS parameter discovered_direct_iosz was set to a default of 256k. We bumped this up to 1024k and the problems disappeared, AS IF BY MAGIC! Greg Rahn also suggested setting max_direct_iosz to 1024k, as well as setting vxio:vol_maxio=2048 to allow 1MB max I/Os.
Greg was also under the impression that VxFS without ODM would not perform asynchronous I/O. He suggested I run an “strace -c” on the LGWR pid, saying that “if you see io_submit and io_getevents in the syscall column, it is using async io on Linux. If you see pwrite64 it is not,” and I was happy to see that we were indeed.
oracle:~/sr $ strace -c -p 12605
Process 12605 attached - interrupt to quit
Process 12605 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
80.66    1.142811         298      3831           io_getevents
7.26    0.102820          51      2004           io_submit 
Act IV

Jumping to Conclusions

As I mentioned before, I’m not convinced at all that the hangings had anything to do with the I/O issues. However I must note that since we enabled asynchronous I/O on top of the direct I/O, we haven’t had any instance hanging. Where we used to get them once every other day, it has been three weeks since the last incident.

To their end, Oracle support had analyzed our system state dumps and noted that our session cached cursors was at 100% utilization, and a lot of the wait events during the hanging had to do with cursors (cursor: pin S wait on X). Per their recommendation, we have raised the session_cached_cursors parameter, but haven’t had a window yet in which to bounce the instance so the parameter can take effect.

If there ever is a point where we figure out exactly what was causing the problem and what the solution was, I’ll be sure to write about it in this space. For now I just wanted to show you what an impulsive, ignorant fool I am.

~ Fin ~

Written by Don Seiler

December 19, 2007 at 2:57 pm

Posted in linux, oracle

Tagged with , , , ,

DBWR and the Devil

without comments

I was recently re-reading parts I, II, III and IV of Kevin Closson’s series on the perils of overconfiguring your db_writer_processes (which I mentioned earlier this year). My instance appeared to be a perfect candidate. It was configured with db_writer_processes=4, using the one-DBWR-for-each-CPU approach that Kevin rails against for most configurations.

After some other reading and investigation, I decided to make the change to only use one DBWR processes. Changes to the db_writer_processes parameter require an instance bounce to take effect, and we had some downtime scheduled anyway.

I’m happy to say that there are no ill effects. I don’t the see DBWn process anywhere on the top(1) radar. I can’t say I haven’t seen any drastic performance improvements, but if I can do the same work at the same level with 1/4 of the resources, I’d call that a win. It’s probably a bigger win for our development/test server that had 3 production clones, each with 4 DBWR processes, for a total of 12 DBWRs. This is now 1 each for 3 total, and (again) no reported performance hit.

Written by Don Seiler

December 1, 2007 at 9:09 am

Posted in oracle

Tagged with , ,

Greg Rahn Explains Plans

without comments

Greg over at Structured Data has a fantastic piece titled “Troubleshooting Bad Execution Plans” which is a must-read for all. Learning about the GATHER_PLAN_STATISTICS hint alone was worth the price of admission.

Also time for me to re-evaluate the changes I made to OPTIMIZER_INDEX_CACHING and OPTIMIZER_INDEX_COST_ADJ a few versions back:

SQL> show parameters optimizer_index

NAME                                 TYPE        VALUE
------------------------------------ ----------- -----
optimizer_index_caching              integer     90
optimizer_index_cost_adj             integer     10

Written by Don Seiler

November 21, 2007 at 5:01 pm

Posted in oracle

Tagged with ,

Tuning Backup & Recovery

with 2 comments

A level 0 database backup of my production instance takes over 8 hours, and duplication is just as long, if not longer. I’m going to do some reading and see if we can shore things up. I’ll report back.  The database is around 1.1 terabytes.  Some have told me that their databases of similar size are backed up in only a couple of hours.

My main concern is the duplication, or restore & recovery.  An 8 hour restore/recovery  basically means the loss of a full business day (depending on when it is needed).

Written by Don Seiler

November 13, 2007 at 4:52 pm

Posted in oracle

Tagged with ,

Debunking the Buffer Cache Hit Ratio

without comments

Written by Don Seiler

September 6, 2007 at 1:14 pm

Posted in Uncategorized

Tagged with ,

DBWR 101

without comments

Kevin Closson has a great series started on the DBWR, and how people tend to misconfigure it (mostly by having too many DBWR processes). Catch Part I and Part II now.

Written by Don Seiler

August 17, 2007 at 12:05 pm

Posted in Uncategorized

Tagged with , ,

Tuning the PGA: The Follow-Up

with 2 comments

I know that all 3 of you have been shivering with anticiPATION at how my PGA tuning experiment went. Today is your lucky day.

I did indeed raise the value of pga_aggregate_target to 512MB. The cache hit ratio rose to near 80%, but the estimated over-allocations was quickly into the tens of thousands again. Looking at V$PGA_TARGET_ADVICE, giving the PGA 1024MB (or 1GB) seemed the next logical choice. That change was made last weekend, and after one week, I’m getting nothing but love:

SQL> SELECT round(PGA_TARGET_FOR_ESTIMATE/1024/1024) target_mb, ESTD_PGA_CACHE_HIT_PERCENTAGE cache_hit_perc, ESTD_OVERALLOC_COUNT
FROM V$PGA_TARGET_ADVICE;

TARGET_MB CACHE_HIT_PERC ESTD_OVERALLOC_COUNT
---------- -------------- --------------------
       128             60                13700
       256             64                12475
       512             82                 5614
       768             91                  143
      1024             95                    0
      1229             96                    0
      1434             96                    0
      1638             96                    0
      1843             96                    0
      2048             96                    0
      3072             96                    0
      4096             96                    0
      6144             96                    0
      8192             96                    0

I’ve put my current level in blue above. I’m satisfied with these numbers at 95% and 0.

Written by Don Seiler

May 21, 2007 at 1:21 pm

Posted in oracle

Tagged with , ,

PGA: Not just for Caddyshack 2 fans

without comments

Ahh … Jackie Mason. The poor man’s Rodney Dangerfield. The pride of Sheboygan, Wisconsin (other than meat products).

But, no, today I’m not talking about golf (nor will I ever, if I can help it). I’m talking about Oracle’s PGA, the Program Global Area. The PGA is a private chunk of memory dedicated to each server process (i.e. dedicated user processes). It is used by the server processes when doing sorting or hash-joins, among others, and it is definitely something you want to pay attention to and properly tune for your database needs.

After doing this reading, I calcuated that my PGA should probably be at least 256MB for sure, probably closer to 512MB. Note that I hadn’t touched or directly set it prior to this. As soon as this output appeared on the screen, I doubled over in self-disgust.

SQL> show parameter pga_aggregate_target

NAME                                 TYPE        VALUE------------------------------------ ----------- ------------------------------pga_aggregate_target                 big integer 24M

I then went to see what the advisor table had to say. Bear in mind that you want CACHE_HIT_PERCENTAGE as near to 100 as possible, and you definitely want ESTD_OVERALLOC_COUNT to be 0.

SQL> SELECT round(PGA_TARGET_FOR_ESTIMATE/1024/1024) target_mb,       ESTD_PGA_CACHE_HIT_PERCENTAGE cache_hit_perc,       ESTD_OVERALLOC_COUNT  FROM V$PGA_TARGET_ADVICE;

 TARGET_MB CACHE_HIT_PERC ESTD_OVERALLOC_COUNT---------- -------------- --------------------        12             52                30444        18             52                30444        24             75                30444        29             75                30444        34             75                30444        38             75                30444        43             75                30443        48             75                30423        72             75                30377        96             75                30309       144             75                29990       192             76                29335

Yeah. Neither goal is even on the charts. So my plan now is to bump it up pga_aggregate_target to 512MB and see how those numbers settle out in a week. I might even want to go higher in the end. Google seems to imply that 24M is indeed the default pga_aggregate_target value, at least it was when I set this up as a new 9iR2 instance and imp’d the data over.

To be continued …

Written by Don Seiler

April 3, 2007 at 11:17 am

Posted in Uncategorized

Tagged with , , , , ,

selecting count(1) vs count(*)

without comments

Count me among the many who naively assumed that using count(1) instead of count(*) was somehow faster or less resource-intensive when counting the number records in a select query. Herod over at Yet Another Oracle DBA provides some more evidence on top of Tom Kyte’s previous smackdown.

Written by Don Seiler

January 5, 2007 at 2:13 pm

Posted in Uncategorized

Tagged with ,

Bind Your Variables or Die

without comments

NOTE: I’m re-pasting this from an internal wiki page I wrote up for our new developers who may not be aware of such a thing as bind variables and how they can greatly improve performance and scalability.

Most know that using bind variables help greatly with performance when dealing with an often-used query, such as a customer lookup. Read Mark Rittman’s article, which does a great job of explaining all there is to be explained in that regard.

A quick summary that Oracle has to re-parse literal queries everytime, generating new query plans every time and storing that processed query in memory. With bind variables, only one such process, parse and memory storage takes place and that information gets re-used when that bind query is called again with different bound values. This saves on the CPU having to re-parse and re-plan the query, and saves on memory by not having to store yet another query plan.

To get an idea of how important DBAs consider the use of bind variables, read Tom Kyte’s thrashing of a user who was unlucky enough to incur his wrath.

Read about Using Bind Variables to Guard against SQL injection. SQL injection exploits are one of the most common attacks on database-driven websites, as attackers would exploit the fact that a SQL query such as user validation is built with literals, and they could turn this query:

SELECT * FROM usersWHERE username='dts'and password='foo'

into

SELECT * FROM usersWHERE username='dts'and password='foo' or 1=1

Using bind variables prevents such exploits from happening due to how the values are substituted.

Written by Don Seiler

November 28, 2006 at 10:14 am

Posted in Uncategorized

Tagged with , , ,