HOWTO: Oracle Cross-Platform Migration with Minimal Downtime
Originally posted at The Pythian Group blog.
I recently performed a migration from Oracle 10gR2 on Solaris to the same version on Linux, immediately followed by an upgrade to 11g. Both platforms were x86-64. Migrating to Linux also included migrating to ASM, whereas we had been using ZFS to hold the datafiles on Solaris. Restoring files into ASM meant we would have to use RMAN (which we would probably choose to use anyway).
As with many databases, the client wanted minimal downtime. It was obvious to us that the most time-consuming operation would be the restore and recovery into the new instance. We were basically doing a restore and recovery from production backups and archived redo logs. It quickly dawned on me that we could start this operation well before the scheduled cutover time and downtime window, chopping at least six hours from the downtime window. The client would only need to keep the new instance in mount mode after the initial restore/recovery finished, periodically re-catalog the source instance’s FRA (which was mounted via NFS), and then re-run the recover database command in RMAN. Once the time comes to cutover, simply archivelog current the original instance and shutdown immediate. Then open the new instance with the RESETLOGS option, and voila! Migration complete!
I’ll try to recreate a simple example here. Read the rest of this entry »
Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize
Originally posted on The Pythian Group blog.
Full credit for this tale should go to my colleague Christo Kutrovsky for the inspiration and basic plan involved.
We recently migrated a large database from Solaris SPARC to Solaris x86-64. All seemed to go well with the migration, but in the next few weeks, we noticed some I/O issues cropping up. Some research led us to find that the ZFS filesystem used to hold the datafiles was killing us on I/O. The default “recordsize” setting for ZFS was 128k.
$ /usr/sbin/zfs get recordsize zfs-data NAME PROPERTY VALUE SOURCE zfs-data recordsize 128K default
An Oracle database typically uses 8k for the block size, but in this case it was 16k. We saw basically the same thing that Neelakanth Nadgir described in his blog post, Databases and ZFS:
With ZFS, not only was the throughput much lower, but we used more [than] twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized. We noticed that we were not only reading in more data, but we were also doing more IO operations [than] what is needed.
The fix is to set the ZFS recordsize for a datafile filesystem to match the Oracle instance’s db_block_size. We also read in the ZFS Best Practices Guide that redo logs should be in a separate filesystem with the default ZFS recordsize of 128k. We already had them separate, so we just needed to get our datafiles on a ZFS filesystem with a 16k recordsize.
Read the rest of this entry »
Setting up Network ACLs in Oracle 11g… For Dummies
Originally posted on The Pythian Group blog.
Having recently performed a test upgrade for a client from Oracle RDBMS 10g to 11g, I can tell you that one of the big changes that will likely require action on your part as DBA is the new fine-grained access control for the packages UTL_SMTP, UTL_TCP, UTL_MAIL, UTL_HTTP and UTL_INADDR. Part of the Oracle 11g pre-upgrade tool will notify you of users that will require new privileges.
Of course, Oracle’s post-upgrade network ACL setup documentation is much more confusing than it needs to be, at least for simple minds like me. A colleague stepped forward with a simple set of commands for a basic setup that even the tired and stressed can understand.
I’ll share that here, with some basic explanation:
BEGIN
-- Create the new ACL, naming it "netacl.xml", with a description.
-- Also, provide one starter privilege, granting user FOO
-- the privilege to connect.
DBMS_NETWORK_ACL_ADMIN.CREATE_ACL('netacl.xml',
'Allow usage to the UTL network packages', 'FOO', TRUE, 'connect');
-- Now grant privilege to resolve DNS names for FOO,
-- and then grant connect and resolve to user BAR
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'FOO', TRUE, 'resolve');
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'BAR', TRUE, 'connect');
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE('netacl.xml' ,'BAR', TRUE, 'resolve');
-- Specify which hosts this ACL applies to,
-- for simplicity, we're saying all (*)
-- You might want to specify certain hosts to lock this down.
DBMS_NETWORK_ACL_ADMIN.ASSIGN_ACL('netacl.xml','*');
END;
/
As you can see, this example will let the FOO and BAR database users connect and resolve to any host. The ASSIGN_ACL section in the full package documentation (see link below) details how this can be used to lock down a user’s ability to make outside connections.
Of course, nothing beats reading the Oracle 11g DBMS_NETWORK_ACL_ADMIN documentation, where you can see some examples of stricter ACL setups.
Turn Off db_cache_advice To Avoid Latch Contention Bugs
Originally posted on The Pythian Group blog.
A couple of weeks ago, we noticed some timeouts in some of our standard Oracle RDBMS health check scripts on a new instance. I had just migrated this instance to bigger, better, badder hardware and so it had been given more SGA to use, namely a bigger buffer cache. The software version was still Oracle 10.2.0.2, as we wanted to introduce as few variables as possible (we were already moving to a new platform with an endian change).
At first the timeouts were infrequent, but over the course of a week started to grow in frequencey until the point where none of the checks were finishing in the allowed timeframe. We ran an AWR report, and tucked far down in the “Latch Activity” section, a colleague noticed this:
Pct Avg Wait Pct
Get Get Slps Time NoWait NoWait
Latch Name Requests Miss /Miss (s) Requests Miss
------------------------ -------------- ------ ------ ------ ------------ ------
...
simulator lru latch 10,032,617 3.3 0.7 44950 336,837 0.3
...
Latch Activity DB/Inst: FOO/foo Snaps: 156-157
-> "Get Requests", "Pct Get Miss" and "Avg Slps/Miss" are statistics for
willing-to-wait latch get requests
-> "NoWait Requests", "Pct NoWait Miss" are for no-wait latch get requests
-> "Pct Misses" for both should be very close to 0.0
Pct Avg Wait Pct
Get Get Slps Time NoWait NoWait
Latch Name Requests Miss /Miss (s) Requests Miss
------------------------ -------------- ------ ------ ------ ------------ ------
transaction branch alloc 112,412 0.0 0.0 0 0 N/A
undo global data 466,321 0.0 0.0 0 0 N/A
user lock 7,440 0.8 0.4 1 0 N/A
-------------------------------------------------------------
The “simulator lru latch” event brought us to MetaLink note 5918642.8 and bug 5918642. Affecting 10g and 11g prior to 10.2.0.4 and 11.1.0.7, respectively. The bug is with the database buffer cache advisor, controlled by the parameter db_cache_advice, which defaults to ON (depending on statistics_level). The note simply states:
High simulator lru latch contention can occur when db_cache_advice is set to ON if there is a large buffer cache.
We simply set db_cache_advice to OFF (thankfully it is a dynamic parameter), and pretty quickly our checks were running just fine.
My suggestion is to simply turn this off unless you are actively using the cache advisor to tune an instance. Once you are done tuning, and are no longer using the advisor, turn it off.
NOTE: As Mladen Gogola pointed out in the comments, turning this off will cause problems if you are using automatic memory management (i.e. sga_target > 0). Re-pasting his post here:
The problem with that advice is that it will prevent automatic memory management from resizing the buffer cache and the instance will end up with a huge, mostly empty, shared pool and default buffer cache. Automatic memory management is biased toward shared pool even with the cache dvice turned on, without it, buffer cache will be reduced to the minimum size, usually only 64MB. If you disable cache advice, I would also recommend disabling the automatic memory management and configuring SGA manually.
Sending Timezone-Aware Email with UTL_SMTP
Originally posted at The Pythian Group blog.
I’m back again with another in what I hope will be a long line of “Quick Tips for Newbies” series.
At The Pythian Group, we have employees all over the globe, from our headquarters in Ottawa to regional offices in Boston, Prague, India and Sydney, and a few scattered remote workers in Seattle, Paris, Kiev, Brazil, South Africa and Wisconsin, among other places. In other words, we are spread across multiple timezones, and since it wasn’t too long ago that everyone was in Ottawa, this is something that still presents little quirks.
One such quirk involved email generated by one of our internal Oracle instances—via a stored procedure that used UTL_SMTP to send the messages—did not have timezone information in the “Date” email header. As a result, they would be stamped with the hour in Eastern timezone (Ottawa time), but the mail clients would think that hour was local. Depending on where you are relative to Ottawa, this could be many hours in the past or future. Of course, this wouldn’t be noticed if you were in Ottawa or even Boston, both in Eastern. For the rest, it was at the very least, an annoyance—but one that is easily fixed.
Looking at the PL/SQL stored procedure that we used to generate email messages, I saw that the “Date” header was being built with this code:
date_hdr := 'Date: '||to_char(sysdate,'dd Mon yy hh24:mi:ss');
The fix is almost trivial—just use SYSTIMESTAMP instead of SYSDATE, and include the timezone in the TO_CHAR function:
date_hdr := 'Date: '||to_char(systimestamp,'dd Mon yy hh24:mi:ss tzhtzm');
Voila! Emails now had a full Date header. And there was much rejoicing from around the world.
Here’s a quick query to highlight the difference:
SQL> select to_char(sysdate,'dd Mon yy hh24:mi:ss') from dual; TO_CHAR(SYSDATE,'D ------------------ 01 Dec 08 18:50:02 SQL> select to_char(systimestamp,'dd Mon yy hh24:mi:ss tzhtzm') from dual; TO_CHAR(SYSTIMESTAMP,'DD ------------------------ 01 Dec 08 18:50:10 -0500
Even if you aren’t sending email to all the ends of the Earth, it won’t hurt to make your messages timezone-aware. I’m sure it will save some confusion and frustration down the line.
Note: I discovered the fix via this blog post, which seems to be invite-only at the time of this writing.
ORA-16069? You May Need A New Standby Controlfile
Originally posted on The Pythian Group blog.
On a recent Monday, I had to perform an emergency Oracle standby switchover for a client whose primary instance host had mysteriously rebooted itself over the previous day. Confidence in that host was, understandably, shaken.
The Oracle Data Guard configuration is a 3-instance setup using Data Guard Broker: one primary, we’ll call it OraA, feeding two standby instances, OraB and OraC. In this particular configuration, we perform switchovers between OraA and OraC. Caught in the middle is OraB, which is on a 60-minute standby delay.
After this particular switchover, OraB started complaining with this message in the alert log:
ORA-16069: Archive Log standby database activation identifier mismatch
We had seen this occasionally in prior switchovers, and the problem would fix itself once the standby delay passed and the OraB standby would process the log notifying it of the switchover. This time, however, recovery was stopped and more than enough time had elapsed. OraA and OraC were performing perfectly fine.
Much of the reference searching I saw suggested that the standby instance would have to be completely rebuilt. Not an appetizing option. A search of metalink turned up Bug 4048687, which seemed to demonstrate a similar problem, although on a different OS/Platform. That solution was to recreate the standby controlfile. Trust me, it sounds more drastic than it is!
Here’s how to do it in just 6 easy steps!
- Shutdown the misbehaving standby.
- Copy one of the current standby controlfiles for safekeeping (just in case).
- On the primary instance, create a new standby controlfile:
alter database create standby controlfile as '/tmp/stdby.ctl'; - Transfer that new standby controlfile to the standby host.
- Copy the new controlfile to the controlfile location(s) used by the instance (you have more than one, right?).
- STARTUP MOUNT the standby instance. If you use the Data Guard Broker, it should automatically begin recovery for you; otherwise restart managed recovery with
alter database recover managed standby database disconnect;
Voila. Standby recovery should resume nicely, assuming logs are there to apply.
RMAN Redundancy is not a Viable Retention Policy
Originally posted on The Pythian Group blog.
The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.
I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.
First some table setting:
- Standalone 10.2.0.2 instance (no RAC, no DataGuard/Standby)
- RMAN retention policy set to REDUNDANCY 2
- Backups stored in the Flash Recovery Area (FRA)
A few months ago, we had a datafile corruption on this relatively new instance (data had been migrated from an old server about a week prior). The on-call DBA followed up the page by checking for corruptions in the datafile with this command:
RMAN> backup check logical datafile '/path/to/foobar_data.dbf';
This, my friends, led to the major fall, though we did not know it for many hours. You see, the FRA was already almost full. This causes the FRA to automatically delete obsolete files to free up space. That last backup command, while only intended to check for logical corruption, did actually perform a backup of the file, and rendered the earliest backup of the file obsolete since there were two newer copies. That earliest file happened to be from the level 0 backup from which we would later want to restore.
Of course, at first we didn’t know why the file was missing. Logs showed that it was on disk no less than two hours before the problem started. Later, scanning the alert log for the missing backup filename yielded this:
Deleted Oracle managed file /path/to/flash_recovery_area/FOO_DB/backupset/2008_12_01/o1_xxxx.bkp
Oracle deleted the one backup file that we needed!
Even worse, it wasn’t until this time on a Monday night that we realized that the level 0 taken the previous weekend had failed to push the backup files to tape because of a failure on the NetBackup server. The problem was reported as part of Monday morning’s routine log checks, but the missing files had not yet been pushed to tape.
In the end, we were able to drop and restore the tablespace to a previous point in time on a test instance from another backup file and exp/imp data back over. It was ugly, but it got things back online. Many DBAs better than myself gave their all on this mission.
To summarize, the ingredients:
- Oracle RMAN
- CONFIGURE RETENTION POLICY TO REDUNDANCY 2;
- Flash Recovery Area near full, obediently deleting obsolete files.
- Tape backup failure
Add in an innocent backup command and . . . BOOM! Failure Surprise.
The two biggest points to take away are:
- Tape backup failures are still serious backup failures and should be treated as such, even if you backup to disk first.
- REDUNDANCY is not a viable retention policy. In my house, it is configuration non grata.
GNU basename in PL/SQL
Reposted from The Pythian Group blog.
In the process of scripting a database migration, I was in need of something akin to the GNU basename utility that I know and love on Linux. basename is most famous for taking a full file path string and stripping away the leading path component, returning just the name of the file. This can be emulated in PL/SQL with calls to SUBSTR and INSTR, like this:
substr(dirname,instr(dirname,'/',-1)+1)
(Thanks to Ian Cary, who shared this logic on oracle-l)
As you can see, this simply finds the last occurence of /, which is our directory separator on *nix and Solaris operating systems. On Windows, it would be \. It then returns a substring beginning one character after that last separator until the end of the string. Voila, a basic basename routine!
Upon reading the basename man page again, I found that basename also takes an optional parameter, a suffix string. If this suffix string is provided, basename will also truncate that string from the end. For example:
$ basename /home/seiler/bookmarks.html bookmarks.html $ basename /home/seiler/bookmarks.html .html bookmarks
I decided that this would be handy to have, and set out to create a compatible basename function in PL/SQL. Here is what I came up with:
CREATE OR REPLACE FUNCTION basename (v_full_path IN VARCHAR2,
v_suffix IN VARCHAR2 DEFAULT NULL,
v_separator IN CHAR DEFAULT '/')
RETURN VARCHAR2
IS
v_basename VARCHAR2(256);
BEGIN
v_basename := SUBSTR(v_full_path, INSTR(v_full_path,v_separator,-1)+1);
IF v_suffix IS NOT NULL THEN
v_basename := SUBSTR(v_basename, 1, INSTR(v_basename, v_suffix, -1)-1);
END IF;
RETURN v_basename;
END;
/
I’ve also added an optional third parameter to specify a directory separator other than the default. It would probably be rarely useful, but not hard to remove if you don’t like it. As you can see, I’ve used similar SUBSTR/INSTR logic to identify the suffix index and prune it out.
Here it is in action:
SQL> COLUMN file_name FORMAT a45; SQL> COLUMN basename FORMAT a15; SQL> COLUMN no_suffix FORMAT a12; SQL> SELECT file_name 2 , basename(file_name) as basename 3 , basename(file_name, '.dbf') as no_suffix 4 FROM dba_data_files; FILE_NAME BASENAME NO_SUFFIX --------------------------------------------- --------------- ------------ /u01/app/oracle/oradata/orcl/users01.dbf users01.dbf users01 /u01/app/oracle/oradata/orcl/sysaux01.dbf sysaux01.dbf sysaux01 /u01/app/oracle/oradata/orcl/undotbs01.dbf undotbs01.dbf undotbs01 /u01/app/oracle/oradata/orcl/system01.dbf system01.dbf system01 /u01/app/oracle/oradata/orcl/example01.dbf example01.dbf example01
I hope this makes your work just a little bit easier, as it has mine.
Does Oracle’s Block Change Tracking File Shrink?
Reposted from The Pythian Group blog.
Just a quick post to get myself back into blogging mode. Recently in IRC (#oracle on freenode, to be precise), a fresh face asked if the Block Change Tracking file ever shrinks. She had been worrying about the file in her instance continuing to grow. A number of us speculated (non-BAAG!) that perhaps taking an RMAN backup would somehow purge the file of what it was keeping track of, and then the magical Oracle fairies would promptly resize it for us. Needless to say, I was hesitant to take this theory forward with Alex Gorbachev aware of my home address.
After setting up Oracle 10.2.0.1 on a nice VirtualBox image
(more on that in another post) running CentOS 5, I began to do some reading. For some reason, actually reading the official tahiti docs was last on my list. A search of the 10gR2 docs quickly yielded this (from RMAN Incremental Backups):
4.4.4.4 Estimating Size of the Change Tracking File on Disk
The size of the change tracking file is proportional to the size of the database and the number of enabled threads of redo. The size is not related to the frequency of updates to the database. Typically, the space required for block change tracking is approximately 1/30,000 the size of the data blocks to be tracked. Note, however, the following two factors that may cause the file to be larger than this estimate suggests:
- To avoid overhead of allocating space as your database grows, the change tracking file size starts at 10MB, and new space is allocated in 10MB incremenents [sic]. Thus, for any database up to approximately 300GB the file size is no smaller than 10MB, for up to approximately 600GB the file size is no smaller than 20MB, and so on.
- For each datafile, a minimum of 320K of space is allocated in the change tracking file, regardless of the size of the file. Thus, if you have a large number of relatively small datafiles, the change tracking file is larger than for databases with a smaller number of larger datafiles containing the same data.
So (if the docs are to be trusted), it would seem that whether or not a backup is taken has no effect on the size of the file, or at least wouldn’t cause it to be shrunk. The size is tied to the amount of data in the database itself, not necessarily the changes in the database waiting to be included in the next incremental RMAN backup.
The documentation does suggest, however, that file size might be affected if (for example) a tablespace and its datafiles were dropped from the database. I’ll save this test for another day!
FLASHBACK TABLE vs. DBA_OBJECTS . LAST_DDL_TIME
NOTE: This post originally appeared on The Pythian Group blog on 6 June 2008, and is reposted here with permission.
A little over a week ago, a teammate and I were trying to use Oracle’s FLASHBACK TABLE to undo an “oops” UPDATE statement that a client’s developers had run on one of their test databases, clearing data from two columns in all rows of the table. The statement was actually part of a script that also contained ALTER TABLE statements to add columns. This is important to note because FLASHBACK TABLE will only let you go back as far as the most recent DDL against that table. To quote the SQL reference, “Oracle Database cannot restore a table to an earlier state across any DDL operations that change the structure of the table.”



