Oracle Observations

April 26, 2007

Installing and supporting Oracle RMAN and Tivoli DPO

Filed under: rman,TDPO — bigdaveroberts @ 3:12 pm

Another document that I have found regarding one of the projects I was due to do, was an RMAN/Tivoli backup.

Note that before the completion of the project, I was moved onto another project so there is the possibility that this checklist contains untested assertions.

It is also certainly somewhat out of date!

This document should be read in conjunction with the standard documentation and the documents listed at the end of this document. It is based largely on information gathered from Metalink forum articles, Oracle usenet groups and the adsm.org archives (search.adsm.org).

1. Installation

2. Configuring Oracle

Verify that the DB_DOMAIN parameter is set in the init<Database SID>.ora configuration file.

Verify that there is an oracle user on the system named oracle!

3. Testing without Tivoli

It is possible to perform some testing even before Tivoli DPO has been linked to RMAN. This is achieved through the stub interface library (typically libdsbtsh8.a) which implements a ‘tape backup’ to disk.

The BACKUP_DIR environment variable is used by the stub interface library to determine the disk location for the simulation of a tape backup.

E.g.:

# export BACKUP_DIR=/tmp

# sbttest command

This method should not be used for live backups to disk, as it is incredibly inefficient, its purpose is solely for testing purposes.

4. Version checking

Check the version of the library installed on the system:

/usr/tivoli/tsm/client/oracle/bin what libobk.a

libobk.a:

(C) Copyright IBM Corporation 1990, 2000

build date: Mon Mar 26 15:54:21 PST 2001

Tivoli Data Protection for Oracle: version 2.2.0.0

The version of the library should be 2.1.7D or later!

5. Relink

After the relink various platform specific methods exist for verifying the executables linkage to the library.

HP/UX:

chatr $ORACLE_HOME/bin/oracle

dynamic /ora01/app/oracle/product/8.1.6/lib//libobk.sl

AIX:

dump -X32 -Hv oracle | pg

…..

8                                    libobk.a            shr.o

Solaris:

ldd ?

6. Configuration

Creating a password file

Creating the password file is version specific, and changed with TDPO v2.2. The instructions in the Current IBM red book are for version v2.1. Check that you are using the correct utility.

PASSWORD_ACCESS parameter in the DSM.SYS file and make sure it is NOT set to “generated”. The parameter PASSWORD_ACCESS needs to be set to “permanent”

Run the tdpoconf utility to generate a password file

Tdpoconf password -TDPO_OPTFILE=<>

The file will be of the form TDPO.<node name> where the node name is defined in the tdpo.opt file.

Change the owner:group of the file to be oracle:dba

7. Testing

The backup process can be tested non destructively by performing the backup validate command.

The restore process can be tested non-destructively by performing a restore validate command on a valid backup.

It is possible for the backup to fail silently if the setsize is too large, it is recommended that the setsize be explicitly set and that a restore validate should be performed to verify the validity of the backup. If the setsize is too large, the restore validate will return the following errors:

RMAN-10035: exception raised in RPC: ORA-19660: some files in the backup set could not be verified

ORA-19661: datafile 3 could not be verified

RMAN-10031: ORA-19660 occurred during call to DBMS_BACKUP_RESTORE.RESTOREBACKUPPIECE

E.g.:

Backup database setsize = 2024800

8. Tracing and debugging

AIX

A list of system errors can be generated with the following command:

errpt -a

sbttest

The sbttest (and sbttest.exe) utility can be used to check the configuration of the TDPO library on most platforms.

Interpretation of error codes

Additional information 7xxx – indicates operating system probs.

Error numbers are defined in skgfqsbt.h and 149068.1 and are also listed towards the end of this document.

Additional information/return codes 2xxx – explanations can be found in: dsmrc.h

Likely error messages and their interpretation

ORA-19506 and ORA-27006

The media management software is linked correctly, but is not operating properly.

ORA-19511 and a 4110 media manager return code

The media management layer is not correctly linked.

BACKUP_DIR environment variable is not set

The oracle executable is still linked to the dummy test library libdsbtsh8.a or libdsbtsh8.so.

Log files to examine

RMAN log

sbtio.log file (located in user_dump_dest)

tdpo.out

tsmapi.out

DSIERROR.LOG (API error log file for TPDO)

See also tracefile parameter in dsm.opt

Generating a trace file

A parameter can be used to force the generation of a trace file.

E.g.:

run {

allocate channel t1 type ‘sbt_tape’ trace 1;

debug on;

restore tablespace PSAPBTABD ;

recover tablespace PSAPBTABD ;

release channel t1 ;

debug off;

}

The ‘TRACE 1’ will create an Oracle trace file in udump for each channel allocated. We need to see what calls were made last in those files. The trace will capture the conversation between the MML and the Oracle server process.

9. Problems

HP/UX

The oracle executable errors and is unable to find the TDPO supplied tape library.

Use chatr to verify that the oracle executable uses the LD_LIBRARY_PATH/SHLIB_PATH environment variables:

% cd $ORACLE_HOME/bin

% chatr oracle

1 oracle:

2              shared executable

3              shared library dynamic path search:

4                              SHLIB_PATH enabled second

5                              embedded path disabled first Not Defined

Etc.

The command chatr is used to enable or disable SHLIB_PATH and LD_LIBRARY_PATH:

% chatr +s enable <executable>

or

% chatr +s disable <executable>

On HP-UX 10.20 and 11 (32 bit), this enables/disables SHLIB_PATH only.

On HP-UX 11 (64 bit), both SHLIB_PATH and LD_LIBRARY_PATH are enabled/disabled.

Out of memory errors

Out of memory errors can be caused by setting the FILESPERSET parameter too high, try setting FILESPERSET to 4.

Timeouts

Timeouts can be caused by setting the FILESPERSET parameter too high, try setting FILESPERSET to 1.

Oracle 8.1.6

There are known issues linking Oracle 8.1.6 to TDPO. Instructions for linking this release can be found in metalink article147862.1.

ORA-19506: failed to create sequential file, name=”al_t468634250_s3617_p1″, parms=””

Verify that the format parameter guarantees a unique backup piece name! This can most easily be achieved by adding the %u parameter to the file name mask.

Symbol putmsg in ksh is undefined or Symbol getmsg in ksh is undefined

As root, run /usr/sbin/strload

Parameter length problems

In releases prior to 2.2 parameters were defined in the RMAN script rather than the tdpo.opt file. This potentially led to issues with limits on the length of the parameter string. As a work around, Oracle recommended producing short paths using symbolic links that can be substituted for long path names to reduce the length of the parameter string.

Problems connecting to the TSM server

The following error can be caused by setting the IDLETIMEOUT parameter to a value that is too high:

“Enter your user id: admin

ANS1017E Session rejected: TCP/IP connection failure

ANS8023E Unable to establish session with server.

ANS8002I Highest return code was -50.”

The above error was encountered when the IDLETIMEOUT parameter was set to the value: “999999999”.

Sbtio.log grows very large very quickly

Tivoli misunderstood how tracing worked and released the Tivoli Data Protection Agent 2.2 with tracing turned on to the maximum. Tivoli provided a patch that defaulted the tracing to level 0. The patch may be found on:

ftp://index.storsys.ibm.com/tivoli-storage-management/patches/tivoli-data-protection/oracle/platform

10. Hints, tips and notes

HP/UX

The 805 64bit on HP is different than other 64bit versions. There is no lib64 directory. Also the libobk is not a shared library (.sl) but an archive library (.a).

11. Q&A

Why is my incremental backup not faster than my full backup?

Incremental backups increase the speed when the network is the bottleneck, when the tape devices are local the backup will take a similar amount of time.

12. Cloning

In a disaster recovery scenario, it is assumed that you are essentially recreating a system that will have all of the same logical characteristics as the original node. However to clone a database to a different server that will in Tivoli terms constitute a new node requires a slightly more complex procedure than that used for a disaster recovery.

To restore a database to a different node using Tivoli requires that you configure the system to make RMAN believe that the media manager node is the same as the one that was made to make the original backup.

Set the parameter “TDPO_NODE” in the “tdpo.opt” to be the TSM node that performed the backup.

Set the TDPO_OWNER to be the database owner on the new machine if the owner is different to that on the original machine.

Set the environment variable DSMO_NODE to be the name of the original node and then generate the TDPO password file.

Restore the database.

Change the TDPO_NODE parameter in the tdpo.opt file back to be the node name for that machine.

Set the environment variable DSMO_NODE to be the name of the new node and then regenerate the TDPO password file.

See metalink notes: 183025.1, 232107.996, 1020252.102

13. Uninstalling

To upgrade Oracle you will need to unlink the TDPO library from the current release of oracle, upgrade oracle and then relink.

You must recreate the symbolic link between the libobk.a dynamic library and the $ORACLE_HOME/lib/libdsbtsh8.a which is the standard Oracle I/O library.

Unlinking e.g.:

$ cd $ORACLE_HOME/lib

$ rm libobk.a

$ ln -s libdsbtsh8.a libobk.a

14. Performance

If DSMO_AVG_SIZE is defined too large for the defined disk storage pool, ADSM will skip the disk storage pool and use the secondary tape storage pool instead, without first writing to the disk storage pool.

(Metalink 1050834.6)

There is a bug in Oracle 8.1.7.4 that can lead to deteriorating performance of backups; a work around for this bug is to analyze all the tables in the RMAN recovery catalog. (Metalink 279968.995)

Your first step is probably to identify (or narrow down the list of suspects for) the bottleneck.

How fast is your network connection, and how busy? On a really good day, a *dedicated* 10Mbit Ethernet will give you about 1MB/sec throughput. That’s about 3.5 GB/hour and about 30 hours to backup 100 GB. (I’m assuming that you meant “100GB” rather than “100MB” as your database size).

A single DLT 7000 has a native transfer rate of about 5MB/sec. Under ideal conditions, and without compression, this will give you a maximum of 18 GB/hour; if your backups are not buffering well, the tape drive may be stopping and starting a lot (shoe-shining), and your effective transfer rate will plummet. (Turning compression on could actually hurt sometimes; this will double — or more — the rate that the tape drive can accept data, increasing the possibility of start/stop problems.)

What is your database server doing during the backup? Is the CPU busy? Does it page a lot; are you getting large amounts of I/O wait? Setting filesperset=100 may be a mistake. If you’re seeing a high %WIO, try setting this to the number of disks on your system (or less). Unless your disks are more than 10 years old, serial reads on one or two disks should be able to easily exceed the bandwidth of most networks.

It is recommended that 2 channels per device are allocated, but test to verify the performance improvement.

See also

http://otn.oracle.com/deploy/availability/pdf/rman_performance_wp.pdf

Update (18-Oct-2013) the original document appears to have been updated and renamed:

http://www.oracle.com/technetwork/jp/content/rman-performance-wp-133899.pdf

An additional document is now also available:

http://www.oracle.com/technetwork/jp/content/rman-perfromance-scbc-wp-133854.pdf

SAP includes its own parameter file for RMAN:

$ORACLE_HOME/dbs/sap/init.sap

This file contains parameters that override the default RMAN parameters.

E.g. of init.sap parameter:

RMAN_FILESPERSET = 10

16. Sources of additional Tivoli DPO information

Supported platforms:

http://www.tivoli.com/products/index/storage-mgr-db/platforms.html

(Update 18-Dec-2013) Supported platforms are now found via this landing page:

http://www-01.ibm.com/software/sysmgmt/products/support/Tivoli_Supported_Platforms.html

This will will eventually redirect you to IBMs Software Product Compatibility Reports page:

http://publib.boulder.ibm.com/infocenter/prodguid/v1r0/clarity/index.html

Certification matrix:

http://www.tivoli.com/support/storage_mgr/adoracjp.htm

Product information

http://www.tivoli.com/products/index/data_protect_oracle/

http://www.oracle.com/ip/deploy/database/features/recovery/index.html?recovery.html

Support information / hardware requirements

http://www.tivoli.com/support/storage_mgr/adagent.htm

Other useful Documentation

Oracle 9i RMAN Backup and Recovery.

IBM Redbook:

Backing Up Oracle Using Tivoli Storage Management

http://www.redbooks.ibm.com/abstracts/sg246249.html

Note that this book describes TDPO v2.1, Not v2.2

Documentation for tdposync:

http://publib.boulder.ibm.com/tividd/td/DPON/SH26-4112-02/en_US/HTML/anou0010.htm

RMAN performance tuning document:

http://otn.oracle.com/deploy/availability/pdf/rman_performance_wp.pdf

17. Glossary

TDP – Tivoli Data Protection

TDPO – Tivoli Data protection for Oracle

18. TDPO error codes reported back to RMAN

‘*’ Are internal and should never be seen during normal operation:

/* Error codes for sbtopen */

7000 /* file not found (only for read) */

7001 /* file exists (only for write) */

* 7002 /* bad mode specified */

* 7003 /* bad block size specified */ Bad tpblksiz, should be multiple of 512

7004 /* no tape device found */

7005 /* device is busy; try again later */

7006 /* tape volume not found */

7007 /* tape volume is in-use */

7008 /* I/O Error */

7009 /* cant connect with Media Manager */

7010 /* permission denied */

7011 /* OS error */ system error – e.g. malloc, fork errors

7012 /* invalid argument(s) */ invalid argument(s) to sbtopen

7013 Backup file exists

/* Error codes for sbtclose */

7020 /* bad th (no sbtopen done) */

7021 /* bad flags */

7022 /* I/O error */

7023 /* OS error */

7024 /* invalid argument(s) */

7025 /* cant connect with Media Manager */

/* Error codes for sbtwrite */

* 7040 /* bad th */ bad th to sbtwrite

7042 /* I/O error */

7043 /* OS error */

* 7044 /* invalid argument(s) */ invalid argument(s) to sbtwrite

/* Error codes for sbtread */

* 7060 /* bad th */ bad th to sbtread

7061 /* EOF encountered */

7063 /* I/O error */

7064 /* OS error */

* 7065 /* invalid argument(s) */ invalid argument(s) to sbtread

/* Error codes for sbtremove */

7080 /* backup file not found */

7081 /* backup file being used */

7082 /* I/O Error */

7083 /* cant connect with Media Manager */

7084 /* permission denied */

7085 /* OS error */

* 7086 /* invalid argument(s) */ invalid argument(s) to sbtremove

/* Error codes for sbtinfo */

7090 /* backup file not found */

7091 /* I/O Error */

7092 /* cant connect with Media Manager */

7093 /* permission denied */

7094 /* OS error */

* 7095 /* invalid argument(s) */ invalid argument(s) to sbtinfo

/* Error codes for sbtinit */

* 7110 /* invalid argument(s) */ invalid argument(s) to sbtinit

7111 /* OS error */ system error – e.g. malloc, fork errors

* 7120 th was not of a previously sbtpopened one (sbtclose)

* 7121 Bad Flags to sbtclose

7122 I/O ERROR

7123 System Error

7124 Can’t connect with Media Manager

* 7125 Invalid Argument to sbtclose

Common error codes.

Can be issued by any SBT V2 function. **********************************************************

7500 /* Operator intervention requested */

7501 /* Catch-all media manager error */

7502 /* File Not Found */

7503 /* File already exists */

7504 /* End of File */

7505 /* cant proxy copy the specified file */

7506 /* no proxy work in progress */

April 24, 2007

UNIX debuggers & archive.org

Filed under: Pro*C,Unix — bigdaveroberts @ 4:44 pm

It never ceases to surprise me, the number of different debuggers produced.

adb, sdb, gdb, dbx, mdb and edb.

edb?

I came across edb while trying to find a solution to a novel Oracle problem.

After compiling a newly amended Pro*C program, it failed to work, at all.

To investigate the problem, I ran the program under truss to see what was happening.

To my surprise, Oracle was making 50 attempts to open a network connection, and after failing 50 times, was exiting in an ungracious manor.

My next step I was planning to take was to run it through a debugger.

I started looking for all the usual culprits, but, disappointingly none seemed to be present.

As an alternative to attempting to get gdb compiled/installed, I attempted a search for any 3 letter programs ending db, and discovered the improbably named edb debugger. (Improbable, until I found out that it was written by Edinburgh Portable Compilers)

So I went off onto the internet to see if I could locate some documentation.

Unfortunately, while the website did still exist at that time (www.epc.co.uk), the documentation was incomplete and/or corrupt.

It was then that an ex-collogue suggested that I try archive.org.

And, while my employer had blacklisted the site, I popped over to the customer, and located several iterations of the companies’ website, replete with multiple iterations of the edb manual!

In the end, by the time that I had returned with my documentation, the problem had already been resolved by my original colleague.

And what was the cause of this odd behavior?

Well, the makefile had hard coded directories for the include files, and the wrong version of the sqlca structure had been used, and when linked with libraries from a different release, produced the original bizarre observed behavior.

April 12, 2007

Async I/O and all that.

Filed under: AIX,Unix — bigdaveroberts @ 10:10 am

While I do like to characterise myself as a DBA, I have only had one all nighter, which sort of exposes the conceit of my claim.

The system involved was a large (for the time) data warehouse, and the database had crashed unexpectedly on the day before the main weekly data load. (If I recall correctly, it was due to a control file timeout.) While it was not unusual for the system to fail, (mainly due to the fact that the database files were sparse, for some unknown historical reason making space management somewhat problematic), as the weekly load was due shortly the system was being monitored more closely on the evening that the main loads commenced.

After the data load stated the system crashed again. The system was recovered, looked fine, and the data load was initiated again, and again the database crashed. The cycle repeated. I believe that in all the database crashed 11 times in close succession.

To be honest I handed the problem over to another DBA in the morning, and they initiated a Disaster Recovery, while continuing to investigate the problem.

The first problem found was one identified by Oracle. We had recently enabled Parallel DML, and this had implicitly enabled parallel recovery. Oracle 8.1.7.2 apparently had issues with parallel recovery which made it somewhat less efficient than normal recovery. This meant that Oracle continued to perform recovery operations long after the database had been restarted.

The issue was that having had one crash increased the workload/contention in the database causing the multiple additional crashes.

What had caused the original crash?

Well, at least 3 issues.

1. A new contract developer had been recruited by the client who didn’t realise what a smart idea removing the indexes and recreating them after a data load was. This had over time, as his code had been promoted to live substantially increased the work done during the data loads.

2. There was contention between the control files and the redo logs due to poor file locations.

3. The system was running on AIX, and The oracle manual for configuring AIX asynchronous I/O was inadequate. While the manual had recommended increasing the maxioservers and the minioservers, it made no recommendations regarding the queue length.

There were also hints of a conflict between IBM and Oracle, in that it was suggested that the underlying cause of the Asynchronous I/O problem was from a rewrite of IBM’s Async I/O subsystem, and subsequently when errors began to appear from the Oracle database, Oracle blamed IBM, as Oracle had worked without errors before the change.

In terms of problem resolution.

The first moves were to disable parallel recovery with the assistance of Oracle support and a temporary suspension of the new data load, along with disabling Async I/O in oracle. (At my initiative.)

The result was that the crashes stopped.

I was instructed to re-instate async i/o by my team leader, as in his view, as the customer was paying for gold support we should get a proper resolution from Oracle rather than using a work around that would potentially reduce the performance of their system.

Further investigation revealed an increasing trend in [EAGAIN] warnings being logged by Oracle, and with the benefit of a large control file, a rather elegant graph of the deterioration in the time taken to archive redo log files (due to disk contention) was produced using the following SQL:

SELECT recid,
To_Char( next_time, 'YYYY-MON-DD HH24:MI:SS' ),
(completion_time - next_time) * ( 24 * 60 * 60 ) "Seconds to archive"
FROM v$archived_log
ORDER BY recid
/

We then spent a period of time, finding ‘colder’ disks for the location of the control files, and with the production of a script to log the number of [EAGAIN] warnings per hour we proceeded to tune the async I/O parameters sufficiently to minimise the number of warnings produced.

The end result were parameter settings far in excess of the recommendations in the installation guide, and only in fact limited by the number of async i/o processes that Oracle would start with. (This turned out to be about 350.) Also a script that automatically counted the number of [EAGAIN] warnings which alerted us by email if the number went over a certain threshold.

To an extent, that was the end of the problem. Although after a couple of months one of the developers spotted that a query would return differing results depending on which execution plan was used. After investigation, an attempt was made to rebuild the index, which failed. And having passed the call to Oracle we were recommended to upgrade to 8.1.7.3. After this upgrade had been performed, we were then able to rebuild the index. Unfortunately the disparity in the results persisted, and we eventually dropped and rebuilt the table.

While the corruption can’t be proven to be related to the large number of crashes encountered, it does seem to be more than co-incidence that a quite bizarre problem occurs shortly after a substantial number of system crashes.

Blog at WordPress.com.