Oracle Observations

July 20, 2007

Re-installing the Microsoft Java VM.

Filed under: Java VM — bigdaveroberts @ 2:42 pm

OK, I know that this isn’t actually Oracle related, but as I trashed my java installation while attempting to install Oracle 9i, I thought it might be of interest to some nonetheless.

For reference the target platform is Windows 2000 SP4.

And as you are probably aware, Microsoft Java VM support ended at the end of last year, so following these instructions probably breaks your licence agreement.

The problem started because, when I experienced difficulties installing 9i, my suspicion fell on the Microsoft JVM, so I disabled it, and when that failed to work, I uninsulated it, presuming that I would be able to re-install later.

Having located a copy of javavm.exe, I ran it only to be confronted by an error indicating that I needed a later release of the operating system, or a service pack.

However it became apparent that it wasn’t the underlying program that wouldn’t install, but rather it was the protective(?) wrapper that Microsoft had added over the top.

So this are the procedures that I went through to fix the problem:

I ran javavm.exe.

When you run this you will be presented with a dialog box, select setup, and you will then receive the error message:

The Microsoft VM you are attempting to install is a protected system component and can only be updated with a later release of the operating system or service pack.

Before you hit OK, do a search for the file MSJavaVM.exe and copy it to a safe place (the copy that you have found will be deleted when you hit OK.)

Hit OK.

Run the file that you saved.

You will again receive the same error message (after several other new messages).

Again, before you hit OK, search for the javabase.cab file, and save the while contents of the directory in which you find the file.

Hit OK.

In the saved directory find the file java.inf right click and select install.

At this point you will may have to change the Java settings in the Internet explorer/tools/Internet options/Advanced.

Then following a reboot, your JVM should be restored.

Well it worked for me, and I’m not going to trash my installation again to verify that the solution above is repeatable or reliable!

July 17, 2007

ORA-600 [15015] revisited.

Filed under: ORA-600 — bigdaveroberts @ 1:50 pm

Well Oracle has produced an analysis of the problem based on one of the hundreds of trace files produced.

With hindsight, one of the symptoms I should have mentioned in my original post was that when logging into sql*plus you received an error indicating that the set_application_info procedure was invalid.

Oracles conclusion was that due to a bug in Oracle(1867501). Sometimes if a process connects to Oracle as SYSDBA and issues commands while the database is starting up, the SGA can be corrupted.

From the point that this happens, then all of the following errors (including the ORA-600) are secondary.

I do like the response in the fact that it fits my favored scenario of being caused by an unforeseen side effect of a change. I am however suspicious, because the change that involved scheduling a script to regularly connect to the database as SYSDBA and run a script was implemented more than 12 months ago. So I am still concerned that one of the more recent changes may also be implicated as a secondary cause of the problem.

OTOH the information Oracle has given us allows us to make a change that will avoid the problem in future!

If anyone else encounters the same error, I would be interested in any information you have with regards to what you may have recently done to your system!

July 12, 2007

EAGAIN (again)

Filed under: AIX — bigdaveroberts @ 3:00 pm

One of the more interesting aspects of a blog, is the ability to see the search terms used by the user in the search engine that redirected the user to this blog.

Thus it is possible for me to know that almost every day someone searches for EAGAIN, and looks at my blog on performance problems using async I/O on AIX.

As that blog entry covers a number of issues, I think that it might be worthwhile to revisit this subject and dedicate a single post to the subject of EAGAIN warnings under AIX.

The history of the EAGAIN problem under AIX as I understand it.
(Based largely on supposition rather than hard fact!)

When IBM originally produced the Asynchronous I/O subsystem for buffered file systems on AIX 3, the solution implemented was sub-optimal, in that on occasions it would unnecessarily lock the inode, and not actually always be asynchronous.

Oracle then used IBMs asynchronous I/O API to implement async I/O on AIX.

There are then 2 possibilities as to what happened.

Oracle gave insufficient instructions in the setup guide concerning async I/O configuration in the AIX environment and when IBM re-wrote the async I/O subsystem Oracle began to generate EAGAIN errors indicating a poor configuration that had been hidden by the inefficient initial implementation.

or

When IBM re-wrote the async I/O subsystem they added an additional configuration parameter, which without an appropriate setting resulted in numerous EAGAIN warnings.

What certainly did happen, was that IBM introduced new bugs into the system which required several iterations of patches to resolve.

Whatever the cause, many people running Oracle on AIX encountered an increasing number of “Warning lio_listo returned EAGAIN” messages.

The response of Oracle was to blame AIX, as before the upgrade, the warnings were not occurring, and IBM blamed Oracle, as all they had done was improve the efficiency of their async I/O system.

What should you do if you encounter EAGAIN warnings under AIX.

Firstly you should ensure that the appropriate AIX operating system patches have been applied.

Test bos.rte.aio level with:

# lslpp -l bos.rte.aio
bos.rte.aio 5.1.0.25 COMMITTED Asynchronous I/O Extension

Secondly, you should accept that the eradication of EAGAIN warnings is not a guarantee that you have actually resolved the underlying problem nor that the existence of the occasional EAGIN warning indicates a problem.

As the basic explanation of the message indicates, the warning is an indication that the I/O system is not running optimally.

In AIX the async implementation consists of a single buffer to contain all disk writes, with multiple write processes executing the write instructions.

When an EAGAIN warning occurs, it is simply an indication that the async i/o write buffer is full, and Oracle will have to absorb the overhead of attempting to write the data to the buffer again.

If you increase the size of the buffer, you will reduce the number of warnings and slightly reduce the workload on oracle, however, this should not be your first consideration or goal.

The greatest way to increase the efficiency of the system is to increase the rate that disk writes are completed and thus removed from the queue by increasing I/O bandwidth (replacing RAID 5 with mirroring, using faster disks, reducing disk contention etc), then secondly you should look at reducing the number of disk writes added to the async I/O buffer by methods of redo reduction and deletion before and recreation of indexes after data loads.

It is only after using general methods of increasing i/o efficiency, that you should then turn to attempts to tune the async I/O subsystem itself.

You should consider that while using async I/O you can configure multiple processes to write to the hard disk simultaneously, a Hard disk can only physically write to one place on a hard disk at a time. Thus it is only through the combination of NCQ and disk buffers that implementing multiple write processes per disk will actually increase I/O. Thus even if you do reduce the number of EAGAIN warnings by increasing the number of write processes, that is not a guarantee that the speed of the system has been increased! Again by increasing then size of the async i/o buffer, you may well reduce the number of EAGAIN warnings, but if the memory utilised could have been better used to increase the size of the SGA, then the performance of the system may be reduced, even though the number of warnings has been reduced.

Obviously, if you haven’t changed the configuration of the system, and the number of EAGAIN warnings is on the increase, then that is an indication of a problem, but the solution may well not be in the realm of the DBA, it may be that new inefficient routines are being implemented by the developers.

In short, the EAGAIN warning itself should not be considered the problem itself, but rather it should be considered another symptom that if not eradicated, probably needs to be monitored and managed.

Further information:

IBM documents on tuning oracle on AIX:

http://www-941.ibm.com/collaboration/wiki/download/attachments/5570/Oracle_on_AIX_WebinarFeb2007.pdf

http://www.sioug.si/sioug2005/datoteka.jsp?filename=Tomaz%20Vincek%20-%20Oracle%2010g%20Performance%20Tuning%20v%20UNIX%20okolju.pdf

For the enthusiastic, AIX documentation about the lio_listo function:

https://www-rz.uni-hohenheim.de/betriebssysteme/unix/aix/aix_4.3.3_doc/ext_doc/usr/share/man/info/en_US/a_doc_lib/aixprggd/kernextc/async_io_subsys.htm

And an interesting metalink article:

34924.996

IBM response to query about aiostat (a tool that IBM supplied to analyse the volume of asynchronous i/o calls)

http://www-1.ibm.com/support/docview.wss?uid=std3295a41c3f8c73bda49256f66000ecf3d

July 11, 2007

ORA-00600 [15015] and all that.

Filed under: ORA-600 — bigdaveroberts @ 11:35 am

Well last Wednesday was the first fun day (at work) for a long while.

The application, database and OS were all struggling, and I suspect that the network was also experiencing problems.

After consideration, the apparent cause (based on being the earliest errors we could find evidence of) were repeated ORA-600 errors (predominantly 15015), starting within seconds of the database being restarted after the backup (which also failed).

The errors appeared to be related to the snapshot process that was dieing every 5 minutes and was then being automatically restarted by the database.

I looked up ORA-600 [15015] on Google, and got no hits and I looked up 15015 on the Metalink ORA-600 argument look up tool and received the unhelpful response:

A description for this ORA-600 error is not yet published.

I also searched in the knowledge base including the archived articles and bug database and received no hits.

So we have a stable system on a terminal release (8.1.7.4) that suddenly and for no apparent reason starts kicking out super obscure errors.

And it isn’t as if there have been any significant changes implemented.

There was one change to patch an oracle bug that reared its head when we started running the client under Citrix and one to increase the size of the SGA. Both changes were implemented more than a month ago.

Before you get excited, I have to say that I don’t know what the problem is. (Lets be frank, the only reason you are reading this is because you are experiencing the same error.)

So where do my theories lie?

There is a general tendency for Software people to blame hardware when a new problem appears in a stable system, but I wouldn’t initially blame hardware. (/var/adm/messages didn’t have anything novel in it until a disk partition filled from all the core dumps and the problem was resolved by a reboot.)

I also don’t tend towards the conspiracy theorists that assume that all problems start with an uncontrolled change made by some well meaning techie. Certainly pkginfo didn’t indicate that the system had been patched or had any new packages installed within the last 12 months.

Generally I find that unexpected problems are most often explained by the unforeseen outcomes of poorly understood changes, and while the 2 changes appear to be superficially innocuous, it is there that my suspicion starts.

The Oracle patch will probably have been installed on multiple systems for multiple customers, and while it may be possible for the interaction between multiple patches to produce unusual results, the simple fact is that the system is rarely patched and is as close to a vanilla install as possible. Thus I think that the patch is unlikely to be the cause.

Thus finally we are left with the SGA increase. This does worry me slightly, in that the size of the SGA is now close to the SHMMAX setting for the maximum shared memory segment size, to the extent that on some mornings we receive a warning:

WARNING: Not enough physical memory for SHM_SHARE_MMU segment of size 0xnnnnnnnn

in the alert log, which Metalink unhelpfully indicates may be serious on some versions and innocuous on others.

So with my suspicion that there is an issue with shared memory I have scheduled a cron job to record the results for ipcs -Am before and after each backup window, and leave the problem with a watching brief.

Obviously when Oracle comes up with a response I will post an update.

May 23, 2007

Where are those tnsnames entries defined?

Filed under: SQL*Net — bigdaveroberts @ 3:53 pm

One thing that surprised me when running oracle utilities through truss was the first location that Oracle tries to find the tnsnames data from.

Rather that looking in $ORACLE_HOME/network/admin or $TNS_ADMIN for tnsnames.ora, it appears that the first thing it does is look in the .tnsnames.ora file in the current working directory.

How do you change the alert icons in Forms 6i? (and regedit sysinternals)

Filed under: Oracle forms — bigdaveroberts @ 3:46 pm

Easy, in only 2 steps:

Firstly, you need to edit the appropriate UI_ICON registry entry to add a new path for forms icons ahead of the default path.

Then, in this new directory you can then create 3 icons: caution.ico, stop.ico and note.ico.

When your forms application runs and displays an alert box, instead of the 3 default icons for each of the alert types, the icons you have specified will be displayed!

Further, should you wish to, many of the buttons in Forms builder (and presumably other tools) can be re-defined.

All you need to do is get hold of the regmon utility from sysinternals (Now distributed by Microsoft at www.microsoft.com/technet/sysinternals/utilities/regmon.mspx ) and set it to tell you about all the icon files that ifbld60.exe is attempting to open, and by correctly identifying which icon corresponds to which button, it is possible to replace the buttons with your own creations!

April 26, 2007

Installing and supporting Oracle RMAN and Tivoli DPO

Filed under: TDPO, rman — bigdaveroberts @ 3:12 pm

Another document that I have found regarding one of the projects I was due to do, was an RMAN/Tivoli backup.

Note that before the completion of the project, I was moved onto another project so there is the possibility that this checklist contains untested assertions.

It is also certainly somewhat out of date!

This document should be read in conjunction with the standard documentation and the documents listed at the end of this document. It is based largely on information gathered from Metalink forum articles, Oracle usenet groups and the adsm.org archives (search.adsm.org).

1. Installation

2. Configuring Oracle

Verify that the DB_DOMAIN parameter is set in the init<Database SID>.ora configuration file.

Verify that there is an oracle user on the system named oracle!

3. Testing without Tivoli

It is possible to perform some testing even before Tivoli DPO has been linked to RMAN. This is achieved through the stub interface library (typically libdsbtsh8.a) which implements a ‘tape backup’ to disk.

The BACKUP_DIR environment variable is used by the stub interface library to determine the disk location for the simulation of a tape backup.

E.g.:
# export BACKUP_DIR=/tmp
# sbttest command

This method should not be used for live backups to disk, as it is incredibly inefficient, its purpose is solely for testing purposes.

4. Version checking

Check the version of the library installed on the system:

/usr/tivoli/tsm/client/oracle/bin what libobk.a

libobk.a:

        (C) Copyright IBM Corporation 1990, 2000
        build date: Mon Mar 26 15:54:21 PST 2001
         Tivoli Data Protection for Oracle: version 2.2.0.0

The version of the library should be 2.1.7D or later!

5. Relink

After the relink various platform specific methods exist for verifying the executables linkage to the library.

HP/UX:

chatr $ORACLE_HOME/bin/oracle

dynamic /ora01/app/oracle/product/8.1.6/lib//libobk.sl

AIX:
dump -X32 -Hv oracle | pg
…..
8                                    libobk.a            shr.o

Solaris:
ldd ?

6. Configuration

Creating a password file

Creating the password file is version specific, and changed with TDPO v2.2. The instructions in the Current IBM red book are for version v2.1. Check that you are using the correct utility.

PASSWORD_ACCESS parameter in the DSM.SYS file and make sure it is NOT set to “generated”. The parameter PASSWORD_ACCESS needs to be set to “permanent”

Run the tdpoconf utility to generate a password file

Tdpoconf password -TDPO_OPTFILE=<>

The file will be of the form TDPO.<node name> where the node name is defined in the tdpo.opt file.

Change the owner:group of the file to be oracle:dba

7. Testing

The backup process can be tested non destructively by performing the backup validate command.

The restore process can be tested non-destructively by performing a restore validate command on a valid backup.

It is possible for the backup to fail silently if the setsize is too large, it is recommended that the setsize be explicitly set and that a restore validate should be performed to verify the validity of the backup. If the setsize is too large, the restore validate will return the following errors:

RMAN-10035: exception raised in RPC: ORA-19660: some files in the backup set could not be verified
ORA-19661: datafile 3 could not be verified
RMAN-10031: ORA-19660 occurred during call to DBMS_BACKUP_RESTORE.RESTOREBACKUPPIECE

E.g.:

Backup database setsize = 2024800

8. Tracing and debugging

AIX

A list of system errors can be generated with the following command:

errpt -a

sbttest

The sbttest (and sbttest.exe) utility can be used to check the configuration of the TDPO library on most platforms.

Interpretation of error codes

Additional information 7xxx – indicates operating system probs.

Error numbers are defined in skgfqsbt.h and 149068.1 and are also listed towards the end of this document.

Additional information/return codes 2xxx – explanations can be found in: dsmrc.h

Likely error messages and their interpretation

ORA-19506 and ORA-27006

The media management software is linked correctly, but is not operating properly.

ORA-19511 and a 4110 media manager return code

The media management layer is not correctly linked.

BACKUP_DIR environment variable is not set

The oracle executable is still linked to the dummy test library libdsbtsh8.a or libdsbtsh8.so.

Log files to examine

RMAN log
sbtio.log file (located in user_dump_dest)
tdpo.out
tsmapi.out
DSIERROR.LOG (API error log file for TPDO)
See also tracefile parameter in dsm.opt

Generating a trace file
A parameter can be used to force the generation of a trace file.

E.g.:

run {
allocate channel t1 type ’sbt_tape’ trace 1;
debug on;
restore tablespace PSAPBTABD ;
recover tablespace PSAPBTABD ;
release channel t1 ;
debug off;
}

The ‘TRACE 1′ will create an Oracle trace file in udump for each channel allocated. We need to see what calls were made last in those files. The trace will capture the conversation between the MML and the Oracle server process.

9. Problems

HP/UX

The oracle executable errors and is unable to find the TDPO supplied tape library.

Use chatr to verify that the oracle executable uses the LD_LIBRARY_PATH/SHLIB_PATH environment variables:

% cd $ORACLE_HOME/bin
% chatr oracle
1 oracle:
2              shared executable
3              shared library dynamic path search:
4                              SHLIB_PATH enabled second
5                              embedded path disabled first Not Defined
Etc.

The command chatr is used to enable or disable SHLIB_PATH and LD_LIBRARY_PATH:

% chatr +s enable <executable>
or
% chatr +s disable <executable>

On HP-UX 10.20 and 11 (32 bit), this enables/disables SHLIB_PATH only.
On HP-UX 11 (64 bit), both SHLIB_PATH and LD_LIBRARY_PATH are enabled/disabled.

Out of memory errors

Out of memory errors can be caused by setting the FILESPERSET parameter too high, try setting FILESPERSET to 4.

Timeouts

Timeouts can be caused by setting the FILESPERSET parameter too high, try setting FILESPERSET to 1.

Oracle 8.1.6

There are known issues linking Oracle 8.1.6 to TDPO. Instructions for linking this release can be found in metalink article147862.1.

ORA-19506: failed to create sequential file, name=”al_t468634250_s3617_p1″, parms=”"

Verify that the format parameter guarantees a unique backup piece name! This can most easily be achieved by adding the %u parameter to the file name mask.

Symbol putmsg in ksh is undefined or Symbol getmsg in ksh is undefined

As root, run /usr/sbin/strload

Parameter length problems

In releases prior to 2.2 parameters were defined in the RMAN script rather than the tdpo.opt file. This potentially led to issues with limits on the length of the parameter string. As a work around, Oracle recommended producing short paths using symbolic links that can be substituted for long path names to reduce the length of the parameter string.

Problems connecting to the TSM server

The following error can be caused by setting the IDLETIMEOUT parameter to a value that is too high:

“Enter your user id: admin
ANS1017E Session rejected: TCP/IP connection failure
ANS8023E Unable to establish session with server.
ANS8002I Highest return code was -50.”

The above error was encountered when the IDLETIMEOUT parameter was set to the value: “999999999″.

Sbtio.log grows very large very quickly

Tivoli misunderstood how tracing worked and released the Tivoli Data Protection Agent 2.2 with tracing turned on to the maximum. Tivoli provided a patch that defaulted the tracing to level 0. The patch may be found on:

ftp://index.storsys.ibm.com/tivoli-storage-management/patches/tivoli-data-protection/oracle/platform

10. Hints, tips and notes

HP/UX

The 805 64bit on HP is different than other 64bit versions. There is no lib64 directory. Also the libobk is not a shared library (.sl) but an archive library (.a).

11. Q&A

Why is my incremental backup not faster than my full backup?

Incremental backups increase the speed when the network is the bottleneck, when the tape devices are local the backup will take a similar amount of time.

12. Cloning

In a disaster recovery scenario, it is assumed that you are essentially recreating a system that will have all of the same logical characteristics as the original node. However to clone a database to a different server that will in Tivoli terms constitute a new node requires a slightly more complex procedure than that used for a disaster recovery.

To restore a database to a different node using Tivoli requires that you configure the system to make RMAN believe that the media manager node is the same as the one that was made to make the original backup.

Set the parameter “TDPO_NODE” in the “tdpo.opt” to be the TSM node that performed the backup.

Set the TDPO_OWNER to be the database owner on the new machine if the owner is different to that on the original machine.

Set the environment variable DSMO_NODE to be the name of the original node and then generate the TDPO password file.

Restore the database.

Change the TDPO_NODE parameter in the tdpo.opt file back to be the node name for that machine.

Set the environment variable DSMO_NODE to be the name of the new node and then regenerate the TDPO password file.

See metalink notes: 183025.1, 232107.996, 1020252.102
13. Uninstalling

To upgrade Oracle you will need to unlink the TDPO library from the current release of oracle, upgrade oracle and then relink.

You must recreate the symbolic link between the libobk.a dynamic library and the $ORACLE_HOME/lib/libdsbtsh8.a which is the standard Oracle I/O library.

Unlinking e.g.:

$ cd $ORACLE_HOME/lib

$ rm libobk.a
$ ln -s libdsbtsh8.a libobk.a

14. Performance

If DSMO_AVG_SIZE is defined too large for the defined disk storage pool, ADSM will skip the disk storage pool and use the secondary tape storage pool instead, without first writing to the disk storage pool.
(Metalink 1050834.6)

There is a bug in Oracle 8.1.7.4 that can lead to deteriorating performance of backups; a work around for this bug is to analyze all the tables in the RMAN recovery catalog. (Metalink 279968.995)

Your first step is probably to identify (or narrow down the list of suspects for) the bottleneck.

How fast is your network connection, and how busy? On a really good day, a *dedicated* 10Mbit Ethernet will give you about 1MB/sec throughput. That’s about 3.5 GB/hour and about 30 hours to backup 100 GB. (I’m assuming that you meant “100GB” rather than “100MB” as your database size).

A single DLT 7000 has a native transfer rate of about 5MB/sec. Under ideal conditions, and without compression, this will give you a maximum of 18 GB/hour; if your backups are not buffering well, the tape drive may be stopping and starting a lot (shoe-shining), and your effective transfer rate will plummet. (Turning compression on could actually hurt sometimes; this will double — or more — the rate that the tape drive can accept data, increasing the possibility of start/stop problems.)

What is your database server doing during the backup? Is the CPU busy? Does it page a lot; are you getting large amounts of I/O wait? Setting filesperset=100 may be a mistake. If you’re seeing a high %WIO, try setting this to the number of disks on your system (or less). Unless your disks are more than 10 years old, serial reads on one or two disks should be able to easily exceed the bandwidth of most networks.

It is recommended that 2 channels per device are allocated, but test to verify the performance improvement.

See also

http://otn.oracle.com/deploy/availability/pdf/rman_performance_wp.pdf

SAP includes its own parameter file for RMAN:

$ORACLE_HOME/dbs/sap/init.sap

This file contains parameters that override the default RMAN parameters.

E.g. of init.sap parameter:

RMAN_FILESPERSET = 10

16. Sources of additional Tivoli DPO information

Supported platforms:

http://www.tivoli.com/products/index/storage-mgr-db/platforms.html

Certification matrix:

http://www.tivoli.com/support/storage_mgr/adoracjp.htm

Product information

http://www.tivoli.com/products/index/data_protect_oracle/

http://www.oracle.com/ip/deploy/database/features/recovery/index.html?recovery.html

Support information / hardware requirements

http://www.tivoli.com/support/storage_mgr/adagent.htm

Other useful Documentation

Oracle 9i RMAN Backup and Recovery.

IBM Redbook:

Backing Up Oracle Using Tivoli Storage Management

http://www.redbooks.ibm.com/abstracts/sg246249.html

Note that this book describes TDPO v2.1, Not v2.2

Documentation for tdposync:

http://publib.boulder.ibm.com/tividd/td/DPON/SH26-4112-02/en_US/HTML/anou0010.htm

RMAN performance tuning document:

http://otn.oracle.com/deploy/availability/pdf/rman_performance_wp.pdf

17. Glossary

TDP – Tivoli Data Protection

TDPO – Tivoli Data protection for Oracle

18. TDPO error codes reported back to RMAN

‘*’ Are internal and should never be seen during normal operation:

/* Error codes for sbtopen */
  7000 /* file not found (only for read) */
  7001 /* file exists (only for write) */
 * 7002 /* bad mode specified */
 * 7003 /* bad block size specified */ Bad tpblksiz, should be multiple of 512
  7004 /* no tape device found */
  7005 /* device is busy; try again later */
  7006 /* tape volume not found */
  7007 /* tape volume is in-use */
  7008 /* I/O Error */
  7009 /* cant connect with Media Manager */
  7010 /* permission denied */
  7011 /* OS error */ system error – e.g. malloc, fork errors
  7012 /* invalid argument(s) */ invalid argument(s) to sbtopen
  7013 Backup file exists

/* Error codes for sbtclose */
  7020 /* bad th (no sbtopen done) */
  7021 /* bad flags */
  7022 /* I/O error */
  7023 /* OS error */
  7024 /* invalid argument(s) */
  7025 /* cant connect with Media Manager */

/* Error codes for sbtwrite */
 * 7040 /* bad th */ bad th to sbtwrite
  7042 /* I/O error */
  7043 /* OS error */
 * 7044 /* invalid argument(s) */ invalid argument(s) to sbtwrite

/* Error codes for sbtread */
 * 7060 /* bad th */ bad th to sbtread
  7061 /* EOF encountered */
  7063 /* I/O error */
  7064 /* OS error */
 * 7065 /* invalid argument(s) */ invalid argument(s) to sbtread

/* Error codes for sbtremove */
  7080 /* backup file not found */
  7081 /* backup file being used */
  7082 /* I/O Error */
  7083 /* cant connect with Media Manager */
  7084 /* permission denied */
  7085 /* OS error */
 * 7086 /* invalid argument(s) */ invalid argument(s) to sbtremove

/* Error codes for sbtinfo */

7090 /* backup file not found */
  7091 /* I/O Error */
  7092 /* cant connect with Media Manager */
  7093 /* permission denied */
  7094 /* OS error */
 * 7095 /* invalid argument(s) */ invalid argument(s) to sbtinfo

/* Error codes for sbtinit */
 * 7110 /* invalid argument(s) */ invalid argument(s) to sbtinit
  7111 /* OS error */ system error – e.g. malloc, fork errors

* 7120 th was not of a previously sbtpopened one (sbtclose)
 * 7121 Bad Flags to sbtclose
  7122 I/O ERROR
  7123 System Error
  7124 Can’t connect with Media Manager
 * 7125 Invalid Argument to sbtclose

Common error codes.
  Can be issued by any SBT V2 function. **********************************************************
  7500 /* Operator intervention requested */
  7501 /* Catch-all media manager error */
  7502 /* File Not Found */
  7503 /* File already exists */
  7504 /* End of File */
  7505 /* cant proxy copy the specified file */
  7506 /* no proxy work in progress */

April 24, 2007

UNIX debuggers & archive.org

Filed under: Pro*C, Unix — bigdaveroberts @ 4:44 pm

It never ceases to surprise me, the number of different debuggers produced.

adb, sdb, gdb, dbx, mdb and edb.

edb?

I came across edb while trying to find a solution to a novel Oracle problem.

After compiling a newly amended Pro*C program, it failed to work, at all.

To investigate the problem, I ran the program under truss to see what was happening.

To my surprise, Oracle was making 50 attempts to open a network connection, and after failing 50 times, was exiting in an ungracious manor.

My next step I was planning to take was to run it through a debugger.

I started looking for all the usual culprits, but, disappointingly none seemed to be present.

As an alternative to attempting to get gdb compiled/installed, I attempted a search for any 3 letter programs ending db, and discovered the improbably named edb debugger. (Improbable, until I found out that it was written by Edinburgh Portable Compilers)

So I went off onto the internet to see if I could locate some documentation.

Unfortunately, while the website did still exist at that time (www.epc.co.uk), the documentation was incomplete and/or corrupt.

It was then that an ex-collogue suggested that I try archive.org.

And, while my employer had blacklisted the site, I popped over to the customer, and located several iterations of the companies’ website, replete with multiple iterations of the edb manual!

In the end, by the time that I had returned with my documentation, the problem had already been resolved by my original colleague.

And what was the cause of this odd behavior?

Well, the makefile had hard coded directories for the include files, and the wrong version of the sqlca structure had been used, and when linked with libraries from a different release, produced the original bizarre observed behavior.

April 12, 2007

Async I/O and all that.

Filed under: AIX, Unix — bigdaveroberts @ 10:10 am

While I do like to characterise myself as a DBA, I have only had one all nighter, which sort of exposes the conceit of my claim.

The system involved was a large (for the time) data warehouse, and the database had crashed unexpectedly on the day before the main weekly data load. (If I recall correctly, it was due to a control file timeout.) While it was not unusual for the system to fail, (mainly due to the fact that the database files were sparse, for some unknown historical reason making space management somewhat problematic), as the weekly load was due shortly the system was being monitored more closely on the evening that the main loads commenced.

After the data load stated the system crashed again. The system was recovered, looked fine, and the data load was initiated again, and again the database crashed. The cycle repeated. I believe that in all the database crashed 11 times in close succession.

To be honest I handed the problem over to another DBA in the morning, and they initiated a Disaster Recovery, while continuing to investigate the problem.

The first problem found was one identified by Oracle. We had recently enabled Parallel DML, and this had implicitly enabled parallel recovery. Oracle 8.1.7.2 apparently had issues with parallel recovery which made it somewhat less efficient than normal recovery. This meant that Oracle continued to perform recovery operations long after the database had been restarted.

The issue was that having had one crash increased the workload/contention in the database causing the multiple additional crashes.

What had caused the original crash?

Well, at least 3 issues.

1. A new contract developer had been recruited by the client who didn’t realise what a smart idea removing the indexes and recreating them after a data load was. This had over time, as his code had been promoted to live substantially increased the work done during the data loads.

2. There was contention between the control files and the redo logs due to poor file locations.

3. The system was running on AIX, and The oracle manual for configuring AIX asynchronous I/O was inadequate. While the manual had recommended increasing the maxioservers and the minioservers, it made no recommendations regarding the queue length.

There were also hints of a conflict between IBM and Oracle, in that it was suggested that the underlying cause of the Asynchronous I/O problem was from a rewrite of IBM’s Async I/O subsystem, and subsequently when errors began to appear from the Oracle database, Oracle blamed IBM, as Oracle had worked without errors before the change.

In terms of problem resolution.

The first moves were to disable parallel recovery with the assistance of Oracle support and a temporary suspension of the new data load, along with disabling Async I/O in oracle. (At my initiative.)

The result was that the crashes stopped.

I was instructed to re-instate async i/o by my team leader, as in his view, as the customer was paying for gold support we should get a proper resolution from Oracle rather than using a work around that would potentially reduce the performance of their system.

Further investigation revealed an increasing trend in [EAGAIN] warnings being logged by Oracle, and with the benefit of a large control file, a rather elegant graph of the deterioration in the time taken to archive redo log files (due to disk contention) was produced using the following SQL:


SELECT recid,
       To_Char( next_time, 'YYYY-MON-DD HH24:MI:SS' ),
       (completion_time - next_time) * ( 24 * 60 * 60 ) "Seconds to archive"
  FROM v$archived_log
 ORDER BY recid
/

We then spent a period of time, finding ‘colder’ disks for the location of the control files, and with the production of a script to log the number of [EAGAIN] warnings per hour we proceeded to tune the async I/O parameters sufficiently to minimise the number of warnings produced.

The end result were parameter settings far in excess of the recommendations in the installation guide, and only in fact limited by the number of async i/o processes that Oracle would start with. (This turned out to be about 350.) Also a script that automatically counted the number of [EAGAIN] warnings which alerted us by email if the number went over a certain threshold.

To an extent, that was the end of the problem. Although after a couple of months one of the developers spotted that a query would return differing results depending on which execution plan was used. After investigation, an attempt was made to rebuild the index, which failed. And having passed the call to Oracle we were recommended to upgrade to 8.1.7.3. After this upgrade had been performed, we were then able to rebuild the index. Unfortunately the disparity in the results persisted, and we eventually dropped and rebuilt the table.

While the corruption can’t be proven to be related to the large number of crashes encountered, it does seem to be more than co-incidence that a quite bizarre problem occurs shortly after a substantial number of system crashes.

March 21, 2007

Technorati favouries button

Filed under: Uncategorized — bigdaveroberts @ 2:51 pm

Add to Technorati Favorites

« Previous PageNext Page »

Blog at WordPress.com.