Oracle Observations

July 20, 2007

Re-installing the Microsoft Java VM.

Filed under: Java VM — bigdaveroberts @ 2:42 pm

OK, I know that this isn’t actually Oracle related, but as I trashed my java installation while attempting to install Oracle 9i, I thought it might be of interest to some nonetheless.

For reference the target platform is Windows 2000 SP4.

And as you are probably aware, Microsoft Java VM support ended at the end of last year, so following these instructions probably breaks your licence agreement.

The problem started because, when I experienced difficulties installing 9i, my suspicion fell on the Microsoft JVM, so I disabled it, and when that failed to work, I uninsulated it, presuming that I would be able to re-install later.

Having located a copy of javavm.exe, I ran it only to be confronted by an error indicating that I needed a later release of the operating system, or a service pack.

However it became apparent that it wasn’t the underlying program that wouldn’t install, but rather it was the protective(?) wrapper that Microsoft had added over the top.

So this are the procedures that I went through to fix the problem:

I ran javavm.exe.

When you run this you will be presented with a dialog box, select setup, and you will then receive the error message:

The Microsoft VM you are attempting to install is a protected system component and can only be updated with a later release of the operating system or service pack.

Before you hit OK, do a search for the file MSJavaVM.exe and copy it to a safe place (the copy that you have found will be deleted when you hit OK.)

Hit OK.

Run the file that you saved.

You will again receive the same error message (after several other new messages).

Again, before you hit OK, search for the javabase.cab file, and save the while contents of the directory in which you find the file.

Hit OK.

In the saved directory find the file java.inf right click and select install.

At this point you will may have to change the Java settings in the Internet explorer/tools/Internet options/Advanced.

Then following a reboot, your JVM should be restored.

Well it worked for me, and I’m not going to trash my installation again to verify that the solution above is repeatable or reliable!

July 17, 2007

ORA-600 [15015] revisited.

Filed under: ORA-600 — bigdaveroberts @ 1:50 pm

Well Oracle has produced an analysis of the problem based on one of the hundreds of trace files produced.

With hindsight, one of the symptoms I should have mentioned in my original post was that when logging into sql*plus you received an error indicating that the set_application_info procedure was invalid.

Oracles conclusion was that due to a bug in Oracle(1867501). Sometimes if a process connects to Oracle as SYSDBA and issues commands while the database is starting up, the SGA can be corrupted.

From the point that this happens, then all of the following errors (including the ORA-600) are secondary.

I do like the response in the fact that it fits my favored scenario of being caused by an unforeseen side effect of a change. I am however suspicious, because the change that involved scheduling a script to regularly connect to the database as SYSDBA and run a script was implemented more than 12 months ago. So I am still concerned that one of the more recent changes may also be implicated as a secondary cause of the problem.

OTOH the information Oracle has given us allows us to make a change that will avoid the problem in future!

If anyone else encounters the same error, I would be interested in any information you have with regards to what you may have recently done to your system!

July 12, 2007

EAGAIN (again)

Filed under: AIX — bigdaveroberts @ 3:00 pm

One of the more interesting aspects of a blog, is the ability to see the search terms used by the user in the search engine that redirected the user to this blog.

Thus it is possible for me to know that almost every day someone searches for EAGAIN, and looks at my blog on performance problems using async I/O on AIX.

As that blog entry covers a number of issues, I think that it might be worthwhile to revisit this subject and dedicate a single post to the subject of EAGAIN warnings under AIX.

The history of the EAGAIN problem under AIX as I understand it.
(Based largely on supposition rather than hard fact!)

When IBM originally produced the Asynchronous I/O subsystem for buffered file systems on AIX 3, the solution implemented was sub-optimal, in that on occasions it would unnecessarily lock the inode, and not actually always be asynchronous.

Oracle then used IBMs asynchronous I/O API to implement async I/O on AIX.

There are then 2 possibilities as to what happened.

Oracle gave insufficient instructions in the setup guide concerning async I/O configuration in the AIX environment and when IBM re-wrote the async I/O subsystem Oracle began to generate EAGAIN errors indicating a poor configuration that had been hidden by the inefficient initial implementation.

or

When IBM re-wrote the async I/O subsystem they added an additional configuration parameter, which without an appropriate setting resulted in numerous EAGAIN warnings.

What certainly did happen, was that IBM introduced new bugs into the system which required several iterations of patches to resolve.

Whatever the cause, many people running Oracle on AIX encountered an increasing number of “Warning lio_listo returned EAGAIN” messages.

The response of Oracle was to blame AIX, as before the upgrade, the warnings were not occurring, and IBM blamed Oracle, as all they had done was improve the efficiency of their async I/O system.

What should you do if you encounter EAGAIN warnings under AIX.

Firstly you should ensure that the appropriate AIX operating system patches have been applied.

Test bos.rte.aio level with:

# lslpp -l bos.rte.aio
bos.rte.aio 5.1.0.25 COMMITTED Asynchronous I/O Extension

Secondly, you should accept that the eradication of EAGAIN warnings is not a guarantee that you have actually resolved the underlying problem nor that the existence of the occasional EAGIN warning indicates a problem.

As the basic explanation of the message indicates, the warning is an indication that the I/O system is not running optimally.

In AIX the async implementation consists of a single buffer to contain all disk writes, with multiple write processes executing the write instructions.

When an EAGAIN warning occurs, it is simply an indication that the async i/o write buffer is full, and Oracle will have to absorb the overhead of attempting to write the data to the buffer again.

If you increase the size of the buffer, you will reduce the number of warnings and slightly reduce the workload on oracle, however, this should not be your first consideration or goal.

The greatest way to increase the efficiency of the system is to increase the rate that disk writes are completed and thus removed from the queue by increasing I/O bandwidth (replacing RAID 5 with mirroring, using faster disks, reducing disk contention etc), then secondly you should look at reducing the number of disk writes added to the async I/O buffer by methods of redo reduction and deletion before and recreation of indexes after data loads.

It is only after using general methods of increasing i/o efficiency, that you should then turn to attempts to tune the async I/O subsystem itself.

You should consider that while using async I/O you can configure multiple processes to write to the hard disk simultaneously, a Hard disk can only physically write to one place on a hard disk at a time. Thus it is only through the combination of NCQ and disk buffers that implementing multiple write processes per disk will actually increase I/O. Thus even if you do reduce the number of EAGAIN warnings by increasing the number of write processes, that is not a guarantee that the speed of the system has been increased! Again by increasing then size of the async i/o buffer, you may well reduce the number of EAGAIN warnings, but if the memory utilised could have been better used to increase the size of the SGA, then the performance of the system may be reduced, even though the number of warnings has been reduced.

Obviously, if you haven’t changed the configuration of the system, and the number of EAGAIN warnings is on the increase, then that is an indication of a problem, but the solution may well not be in the realm of the DBA, it may be that new inefficient routines are being implemented by the developers.

In short, the EAGAIN warning itself should not be considered the problem itself, but rather it should be considered another symptom that if not eradicated, probably needs to be monitored and managed.

Further information:

IBM documents on tuning oracle on AIX:

http://www-941.ibm.com/collaboration/wiki/download/attachments/5570/Oracle_on_AIX_WebinarFeb2007.pdf

http://www.sioug.si/sioug2005/datoteka.jsp?filename=Tomaz%20Vincek%20-%20Oracle%2010g%20Performance%20Tuning%20v%20UNIX%20okolju.pdf

For the enthusiastic, AIX documentation about the lio_listo function:

https://www-rz.uni-hohenheim.de/betriebssysteme/unix/aix/aix_4.3.3_doc/ext_doc/usr/share/man/info/en_US/a_doc_lib/aixprggd/kernextc/async_io_subsys.htm

And an interesting metalink article:

34924.996

IBM response to query about aiostat (a tool that IBM supplied to analyse the volume of asynchronous i/o calls)

http://www-1.ibm.com/support/docview.wss?uid=std3295a41c3f8c73bda49256f66000ecf3d

July 11, 2007

ORA-00600 [15015] and all that.

Filed under: ORA-600 — bigdaveroberts @ 11:35 am

Well last Wednesday was the first fun day (at work) for a long while.

The application, database and OS were all struggling, and I suspect that the network was also experiencing problems.

After consideration, the apparent cause (based on being the earliest errors we could find evidence of) were repeated ORA-600 errors (predominantly 15015), starting within seconds of the database being restarted after the backup (which also failed).

The errors appeared to be related to the snapshot process that was dieing every 5 minutes and was then being automatically restarted by the database.

I looked up ORA-600 [15015] on Google, and got no hits and I looked up 15015 on the Metalink ORA-600 argument look up tool and received the unhelpful response:

A description for this ORA-600 error is not yet published.

I also searched in the knowledge base including the archived articles and bug database and received no hits.

So we have a stable system on a terminal release (8.1.7.4) that suddenly and for no apparent reason starts kicking out super obscure errors.

And it isn’t as if there have been any significant changes implemented.

There was one change to patch an oracle bug that reared its head when we started running the client under Citrix and one to increase the size of the SGA. Both changes were implemented more than a month ago.

Before you get excited, I have to say that I don’t know what the problem is. (Lets be frank, the only reason you are reading this is because you are experiencing the same error.)

So where do my theories lie?

There is a general tendency for Software people to blame hardware when a new problem appears in a stable system, but I wouldn’t initially blame hardware. (/var/adm/messages didn’t have anything novel in it until a disk partition filled from all the core dumps and the problem was resolved by a reboot.)

I also don’t tend towards the conspiracy theorists that assume that all problems start with an uncontrolled change made by some well meaning techie. Certainly pkginfo didn’t indicate that the system had been patched or had any new packages installed within the last 12 months.

Generally I find that unexpected problems are most often explained by the unforeseen outcomes of poorly understood changes, and while the 2 changes appear to be superficially innocuous, it is there that my suspicion starts.

The Oracle patch will probably have been installed on multiple systems for multiple customers, and while it may be possible for the interaction between multiple patches to produce unusual results, the simple fact is that the system is rarely patched and is as close to a vanilla install as possible. Thus I think that the patch is unlikely to be the cause.

Thus finally we are left with the SGA increase. This does worry me slightly, in that the size of the SGA is now close to the SHMMAX setting for the maximum shared memory segment size, to the extent that on some mornings we receive a warning:

WARNING: Not enough physical memory for SHM_SHARE_MMU segment of size 0xnnnnnnnn

in the alert log, which Metalink unhelpfully indicates may be serious on some versions and innocuous on others.

So with my suspicion that there is an issue with shared memory I have scheduled a cron job to record the results for ipcs -Am before and after each backup window, and leave the problem with a watching brief.

Obviously when Oracle comes up with a response I will post an update.

Blog at WordPress.com.