Oracle Observations

April 24, 2007

UNIX debuggers & archive.org

Filed under: Pro*C,Unix — bigdaveroberts @ 4:44 pm

It never ceases to surprise me, the number of different debuggers produced.

adb, sdb, gdb, dbx, mdb and edb.

edb?

I came across edb while trying to find a solution to a novel Oracle problem.

After compiling a newly amended Pro*C program, it failed to work, at all.

To investigate the problem, I ran the program under truss to see what was happening.

To my surprise, Oracle was making 50 attempts to open a network connection, and after failing 50 times, was exiting in an ungracious manor.

My next step I was planning to take was to run it through a debugger.

I started looking for all the usual culprits, but, disappointingly none seemed to be present.

As an alternative to attempting to get gdb compiled/installed, I attempted a search for any 3 letter programs ending db, and discovered the improbably named edb debugger. (Improbable, until I found out that it was written by Edinburgh Portable Compilers)

So I went off onto the internet to see if I could locate some documentation.

Unfortunately, while the website did still exist at that time (www.epc.co.uk), the documentation was incomplete and/or corrupt.

It was then that an ex-collogue suggested that I try archive.org.

And, while my employer had blacklisted the site, I popped over to the customer, and located several iterations of the companies’ website, replete with multiple iterations of the edb manual!

In the end, by the time that I had returned with my documentation, the problem had already been resolved by my original colleague.

And what was the cause of this odd behavior?

Well, the makefile had hard coded directories for the include files, and the wrong version of the sqlca structure had been used, and when linked with libraries from a different release, produced the original bizarre observed behavior.

April 12, 2007

Async I/O and all that.

Filed under: AIX,Unix — bigdaveroberts @ 10:10 am

While I do like to characterise myself as a DBA, I have only had one all nighter, which sort of exposes the conceit of my claim.

The system involved was a large (for the time) data warehouse, and the database had crashed unexpectedly on the day before the main weekly data load. (If I recall correctly, it was due to a control file timeout.) While it was not unusual for the system to fail, (mainly due to the fact that the database files were sparse, for some unknown historical reason making space management somewhat problematic), as the weekly load was due shortly the system was being monitored more closely on the evening that the main loads commenced.

After the data load stated the system crashed again. The system was recovered, looked fine, and the data load was initiated again, and again the database crashed. The cycle repeated. I believe that in all the database crashed 11 times in close succession.

To be honest I handed the problem over to another DBA in the morning, and they initiated a Disaster Recovery, while continuing to investigate the problem.

The first problem found was one identified by Oracle. We had recently enabled Parallel DML, and this had implicitly enabled parallel recovery. Oracle 8.1.7.2 apparently had issues with parallel recovery which made it somewhat less efficient than normal recovery. This meant that Oracle continued to perform recovery operations long after the database had been restarted.

The issue was that having had one crash increased the workload/contention in the database causing the multiple additional crashes.

What had caused the original crash?

Well, at least 3 issues.

1. A new contract developer had been recruited by the client who didn’t realise what a smart idea removing the indexes and recreating them after a data load was. This had over time, as his code had been promoted to live substantially increased the work done during the data loads.

2. There was contention between the control files and the redo logs due to poor file locations.

3. The system was running on AIX, and The oracle manual for configuring AIX asynchronous I/O was inadequate. While the manual had recommended increasing the maxioservers and the minioservers, it made no recommendations regarding the queue length.

There were also hints of a conflict between IBM and Oracle, in that it was suggested that the underlying cause of the Asynchronous I/O problem was from a rewrite of IBM’s Async I/O subsystem, and subsequently when errors began to appear from the Oracle database, Oracle blamed IBM, as Oracle had worked without errors before the change.

In terms of problem resolution.

The first moves were to disable parallel recovery with the assistance of Oracle support and a temporary suspension of the new data load, along with disabling Async I/O in oracle. (At my initiative.)

The result was that the crashes stopped.

I was instructed to re-instate async i/o by my team leader, as in his view, as the customer was paying for gold support we should get a proper resolution from Oracle rather than using a work around that would potentially reduce the performance of their system.

Further investigation revealed an increasing trend in [EAGAIN] warnings being logged by Oracle, and with the benefit of a large control file, a rather elegant graph of the deterioration in the time taken to archive redo log files (due to disk contention) was produced using the following SQL:

SELECT recid,
To_Char( next_time, 'YYYY-MON-DD HH24:MI:SS' ),
(completion_time - next_time) * ( 24 * 60 * 60 ) "Seconds to archive"
FROM v$archived_log
ORDER BY recid
/

We then spent a period of time, finding ‘colder’ disks for the location of the control files, and with the production of a script to log the number of [EAGAIN] warnings per hour we proceeded to tune the async I/O parameters sufficiently to minimise the number of warnings produced.

The end result were parameter settings far in excess of the recommendations in the installation guide, and only in fact limited by the number of async i/o processes that Oracle would start with. (This turned out to be about 350.) Also a script that automatically counted the number of [EAGAIN] warnings which alerted us by email if the number went over a certain threshold.

To an extent, that was the end of the problem. Although after a couple of months one of the developers spotted that a query would return differing results depending on which execution plan was used. After investigation, an attempt was made to rebuild the index, which failed. And having passed the call to Oracle we were recommended to upgrade to 8.1.7.3. After this upgrade had been performed, we were then able to rebuild the index. Unfortunately the disparity in the results persisted, and we eventually dropped and rebuilt the table.

While the corruption can’t be proven to be related to the large number of crashes encountered, it does seem to be more than co-incidence that a quite bizarre problem occurs shortly after a substantial number of system crashes.

February 28, 2007

The first tip.

Filed under: Unix — bigdaveroberts @ 3:14 pm

During a project in 92 at my second job, where at certain times we would build a test machine, with the latest version of the software we were developing and package it up for the customer to test at their offices; we encountered an unusual performance problem.

One of the peculiarities of the project was that the customer was sold a solution based around Unix V4 on an Intel platform, whereas, at the development stage, the only machines available were based on Unix V3 on a Motorola based box.

The initial releases based on the old hardware worked fine (apart from the bugs) and formed the basis of the initial pilots, which were successful.

However after moving the development over to the new platform, when we produced a system and passed it to the customer for testing it ran like a dog. The kit was returned to us and was tested and we confirmed the presence of a severe performance problem.

After initial investigation, the problem was put down to hardware (this was after all a new hardware platform) and a new machine was built, tested and dispatched to the customer, where again, the system performed more like a 3 legged lap dog than a greased whippet!

A simple illustration of the scale of the problem was the ps command. On our development box it ran in half a second, on the similarly specked box returned by the customer, it took a full six seconds to run!

After some analysis with truss, it was found that there was a divergence after the xstat system call was executed with a parameter of ‘/etc/passwd’.

As the contents of the file were similar and permissions on the file were the same, the problem seemed to be caused by the timestamp on the file.

How had this happened?

Well, the process of preparing the system for the customer included a final step where the passwords were changed from the private passwords we used to ones that the customer knew.

The first thing that the customer did for their testing was to reset the system clock to a date from which they had ‘live’ data for the regression testing.

From that point on, the time stamp on the /etc/passwd file, as far as the operating system was concerned was in the future, and all the cached data was considered invalid. The server would produce the results without using the cached data and re-generate the cached data at the same time, which would be time stamped with the current machine time, which was still before the timestamp on the /etc/passwd file.

Why?

My guess would be for a combination of two reasons.

1) Unix V4 was the first release with a standard ABI (application binary interface) as well as a standard API, based on the Intel platform, which in theory would allow any program compiled on one Intel based server to run on any other Intel based server. However, at the time the Intel architecture was falling behind the new RISC architecture based systems, and to make the Intel a viable platform, the amount of OS data that was cached was increased to improve the performance.

2) I believe that Unix V4 was the first release to gain C2 security accreditation. Meaning that UNIX had to rigorously enforce the validity of the cached data on which it relied before each use.

At the end of the day the problem was resolved by touching the /etc/passwd file each time the system date was put back.

On another occasion, the issue was also caused by the last modified date on the /dev file system being in the future, so the issue isn’t isolated to just the passwd file.

Create a free website or blog at WordPress.com.