The Oracle Database Flaw

The Oracle flaw: Clarifications and more information

In the wake of InfoWorld's exclusive story on a flaw in Oracle's flagship database product, Oracle weighs in and new developments emerge

Since InfoWorld published "Fundamental Oracle flaw revealed" on Jan. 17, we've received abundant feedback from Oracle users and consulted with Oracle representatives, who went through the story point by point, offering clarifications and additional details, including information about the patches that address the flaw.

Moreover, recognized Oracle expert Riyaj Shamsudeen, president of the database services company OraInterals, has zeroed in on another aspect of the flaw in a Jan. 20 post entitled "SCN -- What, why, and how?" Shamsudeen notes that he had held off posting the blog entry "for many months." In a reference to InfoWorld's article, he adds: "Since this issue is in the public knowledge domain, I can share the knowledge without any repercussions."

[ See the original story, "Fundamental Oracle flaw revealed," which includes clarifications and additional information. | Also see Editor in Chief Eric Knorr's message to the Oracle community: "Calling all Oracle customers." ] 

Before we address this new development, a quick recap of where things stand: Oracle has acknowledged that InfoWorld uncovered undocumented, manual methods to raise the Oracle SCN (System Change Number) -- a sort of "time stamp" for every Oracle transaction -- which could cause an Oracle database to hit the SCN limit and cease to function properly. Oracle also corroborates another key point in the story: that elevated SCN values can be passed among Oracle databases, so that in heavily linked environments, those values may spread quickly.

But with steadfast consistency, Oracle has characterized the risk posed by these problems as minimal. In a conversation several days after publication, Oracle's Mark Townsend, vice president of database product management, told InfoWorld that the hot-backup bug described in the story was, in all likelihood, the only way Oracle Database systems might reach the SCN limit. (That bug is confined to Oracle Database 11g releases 11.1.0.7 and 11.2.0.2 and is listed as 12371955: "High SCN growth rate from ALTER DATABASE BEGIN BACKUP in 11g.")

Since our last conversation with Oracle, however, we have discovered a new method by which the SCN might be manually elevated -- and have received corroboration for an additional scenario we could only speculate about until Shamsudeen confirmed it in his post.

SCN rising: Two new possibilities
From the outset, InfoWorld has made a clear distinction between two aspects of the Oracle flaw: Manual methods by which a bad actor might raise the SCN value of a database and cause it to hit the limit -- and organically elevated SCN numbers that occur through a bug such as the hot backup bug. Both cases can result in extreme SCN value increases among interconnected databases.

The three manual methods InfoWorld discovered, which are blocked by patches in Oracle's January 2012 Critical Patch Update, could be used to facilitate an attack on an Oracle database requiring only minimal privileges. By contrast, the risk of elevated SCN numbers spreading and causing problems appears to be confined to large Oracle environments in which databases are heavily interlinked.

Now a fourth method of manually raising the value of the SCN has emerged. In the interest of protecting the security of Oracle customers, we will refrain from describing this method.

More important, we believe, is the organic elevation of the SCN value described by Shamsudeen that could occur in heavily linked environments.

As we explained in the original article, Oracle databases have a constantly moving "soft" SCN limit based on the number of seconds that have elapsed since Jan. 1, 1988, times the number 16,384. Even today, few systems could or would exceed that per-second transaction rate on an ongoing basis -- and certainly not every second over the course of 24 years.

True, in high-end Oracle installations today, systems periodically process more than 16,384 transactions per second -- but in bursts, typically when running batch processes. Average throughput is much lower, so in an isolated system those bursts could never increase the average SCN number faster than 16,384 ticks per second. Yet as Shamsudeen describes, a group of interlinked databases can indeed exceed that rate, because linked databases synch up by adopting whichever SCN value is highest. Shamsudeen outlines the phenomenon like this:

[The] problem comes if many interconnected databases [are] each generating at [a] higher rate in kind of round-robin fashion. DB1 generates 20K SCNs per second in the first 5 minutes, DB2 generates 20K SCNs per second in the next 5 minutes, DB3 generates 20K SCNs per second in the next 5 minutes, etc. In this case, all three Databases will have a sustained 20K SCNs per second rate. [The] database[s are] slowly catching up to soft limit (1 second per every 4 second exactly) and again, it will take many years for them to catch up to the soft limit assuming the databases are active, continuously. But, there is that infamous, hated by my client, hot backup bug.

In other words, in interlinked Oracle environments where the backup bug has already raised SCN values close to the SCN soft limit, this "ping pong" type of escalation could eventually erode the margin between dangerously elevated SCN values and the limit itself.

In addition, in the wake of the story's publication, a source who asked to remain anonymous confirmed to InfoWorld that at least one Oracle environment was now experiencing this ping pong phenomenon in the real world.

Clearly, Oracle shops with databases in this condition need to monitor SCN levels closely. Oracle has released new monitoring scripts and even a color-coded system to inform DBAs of how close they are to the SCN soft limit.  New warnings are generated when SCN values hit a certain threshold that Oracle declined to specify to InfoWorld. The warning advises calling Oracle support immediately to address the issue; presumably, remediation recommended by support reps depends on the circumstances surrounding the elevated SCN.

First details on the Oracle patches
Oracle has provided InfoWorld with additional information about the "inoculation" patch mentioned in the original story. When applied to an Oracle database instance, the patch causes the database to refuse connection with other databases that have SCN values Oracle considers too close to the SCN soft limit. In effect, a second soft limit is introduced.

In mathematical terms, without the patch, an Oracle database allows a connection from another database as long as the transmitted SCN is x-1 or lower, but not x or x+1 (x equals the SCN soft limit during the second at which the connection was attempted). The patch prevents a connection from a database if the connecting database has an SCN value of x-y, where y equals a time value -- a value that Oracle has declined to reveal to InfoWorld.

The patch will indeed prevent a database from accepting an elevated SCN that could cause that database to hit the soft limit during normal processing and cause problems ranging from lost transactions to a database shutdown. But it may also interfere with normal operations if the calling database has an elevated SCN acquired through a bug or other means. This means that a database with a sufficiently elevated SCN may not be able to link with patched databases until enough time has elapsed to push its SCN below the new, second limit.

Future throughput
In rolling out the patches in its January 2012 Critical Patch Update, Oracle revoked a patch described in "Fundamental Oracle flaw revealed" that increased the SCN soft limit multiplier to 32,768 and allowed admins to further increment that multiplier. According to Oracle, "this capability was withdrawn by Oracle as it potentially can incur more problems than it solves, as further outlined in the story," such as the potential for two databases working with different multipliers to refuse connection.

This means that Oracle is stuck with the 16,384 multiplier for the foreseeable future, unless it can come up with a way to completely segment databases that use the higher calculation from ever linking with databases that use the lower calculation.

Then there's Moore's Law: What happens when a single Oracle database instance becomes capable of averaging more than 16,384 transactions per second? As we now know, in highly interconnected environments, multiple instances working in tandem are already closing in on that limit. Servers that have two or three times the processing power of today's fastest servers could further shorten the distance between a high SCN value and the SCN soft limit, posing a special risk to environments where SCN levels are already running hot.

If SCN values were easy to reset, this discussion would be moot. But the fact remains that rebuilding a database is the only way we know of (or that Oracle has disclosed) to roll back the SCN value. Otherwise, customers can shut down systems for a period of time to allow the SCN limit to increase -- an impractical solution for most high-end database environments.

Given the number of undocumented features we have encountered in the course of this investigation, we invite other Oracle experts who may have more information to contact InfoWorld.

This article, "The Oracle flaw: Clarifications and more information," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.

Copyright © 2012 IDG Communications, Inc.