A flaw in the Oracle database has been disclosed, whereby the Oracle System Change Number (SCN) – a feature that helps synchronize database events – outgrows its defined limits. The SCN is an ever-increasing sequence number used to determine the ‘age’ of data. It is incremented automatically by 16k per second to provide a time reference, and again each time data is ‘committed’ (written to disk). This enables transactions to be referenced to the second, and ordered within each second. As you might imagine, this is a very large number, with a maximum value and a maximum increase per day. If the SCN passes its maximum value the database completely stops. The new discovery concerns the SCN.
I’ll get more into the scope of the problem in a second, but first some important background.
When I started learning about database internals – how they were architected and the design of core services – data integrity was the number one design goal. Period! Performance, efficiency, and query execution paths were important, but actually getting the right data back from your queries was the essential requirement. That concept seems antiquated today, but storing and then retrieving correct data from a relational system was not a certainty in the beginning. Power outages, improper thread handling, locking, and transactional sequencing issues have all resulted in database corruption. We got transactions processed in the wrong order, calculations on stale data, and transactions simply lost. This resulted in nightmares for DBAs who had to determine what went wrong and reconstruct the database. If this hits an accounting system suddenly nothing adds up in the general ledger and the entire company is in a panic at the end of the quarter. We can normally take data consistency for granted today, thanks to all the work that went into relational database design and solving those reliability problems in the early years.
One of the basic tools embedded into relational platforms to solve data consistency issues is the sequence generator. It’s an engine that generates a sequence of numbers used to order and arrange events. Sequence numbers provide a mechanism for synchronization, and help provide data consistency within a single database and across many databases. Oracle created the SCN many years ago for this purpose, and it’s literally a core capability, upon which many critical database functions rely. As an example, every database read operation – looking at stored data – compares the current SCN with the SCN of the data stored on disk to ensure data was not changed by another process during the query. This ensures that each operation in a multi-threaded database reads accurate data. The SCN plays a roll in the consistency checks when databases are brought online and is core to database recovery in the event of corruption. In a nutshell, every data block in a database is tied to the SCN!
Now back to the bug: This flaw was discovered as a result of a backup and recovery feature abnormally advancing the SCN by a few billion or even a few trillion. For most firms this will never be an issue, as the number is simply too large for a few extra billion to matter. But for large organizations who have designed their databases to synchronize using common SCNs the possibility of failure is real – and the impact would be catastrophic. At this time Oracle has both patched the flaw during recovery where the number is erroneously advanced and changed the database to double the SCN range. Just as importantly, the provided the patch quickly.
The patch appears to fix the bug and with the increased SCN range we assume this problem will never occur in a normal setting. The odds are infinitesimally small. What has people worried is that attackers could leverage this into a denial of service attack and disable a database – or possibly every linked database in a cluster – for an extended period. There are a couple known ways to exploit the vulnerability so patch your systems as soon as possible. What worries me even more is, with this focus on the SCN, that researchers might discover new ways to attack inter-database SCN synchronization and corrupt data. It’s purely speculative on my part, but this capability was designed before developers worried much about security, so I would not be surprised if we see an exploit in the coming months.
A couple closing comments: The InfoWorld article that broke news of this flaw is excellent. It’s lengthy but thorough, so I encourage you to read it. Second, if your environment relies on inter-database SCN you need to do two things: level set security across all participating databases, and start looking at a migration plan to reduce or eliminate the inter-database dependency to mitigate risk. For most firms I know that rely on the SCN, the best bet will be to tighten security, as the rewrite costs to leverage another synchronization method would be prohibitive. Finally, Oracle assigned a risk score of 5.5 to CVE-2012-0082. Does that sound accurate to you? Once again Oracle’s risk scores do a poor job of describing risk to your systems, so take a closer look at your exposure and decide for yourself.
Comments