Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Labels: , , , , , ,

Wednesday, September 03, 2008

When Not to Index

Answering a question I got via e-mail on indexing...

Every now and then I take the opportunity to blog about a question I get through e-mail. This time the question was:

When does it make more sense not to build an index for a DB2 table?

I'll attempt to answer this question for any SQL DBMS, not just for DB2:

First of all, this is a very open-ended question, so I will give a high-level answer. Let's start by saying that most of the time you will want to build at least one - and probably multiple - indexes on each database table that you create. Indexes are crucial for optimizing performance of SQL access. Without an index, queries must scan every row of the table to come up with a result. And that can be very slow.

OK, that being said, here are some times when it might makes sense to have no indexes defined on a table:

  • When all accesses retrieve every row of the table. Because every row will be retrieved every time you want to use the table an index (if used) would just add extra I/O and would decrease, not enhance performance. Though not extremely common, you may indeed come across such tables in your organization.

  • For a very small table with only a few pages of data and no primary key or uniqueness requirements. A very small table (perhaps 10 or 20 pages) might not need an index because simply reading all of the pages is very efficient already.

  • When performance doesn't matter and the table is only accessed very infrequently. But, hey, when do you ever have those type of requirements in the real world?

Other than under these circumstances, you will most likely want to build one or more indexes on each table, not only to optimize performance, but also to ensure uniqueness, to support referential integrity, and perhaps to drive data clustering.

Of course, indexes do not come without cost. Indexes take up disk space and adding a lot of indexes will consume disk space. For some DBMS products, adding many indexes can impact the working set size and perhaps raise memory problems. Additionally, although indexes speed up queries they degrade inserts and deletes, as well as any modification to indexed columns.

What do you think? Are there other situations where a table should have no indexes? Are there any pertinent high-level issues I missed? Feel free to add your thoughts and comments below!

Labels:

Tuesday, December 18, 2007

Indexing the DB2 Catalog [DB2 9 for z/OS]

First of all, I want to apologize for the lack of new posts here the last couple of weeks. I've been busy traveling and preparing for the holidays -- as have most of you I guess, at least the holiday preparation part! Anyway, this will likely be my last post this year and it will be a short one at that. But be sure to check back again in the new year (2008) as I will begin posting a bit more regularly again (hopefully).

In today's post I want to tout a small but helpful improvement in DB2 9 for z/OS that makes it easier to build all of the indexes you want on your DB2 Catalog tables. As you know, even though the DB2 Catalog ships with a set of pre-defined indexes from IBM, you can add your own for improving the performance of your catalog queries. Some DB2 tools vendors even ship recommendations of indexes for you to build to make their solutions work more efficiently.

Well, prior to V9 the limit for user-defined defined indexes on the DB2 Catalog was 100. Over time, some shops have bumped up against this limit and were unable to define any additional indexes.

The good news is that DB2 V9 ups the limit to 500 user-defined catalog indexes.

That's all for today... and I wish all of my readers a very happy holiday season!

Labels: ,

Tuesday, July 24, 2007

Index Compression [DB2 9 for z/OS]

Another useful new feature debuting in V9 is the ability to compress indexes. We’ve been able to compress DB2 data in table spaces for a long time now, either through an exit routine or with the COMPRESS table space parameter (added in DB2 V3). But before V9 we’ve never been able to compress index data.

Why would you want to compress index data? Well, some types of applications require very large indexes on very large tables - - data warehousing applications are one good example. Sometimes, the storage required for indexes to support your data warehouse applications can exceed the storage required for the base table. So it makes sense that you might want to reduce the storage consumed by such indexes.

DB2 V9 introduces the COMPRESS parameter for indexes. You can specify COMPRESS YES (or NO) on your CREATE INDEX and ALTER INDEX statements. Index partitioning is done at the index level and cannot be performed on a partition by partition basis.

Additionally, DB2 will only compress the data in leaf pages, not in the root page and any non-leaf pages in between. This makes sense because you don’t want to incur the expense of decompressing all of these type of pages in your indexes in order to find the right leaf page range.

Index compression does not require a compression dictionary. As such, DB2 can begin to immediately compress data in the leaf pages of your newly created indexes.

Now think about what we’ve already learned for a minute. The data on the leaf page is compressed, but we will want to access it uncompressed, right? So index pages are stored on disk in a compressed format but will be expanded when read. So those 4K index pages on disk will require more than 4K when expanded. This means that compressed indexes must be defined in a larger buffer pool (8K, 16K, or 32K). Nevertheless, when you compress an index DB2 will always compress the data down into a 4K page size on disk no matter what page size you choose.

Another consideration to keep in mind is that index data is decompressed for index image copies, so a copy of an index will require more storage space than the actual index requires.

So, when you move to DB2 9 in NFM you have an additional compression decision to make: which indexes should be compressed and which should not? But it is a good thing to have more options at our disposal, especially for applications with huge indexing requirements.

Labels: , , ,

Monday, May 14, 2007

Index on Expressions [DB2 9 for z/OS]

DB2 9 for z/OS offers, for the first time, the ability to create an index on data that is not technically in the table. At this point, you may well be asking “What does that mean?” and “Why would I do that?” I’ll attempt to answer both of those questions.

Basically, the neat new feature is an extension to the CREATE INDEX statement that lets you create an index on an expression. That begs the question of just what is an expression… Well, an expression can be as simple as a column reference, or it can be a built-in function invocation or even a more general expression, with certain restrictions.

The expression that is being indexed is referred to as the key-expression. The maximum length of the text string of each key-expression is 4000 bytes after conversion to UTF-8. Each key-expression must contain as least one reference to a column of the table named in the ON clause of the index. Referenced columns cannot be LOB, XML, or DECFLOAT data types nor can they be a distinct type that is based on one of those data types. Referenced columns cannot include any FIELDPROCs nor can they include a SECURITY LABEL. Furthermore, the key-expression cannot include any of the following:

  • subquery
  • aggregate function
  • not deterministic function
  • function that has an external action
  • user-defined function
  • sequence reference
  • host variable
  • parameter marker
  • special register
  • CASE expression
  • OLAP specification

Unlike a simple index, the index key of an index on expression is composed by concatenating the result (also known as a key-target) of the expression that is specified in the ON clause. An index that is created on an expression lets a query take advantage of index access (if the index is chosen by the optimizer) and avoid a table space scan.

Perhaps an example would help to clarify this topic. Consider this sample DDL:

CREATE INDEX XUPLN
ON EMP
(UPPER(LAST_NAME))
USING STOGROUP DSN8G910
PRIQTY 360 SECQTY 36
ERASE NO
COPY YES;

This example would create an index on the LAST_NAME column of the EMP table, but would index on the data after applying the UPPER function. This is useful if you store the data in mixed case but submit queries with parameter markers (or host variables) in upper case only. By indexing the data as upper case the index better matches your queries.

Query performance can be enhanced if the optimizer chooses that index. When you use an index on an expression, the results of the expressions are evaluated during insertion time or during an index rebuild and are kept in the index. If the optimizer chooses to use that index, the predicate is evaluated against the values that are stored in the index. As a result, run-time performance overhead is eliminated.

This is a nice new feature of DB2 9 for z/OS that you should consider if you have queries that search on expressions.

Labels: ,