Tuesday, May 26, 2009

Ensuring Data Integrity is a Tricky Business

The term "data integrity" can mean different things to different people and at different times. But at a high level, there really are two aspects of integrity with respect to databases: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted and maintained properly is the goal of database structure integrity. Each DBMS uses its own internal format and structure to support the databases, table spaces, tables, and indexes under its control. System and application errors at times can cause faults within these internal structures. The DBA must identify and correct such faults before insurmountable problems occur. Semantic data integrity refers to the meaning of data and relationships that need to be maintained between different types of data. The DBMS provides options, controls and procedures to define and assure the semantic integrity of the data stored within its databases.

Structural database integrity and consistency is critical in the ongoing administration of databases. If the structural integrity of the database is not sound, everything else will be suspect, too. There are multiple types of structural problems that can occur. Indexing problems are one. Certain types of database maintenance can cause such problems and DBAs need to be able to recognize the problem, and rebuild the indexes to correct their structural integrity. Indexes are not the only database objects that utilize pointers. Many DBMSs use pointers to store very large objects containing text and image data. These can become corrupted.In today's modern database systems, structural integrity is rare -- much rarer than it used to be.

The more difficult and more pervasive problem is assuring the semantic integrity of the data. Getting that right requires proper design, processes that match your business requirements, good communication skills, and constant vigilance.

Perhaps the number one cause of data integrity problems is improperly designed databases. Just getting the data type and length correct for each column can go a long way to making sure the right data is stored. Think about it. If you need to store dates but the column is defined as CHAR(8) how can you enforce that only valid dates are stored? You would need to code program logic to accomplish that. But if the column is defined as DATE then the DBMS would take care of it -- and more of the data would be likely to be correct.

The DBA must also set up data relationships properly in the database. This is done using referential integrity (RI), a method for ensuring the "correctness" of data within a DBMS. People tend to over-simplify RI stating that it is merely the identification of relationships between tables. It is actually much more than this. Of course, the identification of the primary and foreign keys that constitutes a relationship between tables is a component of defining referential integrity. Basically, RI guarantees that an acceptable value is always in the foreign key column. Acceptable is defined in terms of an appropriate value as housed in the corresponding primary key (or perhaps null).

The combination of the relationship and the rules attached to that relationship is referred to as a referential constraint. The rules that accompany the RI definition are just as important as the relationship. These rules define how data is to be properly added to the databases and what happens when it is removed.

There are other mechanisms in the DBMS that DBAs can use to enforce semantic data integrity. Check constraints and rules can be applied to columns that dictate valid values. The DBMS will reject invalid data that does not conform to the constraints. More complex data relationships can be set up using database triggers.

Every DBA should take advantage of the mechanisms provided by the DBMS to ensure data integrity. When DBMS-provided methods are used, fewer data integrity problems are likely to be found. Fewer data integrity problems mean higher quality databases and more proficient end users. You have to know what integrity rules are proper for the DBMS to enforce. But once defined, many of those rules can be enforced by the DBMS.

And that is very good, indeed!

Labels: , , , , , ,

Monday, November 03, 2008

On Date Formats, Part 2

Here is a follow-up question and answer based on my previous blog post:


Q: My format does not fit into any of the formats listed in the DB2 manuals. What if I have a DATE stored like YYYYMMDD (with no dashes or slashes) and I want to compare it to a DB2 date?


A: Okay, let's look at one potential solution to your problem (and then I want to briefly talk about the use of proper data types). First of all you indicate that your date column contains dates in the following format: yyyymmdd with no dashes or slashes. You do not indicate whether this field is a numeric or character field - I will assume that it is character. If it is not, you can use the CHAR function to convert it to a character string.


Then, you can use the SUBSTR function to break the character column apart into the separate components, for example SUBSTR(column,1,4) returns the year component, SUBSTR(column,5,2) returns the month, and SUBSTR(column,7,2) returns the day.


Then you can concatenate all of these together into a format that DB2 recognizes, for example, the USA format which is mm/DD/yyyy. This can be done as follows:


SUBSTR(column,5,2) || "/" || SUBSTR(column,7,2) || 
"/" || SUBSTR(column,1,4)


Then you can use the DATE function to convert this character string into a DATE that DB2 will recognize. This is done as follows:


DATE(SUBSTR(column,5,2) || "/" || SUBSTR(column,7,2) || 
"/" || SUBSTR(column,1,4))


The result of this can be used in date arithmetic with other dates or date durations. Of course, it may not perform extremely well, but it should return the results you desire.


Now, a quick word about using proper data types. I say this all of the time, but there are many applications and implementations "out there" that do not heed the advice: it is wise to use the DATE data type when you store dates in DB2 tables. It simplifies life later on when you want to do things like formatting dates and performing date arithmetic.


Using the appropriate data type also ensures that DB2 will perform the proper integrity checks on the columns when data is entered, instead of requiring application logic to ensure that valid dates are entered.

Labels: , ,

Thursday, September 04, 2008

Database Performance and Row Size

Recently I was reading through some posts in a database-related newsgroup or mailing list (actually, right now I can't remember which one it was). The conversation I was reading was in response to a question like "Does the number of columns or size of the row matter in terms of performance?"

Actually, the question asked what kind of a performance impact might be expected if a query was issued against two similar tables. The first table had (say) 20 columns, and the second table had the same 20 columns, as well as 35 additional columns.

Well, most of the basic responses were similar. The consensus was that as long as the query was going against the same columns then performance should be about the same. I disagree. Here is why.

You also need to factor in the I/O requests that are required to return the data. The DBMS will perform I/O at the block (or page) level - this is so whether you return one row or millions of rows. For multi-row results, accessing data from the table with the wider row (more columns) will usually be less efficient. This is so because fewer rows will exist on each page (the row with 100 columns is smaller than the row with 150 columns so more rows can reside in a single, pre-sized block/page). The bigger the result set, the more pronounced the performance degradation can be (because more physical I/Os are required to retrieve the data).

Think about it this way. Is it faster to pull smaller peaches out of a basket than bigger peaches? That is about the same type of question and anybody can envision the process. Say you want 100 peaches. Small peaches fit 25 per basket; big peaches fit ten per basket. To get 100 small peaches you'd need to pull 4 baskets from the shelf. To get 100 big peaches you'd need to pull 10 baskets from the shelf. The second task will clearly take more time.

Of course, the exact performance difference is difficult to calculate - especially over an online forum and without knowledge of the specific DBMS being used. But there will, more than likely, be a performance effect on queries when you add columns to a table.

Labels: ,