According to Wikipedia, a data warehouse is a central data store used for data analysis and management reporting. In practical terms, it can take many forms but serves a core purpose: to throw the harsh light of day onto corporate data, so that managers can make informed decisions.
The process of creating a data warehouse varies from project to project, depending on whether the data physically exists anywhere and what kind of outputs will be needed. Generally speaking, data has to move from somewhere (i.e., somebody’s brain, an Excel spreadsheet, the accounting system, or all of the above) to somewhere else where it’s better organized and much more useful. The challenge for all data warehouse projects, then, is to ensure that the processes for making this happen are solid and yield useful and accurate data.
I recently completed yet another data warehouse project (although, for the amount of data involved, it was more of a “data shed” than a warehouse.) With some post-project reflection, I’ve noted some steps you can take in your next data warehouse project to help ensure your success (and keep your data out of the crapper.)
Don’t combine steps
As tech monkeys, we tend to like elegant and efficient code. Indeed, early hackers were praised for achieving maximum functionality with the least amount of code possible. Beware – for these projects succumbing to this siren song can doom you to failure!
In a data warehouse, source data gets transferred to the warehouse but must typically undergo some type of modification, formatting, or propagation along the way. Tech monkey logic would have you perform all steps at once, but you must resist! Keep your steps separated with interim data accessible after the fact, so that in case there are problems in the process you can more easily track down issues or verify accuracy.
Trying to find a procedural bug in a seven-step process when you’re doing it all in one shot is nearly impossible. Simplify your (support) life and keep the steps separate.
The whole is only as good as the data
You may have a fancy application presenting the data to your users, or a bunch of cool reports, or a slick website interface. Regardless of the delivery system, the key to remember is that the project is only as good as the data that’s in it.
When I mentioned earlier about shining the harsh light of day on your data, I wasn’t kidding. One of my early data warehouse projects revealed a bunch of bad data born of bad data entry habits in our corporate mainframe that didn’t show up on the old reports. When we translated that bad data to the warehouse, the slick new reports showed those errors all over the place. Our managers assumed there were serious problems with our project, when in fact the reports were working just fine.
The important takeaway here is that you need to build in time for cleaning up the data. The bigger your data store, the bigger this problem will be, so plan accordingly.
“Should be” is not the same as “is”
Just as data correctness is key to success, so is data attribute correctness. In every project I’ve seen cases where we identify the potential values for data elements, only to later discover exceptions that throw off carefully constructed data translation processes or report formats. I’ve also seen cases where stakeholders agreed on business rules that governed data values, only to later find that those rules were different (or even ignored) in practice.
You should assume that these things will come up in your project, and don’t be afraid to verify and re-verify data attributes and assumptions along the way.
Pre-identify data ownership
Another step that trips up data warehouse projects is failing to properly identify data element ownership, at both the person and system levels. (Yes, that’s data element – record-level ownership isn’t fine tuned enough and, besides… which record? The source datastore(s)? The warehouse datastore?)
Sure, it’s easy to plan and build for the data as it stands today, but what happens if there are data conflicts? If two different people create data for the same physical thing, which takes precedent? If two different people edit a single data element, who’s change wins out? These and other similar questions must be answered up front, and those answers must be built into the data transformation steps ahead of time. Otherwise, you run the risk of data collisions and conflicts, and ultimately end up with bad data.
And you do remember what bad data means for your project, right?
Don’t get blinded by the science
Again, as tech monkeys we tend to get caught up in the technology steps. The thing to remember is that, while certainly important, the mechanics of moving data from source(s) to destination is but a small part of the overall data warehouse process. The people involved in the data warehouse and the processes they follow are more important to the ultimate success of the project. Don’t forget to include them in your process and contingency planning.
Don’t forget to build in security
It’s an unfortunate reality, but we must remember that data warehouses as a rule tend to involve a healthy amount of process re-engineering. In other words, it’s a recognition that the organization was unsatisfied with the way data was being collected and shared. As is the case whenever we talk business processes, there will be those who have a vested interest in how things WERE done, and they may not agree with the changes (or, frankly, the need for any changes.) Translation: Not everybody will want the project to succeed.
It’s vital for us, then, to ensure the integrity and viability of the data in order to protect the organization’s best interests. This requires steps such as enabling secure logins, controlling read/write access to data, logging changes, performing regular (frequent) backups, and other similar functions. We need to build these into our project timelines up front, and resist the urge to rollout the project without them in place. Remember, security (and success) is up to you!
Data Warehouse or Data Outhouse?
These steps are crucial to ensuring a successful experience. Follow them and your data warehouse project will be far more likely to succeed. It’s up to you… would you rather your reports be useful for making business decisions, or only useful as toilet paper?