Quote:
Originally Posted by seventhson I do this all the time. It's just a matter of mapping the old data format to the new data format in a consistent way. For me, it's one of the more straightforward types of software development jobs. |
I would add that in addition to the mapping, profiling the data would be useful so that in your mapping you can identify the valid ranges and values for each of the data elements in the mapping document / data dictionary. Then you can use this information to code your conversion to do the appropriate checks.
Profiling is useful so you can get some idea of how clean or how dirty is the data. Data quality will have a direct effect on the ease or difficulty of the conversion after the data element mapping.
When dealing with data quality I use the following guide:
1. Assume the system you are working has dirty data and plan accordingly,
2. When the customer (whether internal or external) says don't worry, the data is / should be clean, see number 1.
I've found that the common hangup in most interface programs I've written in the past were date format processing. Different systems and different databases tolerate (or don't) date formatting differently and this is usually where I have had issues on most every project. One example: '2011-1-1' is acceptable as Jan 1 2011 in some systems but others must have '2011-01-01' if using that format and need to make sure you have those leading zeros. There are many more issues and many ways to skin the cat too numerous to enumerate but make sure the issues are well understood and handled in your conversion. Experience is your best friend here.
Another issue is decimal precision. One example: Some database and software systems will go to 18 decimal places in precision and others 30 decimal places. Not a big deal for many systems, but I was working on a system once where the source went out to 30 decimal places and the target handled only 18 decimal places and had to deal with a solution for that.
Don't get me wrong, I've seen crap in other data domains that were tricky, it just seemed that every one at least had the date issues to worry about.
Referential Integrity is a super important data issue. An example, if your Sheriff system has a table that stores violations and another table that stores perpetrators, and the rule is that every violation has a perp, there needs to be a perp in the perp table for every perp that's in the violation table. For really active perps, you may have many different violations with the same perp, but you would only want one instance perp in the perpetrator table. This is called a one to many relationship. Where issues arise is that sometimes you may have a perp in the violation table but for some reason is not in the perp table. This is called an orphan record by some people. Need to be able to handle this issue as some systems are poor at enforcing referential integrity. Best to try to find these in the data profile stage so it can be presented to the customer for clarification and direction on what actions should be taken.
Another issue is the rank of data importance. In systems there are some data elements absolutely cannot have mistakes and then there is other less-important elements that can be a little more lenient. The ideal is that all data elements should be perfect. All customers want it perfect, but none seem to want pay for the time it takes to correct, especially if they have crappy data. Every project has a time and budget and it's best to find out up front what absolutely has to be dead on so that those tier 1 data elements gets done correctly for sure and use the rest of the time to make sure the tier 2 and tier 3 data elements have acceptable quality. Please note I am saying prioritize to the most important. I am not saying skip the less important. There is a big difference.
Bottom line is that there will be much up front work to do before you start programming:
1. Mapping source to target data elements
2. Profiling source data
3. Review results of profile with the customer so both parties can understand the level of data quality and how that affects the actual processing.
4. Extract exception rules from results of the data profiling and add to the mapping document for use later in coding.
5. Get an understanding from the customer that for future data that comes in and falls outside of what is acceptable, what to do with that data? Keep it and do nothing to it, reject and put into a different file for review, etc? Put that information into the mapping document for future coding.
This seems like a lot for a conversion, and if we are talking about a handful of data elements you will be processing then it may be. My background is in multi-terabyte database processing on millions to billions of records so profiling is very important part of interface processing to understand how to handle anomalies because there is no way we can hand correct that volume of data if/when mistakes are uncovered.
I'll post more as I have time.