ETL: How to handle bad dataPosted: May 31, 2011
During any ETL design, we implement various functionalities like validation, auditing, notification, job recovery, job logging, data cleansing, handling bad data, etc. I am going to talk about handling bad data in this blog. At a top level, ETL design allows for bad data to be rejected and sent over to the appropriate users in the form of files. But, in my opinion, there is more to this than meets the eye. As an ETL architect, our responsibility does not end there. So before I get into details of how shall we handle bad data, let me tell you how what makes the incoming data bad enough to be handled. Following are the few reasons that generate bad data:
- A set of business rules are laid out that define whether the incoming data is good or bad. Let’s consider the sales record where the cost of the product must be present. If the cost contains a null value or a negative number, the sales records would be considered bad.
- Any data that would not satisfy the referential integrity in data warehouse database. This usually happen in case of missing inter-dependent data. If the incoming data contains references to some other data, which could not be loaded for some reason, this incoming data becomes bad and shall not be loaded into data warehouse database. A typical example would be a retail chain maintaining product master in a centralized database. And the sales data is generated across different POS terminals. So during ETL, if corporate database is down for whichever reason, ETL would not be able to load new products. But at the same time, there is sales generated for these new products. Thus with no product in product dimension (master) the sales record is considered bad at this moment and would not be loaded into data warehouse database.
- Missing business keys: If mandatory data is missing in the incoming data then that data is considered bad. This usually does not happen if the data is sourced from other relational databases. While sourcing data from files, there is every possibility that data may go missing, even if the format of the incoming feed file is already decided.
- Missing data: There may be many cases when the data is missing in the incoming data. That logically makes the data bad. For example an incoming product feed file contains record with no product code and product description. In this case the data is considered invalid.
Now, as we have seen what generates the bad data, we must understand that data cleansing does not make any data bad. Data cleansing is applied only on good data. So before we start data cleansing, a mechanism needs to be put in place to identify good data from the bad data.
Once this bad data is identified, it is usually stored in a separate area called “Rejection Area”. This rejection area can be in a separate schema in the same database that contains staging schema or it can be a separated database altogether. The structure of the rejection area (table structure) is similar to that of staging area with the additional few columns. These additional are required to store metadata about rejected data.
As an ETL architect we need to design our ETL to provide following functionality:
- Ability to reprocess this bad data whenever required. The data that could not be loaded due to missing references is usually re-processed when that missing data is loaded into the data warehouse database. Consider the case where due to missing product master, sales data was rejected. Later, when the latest product master is loaded, this bad data (which wasn’t really bad) which could not be loaded needs to be reprocessed. Otherwise the sales summary report would not be accurate. Another factor when the bad data needs to be reprocessed is change in business rules. If there is a lot of rejection due to strict validation rules, customer may decide to change these validation rules so that large amount of data is not rejected.We can automate this functionality by adding few columns in the rejection area table:
- Reprocess_flag (Y|N): This flag when set, determines if this record needs to be re-processed. Moving these rejected records from rejection area to staging area should be automated. This would help customers in various ways such as reduced the dependency on IT staff, lower maintenance cost.
- Reprocess_Job_Id: Usually the metadata about each run of job is maintained. So to be able to audit, when the rejected record was reprocessed, the job id is stamped into this column.
- Active_Flag: Once the record marked to be reprocessed, is copied over to staging area, this record is made inactive as this record is not valid anymore. It may happen that this record may fail validations again and end up in rejection area, but it would be considered as another record. So in essence, there is only one instance of the rejected record in rejected area. This implies that inactive records cannot be chosen to be re-processed.
- Ability to reprocess incoming data: Many times due to various reasons, the same data that is already processed is fed again into the ETL. This requires us to identify the corresponding records in rejection area and mark them as inactive as these records are not valid anymore. The incoming records would need to be validated as per the current business rules. The records in staging area are compared with the records in rejection area against the business keys and for matching records active_flag is set to ‘N’. This process of marking existing rejected records as inactive is usually automated.
- Ability to mark invalid data: Sometimes the business keys in the incoming records are null. These records are then eventually end up in the rejection area and are active. These records, no matter how many times are reprocessed, they would end up in rejection area again. At the same time, the incoming records can be matched with these records. Thus these records should be marked as invalid. For this add following column in the rejection area table:
- Valid_flag (Y|N): This flag must be set to ‘Y’ for missing business keys. This implies that this record cannot be re-processed ever.
The important fact to be noticed here is that the responsibility of ETL architect does not end here. By designing the ETL, to mark the rejected data as invalid does not solve any business problems. The incoming data must get loaded into the data warehouse database. So it becomes very important for BI architect to talk to end users and tell them the impact of this. The end users may need some tweaking in the source systems, but if they need accurate reports they must send the accurate data.
What I have explained is just one way of designing rejection area (tables containing bad data). What I have discussed is the concept of handling bad data. There can be different ways of implementing the above mentioned functionality. Once ETL and database is designed appropriately, an interface must be provided to the end users that allow users to do following:
- Select any rejected table and mark rejected records to be reprocessed.
- Select any already processed data for reprocessing. This is simple if the incoming data is in the form of feed files. This gets little bit complex when the data is extracted from existing databases. Typically in large ETL systems, the staging area is archived. Depending on the needs of the customer, this goal can be achieved.
- Look at the invalid data and analyze it to be able to fix the source systems accordingly.
- Execute the ETL job after selecting the rejected records or selected source data for reprocessing. This would depend on various other factors such as ETL time window, the need to reflect the correct data, the time of ETL run, etc.
Last but not the least, as an ETL architect our goal shall not be just to implement some logic to handle bad data. Our main responsibility is to make this whole thing as automated as possible. Automation would provide various benefits such as reduced development time, lesser errors hence increased quality and finally the reduced cost for the customer.