Recently I was reviewing a SSIS package written by one of my team members. The package was pretty simple, which connected to one SQL database and it had two tasks:
- Execute SQL Task: Which had few SQL statements to set some data which would be used in the next task.
- Data flow task: Selecting data from a table and load it into another table.
Interestingly, I found, that in “Execute SQL Task” there were few SQL statements and it did not have transaction control i.e., no BEGIN TRAN or COMMIT/ROLLBACK TRAN. Thus, in this blog, I wanted to highlight the impact of writing SQL in the task without transaction control. Although this applies to the SQL code, written in stored procedures, query in SSMS, SSIS task or anywhere SQL code can be written, its impact may go un-noticed especially in SSIS task. For details, read on….
Let’s say, we have a batch of multiple SQL statements without transaction control, which are executed from the Execute SQL task. Now we are going to analyze the impact on data quality if this batch of SQL statements fails at some palce. Here we face the challenge. It is difficult to find out, in this case, what is committed and what is rolled back. This is because SQL Server treats compile time errors and run time errors differently. What I mean by this:
1. SQL Server terminates the batch as soon as Compile time error occurs.
2. SQL Server may or may not terminate the batch when a run time error occurs. Read “may or may not”.
Let’s see this in action…..
Create a test table as follows:
CREATE TABLE Test_ExecSQLTask ( ID INT);
Let’s execute the following statements:
insert into Test_ExecSQLTask values ( 10 );
insert into Test_ExecSQLTask values ( 999999999999999 );
insert into Test_ExecSQLTask values ( 20 );
insert into Test_ExecSQLTask values ( ‘Thirty’ );
insert into Test_ExecSQLTask values ( 40 );
Let’s see what is in table:
select * from Test_ExecSQLTask
Only 1st and 3rd statements go through. When 2nd failed, 3rd was executed. When 4th failed, why the 5th statement did not execute? Why did this happen?
This happened due to different things happening in the background:
The default transaction mode for the Database Engine is AutoCommit Transactions. In this mode, each SQL statement is executed in its own transaction. So for example if a batch has four insert statements, each statement would be executed and committed. What this means is if 3rd statement fails, first two are already committed by then and the rows added would show up in the table.
- insert into Test_ExecSQLTask values ( 999999999999999 );
This failed for the value too big to be stored in INT data type. This is a run time error but it did NOT STOP the batch from executing.
- insert into Test_ExecSQLTask values ( ‘Thirty’ );
This failed for another reason and though this is also a run time error but it did STOP the batch from executing.
Thus, as a good practice, while using Execute SQL Task in SSIS, we should carefully:
- We shall manage the transaction explicitly in the Execute SQL Task. If not, then it is assumed that in case of package failing in the middle of execution, there is no adverse impact on data. Or at least we know the impact.
- Examine the impact of failure of Execute SQL task esp. with no transction control in a big scheme of complete package. When the packages grow in complexities, mulitple developers work on same package over a period of time, we may lose sight of problems.
If a stored procedure is called from “Execute SQL Task”, make sure the stored procedure is managing the transaction. So if transaction is not managed in stored procedure we shall wrap the call in a transaction in the SSIS task. This is very important as the same stored procedure may be called by other modules/apps.
Please let me know your feedback …
During any ETL design, we implement various functionalities like validation, auditing, notification, job recovery, job logging, data cleansing, handling bad data, etc. I am going to talk about handling bad data in this blog. At a top level, ETL design allows for bad data to be rejected and sent over to the appropriate users in the form of files. But, in my opinion, there is more to this than meets the eye. As an ETL architect, our responsibility does not end there. So before I get into details of how shall we handle bad data, let me tell you how what makes the incoming data bad enough to be handled. Following are the few reasons that generate bad data:
- A set of business rules are laid out that define whether the incoming data is good or bad. Let’s consider the sales record where the cost of the product must be present. If the cost contains a null value or a negative number, the sales records would be considered bad.
- Any data that would not satisfy the referential integrity in data warehouse database. This usually happen in case of missing inter-dependent data. If the incoming data contains references to some other data, which could not be loaded for some reason, this incoming data becomes bad and shall not be loaded into data warehouse database. A typical example would be a retail chain maintaining product master in a centralized database. And the sales data is generated across different POS terminals. So during ETL, if corporate database is down for whichever reason, ETL would not be able to load new products. But at the same time, there is sales generated for these new products. Thus with no product in product dimension (master) the sales record is considered bad at this moment and would not be loaded into data warehouse database.
- Missing business keys: If mandatory data is missing in the incoming data then that data is considered bad. This usually does not happen if the data is sourced from other relational databases. While sourcing data from files, there is every possibility that data may go missing, even if the format of the incoming feed file is already decided.
- Missing data: There may be many cases when the data is missing in the incoming data. That logically makes the data bad. For example an incoming product feed file contains record with no product code and product description. In this case the data is considered invalid.
Now, as we have seen what generates the bad data, we must understand that data cleansing does not make any data bad. Data cleansing is applied only on good data. So before we start data cleansing, a mechanism needs to be put in place to identify good data from the bad data.
Once this bad data is identified, it is usually stored in a separate area called “Rejection Area”. This rejection area can be in a separate schema in the same database that contains staging schema or it can be a separated database altogether. The structure of the rejection area (table structure) is similar to that of staging area with the additional few columns. These additional are required to store metadata about rejected data.
As an ETL architect we need to design our ETL to provide following functionality:
- Ability to reprocess this bad data whenever required. The data that could not be loaded due to missing references is usually re-processed when that missing data is loaded into the data warehouse database. Consider the case where due to missing product master, sales data was rejected. Later, when the latest product master is loaded, this bad data (which wasn’t really bad) which could not be loaded needs to be reprocessed. Otherwise the sales summary report would not be accurate. Another factor when the bad data needs to be reprocessed is change in business rules. If there is a lot of rejection due to strict validation rules, customer may decide to change these validation rules so that large amount of data is not rejected.We can automate this functionality by adding few columns in the rejection area table:
- Reprocess_flag (Y|N): This flag when set, determines if this record needs to be re-processed. Moving these rejected records from rejection area to staging area should be automated. This would help customers in various ways such as reduced the dependency on IT staff, lower maintenance cost.
- Reprocess_Job_Id: Usually the metadata about each run of job is maintained. So to be able to audit, when the rejected record was reprocessed, the job id is stamped into this column.
- Active_Flag: Once the record marked to be reprocessed, is copied over to staging area, this record is made inactive as this record is not valid anymore. It may happen that this record may fail validations again and end up in rejection area, but it would be considered as another record. So in essence, there is only one instance of the rejected record in rejected area. This implies that inactive records cannot be chosen to be re-processed.
- Ability to reprocess incoming data: Many times due to various reasons, the same data that is already processed is fed again into the ETL. This requires us to identify the corresponding records in rejection area and mark them as inactive as these records are not valid anymore. The incoming records would need to be validated as per the current business rules. The records in staging area are compared with the records in rejection area against the business keys and for matching records active_flag is set to ‘N’. This process of marking existing rejected records as inactive is usually automated.
- Ability to mark invalid data: Sometimes the business keys in the incoming records are null. These records are then eventually end up in the rejection area and are active. These records, no matter how many times are reprocessed, they would end up in rejection area again. At the same time, the incoming records can be matched with these records. Thus these records should be marked as invalid. For this add following column in the rejection area table:
- Valid_flag (Y|N): This flag must be set to ‘Y’ for missing business keys. This implies that this record cannot be re-processed ever.
The important fact to be noticed here is that the responsibility of ETL architect does not end here. By designing the ETL, to mark the rejected data as invalid does not solve any business problems. The incoming data must get loaded into the data warehouse database. So it becomes very important for BI architect to talk to end users and tell them the impact of this. The end users may need some tweaking in the source systems, but if they need accurate reports they must send the accurate data.
What I have explained is just one way of designing rejection area (tables containing bad data). What I have discussed is the concept of handling bad data. There can be different ways of implementing the above mentioned functionality. Once ETL and database is designed appropriately, an interface must be provided to the end users that allow users to do following:
- Select any rejected table and mark rejected records to be reprocessed.
- Select any already processed data for reprocessing. This is simple if the incoming data is in the form of feed files. This gets little bit complex when the data is extracted from existing databases. Typically in large ETL systems, the staging area is archived. Depending on the needs of the customer, this goal can be achieved.
- Look at the invalid data and analyze it to be able to fix the source systems accordingly.
- Execute the ETL job after selecting the rejected records or selected source data for reprocessing. This would depend on various other factors such as ETL time window, the need to reflect the correct data, the time of ETL run, etc.
Last but not the least, as an ETL architect our goal shall not be just to implement some logic to handle bad data. Our main responsibility is to make this whole thing as automated as possible. Automation would provide various benefits such as reduced development time, lesser errors hence increased quality and finally the reduced cost for the customer.