SSIS –Execute SQL Task (Transaction control)

Recently I was reviewing a SSIS package written by one of my team members. The package was pretty simple, which connected to one SQL database and it had two tasks:

  1. Execute SQL Task: Which had few SQL statements to set some data which would be used in the next task.
  2. Data flow task: Selecting data from a table and load it into another table.

Interestingly, I found, that in “Execute SQL Task” there were few SQL statements and it did not have transaction control i.e., no BEGIN TRAN or COMMIT/ROLLBACK TRAN. Thus, in this blog, I wanted to highlight the impact of writing SQL in the task without transaction control. Although this applies to the SQL code, written in stored procedures, query in SSMS, SSIS task or anywhere SQL code can be written, its impact may go un-noticed especially in SSIS task. For details, read on….

Let’s say, we have a batch of multiple SQL statements without transaction control, which are executed from the Execute SQL task. Now we are going to analyze the impact on data quality if this batch of SQL statements fails at some palce. Here we face the challenge. It is difficult to find out, in this case, what is committed and what is rolled back. This is because SQL Server treats compile time errors and run time errors differently. What I mean by this:

1. SQL Server terminates the batch as soon as Compile time error occurs.

2. SQL Server may or may not terminate the batch when a run time error occurs. Read “may or may not”.

Let’s see this in action…..

Create a test table as follows:

CREATE TABLE Test_ExecSQLTask ( ID INT);

 Let’s execute the following statements:

insert into Test_ExecSQLTask values ( 10 );
insert into Test_ExecSQLTask values ( 999999999999999 );
insert into Test_ExecSQLTask values ( 20 );
insert into Test_ExecSQLTask values ( ‘Thirty’ );
insert into Test_ExecSQLTask values ( 40 );

Result:

Let’s see what is in table:

select * from Test_ExecSQLTask

Only 1st and 3rd statements go through. When 2nd failed, 3rd was executed. When 4th failed, why the 5th statement did not execute? Why did this happen?

This happened due to different things happening in the background:

  1. The default transaction mode for the Database Engine is AutoCommit Transactions. In this mode, each SQL statement is executed in its own transaction. So for example if a batch has four insert statements, each statement would be executed and committed. What this means is if 3rd statement fails, first two are already committed by then and the rows added would show up in the table.
  2. insert into Test_ExecSQLTask values ( 999999999999999 );
    This failed for the value too big to be stored in INT data type. This is a run time error but it did NOT STOP the batch from executing.
  3. insert into Test_ExecSQLTask values ( ‘Thirty’ );
    This failed for another reason and though this is also a run time error but it did STOP the batch from executing.

Thus, as a good practice, while using Execute SQL Task in SSIS, we should carefully:

  1. We shall manage the transaction explicitly in the Execute SQL Task. If not, then it is assumed that in case of package failing in the middle of execution, there is no adverse impact on data. Or at least we know the impact.
  2. Examine the impact of failure of Execute SQL task esp. with no transction control in a big scheme of complete package. When the packages grow in complexities, mulitple developers work on same package over a period of time, we may lose sight of problems. 
  3. If a stored procedure is called from “Execute SQL Task”, make sure the stored procedure is managing the transaction. So if transaction is not managed in stored procedure we shall wrap the call in a transaction in the SSIS task.  This is very important as the same stored procedure may be called by other modules/apps.

Please let me know your feedback …

Advertisements

SSAS Named Sets – Static Vs Dynamic

Named sets were there in MS SQL Server since beginning. They were only static in SQL Server 2005, but with SQL Server 2008, we have dynamic sets as well.

Also, this is one of the popular questions which are usually asked in BI interviews. The intent in asking this question is not to ascertain that candidate knows the difference between static and dynamic named sets, but to find out whether the candidate knows the intricacies involved in using named sets. What business problems it does solves for us. What are the guidelines to create named sets?

So before we delve into named sets, let’s see how the different parts in an MDX query are resolved:

Here if you notice that sets are evaluated after “FROM” and “WHERE” clauses are evaluated. Thus the concept of STATIC and DYNAMIC named sets shows up.

So to summarize:

STATIC Sets: The MDX query execution context is not passed from “FROM” and “WHERE” clauses to “WITH SET”. They are evaluated during CREATE SET statement execution or at query run time if defined with WITH SET clause. If static named sets are defined in the cube, they are evaluated at the time of cube processing.

DYNAMIC Sets: The MDX query execution context is passed from “FROM” and “WHERE” clauses to “WITH SET”. They are evaluated every time the query is run. It really doesn’t matter where they are defined.

Let’s see above in working, before we delve into remaining intricacies:

STATIC SET

DYNAMIC SET

CREATE
SET [Adventure Works].[TOP 5 PRODUCTS BY SALES] AS
TOPCOUNT(
[Product].[Product Model Categories].[Model Name].MEMBERS

,5

,[Measures].[Internet Sales Amount])

;

SELECT [Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM [Adventure Works];


SELECT

[Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM [Adventure Works]

WHERE

([Date].[Calendar].[Calendar Semester].&[2003]&[2]);


SELECT [Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM

(select [Date].[Calendar].[Calendar Semester].&[2003]&[2] on 0

FROM [Adventure Works]);


STATIC sets did not consider sub cubes

CREATE DYNAMIC
SET [Adventure Works].[TOP 5 PRODUCTS BY SALES] AS
TOPCOUNT(
[Product].[Product Model Categories].[Model Name].MEMBERS

,5

,[Measures].[Internet Sales Amount])

;

SELECT [Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM [Adventure Works];


SELECT

[Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM [Adventure Works]

WHERE

([Date].[Calendar].[Calendar Semester].&[2003]&[2]);


SELECT [Measures].[Internet Sales Amount] ON
COLUMNS,

[TOP 5 PRODUCTS BY SALES] ON
ROWS

FROM

(select [Date].[Calendar].[Calendar Semester].&[2003]&[2] on 0

FROM [Adventure Works]);


 Dynamic sets, did consider the sub cubes.

 So we saw that context set by WHERE clause is considered in evaluating DYNAMIC Sets. Also, when we specified, sub cubes, the context is appropriately set in dynamic set. Find out more about sub cubes versus where clause: here

So far so good. But the story doesn’t end here. Read on…

Before we get into more details let me summarize a simple term MULTI-SELECT. It means, more than one member is specified either in Sub Cube or in WHERE clause. We already saw that sub cube context is not passed on to the static sets. To pass the context, we need to use Dynamic sets.

So, where is the problem? The problem lies in the MDX that is sent from client apps to AS Server. Let us see couple of examples:

The cube browser or OWC component implicitly converts sub cube MDX to WHERE clause. And to pass the context set in WHERE clause to the calculated members, we need to use EXISTING clause. Refer to blog SSAS Sub Select vs Where Clause here.

Excel, and Report Builder uses the sub cubes, so to get correct numbers, we can use dynamic sets. But it may not return the correct results. This is very much relevant in implementing drill through actions.

SSRS generates the MDX (if using the designer) which will have both sub cubes and where clauses. So again we need to be careful in using dynamic sets.

So, I would suggest, named sets must be used with caution.

But all is not bad with named sets. Since AS caches the tuples while evaluating the named set expression, they provide better performance.

Will be back with more blogs, when I get time….

Please provide your feedback. I can be reached at jagdish_malani@hotmail.com


SSAS Sub Select Vs Where Clause

I have created various reports with MDX, and mostly I have used sub selects. Recently, in one of the report, for some reason, I swapped “Sub Select” with “Where Clause” and post the testing I sent the report to production. I felt about blogging on it, and here it goes….

Please note that, I will be using Sub Select and Sub Cube interchangeably.

Let’s see a simple MDX for each:

MDX with Sub-Select:

SELECT {[Measures].[Order Count]} on 0 ,[Product].[Category].Members
on 1

FROM     (select

{[Product].[Product Categories].[Subcategory].&[26]

,[Product].[Product Categories].[Subcategory].&[22]} on 0

from [Adventure Works] );

If you notice the numbers, the total for accessories and clothing does not match the number shown for “All Products”. It is not a rocket science and I leave it to readers to find that out. Also, here another concept of Attribute Relationship comes into play, which I am going to write a separate blog.

Sub Cube how it works:

  • All members from all dimensions are eligible for selection.
  • Cube context is defined by the default members of each dimension. However, you can refer to members not within that Data Space.
  • In case of sub selects, the cube is sliced based on the dimensions specified in sub selects, which actually limits the cube space. Thus the members on axis by default contain only related data that is now specified by the sub select. Thus the Categories belonging to sub categories in sub select are shown. The order count, also is for those sub categories.
  • Sub selects are intended to provide “visual totals” type functionality, where the value of the ‘All Member’ on a hierarchy changes to reflect the members actually selected in a query. What that means is the totals would be the totals of the numbers in the cube space restricted by the sub select.
  • Another point of interest is that calculated members have the ability to look outside the subcube. For example:

 with member measures.test as

[Product].[Category].currentmember.prevmember.name

select [Measures].test on 0, [Product].[Category].[Category].members on 1

from ( select ( {[Product].[Category].[Category].&[4],

[Product].[Category].[Category].&[3]} ) onfrom [Adventure Works])

  • Sub cubes cannot contain with clause, non empty clause and a having clause.
  • Sub cubes do not restrict levels as seen below:

 select

descendants([TIME].[yqmd].year.members, time.yqmd.month,self_and_before) on 0

from ( select [TIME].[YQMD].[Quarter].&[2001-10-01T00:00:00] on 0

from [Sales])

  • Tuple specifications for sub cube: When you specify the tuples in the sub cube, the data cells not part of the sub cube will not contribute to aggregates. Contrast this with autoexists.
SELECT

{[Date].[Calendar].[Calendar Quarter].members, [Date].[Calendar].[Calendar year].members} on 0

,[Product].[Category].Members on 1

FROM (select

{([Product].[Product Categories].[Subcategory].&[26]

,[Date].[Calendar].[Calendar Quarter].&[2004]&[1])

,([Product].[Product Categories].[Subcategory].&[22],[Date].[Calendar].[Calendar Quarter].&[2004]&[2])}

on 0

from [Adventure Works] )

where [Measures].[Internet Sales Amount];

  • Where clause specified in the sub cubes is not passed on to outer MDX query:

Let’s find out the total sales for Australia for year 2002:

select [Customer].[Customer Geography].[Country].&[Australia] on 0,

[Measures].[Internet Sales Amount] on 1

from [Adventure Works]

where ([Date].[Calendar].[Calendar Year].&[2002]);

Now let’s use this in the sub cube clause:

select [Customer].[Customer Geography].[Country].&[Australia] on 0,

[Measures].[Internet Sales Amount] on 1

from

( select [Customer].[Customer Geography].[Country].&[Australia] on 0,

[Measures].[Internet Sales Amount] on 1

from [Adventure Works]

where ([Date].[Calendar].[Calendar Year].&[2002]))

Notice that for Australia in 2002, the sales was 2.15 million. But, when the same query was used as sub cube, the slicer in the inner query was ignored by the outer query. It did not restrict the time dimension in any way. That implies that the where clause specified in the sub cubes are not passed on to outer query context.

  • Aggregate function does not restrict its values to the sub cube:
select [Customer].[Customer Geography].[Country].&[Australia] on 0,

[Measures].[Internet Sales Amount] on 1

from [Adventure Works]

where ([Date].[Calendar].[Calendar Year].&[2002]);

The Australia sales is:

The total sales for all countries is:

select

[Customer].[Customer Geography].[All Customers] on 0,

[Measures].[Internet Sales Amount] on 1

from [Adventure Works]

where ([Date].[Calendar].[Calendar Year].&[2002]);

Now when we use aggregate function:

with
member measures.m as

aggregate([Customer].[Customer Geography].[Country].members, [Measures].[Internet Sales Amount])

select [Customer].[Customer Geography].[Country].&[Australia] on 0,

measures.m on 1

from [Adventure Works] where ([Date].[Calendar].[Calendar Year].&[2002]);

  • Performance: When a query contains a sub select it means that the results of any calculations can only be cached for the lifetime of the query. When you get to multiple nested sub selects they do have a slight performance advantage.

MDX with Where Clause:

SELECT {[Measures].[Order Count]} on 0,

[Product].[Category].Members on 1

FROM [Adventure Works]

WHERE ({[Product].[Product Categories].[Subcategory].&[26],

[Product].[Product Categories].[Subcategory].&[22]} );

Before we go further, notice that the order count returned is for the product categories which contain the sub categories specified in the where clause. In other words, the MDX states that give me the sum of order count for categories that contain specified sub categories.

To validate it, see the following MDX:

SELECT {[Measures].[Order Count]}on 0

,{[Product].[Product Categories].[Subcategory].&[26]

,[Product].[Product Categories].[Subcategory].&[22]}

on 1 FROM [Adventure Works]

Compare these numbers with the MDX using Sub-Select above.

Where Clause how it works:

  • All members from all dimensions except the dimensions whose members are specified in the where clause are eligible for selection.
  • You cannot refer to members outside the Data Space hence you cannot use the <where member>’s dimension in the <columns set> or <rows set>.
  • No Visual total: It is important to note that the total (or ALL) member is the total sum of the whole dimension to which the <where member> belongs. Notice the effect of having a where clause here. Categories are pulled based on the members specified in the where clause. The measures are still rolled up by members selected on the axis. In other words, order count is for the product categories (including all sub categories).
  • Now to get the measures rolled up by the members specified in the where clause, we need to use the “existing” condition.

Before we look at EXISTING keyword, let’s first look at the plain MDX:

WITH MEMBER Measures.OrderCount as

SUM( {[Product].[Product Categories].[Subcategory].&[26]

    ,[Product].[Product Categories].[Subcategory].&[22]}

, [Measures].[Order Count])

SELECT {Measures.[OrderCount]}        on 0

        ,[Product].[Category].Members    on 1

FROM [Adventure Works]

WHERE{[Product].[Product Categories].[Subcategory].&[26]

,[Product].[Product Categories].[Subcategory].&[22]} );


Notice that 2573 is the sum of these two categories accessories and clothing as we saw in the previous MDX. We do get the same number for “All Products” and these two categories. This is because the context of where clause is not being used in evaluating calculated member. Also, while evaluating row axis, for each product category, sum returned same value.

Now, let’s put EXISTING keyword, and notice that the correct context is passed as is seen in the MDX below:

WITH MEMBER Measures.OrderCount as

SUM(existing {[Product].[Product Categories].[Subcategory].&[26]

            ,[Product].[Product Categories].[Subcategory].&[22]}

, [Measures].[Order Count])

SELECT {Measures.[OrderCount]}        on 0

        ,[Product].[Category].Members    on 1

FROM [Adventure Works]

WHERE ( {[Product].[Product Categories].[Subcategory].&[26]

,[Product].[Product Categories].[Subcategory].&[22]} );


  • If the WHERE clause does not contain one member but a <where set>, the “CurrentMember” will not work properly. You will need the help of the keyword “EXISTING” on the dimension/set used by the <where set> to ensure you stay within the context of the members defined by this <where set>.
  • May produce better performance
  • It is more readable
  • You might want to use the WHERE clause because it sets the context that is, it defines the “CurrentMember” and some specific MDX functions rely on this capacity (YTD, PP…).

Note:
When there are multiple members are specified in the where clause, it presents a popular MULTI-SELECT issue. I am going to blog about it very soon. Stay tuned…

Wrapping this up with a quick summary here:

SUB SELECT

WHERE CLAUSE

All members from all dimensions are eligible for selection.

<where member> context hence that specified member plus all dimensions EXCEPT the one to which the <where member> belongs are eligible for selection.

Limits the cube space

It does not limits the cube space.

This context is defined by the default members of each dimension.However, you can refer to member not within that Data Space with calculated members.

You cannot refer to members outside the Data Space hence you cannot use the <where member>’s dimension in the <columns set> or <rows set>.

Visual total: It is important to note that the total (or ALL) member will present you a total sum of the values for the members defined by the <subselect set>. This is a “Visual Total”.

No Visual total: It is important to note that the total (or ALL) member will present you the total sum of the whole dimension to which the <where member> belongs.

Aggregate function does not restricts its values to the subcube.

Now to get the measures rolled up by the members specified in the where clause, we need to use the “existing” condition.

When you get to multiple nested subselects they do have a slight performance advantage. Which is why you will see SSRS use both sub-selects and the WHERE clause. This way they get both the performance gain and get the context set, but it makes the query a real mess aesthetically.

You might want to use the WHERE clause because it sets the context that is, it defines the “CurrentMember” and some specific MDX functions rely on this capacity (YTD, PP…).

Sub cubes cannot contain with clauses, non empty clause or having clause.

If you are dealing with “raw” measures, they can often be used interchangeably

If you are dealing with “raw” measures, they can often be used interchangeably

Data cells not part of the subcube will not contribute to aggregates

More readable

Sub cubes do not restrict levels.
Where clause context specified in subcube is not passed on to outer query.
Comparing performance with these two is not considered here.

Please provide your feedback, if you have any. I can be reached at jagdish_malani@hotmail.com


Competency Building

IT organizations were doing well until sometime back. Subprime crisis led to fall of many banks in US and before we know, we were in recession. Though nobody knows when this recession would end, but we all hope that it would end soon. Until this happens, IT organizations are facing typical problems that are often seen during recession times such as rising bench strengths, increased costs, and lower revenues. To make the matters worse, there is uncertainty about when the economy would turn around. Hoping that recession would end soon, organizations shall use this slowdown to their advantage and should prepare themselves for the good times. First thing that organizations are trying to do is to increase the employees’ utilization by effectively training employees on newer and in-demand technologies. They are reluctant to hiring at any levels. Thus organizations are focusing on competency building internally. So in the light of cut-throat competition, building competency has to be aligned with organizations strategy derived from sales planning and operations management.

Let’s see how organizations shall go with competency building . . . .

  1. IDENTIFY THE GOALS: Competency building exercise needs some retrospection before organizations take the first step. The reason being, this exercise would have been done even during good times. So organizations must examine the earlier efforts meticulously. They must find out the success rate and impact of it in various projects. Otherwise following the same approach, would yield same results. If organizations are struggling with the competency building since good times, this is an indicator that something wasn’t done right earlier. And during slowdown, it is critically important for organizations to take the right step, or they risk ending up in wrong direction altogether. Going forward, when the economy picks up, organizations that are strategic would have the edge over their competitors. Organizations must approach competency building as follows:
     

    1. Identify the areas in which the competency needs to be built:People who drive sales strategy (typically senior management) and people who are responsible for operations need to come together and align their goals. The mistake that few organizations do is that they run the competency building in silo. That way, they are never able to build the competency that is required for driving sales growth.
    2. Define the extent of competency building:They must work out the approximate target numbers in each level in pyramid. These numbers must be tied to the sales targets both in the short term as well as long term.
    3. Expectation Management: The next most important factor that defines the success in competency building is to have the right perception. Training employees in a new technology does not make them experts in one go; instead they get a quick and timely head start in the new area. The fact that every organization while hiring, look for a specified years of experience in a particular stream is true across all verticals and at all levels. If this wasn’t true you would have seen advertisements like “We need 5 smart people with or without prior experience for all levels”. Having said so, this does not mean competency cannot be built. It can be build if organizations define a good strategy and ensure its strategy meets the overall organizational goals. For e.g. management must provide for specific follow-up trainings and live projects experiences (even internal project would suffice) and see this exercise through.
  2. ACHIEVE THE GOALS: Once the organizations figure out their goals clearly as stated above, they must ACT to achieve the specified goals. Organization can run different specific programs to achieve these goals. One way would be to run training programs in the areas in which they need to build up the competency. Another would be to use employees available on bench to develop internal tools which are required in order to improve the organizations’ efficiency internally. This shall be done by remaining focused and ensuring that these programs are aligned with their overall goals. Organizations must work to ensure following:
     

    1. Focused trainings: The training programs should be very much focused. Organizations must do their due diligence in doing gap analysis. They shall consider the business lines they are in and also explore this opportunity in expanding their horizon in different areas. This gap analysis can be done by:
      1. Analyze earlier projects: Organizations must spend time in analyzing earlier projects from different perspectives. They must find out what went wrong and what are the specific areas they must improve upon.
      2. Analyzing earlier sales deals: Also organizations must analyze the pre-sales deals that did not materialized. Few reasons that the sales deal did not materialize could have been like no fitting resources, no prior experience, poor estimates, high cost. Focused training would help organizations to fill in these gaps.
    2. Role based trainings: Next organizations must evaluate their employees. They must get the buy in from the employees being trained. Another thing that organizations must ensure is that these trainings are role-based. This means that if an lead level employee is trained on a new technology, organizations must figure out if the same employee would be able to play the similar lead role in that new technology. To play a specific role in any technology needs some prior experience and this becomes important for the senior roles especially. This holds true for not only during recession times but also during good times.
    3. Development of internal tools: Development of internal tools, are good for all organization in various ways such as:
      1. Building competency
      2. Manage operations effectively
      3. Evaluate a technology and
      4. Helps in sales pitch

    But before organizations jump into developing internal tools, they must take care of few things to be effective:

    1. Selection of right tools to be developed: Organizations must ensure that they are investing in developing the right set of internal tools, which are really needed. When approached, every manager or head of department would have a long list of internal tools that they would like to be developed for them. After all they are not paying for this. This would lead to a bigger mess (multiple applications using different technologies with no consolidated data) later if not managed properly. Organizations must take a holistic approach in deciding what tools to be developed, what technology to be used, and their priorities. Organizations must not use any technology just for the competency building sake. This would help them in coordinating training programs effectively with greater success.
    2. Execution model:After identifying the tool to be developed and the team that is going to work on it, organizations have to make sure the project is executed in a right model, as if it is a live project for a customer. Organizations must not take any shortcuts here and swap the roles or responsibilities within the development team executing the project. For instance, project managers must not be collecting requirements if in live projects they aren’t suppose to play that role. Right set of people shall be given the right set of roles.
  3. PERIODIC EVALUATION: Competency building efforts shall not go on for a long time without having any mechanism to measure the success. Their success must be measured periodically. This would be good for both organizations as well as employees. Organizations would be able to deploy new people effectively. If employees are not convinced within themselves, they would not be giving their 100%.

IT organiztions with an effective strategy, strong leadership and vision, would be able to build the right competency.


ETL: How to handle bad data

During any ETL design, we implement various functionalities like validation, auditing, notification, job recovery, job logging, data cleansing, handling bad data, etc. I am going to talk about handling bad data in this blog. At a top level, ETL design allows for bad data to be rejected and sent over to the appropriate users in the form of files. But, in my opinion, there is more to this than meets the eye. As an ETL architect, our responsibility does not end there. So before I get into details of how shall we handle bad data, let me tell you how what makes the incoming data bad enough to be handled. Following are the few reasons that generate bad data:

  1. A set of business rules are laid out that define whether the incoming data is good or bad. Let’s consider the sales record where the cost of the product must be present. If the cost contains a null value or a negative number, the sales records would be considered bad.
  2. Any data that would not satisfy the referential integrity in data warehouse database. This usually happen in case of missing inter-dependent data. If the incoming data contains references to some other data, which could not be loaded for some reason, this incoming data becomes bad and shall not be loaded into data warehouse database. A typical example would be a retail chain maintaining product master in a centralized database. And the sales data is generated across different POS terminals. So during ETL, if corporate database is down for whichever reason, ETL would not be able to load new products. But at the same time, there is sales generated for these new products. Thus with no product in product dimension (master) the sales record is considered bad at this moment and would not be loaded into data warehouse database.
  3. Missing business keys: If mandatory data is missing in the incoming data then that data is considered bad. This usually does not happen if the data is sourced from other relational databases. While sourcing data from files, there is every possibility that data may go missing, even if the format of the incoming feed file is already decided.
  4. Missing data: There may be many cases when the data is missing in the incoming data. That logically makes the data bad. For example an incoming product feed file contains record with no product code and product description. In this case the data is considered invalid.

Now, as we have seen what generates the bad data, we must understand that data cleansing does not make any data bad. Data cleansing is applied only on good data. So before we start data cleansing, a mechanism needs to be put in place to identify good data from the bad data.

Once this bad data is identified, it is usually stored in a separate area called “Rejection Area”. This rejection area can be in a separate schema in the same database that contains staging schema or it can be a separated database altogether. The structure of the rejection area (table structure) is similar to that of staging area with the additional few columns. These additional are required to store metadata about rejected data.

As an ETL architect we need to design our ETL to provide following functionality:

  1. Ability to reprocess this bad data whenever required. The data that could not be loaded due to missing references is usually re-processed when that missing data is loaded into the data warehouse database. Consider the case where due to missing product master, sales data was rejected. Later, when the latest product master is loaded, this bad data (which wasn’t really bad) which could not be loaded needs to be reprocessed. Otherwise the sales summary report would not be accurate. Another factor when the bad data needs to be reprocessed is change in business rules. If there is a lot of rejection due to strict validation rules, customer may decide to change these validation rules so that large amount of data is not rejected.We can automate this functionality by adding few columns in the rejection area table:
    1. Reprocess_flag (Y|N): This flag when set, determines if this record needs to be re-processed. Moving these rejected records from rejection area to staging area should be automated. This would help customers in various ways such as reduced the dependency on IT staff, lower maintenance cost.
    2. Reprocess_Job_Id: Usually the metadata about each run of job is maintained. So to be able to audit, when the rejected record was reprocessed, the job id is stamped into this column.
    3. Active_Flag: Once the record marked to be reprocessed, is copied over to staging area, this record is made inactive as this record is not valid anymore. It may happen that this record may fail validations again and end up in rejection area, but it would be considered as another record. So in essence, there is only one instance of the rejected record in rejected area. This implies that inactive records cannot be chosen to be re-processed.
  2. Ability to reprocess incoming data: Many times due to various reasons, the same data that is already processed is fed again into the ETL. This requires us to identify the corresponding records in rejection area and mark them as inactive as these records are not valid anymore. The incoming records would need to be validated as per the current business rules. The records in staging area are compared with the records in rejection area against the business keys and for matching records active_flag is set to ‘N’. This process of marking existing rejected records as inactive is usually automated.
  3. Ability to mark invalid data: Sometimes the business keys in the incoming records are null. These records are then eventually end up in the rejection area and are active. These records, no matter how many times are reprocessed, they would end up in rejection area again. At the same time, the incoming records can be matched with these records. Thus these records should be marked as invalid. For this add following column in the rejection area table:
    1. Valid_flag (Y|N): This flag must be set to ‘Y’ for missing business keys. This implies that this record cannot be re-processed ever.

    The important fact to be noticed here is that the responsibility of ETL architect does not end here. By designing the ETL, to mark the rejected data as invalid does not solve any business problems. The incoming data must get loaded into the data warehouse database. So it becomes very important for BI architect to talk to end users and tell them the impact of this. The end users may need some tweaking in the source systems, but if they need accurate reports they must send the accurate data.

What I have explained is just one way of designing rejection area (tables containing bad data). What I have discussed is the concept of handling bad data. There can be different ways of implementing the above mentioned functionality. Once ETL and database is designed appropriately, an interface must be provided to the end users that allow users to do following:

  1. Select any rejected table and mark rejected records to be reprocessed.
  2. Select any already processed data for reprocessing. This is simple if the incoming data is in the form of feed files. This gets little bit complex when the data is extracted from existing databases. Typically in large ETL systems, the staging area is archived. Depending on the needs of the customer, this goal can be achieved.
  3. Look at the invalid data and analyze it to be able to fix the source systems accordingly.
  4. Execute the ETL job after selecting the rejected records or selected source data for reprocessing. This would depend on various other factors such as ETL time window, the need to reflect the correct data, the time of ETL run, etc.

Last but not the least, as an ETL architect our goal shall not be just to implement some logic to handle bad data. Our main responsibility is to make this whole thing as automated as possible. Automation would provide various benefits such as reduced development time, lesser errors hence increased quality and finally the reduced cost for the customer.


SSAS Server based Time Dimension

For any SSAS implementation, time dimension plays an important role. Different architects take different approach on this. Some of them decide to create their own physical time dimension table and populate it accordingly. Some architects use the SSAS Server based time dimension. SSAS again provides two options namely:

  • A table for time dimension is created in the data source
  • A table for time dimension is created on the server

Let’s see the impact of using server based time dimension and then you would be able to make a call about which way to go. Following are the points that need consideration:

  1. Relational database contains all fact tables and dimension tables, except a time dimension table, which is implemented in an OLAP layer. So the data model is split between data source and OLAP layers. If relational data warehouse database needs to be exposed to other applications for some reporting, custom rollup by time periods would be difficult to implement.
  2. With server based time dimension, the keys at different levels of a hierarchy are not unique, which present different problems while querying the AS database. Let’s quickly look at this scenario.

    This is the fact table, where reservation date is the data the booking is made and check_in_date is the date when the quest arrived. The column sales is used as a measure.


    The requirement is to create a report which displays the sales for all months, where the month of reservation date and month of check in date are same.
    I have populated the table as follows:


    If we were to use the SQL, then following SQL would give us the required results:


    With MDX, this was again a simple query. In my cube, I have used server based time dimension. Let’s see the results:

    WITH

    MEMBER [MEASURES].[SALES FOR SAME MONTH RESERVATION] AS

    (LINKMEMBER([TIME].[YQMD].CURRENTMEMBER,[CHECK IN].[YQMD]),[MEASURES].[SALES])

    // Displaying the check in date member name to ensure, correct linking

    MEMBER [MEASURES].[CHECK IN DATE] AS

    LINKMEMBER([TIME].[YQMD].CURRENTMEMBER,[CHECK IN].[YQMD]).MEMBER_NAME

    SELECT

    {MEASURES.[CHECK IN DATE],[MEASURES].[SALES FOR SAME MONTH RESERVATION]}

    ON
    COLUMNS,

    YTD([TIME].[YQMD].[MONTH].&[2011-04-01T00:00:00])

    ON
    ROWS

    FROM [TEST CUBE]


    Notice that for January, 2011 the check in date member is at returned as Calendar Year 2011. Also for April, it returned member at Quarter level. Why is this so?

    The reason for this is server based time dimension, uses the same key at different levels as shown below. Thus link member links it to wrong member.

Member

Key

Calendar 2011

1-Jan-2011

Quarter 1, 2011

1-Jan-2011

January 2011

1-Jan-2011

1-Jan-2011

1-Jan-2011

To fix it, we need to change the MDX a little as follows:

WITH

MEMBER [MEASURES].[SALES FOR SAME MONTH RESERVATION] AS

(iif (

    linkmember([TIME].[YQMD].CURRENTMEMBER

    ,[CHECK IN].[YQMD]).level.ordinal = 1,

        linkmember([TIME].[YQMD].CURRENTMEMBER

        ,[CHECK IN].[YQMD]).firstchild.firstchild,

        iif (

            linkmember([TIME].[YQMD].CURRENTMEMBER

            ,[CHECK IN].[YQMD]).level.ordinal = 2,

                linkmember([TIME].[YQMD].CURRENTMEMBER

                ,[CHECK IN].[YQMD]).firstchild,

                linkmember([TIME].[YQMD].CURRENTMEMBER

                ,[CHECK IN].[YQMD])

            )

    ),[MEASURES].[SALES])

MEMBER [MEASURES].[CHECK IN DATE] AS

iif (

    linkmember([TIME].[YQMD].CURRENTMEMBER

    ,[CHECK IN].[YQMD]).level.ordinal = 1,

        linkmember([TIME].[YQMD].CURRENTMEMBER

        ,[CHECK IN].[YQMD]).firstchild.firstchild,

        iif (

            linkmember([TIME].[YQMD].CURRENTMEMBER

            ,[CHECK IN].[YQMD]).level.ordinal = 2,

                linkmember([TIME].[YQMD].CURRENTMEMBER

                ,[CHECK IN].[YQMD]).firstchild,

                linkmember([TIME].[YQMD].CURRENTMEMBER

                ,[CHECK IN].[YQMD])

            )

    ).MEMBER_NAME

SELECT

{MEASURES.[CHECK IN DATE],[MEASURES].[SALES FOR SAME MONTH RESERVATION]} ON
COLUMNS,

YTD([TIME].[YQMD].[MONTH].&[2011-04-01T00:00:00]) ON
ROWS

      FROM [TEST CUBE]

3. If you notice, the month names are being displayed as “January 2011”. Usually the month names are displayed as Jan-11 in reports. If that              is required, you have to change the label expression in all those reports probably as follows:

      =Left(Fields!Month.Value,3)+”-“+Right(Fields!Month.Value,2)

The bottom line is, we do not have much control on server based time dimension and we cannot customize it.

With all these issues (if you may call it), I would think, server based time dimension in SSAS can be used in following scenarios:

  • We have no control on the data sources at all.
  • We are already in production, and we cannot do much about it.
  • Or by the virtue of UDM, your data source is some OLTP system. For our data warehouse database, we would not be allowed to change the OLTP system in anyway.

Please feel free to provide feedback.


Slowly Changing Dimension Type 2 in SSAS Dimension

In the world of BI, everybody must be familiar with slowly changing dimension. There are three types of changing dimensions namely Type 1 (where the attributes are overwritten), Type 2 (history is preserved) and Type 3(limited history is preserved in additional columns).

And when it comes to creating a SSAS dimension, we need to take care of few things like

  • attribute relationships and
  • how the hierarchies would behave if there are some
  • Performance

I will only discuss about SCD – II here in this blog.

Let’s look at the data in customer dimension. We have a SCD – II here.

We are tracking history by marital status. And for the sake of brevity, we will allow change in value only once.

Thus for any change in the value of “Married” column, would require a new record to be inserted with new value in ScdStartDate and ScdState would be marked as “C” for current. The previous record would be marked as “H” in the column ScdState column and end date would be stamped in ScdEndDate. See the following screenshot. Notice that Customer 6 has two records


Now, first we will see, how to set up the attribute relationships in this case. Following is the screenshot of how we setup the attribute relationship:


Here are the attribute which may change, are made related attributes of Dim Customer (Key Attribute). All other attributes which would not change are made related attributes of Customer business key. We are assuming, birth place of a customer would not change.

Let’s create a hierarchy State – City – Customer:


All is well so far. Let’s browse this dimension. We see following:


We see that customer 6 is repeating under the city MUMBAI. This is not what we want. As you have guessed, the solution is pretty simple and change the hierarchy level from Dim Customers (which would repeat for a record with change in marital status) to COD Customer (business key).

Here we go. We change the


Let’s browse once to see if every thing is perfect:


Yes everything is perfect. We are almost done with our dimension design. No wait !!!.

If you have noticed, in this design, we have create a un-natural user hierarchy.


This would affect the querying performance. If you consider real world examples, some dimensions are very huge. For instance, customer dimension for a store like Walmart.

Now let’s see how to fix this.

As a first step, we need to create an intermediate attribute between city and COD Customer. This can be done in DSV with a named calculation as follows:


After that bring in this attribute in the dimension and create a level in the hierarchy with this attribute between City and COD customer:


Let’s make this attribute invisible in the dimension. We do not want this level to show up in the hierarchy as well while browsing. So we need to make Customer – City level invisible. How do we do that, as there is no direct property available for that. To do that, change the name column of this attribute to “City” and use “HideMemberIf” property to “ParentName” as follows:


Let’s see how this comes up while browsing:


Now we can say that our job is done.