Thursday 18 October 2012

ThePower of Exotic Joins and why having a Calendar table is useful


non-SQL developers look away now....

The Scenario:

On her way back from the polo club, the director of marketing was driving across town and noticed several different billboard campaigns been ran by his company. Concerned about the inconsistency of how her brand’s identity was been rolled out he posed a question to his PA to find out all campaigns currently in progress.

The Problem:
In our scenario each campaign has a start and end date along with a campaign id. Let’s say they reside in a table called campaign which is defined as follows:


CREATE TABLE campaign
      (
       campaign_id int
      ,date_start  datetime
      ,date_from   datetime
      )


For this example the following data can be inserted which shows that campaign id’s 1 and 3 will overlap:


insert into dbo.campaign values( 1,'01-jan-2011','10-jan-2011')
insert into dbo.campaign values( 2,'01-feb-2011','10-feb-2011')
insert into dbo.campaign values( 3,'05-jan-2011','09-jan-2011')
insert into dbo.campaign values( 4,'01-mar-2011','10-mar-2011')
insert into dbo.campaign values( 5,'01-jan-2011','10-jan-2011')
insert into dbo.campaign values( 6,'01-feb-2011','10-feb-2011')
insert into dbo.campaign values( 7,'05-jan-2011','09-jan-2011')
insert into dbo.campaign values( 8,'01-mar-2011','10-mar-2011')
insert into dbo.campaign values( 9,'01-jan-2011','10-jan-2011')
insert into dbo.campaign values(10,'01-feb-2011','10-feb-2011')
insert into dbo.campaign values(11,'05-jan-2011','09-jan-2011')
insert into dbo.campaign values(12,'01-mar-2011','10-mar-2011')
insert into dbo.campaign values(13,'01-jan-2011','10-jan-2011')
insert into dbo.campaign values(14,'01-feb-2011','10-feb-2011')
insert into dbo.campaign values(15,'05-jan-2011','09-jan-2011')
insert into dbo.campaign values(16,'01-mar-2011','10-mar-2011')
insert into dbo.campaign values(17,'01-jan-2011','10-jan-2011')
insert into dbo.campaign values(18,'01-feb-2011','10-feb-2011')
insert into dbo.campaign values(19,'05-jan-2011','09-jan-2011')
insert into dbo.campaign values(20,'01-mar-2011','10-mar-2011')


Each row of data holds a start and end date range which we need to compare against each of the other rows in the campaign table. From a Business Intelligence view point this data doesn't present much granularity.

Date ranges stored in this fashion may also have associated hidden business implications linked to types of days such as week days, weekend days or bank holidays which could distort the desired result. This would lead to assumptions and misconceptions leading to non-realistic reporting.


The Solution:
For a set based approach we need to transform each row of summarized data defined by the date range into individual rows. 

To achieve this we need another table called calendar which, for the purposes of this scenario, has only one column called date. The date column should be populated consecutive dates that will span all start and end dates in the campaign table .
CREATE
 TABLE calendar(date  datetime)




The resultant record set from an exotic join will form a collection of individual days of when each campaign is/was in force.

This derived record set provides a starting point to the solution.

Let’s take the full list of steps in order:
1. Transform the date range data set into a record set of individual days.
     Include the campaign id and create a derived table which we can then call dr1.

2. Make a copy of step 1 and create a derived table which we can then call dr2.
3. Create a select statement that joins both derived tables on the day column key and identify the common overlapping days.
4. Add a where clause to exclude equal campaign id’s across both derived tables so that joined rows (ie dates) are only on differing campaigns.


Step 1: Transform the date ranges and create the derived table dr1.
By using the between keyword we specify the date range to join onto the calendar table. This will spawn the rows we are interested in.


      --will be used as derived table dr1
select      c1.campaign_id
                  ,c2.date_day
      from dbo.campaign c1
join dbo.calendar c2 on c2.date_day
between c1.date_start and c1.date_from
      where c1.campaign_id is not null


Step 2: Same as Step 1 that creates the derived table dr2.


      --will be used as derived table dr2
select      c1.campaign_id
                  ,c2.date_day
      from dbo.campaign c1
join dbo.calendar c2 on c2.date_day
between c1.date_start and c1.date_from
      where c1.campaign_id is not null



Step 3 and 4: The SELECT statement joins both derived tables on the day column key and identifies the common overlapping days. The WHERE clause is then used to exclude campaign's that are common across both derived tables.


    select  distinct dr1.campaign_id
    from
      (
      select      c1.campaign_id
                  ,c2.date_day
      from dbo.campaign c1
join dbo.calendar c2 on c2.date_day
between c1.date_start and c1.date_from
      where c1.campaign_id is not null
      ) dr1
      join
      (
      select      c1.campaign_id
                  ,c2.date_day
      from dbo.campaign c1
join dbo.calendar c2 on c2.date_day 
between c1.date_start and c1.date_from
      where c1.campaign_id is not null
      ) dr2
      on dr1.date_day = dr2.date_day
      where dr1.campaign_id <> dr2.campaign_id

      Results
campaign_id
-----------
1
3

(2 row(s) affected)


This shows the power of joining on ranges and the power of using a calendar table. The calendar table can also be extended to include columns that depict days such as bank holidays, weekends which may affect any Business Intelligence reporting system.


Thus endith the lesson :)


No comments:

Post a Comment