Search This Blog

Monday, May 8, 2017

Power BI with Narrative Science: Look Who's Talking (Part 3: On or Off premises)

[2017-May-08] Listening to a car radio with many different talk shows available, it comes more often to my mind, that certain segments of those radio stations broadcasting would eventually be substituted/automated by machines. I don't think it will happen overnight, but the voice natural language generation software is a reality and it's very near! Weather and road traffic news could be the first ones to replace, the data is already available and AI can consume it with different levels of verbosity and variety, thus it will speak like a human.


Natural language generation software already writes reports and creates communication based on data; several companies develop AI products that are able to consume/absorb user data and automatically explain what meaning of this data using natural human language. I've written two blog posts about the Narrative Science extension for Power BI:
- Power BI with Narrative Science: Look Who's Talking? 
- Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future) 

This time I want to briefly write about an option of using Narrative Science extension for Power BI on-premises. This will be very helpful for an organization that works with sensitive datasets and expects to keep their data secure within its own network environment.

This is how a free extension of the Narratives for Power BI works:



After placing a Narrative Science Power BI extension to your report and assigning specific dataset, then this data is transmitted to the Narrative Science platform over internet; where it gets processed and narratives are being generated.  Which then are transferred back to the Power BI report in a form of a natural human communication. Please read here how Narrative Science protects your data in this case: How does Narrative Science protect client data?


There is a 2nd option where Narrative Generation Application and Licensing Agent are hosted on separate servers. 
No client-specific data leaves your network, but internet connectivity is required for licensing:




Here is an extract from the Narrative Science  hardware/software requirements for both  Narrative Generation Application and Licensing Agent (https://powerbi-support.narrativescience.com/hc/en-us/articles/215191443-On-Premises-Installation-Prerequisites-):

Hardware/Server Recommendations:
Linux Server (Examples: Virtual machine, Azure instance) with the minimum system specifications:
1.    Number of Cores – 2
2.    Memory - 2GB per core
3.    Disk - 20GB
If you are installing the Narrative Science licensing agent on a separate server from the narrative generation application, the minimum system specifications for the agent server would be:
1.    Number of Cores – 1
2.    Memory - 2GB
3.    Disk - 10GB
Software Requirements: Operating System - Linux CentOS 7 or RHEL 7 (with systemd)


After setting up both servers you will need to activate license to Narratives for Power BI.



There is one drawback with the licensing though. The number of Narrative Sciences licences purchased must equal to the number of Power BI Pro licenses that you have within your organization, regardless of the actual number of users who plan on working with the Narratives. Perhaps this would be changed in future, especially with the recent changes to  Microsoft Power BI licensing itself.

You can always check the Narrative Science web site (https://www.narrativescience.com/) and contact them directly for more information on specifications and pricing details.

Happy data adventures!

Sunday, April 30, 2017

DAX calculations with hierarchies: Set the order straight.

[2017-Apr-30] If you have ever tried to create calculated measures in SSAS Tabular/ Power BI models where different levels of your model hierarchy had to be considered then you'd understand how DAX language doesn't have a straight way to identify parent & child levels in order to compute necessary metric results. There is a good article http://www.sqlbi.com/articles/clever-hierarchy-handling-in-dax/ by Alberto Ferrari where he explains a workaround approach to handle hierarchies in DAX.

I made one step further, based on my recent client project experience, and created a new example with one extra level for the Product dimension using Adventure Works Internet Sales tabular database. Here is the code to calculate Sales Ratio to a Parent level for all levels (Category, Sub Category, Model and Product Name)

RatioToParent:=
IF (
    ISFILTERED ( Product[HCategory] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) ),
    IF (
        ISFILTERED ( Product[HSubcategory] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HSubcategory] ) ),
        IF (
            ISFILTERED ( Product[HModel] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HModel] )
                ),
            IF (
                ISFILTERED ( Product[HProduct] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) )
            )
        )
    )
)

However, when I looked at the results for this new RatioToParent measure using Excel, I saw some weird values: only the to Category level showed me correct calculations and all other Product hierarchy levels had 100%.




Then I changed the order of levels in my DAX calculated measure from (Category > Sub Category > Model > Product Name) to (Product Name >  Model > Sub Category Category):

RatioToParent_Ordered:=
IF (
    ISFILTERED ( Product[HProduct] ),
    SUM ( 'Internet Sales'[Sales Amount] )
        / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HProduct] ) ),
    IF (
        ISFILTERED ( Product[HModel] ),
        SUM ( 'Internet Sales'[Sales Amount] )
            / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HModel] ) ),
        IF (
            ISFILTERED ( Product[HSubcategory] ),
            SUM ( 'Internet Sales'[Sales Amount] )
                / CALCULATE (
                    SUM ( 'Internet Sales'[Sales Amount] ),
                    ALL ( Product[HSubcategory] )
                ),
            IF (
                ISFILTERED ( Product[HCategory] ),
                SUM ( 'Internet Sales'[Sales Amount] )
                    / CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ), ALL ( Product[HCategory] ) )
            )
        )
    )
)

 and that made the whole difference and resolved the issue of miscalculated results:




So my learning lesson in this exercise was that order of hierarchical levels in DAX calculations is very important. Otherwise, calculations can only be correctly fulfilled for the top parent level of your tabular model hierarchy (please see the first DAX calculations). Lesson learned, mistakes not to be repeated!

Happy data adventures!


Friday, March 31, 2017

SSIS Web Service Task: misleading role of WSDL file

[2017-Mar-31] You can call one of your web service methods in SSIS package using a .NET script task or you can just use an existing Web Service task. The latter option is easier, UI based, doesn't take too much time to set up and Input/Output parameters are very intuitive to navigate.

There is one gotcha with the use of WSDL file though that is not fully documented but is worth to be mentioned. When you initially setup a Web Service task in SSIS, you need to provide a web server URL for your Http connection (let's just take this for example: http://www.webservicex.net/Statistics.asmx?WSDL), then you need to locate a local WSDL web service reference file with all available methods within. Then you test your ETL packages, it works well and all is merry and bright. However if you think that by changing the Http connection for your web service (let's say from Dev to Prod) you will be calling a Production version of your web service method then, actually it's not the case. It still going to be your Dev environment. I've learned this the hard way at my recent BI project, testing a web service call in UAT environment but for unknown reasons I was still getting results from the original Dev environment.

<wsdl:service name="Statistics">
     <wsdl:port name="StatisticsSoap" binding="tns:StatisticsSoap">
         <soap:address location="http://www.webservicex.net/Statistics.asmx"/>
     </wsdl:port>
</wsdl:service>

Apparently an SSIS Web Service task uses WSDL file defined connection to your web server stored in the soap:address tag regardless of what your Http connection would specify your web server location is.

My way to resolve this problem was to create a separate WSDL file for my UAT environment web service and make it a part of SSIS environment configuration, i.e. each different environment Htpp web service connection would be coupled with a corresponding WSDL file in order to prevent data result confusion during SSIS control flow execution, and don't forget to make those two parameters configurable.

Please let me know how you had to resolve a similar issue with the use of SSIS Web Service task after you deploy your ETL package to a different environment.




Tuesday, February 21, 2017

External SQL files vs. Stored procedures in SSIS development

[2017-Feb-21] I'm creating this particular blog post more for myself to remember and use it as a reference to related discussions in future.

One of my client's existent SSIS solution contains a use of multiple SQL execute tasks which are configured to use external files with a complicated SQL code. Such solution architecture design is not very sustainable and my recommendation was to replace those external SQL code files with a set of stored procedures on a SQL Server side.

This client further questions to outline benefits of using stored procedures vs. external files, along with preferable way to use hard coded SQL code vs. stored procedure calls in SSIS packages has led me to write this.

Stability and Security issues:
- By keeping data extract logic in external SQL files will make SSIS data flows tasks more vulnerable to a risk of undesirable SQL scripts change or even worse to a risk of deletion those files which may fail the whole data flow process. And by depositing the same SQL extract logic into stored procedures you now rely on a very stable SQL Server engine to execute your expected set based data mechanism.
- Security wise, externals SQL files requires an extra safekeeping methodology to prevent those files from harmful changes: folder with those scripts will have to be properly secured and additional security policy will have to be defined and maintained. With SQL logic in SQL Server stored procedure you don’t need to rely on something external to protect your data flow task: an existent application service account that already has read/write access to corresponding databases in SQL server will only need to be granted with execute permission for new SQL Server procedures.

Basically, porting SQL code from external files into stored procedures on a SQL Server will prevent your data driven application from a so called "SQL injection" security threat. When both your data and complex data load logic lives in a SQL Server environment, you feel safe and secure!

Recommendations for using SQL code in SSIS:
- There is no right or wrong of using a hard coded SQL code within a SSIS package, in some case it’s valid approach to run some simple SQL commands against a database engine. However when a logic of those SQL commands becomes complicated, it’s recommended to wrap this code into SQL Server procedures and then just execute them within your SSIS control/data flows.
- SQL Server procedure becomes then a better option from a development prospective with a rich set of tools to design and test your data extract code.  Plus, with any further changes you won’t need to re-deploy your SSIS packages and only a code of your stored procedure will have to be updated. 

So using stored procedures for SSIS development in most cases it’s a win-win solution for a developer: it saves development and deployment time for your SSIS solution!


Sunday, January 29, 2017

How to compare two SSAS databases with a custom MDX query

[2017-Jan-29] It came as a request from one of my customers where I had worked as a consultant to quickly compare two similar SSAS databases located in different environments. So I quickly ran an MDX query with some key metrics against each of the SSAS databases and presented the results to the customer. Further along the way, they asked me if I could make it as configurable tool with options to change SSAS servers, database names and possibly adjust some main filtering criteria.

I thought that it would be possible and decided to created an Excel workbook with two OLEDB data connections to SSAS databases with customer MDX queries and additional configurable parameters to adjust those data connections.

There is a very good article that explains how to create a custom MDX query based on a data connection in Excel (Excel – Using a Custom MDX Query in Excel Table) by Gilbert Quevauvilliers, which is not very intuitive. With that thing covered I also needed to get a full list of my cube metrics to build an MDX query (i.e. manually retyping them wasn't the case). SSAS dynamic management views could be very handy for this; with a lot information about SSAS DMVs available I'd recommend this article (SSAS: Using DMV Queries to get Cube Metadata by @BennyAustin) with some basic metadata extract examples already prepared.

In order to show you this work-case I've used the [Adventure Works Internet Sales Tabular Sample Database] and deployed it to my testing SQL Server SSAS 2016 environment. You can adjust your test case and can use your SSAS database either in multidimensional or tabular modes, because MDX data extract logic could be used in both cases.



Then I created this Excel file (SSAS Compare.xlsm) with all dynamic SSAS data connections and MDX queries to support my data compare results, make sure to enable Macro settings in your Excel.



Basically, it's all being done by the use of two SSAS OLEDB connections within this Excel file where I change Connection Strings to SSAS and Command Text for MDX query.




so then it was very easy to create a macro within this Excel workbook and let the configuration magic works:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
    ActiveWorkbook.Sheets("MDX Query").Select
    strMDX_Query = Range("A1").Value
            
    ActiveWorkbook.Sheets("Compare").Select
    
    '' Period settings
    strMDX_Period_1 = Range("B7").Value
    strMDX_Period_2 = Range("E7").Value
        
    '' Server settings
    strSSAS_Server_1 = Range("B5").Value
    strSSAS_Database_1 = Range("B6").Value
    strSSAS_Server_2 = Range("E5").Value
    strSSAS_Database_2 = Range("E6").Value
    
    With ActiveWorkbook.Connections("SSAS_Environment_1").OLEDBConnection
        .Connection = "OLEDB;Provider=MSOLAP.5;Integrated Security=SSPI;Persist Security Info=True;Initial Catalog=" + strSSAS_Database_1 + ";Data Source=" + strSSAS_Server_1 + ";MDX Compatibility=1;Safety Options=2;MDX Missing Member Mode=Error"
        .CommandText = Replace(strMDX_Query, strMDX_Filter, strMDX_Period_1)
    End With
    
    With ActiveWorkbook.Connections("SSAS_Environment_2").OLEDBConnection
        .Connection = "OLEDB;Provider=MSOLAP.5;Integrated Security=SSPI;Persist Security Info=True;Initial Catalog=" + strSSAS_Database_2 + ";Data Source=" + strSSAS_Server_2 + ";MDX Compatibility=1;Safety Options=2;MDX Missing Member Mode=Error"
        .CommandText = Replace(strMDX_Query, strMDX_Filter, strMDX_Period_2)
    End With
    
    ActiveWorkbook.Connections("SSAS_Environment_1").Refresh
    ActiveWorkbook.Connections("SSAS_Environment_2").Refresh
        
    MsgBox "Compare has been finished successfully.", vbInformation, "SSAS Measure Compare"

As a result I was able to dynamically configure SSAS servers and databases; also, I had my MDX query that I used to test for both SSAS data connection and I also added a filtering criteria to enhance this scenario. Feel free to change this file (SSAS Compare.xlsm) any way you want to in order to accommodate your own compare test case.

And have a happy data adventure!

Friday, November 4, 2016

Open Data is like a soup

[2016-Nov-04] Open data is like a soup, some people like it hot (or made with all the latest updates); some people like it spicy (probably experimenting with various data representation forms). And only on some special occasions you'd prefer a c-old soup over a hot one (where old data analysis is driven by a search for some historical patterns or insights).

I'm not new working with data, and in a corporate world most of the datasets are highly structured and regulated. With open data... it's bit different, less regulatory norms and sometimes more difficult to work with this data (it's open, whom do you complain to about it).

I've seen cases when a data custodian made datasets available for public with only old data (like very old, 10 years and more), at the same time presenting their own visualization using data which might not have existed moments ago. In other cases, open data was made available in a form of Excel files which you could open and read but to aggregate one geo region for several years it would take almost a hundred of those files, plus their data format was not a very structured and numerous data massage moves were necessary. Good examples of an open data management also exist where data owners responsibly control availability and updateability of their data.

Going back to the soup analogy, this whole idea working with data and especially with open data reminded me the ending scene of the classic movie "Coming to America" with Eddi Murhpy.




In case if this YouTube video gets removed, here is the text of this scene:
"Wait a minute, wait a minute, wait stop right there! Listen: Stop right there, man. A man goes into a restaurant. You listenin'? A man goes into a restaurant, and he sits down, he's having a bowl of soup and he says to the waiter, waiter come taste the soup. Waiter says: Is something wrong with the soup? He says: Taste the soup. He says: Is there something wrong with the soup? Is the soup too hot? He says: Will you taste the soup? What's wrong, is the soup too cold? Will you just taste the soup?! Allright, I'll taste the soup - where's the spoon?? Aha. Aha! ..."

Last week at the Open Data Toronto meetup (https://www.meetup.com/opentoronto/events/234322841/) there was an opportunity to review various open data sets and data visualizations built on them over the course of the last 1.5 years. Different vendor tools along with various visualization techniques were used to present data insights: from simple tables or charts to morphing maps and word clouds.

And then I realized that when an open data inquiry and a right and appropriate data exploration tool intersect with each other, only then you can get this Aha moment; no more explanations are needed and a data visualization speaks for itself. This really motivates me to explore other open datasets, try different tools and visualization techniques and read books about effective data analysis.

Otherwise my data soup may get cold without a right data spoon :-)

Happy data adventures!

Tuesday, October 18, 2016

Power BI with Narrative Science: Look Who's Talking (Part 2: Predicting the Future)

[2016-Oct-18] I read a book one time about the importance of using visual aids in teaching. They also used a well known illustration that a human optic nerve is thicker than the auditory one; thus, resulting to more information being consumed by brain with seeing things rather than hearing or listening to them. I still believe in this, that a picture is worth a thousand words.

However now we tend to put more visual elements to our business reports and then we spend more time trying to understand a story that those elements may tell us, rather than having a direct conversation or explanation.


In one of my previous blog posts I had already talked about the Narrative Science company and their product to enhance visual business reports with textual narratives (Power BI with Narrative Science: Look Who's Talking?). Basically the Narrative Science component takes numeric report measures and analytically generates a story that business users can read.


Recently they have released an update for the Power BI Narrative Science component (Narratives for Power BI 1.2 Release Notes) and added a few more features to enable more complex narratives customization and data analytics. Here is a brief list of those new features:
  1. Story Inclusion Thresholds
  2. Custom Formatting
  3. Predictive Analytics
  4. Additional Measure Relationships

And Predictive Analytics excited me the most, yes, I wanted to know the future! So, I needed to test this new Narrative Science functionality with an idea in mind that this could enhance my customers' Power BI reports.







There are a few prerequisites for your dataset though to start working with the Trend Analytics functionality of the Narrative Science:

- it has to be a time series of metric/metrics being measured

- it also requires at least 30 time intervals in order to start populating textual predictions if possible.



I took an open dataset that tracks water main breaks within the City of Toronto (Watermain Breaks) for the last 26 years (1990-2015). With the following data columns:
- BREAK_DATE - Date of watermain break reported
- BREAK_YEAR - Year of watermain break reported
- XCOORD - Easting in MTM NAD 27(3 degree) Projection & Lat and Long
- YCOORD - Northing in MTMNAD 27(3 degree) Projection & Lat and Long

Then I quickly aggregated this data and placed it into a column chart:


With the Narrative Science component and my (Break Date, Break Count) dataset I was able to receive a very decent textual story of the 26 years of history of the water breaks:


However along the way I receive the following note about existing limitation of working with my dataset, which I believe will be fixed in future releases:



The other thing that I couldn't solve with my dataset was the actual prediction of future results. No matter how hard I tried to either change data aggregation level or filtering options, I kept getting this message along with more helpful narratives, "A prediction could not be made because [metric name] did not have a good linear trend".

I assume that for the Trend Analytics a linear extrapolation algorithm is used and it's very peculiar to the historical data. However I do believe that this Predictive Analytics functionality of Narrative Science would be enhanced in future too, because for me seeing that a number of breaks is greater in the winter months because of low temperatures is a good trend :-).

In overall I'm very happy to see all these new features being introduced in the recent update release of the Narrative Science for the Power BI. And I will be recommending my customers to start using it more in their reports.


Because you don't say a word, it's the Narrative Science who does :-)



Happy data adventures!