Tuesday, May 31, 2016

Analyzing David and Goliath datasets on a same chart; Dual Axes in Excel, Power BI and SSRS

[2016-May-31] Two different metrics, two different datasets, one may represent a very high value and the other one is very low. What's interesting is that both metrics could be very close related; like Total Product Sales vs. Product Price or Average Monthly Temperature vs. Rainfall over a period of time. 

One way to analyse how such metrics may influence each other is to look at their numbers, the other way is to visually place them on a bar or a line chart and let those graphical elements communicate to you how correlated or not those metrics are. However there is one downside of using graphical tools to analyse both very low and high data values on the same Y-axis; it's just visually difficult. Therefore many analytical tool vendors allow to portray differently scaled data with the use of Primary and Secondary Y-axes (so called Dual Axes).

Let's take a simple data set and try to check 3 different visualization tools that can show data across different data axes. There is an open dataset of National Pollutant Release Inventory (NPRI) provided by Government of Canada (http://open.canada.ca/en/open-data) that shows different industry related pollutant releases to air, water and land for the past five years. 

I've chosen only 2 industries' air pollution data: (1) Oil and Gas Extraction and combined (2) Pipeline Transportation of Crude Oil and Natural Gas, with an idea to find if these 2 metrics' results are correlated over a period of time.

Here is an Excel version of this dataset:


Just looking at the data itself, it's quite obvious that Oil and Gas extraction generate far more air pollution than these natural resources transportation by pipelines. Let's see how Excel chart could show this data:

Excel:

Excel is a good analytical tool and it even gave me an option to see linear trend-lines for each of the metrics (however it's a bit difficult to understand why with the air pollution decrease for the oil and gas extraction, the air got more polluted when this oil and gas had been in a pipeline transition; we'll talk about this later).


Power BI:



The Power BI currently doesn't allow creating a bar chart for a secondary Y-axis; thus we can use a line to show the Pipeline transportation in our case. However all other settings including good data labeling provides us with a very interactive experience for big & small data analysis.

SSRS (SQL Server Reporting Services):




It's a bit different story with the SSRS: from one hand it has a very solid dataset extraction mechanism that is shared across all other SQL Server BI tools, however its visualization part to show both Primary and Secondary axes is not very good: you will have to additionally adjust each of axis top limits in order to sync horizontal lines and I also wish SSRS data label positioning to be more flexible.

I will let you to decide on your own what tool you would prefer to use in order to analyze 2 different metrics data on dual axes. I personally like the Power BI version, then I would go with the Excel way to analyze Primary and Secondary axes data and use of the SSRS dual axes chart would only be considered if the technology is specifically requested by a customer.

As far as the early question about the pollution decrease of oil and gas extractions vs. pollution increase to transport those natural resources using pipelines, one could suggest that during the period of time of 2010 and 2014 resource company have learned to apply modern technologies to produce less air pollution. However the existing pipeline system in Canada may have slightly deteriorated and need more maintenance and modernization. But whatever the reasons are for the air pollution decrease and increase, there have to be some some facts to prove such hypothesis, which may be a part of another blog post or discussion.

Happy data adventures!

Saturday, April 30, 2016

Power BI Box and Whisker visualization vs. Box Plot chart in SSRS

[2016-Apr-30] I don't work with box plot charts very often, however when I do then it becomes a very interesting experience. First, it starts from data extraction and then playing around with various visual settings to make those boxes and whiskers to look good.

I assume that sometimes we tend to work with more familiar chart types to analyze our data, like lines and vertical or horizontal charts; however box plot charts could be very well used to review how your data is distributed within your set of metric values; and besides, it looks different as well :-)

About 3 months ago Microsoft made this type of visualization available in Power BI (Visual Awesomeness Unlocked – Box-and-Whisker Plots), so I decided to compare it with the same box plot chart that I can create using SQL Reporting Services.

My data comes from Berkely Earth (http://berkeleyearth.org/data/) that I had already used for one of my previous PowerMap visualization (Earth Surface Temperature on YouTube). And I took 4 years data of May monthly average temperature from all Canadian Geo stations (1997 - 2000) and showed them both in Power BI and SSRS.

Power BI:
Currently there are two version of the Box & Whisker charts: (1) created by Brad Sarsfield and (2) another created by Jan Pieter Posthuma.




(1) Brad Sarsfield box and whisker chart
Pros:
- Customization for chart quantile value is available 
- Outliers visualization
- Simple data set required Values/Group (values and groups those values are divided to)
Cons:
- Margins for quantiles are rounded to zero decimal places

(2) Jan Pieter Posthuma box and whisker chart
Pros:
- Boxes are colored 
Margins for quantiles are not rounded to zero decimal places
Cons:
- No customization for quantile value (it's preset to Min, 0.25, Mean, Median, 0.75, Max)
- Along with Values/Group settings for dataset, Samples has to be specified (which is basically an attribute for lowest granularity you want to analyze your data)

SSRS: 


























Pros:
- The only SSRS chart for box plot visualization
- More settings can be adjusted compared with similar Power BI visualization
Cons:
- Chart dataset has to be prepared in advance for all 6 data elements (Min, Low, Mean, Median, High, Max)
The only SSRS chart for box plot visualization.

Conclusion:
I personally like Power BI Brad Sarsfield box & whiskers visualization with the way it looks and its customization level, however I also like that margin values for quantiles are not rounded to zero decimal places at Jan Pieter Posthuma visualization.

Thanks for reading this blog post, and happy data adventures!

Tuesday, February 2, 2016

Excel Power Map vs. Power BI Globe Map visualization

[2016-02-02] PowerMap in Excel is a great tool with good design and comprehensive user experience. I’ve started using it about a year ago when our team participated at first Canadian Microsoft Big Data hackathon (http://blogs.technet.com/b/cansql/archive/2015/04/13/big-data-hackathon-a-story-from-a-winning-team.aspx).

So I gradually developed a taste to this Geo analytical tool that helps to map data points and describe insightful and informative stories. Later on in November of 2015, visually similar tool has been added to the Power BI (http://blogs.msdn.com/b/powerbi/archive/2015/11/24/visual-awesomeness-unlocked-the-globe-map.aspx). And this gave me an opportunity to compare and analyse both tools’ functionality.

Basically they’re not competing with each other, but rather live in two different worlds: one is a client application in Excel, the other one is a custom Power BI visualizations. They’re friends and they’re siblings :-)


Things that Excel Power Map and Power BI Globe Map have in common:
  • Mapping data points to a 3D Globe visualization using geo coordinates (Latitude/Longitude and regular address coordinates)
  • Data facts explorations being attached to associated geo coordinates.



Things that Power BI Globe Map doesn’t have compared with the Excel Power Map:

o   Ability to change globe themes
o   Ability to create visual tours and save them as video files
o   Change the chart shapes
o   Change the height and thickness scale of visuals
o   Change color for a series
o   Find location feature
o   Add an annotation with image
o   Legends
o   Bubble charts, pie charts and region charts
o   Power Map graphics options

Things that Excel Power Map doesn’t have compared with Power BI Globe Map:
  • Geo filtering based on additional visualizations that are sourced from the same data model (this is one of my favorites).
  • Power Map has an option to add a 2D chart to a globe visualization too, but Power BI integration of multiple visualizations is on a different level (great way for a real data exploration!).
  • Power BI Globe visualization is free compared with a slightly cumbersome licensing aspect to enable a Power Map option in Excel. 

Few suggestions if you want to build a data visualization in Excel Power Map:
  • Power Map requires really powerful machine to process and create video (go with 64 bit Office; more memory, especially if you try to work with very complicated shape data files and attempting to add images as annotations).
  • When you add a new layer to existent scene, make it invisible first (do all the adjustments) and then make it visible, otherwise as soon you place a field name to Location then your scene will be turned.
  • No Save button and Undo/Redo buttons' behavior are difficult to understand.
  • A few times I've had a glitch with a text box when later you try to change its text or background color (no changes after pressing OK  button); only by recreating this text box all necessary adjustments were possible.
  • Scene effect (difficult to predict, without actually trying them), but when you master your skills changing different effects between different scenes, then you get all the enjoyment of creating nice Map visualizations :-)




[2016-02-06] Update:
Three days ago Microsoft announced Power BI "publish to web" functionality (https://powerbi.microsoft.com/en-us/blog/announcing-power-bi-publish-to-web/) and I've just tried to publish my latest Power BI report of with public Berkeley Earth data of over land temperatures (http://berkeleyearth.org/).

And now I can really rotate a Power BI Globe inside my blogger post, and you can do this as well; plus, you can interact with all other controls of the reports and check how the Globe rotate itself based on your data criteria selections (Country or Temperature range).

It works! Great feeling!

Happy data adventures!

Thursday, December 24, 2015

Specifying Paths to External Items (Report Builder and SSRS) in SharePoint with PowerShell

[2015-Dec-24] My story: two SSRS items (main report and its sub-report) need to be deployed to SharePoint. Trivial task, and even if I need to locate sub-report to a different SharePoint document library or folder I can easily do that during a development phase or using Report Builder after deployment by specifing relative or absolute path for sub-report within a definition of the main report. And here is a good MSDN resource about this - Specifying Paths to External Items (Report Builder and SSRS).







But with multiple environments to deploy the very same set of SSRS items to adjust will be a time consuming task. However we can prepare and adjust SSRS XML files during the time of deployment and PowerShell is one of the many ways to achieve that.

Here is a top section of the main SSRS RDL file:



And here how this very same section may look like with the absolute path for its sub-report in a targeted deployment SharePoint environment:



And portion of a PowerShell script that does it all is very simple:



So, at the end of the day, it's a win-win situation both for Development and Deployment teams:  less time for deployment and more time to develop other things.


[2015-Dec-30]  A brief update for the PowerShell script. I needed to make it more flexible to support sub-report paths update wherever they are located within their main SSRS report. So I had to change my XML way of dealing with the RDL report file to a text file way (XML file is still a text file :-).

And I changed this part of the code:
# update of the connection string to the found sourcing Excel files
[xml] $xdoc = get-content $rptReference

# Getting to the $rsr_environment XML block
$SubReports = $xdoc.Report.ReportSections.ReportSection.body.ReportItems.Subreport
foreach($SubReport in $SubReports)
{
$SubReport_New = $SubReport
$SubReport_New.ReportName = $targetDataSources+"/"+$SubReportFolder+"/"+$SubReport_New.ReportName+".rdl"
$xdoc.Report.ReportSections.ReportSection.body.ReportItems.ReplaceChild($SubReport_New, $SubReport)
}
                   

$xdoc.Save($rptReference

To this:
(Get-Content $rptReference).replace('<ReportName>', '<ReportName>'+$targetDataSources+"/"+$SubReportFolder+"/") | Set-Content $rptReference

(Get-Content $rptReference).replace('</ReportName>', '.rdl</ReportName>') | Set-Content $rptReference 

And it worked like a magic, regardless of how many sub SSRS reports a main report could have and their locations in there!



Wednesday, November 18, 2015

PowerMap with Custom Region shapefiles - Demystified

[2015-Nov-18]  Today is a global GIS Day (http://www.gisday.com/) where people across the globe celebrate and promote the use of various geo locating technologies. I used to work for one company (http://www.elections.on.ca/) about 5 years ago as a data analyst and I saw their GIS specialists worked very hard with different Geo data sets, so I respect this type of work.

Recently I had a chance to participate in the Toronto Public Library Open Data Hackathon (tpl.ca/hackathon) to brainstorm and come up with different ideas that would improve library services. Our team designed a prototype of web application that would enhance library users experience with book clubs. And I proposed to build a reporting layer for library staff who may analyse a potential impact of improved book club experience to a number of new visits and new registrations to the library.

I also wanted to explore Microsoft PowerMap technology in its way of portraying geo layers based on custom region KLM and SHM files which contain custom polygons (Power Map for Excel–May update for Office 365). Before I share how easy or not easy use of shape files for PowerMap was, let me show you a final version of our team PowerMap video that was created for this hackathon:




So after pulling some historical library datasets that they provided to us into my model; I created my initial PowerMap tour graphics. Which I further wanted to enhance with custom region geo files form the (City of Toronto Open Data Catalogue):
- Regional Municipal Boundary
- Neighbourhoods

Currently PowerMap custom regions support only County/Region, State/Province, Postal Code and Custom Region based on geo shape files:




However I was able to overcome this limitation by just renaming the City field (column) name to something else (e.g. CityName or CustomRegion, it’s your choice); and once it wasn’t longer as City, then I was able to map it to the Toronto Regional Municipal Boundary shape file attribute.




The other more difficult thing was to portray a set a polygons from another shapefile which contained information of 140 Toronto neighbourhood regions. I had a list of neighborhood regions in my data model and no matter how hard I tried, I couldn't make them visible in my PowerMap visualization.

So I started to research this a little bit more and found one hint in one of the PoweMap examples that Microsoft provided. For one of their models their provided this note that their shape files was slightly adjusted by a GIS software [QGIS tool http://qgis.org]. And I installed this software on my machine because I didn’t know other way of what was inside of downloaded shape files.

So using the QGIS tool I was able to see how that shape file was constructed and I also saw the reason I wasn’t able to map my data model data sets with attributes from the shape file. 





I didn’t dare to change geo file and thought it would be easier to adjust my model data set, and that change made all the difference. Now all 140 regions became visible in my PowerMap visualization! Good for Microsoft and thanks to GIS and Toronto Open Data Catlog for providing software and necessary data sets.




My key learning in this PowerMap with custom regions (shape files) exercise are:
- Don’t trust region shape file metadata (use GIS tool to check shape file content, especially attribute table);
- I can easily overcome existent PowerMap limitation of mapping only County/Region, State/Province, Postal Code to geo location points, it’s your choice how to name them;
- This leads to my 3rd finding that my model data sets can be mapped not only by names but ID or other fields as well. Because when you import a region shape file it will ask to pick a region name from a list; and I was able to map my polygon and model data sets by their reference IDs as well. So it's like a table to table relationship by reference fields that you choose.

I’m was very pleased with how Microsoft improved PowerMap with the new Custom Region feature based on external geo shape files. And I am even more happy to learn that they plan to enable this PowerMap visualization in the PoweBI! How soon, I don't know.

Please leave your comments or if you have some further questions, I'd be glad to share all that I have learned.

Happy data adventures!

[2015-Nov-24] Just a brief update: I’ve recently contacted both PowerMap and Power BI teams in terms of enabling the Excel PowerMap visualization in Power BI. The PowerMap team informed me that there was no plan to move their client application to Power BI; however they had also mentioned that 3D Map would be available in a form of custom visualization. And today Power BI team has announced availability of the Globe Map visualization (http://blogs.msdn.com/b/powerbi/archive/2015/11/24/visual-awesomeness-unlocked-the-globe-map.aspx).

So synergy is in the air! :-)

Tuesday, October 20, 2015

Too Big To Be Visible - SQL Server vs. Oracle data in SSRS reporting

[2015-Oct-20]  A sourcing system failed to return correct results (returned value was a very high figure with double digit number after "E"). An existent ETL process was capable to save this value into an Oracle staging database, however previously working SSRS report showed me a blank column with no data instead.

I knew where the problem was, so I decided to explore this SSRS limitation in a attempt of showing very big numbers.


First, I took a sample query from the AdventureWorksDW database and added an extra row  to query a very big number that SQL Server can handle:






Then I "migrated" this data set into Oracle environment with a small adjustment to support very big numbers that Oracle can handle too:





and then I created a SSRS report with both SQL Server and Oracle data sets. What happened next was a surprise for me; the Oracle "Fairy Land" value didn't show up in my report:





So after playing with allowed data limits for my Oracle data set, I came to a range that SSRS actually could show in the report: it only allowed me to show Oracle data values when a big number was within (-7E-28 .. +7E+28), I may be wrong in my assumptions, but somehow Oracle client was not returning data out of this range for my SSRS report:





and only after changing my Oracle "Fairy Land" data, SSRS finally showed me expected results. So this was my finding, that SQL Server data will be shown as it is in a SSRS report, however Oracle data connection has some limitations.




Happy data adventures!

Wednesday, August 12, 2015

Why I still like Excel PowerPivot models in SharePoint however Tabular solutions in SSAS are better

[2015-Aug-12] It’s quite obvious that in present time there is a growing audience for all the latest changes and updates within the Microsoft Power BI sphere. I think someone has already built a data model that showed number of tweets about Power BI which prevailed other more usual and heavily discussed software development topics on one day. However I would still like to share some thoughts that came as result of building PowerPivot data model in Excel/SharePoint vs. Tabular solutions.

Also this post was inspired by (Tabular Models vs PowerPivot Models) and (When to choose tabular models over PowerPivot models).

So here are a few risks or disadvantages that we may face with PowerPivot Excel data models SharePoint:

Performance/Availability to users

  • Excel PowerPivot in SharePoint is just a physical file that internally holds a backup of tabular model that automatically gets restored into a memory and is utilized whenever users try to interact either with linked PowerView report or manage Excel files itself. Once this “in-memory” file tabular database is not used for while it gets deleted by internal garbage collector. That means that after a short while, it will take another waiting time to bring that “hibernated” data model into memory before it will be available for users.
  • Server Tabular model is a different type in memory model, that still may be “cleansed” by internal garbage collector; and for sure certain performance tests that will compare Excel PowerPivot model vs. Tabular model needs to be built. However preliminary tests showed that Tabular model was more responsive during user interaction.


Stability

  • SharePoint Excel PowerPivot books live under control of Excel Services, PowerPivot Services and Analysis Services within SharePoint environment and may be affected by user activity with other SharePoint objects; SharePoint disk and memory space might bring some other dependency on life of Excel PowerPivot models.
  • So having a dedicated, set apart from SharePoint, SQL Server Tabular environment may bring a peace of mind to all users who will rely on their reporting data availability.


Scalability 

  • Even with a standalone Excel file 2 GB limit, we are faced with different file size limitations that SharePoint will handle (2 GB is still an option in SharePoint, but this has to be specifically defined).
  • With a tabular database we only have to deal with how its hardware infrastructure is organized (Hardware Sizing a Tabular Solution (SQL Server Analysis Services))


Development/Deployment

  • I like both Excel development environment as well as Visual Studio for tabular solution. In both places I can define all the metrics, build relationships between tables and define necessary calculations.  
  • Deployment could be tough with Excel PowerPivot work books: I’ve built a PowerShell scripts that prepares and updates metadata within each Excel PowerPivot data models prior their uploading to different environments in SharePoint (http://datanrg.blogspot.ca/2015/06/updating-powerpivot-model-and-powerview.html). However it’s still depended on how an Excel XML document is structured in Excel 2013 and this may get changed in later versions.


Manageability

  • You’re on your own with managing an Excel PowerPivot model: it’s you and Excel. Yes, SharePoint will let you to schedule data model refresh once a day; however this schedules may be jeopardized by frequent updates and interactions with Excel files themselves (https://support.microsoft.com/en-us/kb/2874957)
  • Tabular solutions can be backed up, restored, detached, and attached; PowerShell cmdlets can be applied. And most importantly we can trigger a partial or complete data refresh of our server data models multiple times a day (XMLA scripting could only be a great benefit for this).


Security

  • When an Excel PowerPivot model file is deployed to SharePoint, users will or will not have a complete access to the data model (depending on their access permission scope defined in SharePoint).
  • With tabular solution we can utilize all the benefits or role base security model and manage access to our server data models on a more granular lever.


Will all this being said, I still see that an initial PowerPivot data model could be designed and developed in Excel; however once it’s defined and tested then it’d be wiser to export this model into a tabular solution for a better manageability and data availability for end users.