Recently the focus on increased insights and data driven decisions has become paramount from the board room to marketing, and even to the production floor. Those insights and decisions were promised by the world of “Big Data”. The hype around big data is unmistakable. You can’t miss it. I think Big Data will perform at the super bowl halftime show. Well maybe not,but you can bet that the stadium, advertisers, teams and vendors are all usingbig data technologies to be their best for that big game.
I speak with customers that are driven to implement a big data solution, and yet I’ve never seen more people working to learn and implement a technology with so little definition on how they plan to leverage it for their business. They know a few things…
These sounds good in theory, so let’s take them one by one.
1. Big Data might allow them to increase the value and variety of decisions they make.
The foundation of Big Data is a distributed platform and file system such as Mapreduce/Yarn/Tez and HDFS. These will allow you to work with data that is less traditional in form and volume and through that, you maybe able to analyze a wider variety of things and possibly do this faster than if all the data had to go through the same normalization and ETL process. That means that we need a properly sized Hadoop platform with all the right tools and a methodology for how we are going to approach these types of data challenges. This takes time, trial and error to develop without guidance from a partner who has been there before. This is why many of our clients are reaching out to get help finding that leg up on accelerating their big data insights process.
2. These technologiesmight provide a way to get around the limitations of traditional BI systems
I’m not sure about this one. I do believe in many cases that we can deconstruct and prepare data for analysis in a more agile way using some of the tools that the Hadoop ecosystem provides. That said, we have only prepared the data for analysis, we have not actually done anything useful with the data. That will require a client tool like Power BI, or another analytics platform. Our new data will likely need to be aligned with corporate enterprise BI data to be really useful and now we’re combining that insight, often integrating it into our enterprise BI environment now that we’ve found the valuable pieces of that data. So often we’re not getting around limitation, but becoming more agile in our methods of identifying data that’s valuable in the context of our business and then being more efficient in the size and scope of the data that we need to eventually integrate into our enterprise BI solutions.This helps us control data warehouse sprawl and keeps the EDW valuable for the business.
3. They believe it to bea cost-effective and rapid time-to-value option
Some companies think they will deploy an amazing big data solution on hardware they have laying around. This is fine for a prototype, but the only real way to deploy big data in a cost efficient, rapid and scalable way is to look to the cloud. The cloud provides a scalable set of services,infrastructure and storage that allow companies to explore big data’s capabilities and value without building out more data center floor space. The nature of cloud storage and compute being separate allows for a multi-tenant approach to big data that is virtually impossible to deliver on-premise. This allows for scale on demand approaches, saving significant dollars and administrative overhead. Automation capabilities with cloud platforms provide this functionality and allow it to be controlled either from central IT or from the business analysts as need. Pragmatic Works deploys 90% of our customers’big data solutions in Microsoft’s Azure Cloud for rapid time-to-value and a much better 12 month ROI.
4. They believe theirteam can adapt and learn the tools quickly
The most significant point of confusion when investing in big data is around all the different tools, services and components that makeup the ecosystem. Check out here for more on that. There is Hadoop, Pig, Hive,Oozie, Flume, Storm, HDInsight, Cloudera, Impala, and so many others. Companies don’t know which way to go and are nervous about getting locked into a vendor or a direction. Sometimes they even go so far as to try the completely opensource route, thinking that will be better, with no vendor to get locked into.The problem is without an enterprise behind your enterprise solution, you are left without support, training and a system for upgrades and improvements to the product. That is putting a lot of pressure on your team. Many teams are already working full time in their current roles and don’t have time to integrate 20 new technologies into their daily life. The cloud comes to therescue again with products like Microsoft Azure’s HDinsight, a managed big dataservice in the cloud. This service can be scaled as needed, combines with azure’s other secure cloud services (They have added over 200 new services in the last year. .wow..) This service provides all the core Hadoop ecosystem components, no SQL options, virtual infrastructure and tons of integration options for your applications and enterprise systems. We are seeing customersexcited about realizing their big data vision by leveraging a managed platformlike this. Best of all, since it’s the cloud, you only pay for what you use.You’re still depreciating those old servers in the data center aren’t you? I thought so.
5. Their business iscounting on it to provide insight into business growth and efficiencyopportunities
Now that you’ve got this amazing distributed scaling on demand big data platform in the cloud, what are we doing with the data, and here in lies the big secret of big data? It doesn’t solve all your problems.You still need to do that. What a good big data implementation will allow you to do is iterate through data organization, duration and publication faster and provide it as a scalable source for more in depth analysis. Enter machinelearning and predictive analytics. These technologies work with the big data ecosystem (and other data) to help you make decisions about what you expect your business, customers or market to do, allowing your stakeholders to plan more appropriately and effectively. These are the technologies that consumethis data and provide real insight. Many companies are looking for a “Data Scientist” to drive these initiative, but Microsoft has done a nice job in working with their internal data scientist and their Azure teams to create a managed and scalable Azure Machine Learning service that really does a nice job of democratizing this technology and making it accessible to companies willing to explore it. Pragmatic Works is doing a lot in this space, helping clients adapt from traditional BI to a more forward looking approach using predictive analytics. Our customers are seeing significant value and ROI from their investments here. Remember Cloud = scalable, pay for what you use etc.. so this fits that model without having to hire a team of Phd’s to run it for you. Inmany cases we’re able to train the existing team to leverage this platform.That is very exciting for our clients.
As you can see Big Data is an exciting technology development, but as we watch companies adopt these technologies, we are paying careful attention to how they can be successful and not just initiate another overloaded data project. Our techniques and cloud adoption have resulted in our initiatives having a significantly higher adoption rate than the industry average.
If you want to find out how to take your business forward into a more insight driven world, contact us here at Pragmatic Works.
See you out there…
President, Pragmatic Works Consulting
Instructor: Mitchell Pearson
Click here to watch the entire webinar.
Click here to view Mitchell's blog.
Thank you everyone for attending my webinar on SSIS Performance Tuning, if
you missed that presentation you can watch it here:http://pragmaticworks.com/LearningCenter/FreeTrainingWebinars/PastWebinars.aspx?ResourceId=683
Below are some of the questions that I received from the Webinar:
Q: Can you give some performance tips in using script tasks too?
A: Yea, don’t use them! Script tasks can be really bad for performance and
many times I have been able to replace someone else’s code with native SSIS
components. For example, I had situation where the client was using some very
complicated .net code to parse out columns based on logic. The package was
taking over an hour to process 1 million records. I replaced this with some
conditional splits and derived columns and it now runs in 3 minutes.
Q: I am assuming that the file formats must be the same for all files when
using the MultiFlatFile transform, correct?
A: You are absolutely correct. The metadata in each file must match.
Q: PW delivers a 'TF Advanced Lookup Cache Transform" component. What are
the benefits of using this component over the Cache Transform covered earlier?
It seems that the TF components cannot make use of the same result set when the
data set is role based.
A: For basic caching of data I would use the native SSIS cache transform. The
major advantage you get from the Task Factory component is you can do very
difficult range lookups with ease and they will perform at a high level. Please
see my blog post on this.
Q: What version of SQL Server is being used?
A: I was using SQL 2012, but everything in the presentation is applicable to
2005 and 2008.
Q: With the multi flatfile connection manager can you specify specific
A: Yes, the wild card character can be anywhere in the connection string
property. So you could do test*.txt to only pull in text files where the file
name begins with test.
Q: Why would you ever not use table or view (fast load) option in the OLEDB
A: Well I personally would always use that option. However, with the fast
load option all records are committed for the entire batch. So if there is a
record that is bad and causes the failure you will not know which record caused
the error. With table or view option each record is committed individually so
you know exactly which record caused the failure.
By: Devin Knight
Click here to view Devin's blog.
Writing MDX inside of Reporting Services has been the nemesis for many report writers for far too long. If all you need is the Query Designer to create a drag and drop datasets then it’s a piece of cake, but have you ever wondered what’s going on with all the MDX that the Query Designer creates. Or how about how do the parameters work that it creates inside MDX. I’ve heard many report writers that use Analysis Services as a data source (including myself) say it is too difficult to create parameters and that’s why they use the Query Designer. In reality, I think what the real problem is probably that the MDX results that the Query Designer provides make it look a lot more difficult than it really is. If you know some rudimentary MDX you can probably be writing your datasets yourself instead of using the Query Designer, which there’s no problem with sticking with if it fits your needs. The goal of this article is to guide you through writing a basic MDX statement that can be used for a report query and more importantly show you how to parameterize it.
The two MDX functions that we are going to use for this are:
Returns the set specified in the string expression (MSDN definition). Essentially, Converts the text you write into an MDX set.
Returns the member specified in the string expression (MSDN definition). Converts the text you write into an MDX member.
First we will start with an example that uses StrToSet and then add to it using StrtoMember. To follow along with this example you’ll need the Adventure Works cube that can be found at www.codeplex.com. I’ll assume you know some basics of Reporting Services and not define each component of the tool that is not new for this example.
Create a new Report Server Project and Report that uses the Adventure Works cube as a Data Source. Next, create a Dataset and for this example you can call it CategorySales (this Dataset should use the Adventure Works cube data source). Select Query Designer and click the Design Mode button to begin writing your query. Use the following query to return back results without a parameter:
All this query did was returned back the total sales, but this is useless for a report by itself. Let’s say we want to add the product category that the sales belong to. If we were to do this normally in MDX it would look something like this:
That’s not good enough for our pretend end users though. They not only want to see the list of product category sales, but they also want it made into a parameter so they can select from a dropdown box the category or categories they wish to view. To do this, first create the parameter by selecting the Query Parameters button .
After you hit OK to confirm the Query Parameter screen you will need to modify your MDX to take advantage of the parameter you have created.
Here we use StrToSet to convert whatever is brought in from the parameter values selected by the end users to something MDX can understand. The Constrained flag here just ensures that parameter provides a member name in the set.
Hit OK twice to confirm the query so far and make a simple tabular report to view your results so far. You will notice that a dropdown box parameter was automatically created for the ProductCategory parameter we defined. You may already be familiar with this if you have ever used the drag and drop interface for making parameters. Reporting Services automatically creates a Dataset for this parameter dropdown box, which you can view if you right-click on the datasets folder and select Show Hidden Datasets.
Now let’s take this example one step further. We will add two more parameters to our query to create a date range. Hit the Query Parameters button again and create the parameters StartDate and EndDate that reference the Date Dimension and the Date attribute. These can be defaulted to any date for now.
Change the MDX to now add a where clause that restricts dates this time using StrToMember .
Hit OK twice to confirm the changes to the Dataset. Remember that these parameters will automatically create new Datasets and parameters so you should just need to preview the report to see the changes.
I hope this makes writing your own MDX with parameters inside Reporting Services a little easier.
Presenter: Rowland Gosling
Click here to view the entire webinar.
Click here to view Rowland's blog.
Thank you to all who attended my online workshop this morning! I had two questions I’d like to address:
Q: Can you use the ODBC Hive solution to insert data into Hadoop?
A: Not with our current technology. Hadoop data is basically read-only. There is no ODBC way to update a row for instance. It can be updated using things like Pig but basically what you’re doing is replacing entries.
Q: Can you use CSV/Storage Spaces?
A: I did some checking and the general answer is no. On the other hand why would you want to? There’s blob storage and plain old UNC file paths.
Presenter: Bradley Schacht
Click here to view Brad's blog.
Yesterday I presented about Change Data Capture for the Pragmatic Works Free Training on the T’s. I covered the basics of what Change Data Capture (CDC) is and showed how to use it. We had a good turn out, so thank you to everyone that attended. If you were unable to join or would like to see the recording you can find the link here. The Q&A can be found below.
Q: Does disabling CDC remove any objects that were added?
A: Yes, when you disable CDC on an individual table the associated change tracking table as well as the auto-generated functions are removed. If you disable CDC on the database all change tracking and auto-generated functions are removed. The wrapper functions are not removed automatically, you must remove those yourself.
Q: How does the log reader job impact the transactions being written in the application?
A: The log reader only does reads on the transaction log. It does not lock the log or anything like that. There should be no affect on the applications writing to the tables.
Q: Will CDC work with Simple recovery mode?
A: Yes, CDC works with both full and simple recovery modes.
Q: Can CDC be enabled on MDS (Master Data Services) databases?
A: CDC can be enabled on any database as it just reads the logs. That being said you do need a primary key or unique index on the tables. I’m not sure I see the value of putting CDC on an MDS database though. Those tables are setup such that you need to go through the views to get data back.
Q: Does CDC work on Hecaton databases?
A: I don’t see why not, although I will need to test this as I have not tried it. Ultimately there is still the transaction log involved as far as I know so I would venture to guess yes. Don’t at all hold me to that as I haven’t done any real world work with Hecaton so testing and a follow up blog will be required there. You’ve definitely peaked my interest there though.
Q: What do you recommend for getting the first pull of data (initial load), date or LSN based queries?
A: I prefer to go the date base route using the control table. The first time you pull data pass in NULL for the begin date and you will get back all of the data from the beginning of time. Quick and easy.
Q: Is CDC enabled on the source system or the destination system (Data Warehouse)?
A: CDC will be enabled on the source system. You want to leverage the deltas on the source side so you have less data to load into the warehouse and less complex logic to do on the warehouse load ETL.
Q: Can you enable CDC on a view?
A: No, CDC can only be enabled on tables.
Q: What version of SQL Server is required?
A: An Enterprise edition of SQL Server is required to use CDC. The feature is available in 2008 or later.
I also did a similar presentation last night for the Orlando SQL Server Usergroup (OPASS, website) on CDC with SSIS. That focused on using Change Data Capture with the new SSIS 2012 components. I have consolidated the slide decks and included both SSIS projects in the download link which can be found in the giant information box below so no one can miss it. Some notes on what is included:
Thank you again to everyone who joined the webinar or braved the weather to come out to the User Group in Orlando. If you have any follow up questions please hit me up!
Presenter: David Siska
Click here to watch the full webinar.
Click here to view David's blog.
By Bradley Schacht
Click here to view Brad's entire blog.
SharePoint uses a series of SQL Server databases to save configuration information, security setups and all the user data that has been created. The Central Administration site is no different and has its own content database that is created when running the SharePoint configuration wizard. Unfortunately, unlike when creating service applications, you do not get to choose the name of this database during the wizard. The database name will default to SharePoint_AdminContent_<GUID>.
This could be problematic for a number of reasons. Many companies have a set of standards for database names. Other times you will want to change it simply because you have a small OCD issue (that’s me). While the task is not necessarily as easy as going into SQL Server and renaming the database in the object explorer it isn’t necessarily difficult either. Just follow the steps below and you will clear up that OCD (or naming convention) issue in no time.
New-SPContentDatabase –Name SharePointCentralAdmin –WebApplication http://kerberos-sp:36000
Get-SPWebApplication –Identity http://kerberos-sp:36000 | Get-SPContentDatabase | SELECT ID, Name, WebApplication | Format-List
Get-SPSite –ContentDatabase c87506a9-b87d-40b8-9582-aac9ee89c8f8 | Move-SPSite –DestinationDatabase f79cb9d8-8e45-4405-82c9-081f58bce7a0
In a previous post I wrote about Using a Hyper-V VHD in an Azure VM. Today I’d like to show you what my next steps were. My goal with using Azure is to create an lab environment for student that fits the following needs:
The point of this post is to guide you through how to easily create virtual machines for multiple students or employees in a quick and easy script. By the way, I have previously posted how to solve the problem of powering up and powering down the virtual lab environments in my blog about Setting up Automation in Azure.
These steps will guide you through how to take what you learned in my blog aboutUsing a Hyper-V VHD in an Azure VM and create and image from your VHD you uploaded. You would also create an image from a Virtual Machine that you created from the Azure Gallery. Once an image is created you can then spin off as many virtual machines as you would like from that image.
The the image created you can now either manually create new virtual machines from the image or use PowerShell to scale your solution better. For me Powershell made most sense because I’m not trying to build just one virtual machine. In my case I actually need to build 15 identical virtual machines (for 15 students) based off the same image.
param([Int32]$vmcount = 3)# Create Azure VMs for Class
# run in Powershell window by typing .\CreateVMs.ps1 -vmcount 3
$startnumber = 1
$vmName = “VirtualMachineName”
$password = “pass@word01″
$adminUsername = “Student”
$cloudSvcName = “CloudServiceName”
$image = “ImageName”
$size = “Large”
$vms = @()
for($i = $startnumber ; $i -le $vmcount; $i++)
$vmn = $vmName + $i
New-AzureVMConfig -Name $vmn -InstanceSize $size -ImageName $image |
Add-AzureEndpoint -Protocol tcp -LocalPort 3389 -PublicPort 3389 -Name “RemoteDesktop” |
Add-AzureProvisioningConfig -Windows -AdminUsername $adminUsername -Password $password |
New-AzureVM -ServiceName $cloudSvcName
.\CreateVMs.ps1 -vmcount 3
Once your virtual machines are created your next step is to consider if you want to keep them running all the time or only during business hours. Keep in mind you only pay for the time your virtual machines are started. For my scenario I wanted to turn my machines off during non business hours and back on in the morning, which can be automated through Azure Automation. Read this post on Setting up Azure Automation.
Click here to view Devin's entire blog.
Instructor: Kathi Kellenberger
Click here to view Kathi's blog.
Q: Can we develop SSIS packages using SSMS or do we have to use SSDT?
A: You must use SSDT. That makes sense; it is a development tool!
Q: Any impact when upgrading from 2008 SSIS to SSIS 2014?
A: Except for just trying it out, I haven’t had the chance to do any real upgrades. It’s important to figure out if you are going to move to the new Project Deployment model and do the planning. Once I get some more experience with this, I’ll do another webinar just about it.
Q: Can SSIS 2012 sense the environment name and set the environment accordingly?
A: I’m not sure exactly what you are getting at here. In the new project deployment model, you assign the environment.
Q: Can I run a package developed in SSIS 2012 or 2014 in a 2008 R2 environment?
A: I haven’t tried this and a quick search didn’t turn up anything helpful. My gut is telling me that it should work as long as you are using the package deployment model.
Q. Do package configurations still exist in SSIS 2012?
A. Yes, you have the choice of using the package deployment model which gives you the traditional configurations or using the new project deployment model.
Q. Is there a way to make changes to one package without redeploying the entire project?
A. No, in the project deployment model, you must deploy the entire project even if there is just a change to one package.
Presenter: Anthony Martin
SQL Server Integration Services is a versatile tool that makes it easy to move and transform data. Sometimes unusual situations present themselves that leave developers scratching their heads. In this webinar, Anthony will demonstrate how to handle some of these oddball challenges.
Click here to view Anthony Martin's blog.
Q: Could the file cache be used between packages in the same projects?
Yes, the ‘Use file cache’ option of the Cache Transform transformation can be used between packages that are in the same project or even different projects.
In order to do this, you’ll need to configure the Cache Transform with the ‘Use file cache’ option and execute the package. This is because the Cache Transform file (.caw extension) is created until the package is executed. Next, in another package that you want to reuse the Cache Transform file, add a Lookup transformation to a Data Flow Task. While configuring the Lookup transformation, select the ‘Cache Connection Manager’ connection type on the ‘General’ page. On the ‘Connection’ page, click ‘New’ to create a new Cache Connection Manager, select the ‘Use File Cache’ option and browse to the location where the file was created in the previous package. Configure the rest of the Lookup transformation just like you normally would and you are all set!
Q: Are Lookup Transformations better than just using Stored Procedures in the Source component?
If the connection for the source and the lookup are on the same database and correct indexing exists then a stored procedure in the source component that performs the join for you is probably the appropriate choice. However, if the source and the lookup connections are on different database servers or different file types altogether then the lookup transformation is one of only a few options available and is a highly efficient transformation if configured using the default settings.
Q: Could you please share the SSIS project and SSMS solution?
I’ve posted both solutions to my OneDrive. You can access them here.