Sign-up now and get instant access
Leave a comment
Customized training to master new skills and grow your business.
Beginner to advanced classes taught by Microsoft MVPs and Authors.
In-depth boot camps take you from a novice to mastery in less than a week.
Season Learning Pass
Get access to our very best training offerings for successful up-skilling.
Stream Pro Plus
Combine On-Demand Learning platform with face-to-face Virtual Mentoring.
Quick references for when you need a little guidance.
Summaries developed in conjunction with our Learn with the Nerds sessions.
Digital goodies - code samples, student files, and other must have files.
Stay up-to-date on all things Power BI, Power Apps, Microsoft 365 and Azure.
Earn money by driving sales through the Pragmatic Works' Training Affiliate Program.
It's time to address your client's training needs.
Learn how to get into IT with free training and mentorship.
Discover the faces behind our success: Meet our dedicated team
How can we help? Connect with Our Team Today!
Find all the information you’re looking for. We’re happy to help.
Are you looking to gain speed on your Apache Spark jobs? How does 9X performance speed sound? Today I’m excited to tell you about how engineers at Microsoft were able to gain that speed on HDInsight with Apache Spark.
If you’re unfamiliar with HDInsight, it’s Microsoft’s premium managed offering for running open source workloads on Azure. You can run things like Spark, Hadoop, HIVE, and LLAP among others. You create clusters and spin them up and spin them down when you’re not using them.
The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.
There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.
Because the SSDs typically provide more than 1 gigabit/second of bandwidth, as well as leverage operating system in-memory file cache, this gives us enough bandwidth to load big data compute processing engines like Spark. This allows us to run Spark optimally and handle bigger memory workloads and overall better performance, by speeding up these jobs that read data from remote cloud storage, the dominant architecture pattern in the cloud.
In benchmark tests comparing a Spark cluster with and without the IO Cache running, they performed 99 SQL queries against a 1 terabyte dataset and got as much as 9X performance improvement with IO Cache turned on.
Let’s face it, data is growing all over and the requirement for processing that data is increasing more and more every day. And we want to get faster and closer to real time results. To do this, we need to think more creatively about how we can improve performance in other ways, without the age-old recipe of throwing hardware at it instead of tuning it or trying a new approach.
This is a great approach to leverage some existing hardware and help it run more efficiently. So, if you’re running HDInsight, try this out in a test environment. It’s as simple as a check box (that’s off by default); go in, spin up your cluster and hit the checkbox to include IO Cache and see what performance gains you can achieve with your HDInsight Spark clusters.
If you’re interested in big data, AI or Azure, we’re all about data and everything Azure, so you’re in the right place. We’re here to answer questions and see where we can help. Click the link below or contact us—we’d love to help, no matter where you are in your cloud journey.
Join other Azure, Power Platform and SQL Server pros by subscribing to our blog.