Home / Fabric / Working with PySpark in Microsoft Fabric Notebooks
Working with PySpark in Microsoft Fabric Notebooks
Master PySpark in Microsoft Fabric Notebooks! Learn to process big data, build pipelines, and harness Fabric's tools for seamless workflows in a collaborative environment. Perfect for data engineers and analysts!
Dive into the world of big data processing with PySpark in Microsoft Fabric Notebooks! This course offers hands-on experience with PySpark, the powerful framework for large-scale data processing, integrated within Microsoft Fabric’s seamless environment.
Learn how to leverage PySpark's robust capabilities to handle and analyze massive datasets, write efficient queries, and build scalable pipelines. You'll explore how Fabric Notebooks empower you to collaborate, streamline workflows, and utilize Microsoft Fabric’s unique features like built-in security, connectors, and scalability for modern data challenges.
Course Outline ( Free Preview)
Working with PySpark in Microsoft Fabric Notebooks - What You Need to Get Started
Dowload your class files below!
Module 00 - Introduction
In this module, Zane introduces students to the basics of PySpark within Microsoft Fabric Notebooks, covering essential concepts such as reading and writing data, working with data frames, and handling null values. Students will learn through practical labs and challenges, ensuring they gain hands-on experience with PySpark. By the end of the module, students will have a solid foundation to dive deeper into their own data projects using PySpark.
Module 01 - Provisioning a Fabric Notebook
In this module, Zane introduces students to the basics of Fabric notebooks and their integration with PySpark. The module covers the creation of workspaces and lake houses, essential for organizing and managing data within Fabric. Students will learn how to leverage Fabric notebooks for data ingestion, preparation, and transformation, setting the stage for more advanced π Spark applications in subsequent modules.
Module 02 - What is PySpark?21 min.
In this module, Zane introduces students to the basics of PySpark, the Python API for Apache Spark, emphasizing its power in handling big data analytics and real-time data processing. Students will learn how to import data into a lakehouse, explore variables and data types, and understand the nuances of using PySpark within Fabric notebooks. The module also covers practical steps for setting up a workspace, uploading files, and running code cells efficiently.
Module 03 - Working with Strings and Numbers39 min.
In Module 03, Zane covers essential techniques for working with strings and numbers in Pyspark. Students will learn to manipulate and transform data using various string functions, such as `find`, `replace`, `lower`, `upper`, and `length`, as well as numerical operations to handle and analyze numerical data effectively. These modules provide a comprehensive foundation for data processing and analysis in Pyspark, equipping students with the skills to manage both textual and numerical data efficiently.
Module 04 - Working with Dataframes24 min.
In this module, Zane introduces the concept of data frames in Fabric Notebooks, emphasizing their similarity to tables in relational databases and their optimization for distributed computing. Students will learn how to construct data frames from various sources, manipulate data efficiently, and create schemas manually to ensure accuracy and performance. By the end of the module, students will be adept at using data frames for large-scale data processing and analysis.
Module 05 - Querying Data30 min.
In this module, Zane guides students through the essentials of querying data with PySpark, focusing on practical techniques such as selecting, aliasing, adding, and dropping columns within data frames. Students will learn to handle duplicates using distinct and drop duplicates functions, and explore methods to add new columns with the withColumn and lit functions. By the end of the module, students will be equipped with the skills to manipulate and prepare data efficiently for further analysis.
Module 06 - Writing Data22 min.
In this module, Zane introduces students to writing data in π Spark using the write command, covering various formats and file paths. The module emphasizes the importance of different write modes such as append and overwrite, and demonstrates how to handle data frames to create and manage tables in a lake house environment. By the end of the module, students will have hands-on experience in writing data to files and tables, ensuring data integrity and efficient data management.
Module 07 - Filtering Data25 min.
In this module, Zane guides students through the essential techniques for filtering data using PySpark in Fabric Notebooks. Key concepts covered include the use of the filter and where functions, handling case sensitivity, and leveraging functions like starts with, ends with, contains, and like for dynamic data queries. By the end of the module, students will be equipped to efficiently filter and manipulate data sets, ensuring accurate and relevant data retrieval for their projects.
Module 08 - Aggregations36 min.
In this module, Zane guides students through the process of performing aggregations in PySpark within Fabric Notebooks. The focus is on using the `groupBy` function to organize data and applying various aggregation methods such as `min`, `max`, and `count`, with a preference for the `agg` function for its readability and ease of use. Additionally, students will learn to filter aggregated data, rename columns, and cast data types, enhancing their ability to manage and analyze large datasets effectively.
Module 09 - Working with Nulls29 min.
In this module, Zane explains the critical techniques for handling null values in data analysis using PySpark. Students will learn how to effectively use functions like `NA.drop` and `NA.fill` to manage and clean datasets, ensuring data integrity and accuracy. By the end of the module, students will be equipped with practical skills to either remove or replace null values, enhancing their data engineering capabilities.
Zane Goodman is a Trainer for Pragmatic Works specializing in the Power Platform. He's worked in skilled labor roles as well as the construction industry. Zane spent a lot of time in attics, on boom lifts, and in ditches making sure all the lights turned on properly. Now his primary focus is turning on the light for learning, helping our customers learn the ins and outs of Power Apps.