This class will cover the basics of PySpark, a language for working with big data on Databricks. You will learn how to read and write data from different sources, how to work with numerical and string data, how to use data frames for data manipulation and analysis, and how to clean and transform data and handle null values. This class will help you get familiar with the PySpark language and its capabilities for data engineering and analysis.
Course Outline ( Free Preview)
Module 00 - Class Introduction and Files9 min.
This video is an introduction to the PySpark for Databricks course, taught by Mitchell. He explains the agenda and the objectives of the course, which are to learn the basics of PySpark, a language for working with massive parallel processing and data frames. He also tells you how to download the class files, which include the sources, the notebooks, and the PDF of the slides
Module 01 - Provisioning Databricks
In this video, Mitchell introduces Azure Data Bricks, a platform for data engineering, machine learning, and analytics. He explains some of the key features and benefits of Data Bricks, such as its collaboration, scalability, and integration with Apache Spark and the Microsoft Cloud. He also demonstrates how to create a Data Bricks workspace and a Spark cluster in Azure, and how to configure some of the settings and options.
Module 02 - Introduction to PySpark
In this video, Mitchell introduces Pyspark, a Python API for Apache Spark, and shows how to create and run notebooks in DataBricks. He demonstrates how to upload files to the DataBricks file system, how to create variables and data types, and how to use the AI assistant to diagnose and fix errors. He also explains how to print and concatenate strings, and how to convert data types using simple commands.
Module 03A - Working with Strings25 min.
In this video, Mitchell teaches you how to work with strings and some common functions in Pyspark. He shows you how to use lower, upper, length, find, and replace functions to manipulate and transform string values. He also demonstrates how to return a specific character or a range of characters from a string using index positions and colons.
Module 03B - Working with Numbers13 min.
In this video, Mitchell teaches you how to work with numbers in Pyspark, using different functions such as min, max, round, floor, ceiling, and square root. He also shows you how to import modules that contain these functions, and how to use the AI assistant to diagnose errors and get suggestions. He gives you some examples and challenges to practice your skills and check your understanding.
Module 04 - Working with DataFrames25 min.
In this video, Mitchell explains what data frames are and why they are important for working with Pyspark. He shows how to read data from different sources and formats into data frames, and how to use various commands to inspect and manipulate the data. He also demonstrates how to create and use schemas to define the data types and metadata of the data frames.
Module 05 - Querying in PySpark30 min.
In this video, Mitchell teaches you how to query data in Pyspark using various methods and functions. He shows you how to select, rename, and drop columns, how to remove duplicates based on different criteria, and how to add new columns with transformations or literal values. He also demonstrates how to use the AI assistant to diagnose and fix errors in your code.
Module 06 - Writing Data in PySpark23 min.
In this video, Mitchell explains how to write data in Pyspark in Azure Data Bricks using different formats and modes. He shows how to use the write command with parquet, csv, json, and load options, and how to specify the mode as append, overwrite, error, or ignore. He also demonstrates how to remove duplicates, count rows, and validate the data using data frames and the DBFS browser.
Module 07 - Filtering Data31 min.
In this video, Mitchell teaches you how to filter data in Pyspark notebooks using various methods and functions. He explains how to use the filter function with different operators and conditions, such as equals, not equals, starts with, ends with, contains, and in list. He also shows you how to deal with case sensitivity issues by using the lower function from the pyspark.sql.functions module.
Module 08 - Aggregations35 min.
In this video, Mitchell teaches you how to perform aggregations in Pyspark, such as group by, sum, min, max, and count. He also shows you how to use the AG function to simplify your code and how to filter and order your results. He uses a data set of internet sales and demonstrates various examples of aggregating and renaming columns.
Module 09 - Working with Null Values32 min.
In this video, Mitchell teaches you how to work with null values in Pyspark data frames. He shows you how to use NA.drop and NA.fill functions to either remove or replace rows that have null values in certain columns. He also demonstrates how to use conditional logic and different parameters to customize your null handling logic.
Mitchell Pearson has been with Pragmatic Works for 11 years as a Data Platform Consultant, Trainer and Team Lead. Mitchell has authored books on SQL Server, Power BI and the Power Platform. Data Platform experience includes designing and implementing enterprise level Business Intelligence solutions with the Microsoft SQL Server stack (T-SQL, SSIS, SSAS, SSRS), the Power Platform, Microsoft Azure and Fabric.