Austin Libal walks viewers through the foundational steps of using PySpark within Microsoft Fabric. This session is ideal for beginners looking to explore data engineering and analytics using Spark notebooks in Fabric.
What You Need to Begin
- A Microsoft Fabric trial or full license
- A Fabric-enabled workspace
Creating Your First Lakehouse
- Use the Persona Switcher to switch to the Data Engineering persona.
- Create a new Lakehouse (e.g., “Lakehouse PySpark”).
- Upload a sample CSV file (e.g.,
holiday.csv
) to the Lakehouse’s Files folder.
- Drag and drop the file into the Tables folder to auto-generate a table.
Understanding Spark and PySpark
Austin explains that Spark is a distributed computing framework that allows for in-memory data processing using clusters. PySpark is the Python API for Spark, enabling users to write Spark applications using Python.
Working with Notebooks
- Open a new notebook from within the Lakehouse interface.
- Use the notebook to interact with your data using PySpark.
- Drag the holiday table into a code cell to auto-generate PySpark code.
- Run the cell to create a DataFrame and load data into memory.
Notebook Features
- Switch between code and markdown cells.
- Use the Home ribbon to manage language settings and run options.
- Supported languages include PySpark, Scala, Spark SQL, and SparkR.
Creating and Using DataFrames
To work with data in Spark, users create DataFrames. Austin demonstrates how to:
- Generate a DataFrame by dragging a table into a code cell.
- Run the cell to execute a Spark job and load data.
- Use the
df.show()
method to display data in a tabular format.
Don't forget to check out the Pragmatic Works' on-demand learning platform for more insightful content and training sessions on Fabric and other Microsoft applications. Be sure to subscribe to the Pragmatic Works YouTube channel to stay up-to-date on the latest tips and tricks.
Leave a comment