This week's Databricks post in our mini-series is focused on adding custom code libraries in Databricks. Databricks comes with many curated libraries that they have added into the runtime, so you don’t have to pull them in. There are installed libraries in Python, R, Java, and Scala which you can get in the release notes in the System Environment section of Databricks.
It’s common to need to add in custom code of some kind and in my video, I’ll demo three ways to add custom libraries in Databricks in a cluster, workspace, and notebook, as well as some use cases for each. Below is the summary of each option and how to get started but be sure to watch my demo which goes into much more detail and will walk you through each option.
1. Cluster
- Databricks runs on clusters. In my example, I have 2 clusters in my workspace.
- In the cluster Libraries tab, you’ll see I added a library that was not part of the runtime that I will use to pull in Excel files. This is a very common use case – pulling files in from a blob, parsing the Excel files and putting them into a data frame to use.
- To do this, when in the clusters Libraries tab, click on Install New. This pulls up a screen that allows you to install custom packages. This could be something you created internally in custom Java or Python code.
- In my case, it was a Maven script that I wanted to pull in, so I click on Maven and add the coordinates according to the documentation from the Devs and then click Install.
- Now any user in my Databricks environment that is connected to this cluster, will have access to this. See more in my demo.
2. Workspace
- For this second option, we go into Workspace, then click Create and Library.
- This will bring up a very similar screen as we saw in the cluster option. Customers typically use this option for 2 things: to manage their custom code they use throughout their environment or if they are pulling in a Python or R package and they want a specific version.
- Watch my demo to see how I set up a use case for an ML group that wants to use PyTorch (which is not part of the regular install) and how to use a library to ensure they use the correct version.
3. Notebook
- This final option is new and is currently in public preview.
- A good use case for this is prototyping or maybe some packages that someone in machine learning would want to use to see if they can get better results, but they don’t want to add it to the cluster or workspace yet.
- In my demo, I’ll show how to do this by installing an ML package called Theano into my notebook.
I hope this was helpful in showing you three ways to add custom code to Databricks at the cluster level, workspace, and notebook. If your organization needs help with Databricks, Power Platform or Azure in general, we’re here to help. Contact us to learn more about how our customized solutions can help you gain valuable insights into your data for better business decisions.