Create a custom partition in Power BI and trigger a refresh with Python

Working with large datasets I almost always get the question if an incremental refresh can be configured within Power BI and yes, this is possible as described here. Especially if you have a date or datetime column it makes sense to set up an incremental refresh to reduce your dataset refresh time. Once set up, Power BI will create automatically some partitions behind the scenes making sure the data is stored as configured. For example if you wish to have 5 years of archived data, each year will represent one partition. But what if you don’t have a date/datetime column or you wish to create your own partition? Can you still create your own, custom partitions and trigger some kind of incremental refresh? Yes you can! Let me show you how.

Let’s quickly demystify the term “partition”

Asking chatGPT what a partition in Power BI is, we got following reply:

In Power BI, a partition is a way to break up a large dataset into smaller, more manageable chunks based on a set of defined criteria. This can be particularly useful when working with datasets that contain millions or billions of rows, as it can help to optimize performance and reduce the time it takes to load and process the data.

Partitions are essentially subsets of the data that are stored separately from the main dataset, based on specific conditions such as date ranges or other data attributes.

By dividing a large dataset into smaller partitions, Power BI can improve query and report performance, and also reduce the resources required to process and manage the data. This can lead to faster report rendering times, reduced memory usage, and improved overall user experience.

Overall, using partitions in Power BI can be a powerful way to optimize large datasets and improve the performance of your reports and visualizations.

chatGPT on “Explain what partitions are for Power BI datasets”

I must admit I don’t have anything to add here.

Let’s get started

For my demo purpose I’m going to create a dummy table in Azure SQL on which I can test custom partitions in Power BI. Further, I’ll need the open-source tool Tabular Editor (regardless if version 2 or 3) in which I’ll create my partitions. If you prefer, you can also use SQL Server Management Studio (SSMS) to create your partitions. Lastly, I’m going to use Synapse Notebooks to run my Python script at the end to trigger a partition refresh.

So let’s start and create a dummy table. I created a SQL script which will create a table with an ID, Country, Region, and a random Revenue. To update which country should be involved, you can add, delete, or modify the countries starting from line 28. If you wish to modify the schema and table name, just update it on line 36 & 37. Lastly, I specify how many rows per country should be added. In my case I add 10 rows per country but if you need more (or less) update the CounterMax variable on line 39.

Once executed, I got 70 rows in my case in my newly created SQL table. Now, let’s connect with Power BI Desktop to it and switch to Power Query (Transform Data button once connected to the SQL table). Why Power Query? Because I’ll prepare my first partition there and will reuse the code for my other partitions.

As you can see in the screen shot above I don’t have a filter applied yet and see the full list of all my countries. To make my solution as configurable as possible, I add a new parameter called “Europe” and one called “America”. I choose Text as type and add the Europe resp. America as Current Value.

The parameters screen shot shows two further parameters called “SQL Server Name” and “Database Name”. I parametrized my data source for reusability and is considered as best practice but it’s not mandatory.

Next, I select my table, choose the little arrow in the right corner of the column, select Text Filters – Equals… and choose my Europe parameter. Once done, I confirm by selecting OK.

Now, I select Advanced Editor in the ribbon and copy the whole code behind. Making sure I’ll not lose the code I paste it into Notepad.

Lastly, I close the Advanced Editor and confirm everything by hitting “Close & Apply” to load the data into Power BI. After data has loaded, I publish my Power BI report to a Premium workspace (PPU or Embedded works as well).

Once published, I switch to Power BI Service, select the workspace in which the report has been published, head over to settings, and copy the workspace connection.

Next, I open External Tools, connect to my workspace, and select the custom partition dataset. In there, you will find one partition if you expand Tables – Your Table Name (Custom Partition in my case) – Partitions. If you select it, you’ll find the M-Code in the Expression Editor on the right hand side.

Let’s now build our own partition with the same approach but in Tabular Editor. All I need to do is right-click on Partitions, select New Partition (Power Query), click afterwards my newly created partition, and paste the M-Code into the Expression Editor. Lastly, I just need to update the “Europe” parameter to “America”. Make sure the name matches exactly with your parameter name. If you wish to double check it, just expand “Shared Expressions” which represents your defined parameters in Power BI.

To make the partitions more user-friendly, let’s rename them as well by just selecting it and hit F2. I renamed it to “Europe” and “America”. Once done, save your data model.

If I check my report now, nothing has changed as I haven’t refreshed my dataset yet. So let’s trigger a manual refresh in Power BI Service and check the report if something has changed. As we can see, all countries from the region America are now also included. Nice!

Let’s now try to trigger a partition refresh instead of the whole table. In one of my previous blog posts I showed how to refresh a Dataset with Python. I’m going to use the same code and just adjust it a little bit. Checking the documentation we can add a body to our POST request and define which partition should be refreshed. Therefore I adjust my Python script by adding following parameter before calling the REST API.

body = {
    "type": "full",
    "commitMode": "transactional",
    "maxParallelism": 3,
    "retryCount": 1,
    "objects": [
        {
            "table": "Custom Partition",
            "partition": "Europe"
        }
    ]
}

After I executed the script and check the Refresh history of the dataset in Power BI Service, I’ll see Via Enhanced Api as Type.

But how do I know that only one partition is now refreshed and not the whole table? I can check that either with Tabular Editor or SSMS. In Tabular Editor I just connect again to my Dataset, select my Europe partition and check the “Last Processed” Date and Time. If I compare the Date and Time with my America partition I see a difference – so it worked!

With SSMS I have also to connect to my Dataset, right click on my Table, select Partitions, and then I will be able to see the different Partitions as well as the Last Processed Date and Time. By the way you could also manually trigger a Partition Refresh from this view by just selecting the process button (it’s the small one with the three green arrows) and confirm on the next screen with OK.

As a final step let’s test a little bit the performance as well as what would happen if we add a new region. Let’s first add the Asia region with some countries. To do so I open my SQL statement again and add China, Japan, and India as Asia countries.

Once successfully executed, I have 100 rows in my Custom Partition table as each new country added 10 rows. Let’s refresh the Power BI Dataset and check if I’ll see the newly added countries in my report. And as expected, nothing has changed in my report meaning the Asia region is not included! This makes sense as the two created partitions only include America and Europe, therefore everything else will be filtered out. This means I need to create another partition either for each Region or I create “everything else” partition. In my case and for the demo purpose, I choose the second option. To do so I open my report in Desktop, switch to Power Query, click the gear icon in my Filtered Rows step under Applied Steps, and configure the filter to does not equal to Europe And does not equal to America parameter.

Once applied with OK I’ll see a Preview of my Asia countries coming from SQL. Following the steps described above, I just copy the whole M-Code from the Advanced Editor window, switch to Tabular Editor (or SSMS), and create a new Partition with this code called “All other Regions”. After saving my data model I switch back to Power BI Service and trigger a full refresh manually. And now my three new countries are also included in my report.

As we have now three different partitions and making sure we’ll get all data which are added to the SQL table, let’s do some performance testing. I’ll always follow the same approach:

  1. Add the same number of rows to each partition (I deleted France from my SQL table making sure each region has 3 countries)
  2. Do a single partition refresh and check how long it takes
  3. Refresh all partitions at the same time and check how long it takes
  4. Do a full refresh and check how long it takes

As I’m interested if there is a difference refreshing the whole data model or explicitly calling the REST API and trigger all partitions at once, I decided to differentiate step 3 and 4.

You can also trigger multiple partition refreshes through the API by adjusting the body. In my case I used the below body.

body = {
    "type": "full",
    "commitMode": "transactional",
    "maxParallelism": 3,
    "retryCount": 1,
    "objects": [
        {
            "table": "Custom Partition",
            "partition": "Europe"
        },
        {
            "table": "Custom Partition",
            "partition": "America"
        },
        {
            "table": "Custom Partition",
            "partition": "All other Regions"
        }
    ]
}

To add new rows I’ll use my SQL script and adjust the CounterMax Parameter to different values.

Keep in mind, the bigger the number is, the longer it will run. After having roughly 1 Mio rows I used the INSERT INTO SQL Statement to just duplicate the data instead of running the script as it would take too long to add so many rows.

And here is my result. As you can see trigger a single partition is always faster then refreshing the whole dataset. Well, this is expected. But the time saved varies between 37 – 72%! The bigger the data, the more time you’ll save. Interestingly enough, triggering all partitions via the REST API to refresh is also faster, but the bigger the data gets the less time you’ll save with it but still worth considering from my point of view.

Please let me know if this post was helpful and give me some feedback. Also feel free to contact me if you have any questions.

If you’re interested in the files used in this blog check out my GitHub repo https://github.com/PBI-Guy/blog

Refresh a Power BI Dataset with Python

As you probably know Power BI Datasets can be refreshed manually, on a scheduled base or through the Power BI REST API. Depending on the data source you will require a Power BI Gateway to be able to update your dataset. Manually refreshing a dataset can be handy if you need an instant refresh but usually you will at least schedule it to automate this process. But in few scenarios you probably don’t want to wait till a certain, scheduled time or you wish to trigger a refresh immediately but programmatically after your ETL / ELT process has finished. In this blog post I’m going to show how to trigger a refresh with Python.

JFYI: You can also use Power Automate, PowerShell or anything else that can call and authenticate against the Power BI REST API to trigger a dataset refresh and there are plenty good blog posts out there explaining how to do so.

Prerequisites

  • Power BI Dataset in Power BI Service
  • basic Python know-how
  • a registered Azure App
  • optional: Azure Synapse Analytics

How to

First of all we need a Service Principal on which behalf we can trigger the Dataset refresh programmatically. You could also use your user and password to log in and call the Power BI REST API (so-called Master User Authentication) but I wouldn’t recommend this approach due to some obvious reasons: Your password can change, what happens to the program if you leave the company, probably too much rights, etc. Therefore we’re going to register an App and give just the needed permissions. A very well how-to-guide can be found here: https://www.sqlshack.com/how-to-access-power-bi-rest-apis-programmatically/.

Checking the doc (https://docs.microsoft.com/en-us/rest/api/power-bi/datasets/get-refresh-history-in-group & https://docs.microsoft.com/en-us/rest/api/power-bi/datasets/get-refresh-history-in-group) we see that in our case we need at least the Dataset.ReadWrite.All or Dataset.Read.All to be able to check the Refresh History, and Dataset.ReadWrite.All to be able to trigger a refresh.

Before we’re going to refresh a Dataset we want to make sure that no refresh is happening right now. That’s the reason why we need the Refresh History API.

Once the app is register you’ll get an App ID and Secret. Store this information somewhere securely as we’re going to need it later on.

To be able to access the Dataset our Service Principal will need sufficient permission on the dataset itself. In my case I’m just going to add the user as admin in my workspace.

As a next step we’re going to write our Python Code, authenticate against Power BI and call the needed REST APIs. You can choose whatever environment suits you best to write Python. In my case I’m going to use Azure Synapse and create a Jupyter Notebook to run my Python Code.

If you’re interested in how to start with Azure Synapse feel free to check out this link: https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started

I created a folder called PBI Guy and create a new Notebook with the name Power BI Dataset Refresh.

As first step we need to have some code and assign afterwards a Spark Pool to be able to run it. I like to start with importing my needed libraries. Because we need to authenticate against Microsoft Azure Active Directory we’re going to use the MSAL library (https://docs.microsoft.com/en-us/azure/active-directory/develop/msal-overview). Further we will call the Power BI REST API. For this purpose I’m going to use the requests library (https://docs.python-requests.org/en/latest/). Afterwards we would need to convert the output into a JSON format to make it easier to process the data. Therefore we’re importing the JSON library as well (https://docs.python.org/3/library/json.html). Last but not least I’m going to use the Pandas library (https://pandas.pydata.org/) just because it’s so simple to transform and extract the necessary data. Luckily with Python importing libraries is straight forward and our first code snippet looks as following.

I’m sure there are multiple, probably easier ways to achieve the same goal triggering a Power BI Dataset refresh but thinking about future improvements of my code (probably store the refresh history, compare it to other data, etc.) I’ll make it nice and clean from the beginning.

Our next goal is to set up all the needed parameters – and there are a few! We will need:

  • Client ID
  • Client Secret
  • Tenant Name
  • Workspace ID
  • Dataset ID

After we registered our App / Service Principal we got the Client ID and Secret. If you’re not sure what’s your tenant name just login to your M365 Admin Center, go to Setup, click on Domains, and see your domain name ending – this is your M365 tenant name. The easiest way to get your Workspace and Dataset ID is to head over to Power BI Service, click on your Dataset and check your URL. The red part is the Workspace ID, the blue part shows you the Dataset ID.

Besides that we specify the Authority URL which is needed to authenticate with the MSAL library. Basically it points Python to the right Microsoft Tenant. Further we also need to specify the scope (Power BI in this case). Lastly we also need our Power BI REST API URL. Because we’re interested just in the last refresh history status we filter it in the query to Top 1. With all these information we create following code snippet.

Now it’s time to authenticate and grab an access token which can be used to call the Power BI REST API and get the Refresh History of the Dataset. Once authenticated and an access token is available, we’re using it to provide it to our header of our request statement. After we set up the header correctly we call the API through a GET request providing the URL from above and the header we just created including the access token. As a next step we’re converting the result directly into a JSON format which is used to format it again to a Pandas Dataframe with the columns requestId, id, refreshType, startTime, endTime, status. All these information are provided through the Power BI REST API. Lastly we set the Dataframe Index as the id column. Our code should look now as following.

Let us know check if the code runs successfully. We’re going to execute the whole code to import all libraries, set the parameters, authenticate, and call the Power BI REST API to get the refresh history of our dataset. If everything works fine we should see a result like the one below.

Perfect, this means we got now the Power BI Dataset refresh history and see the last refresh has been executed on demand (meaning manually) and it’s completed. The status in our case is very important because if the refresh is running we can’t trigger a new one and it will fail. From the documentation we can see there are four different status:

StatusDescription
UnknownState is unknown or refresh is in progress
CompletedRefresh completed successfully
FailedRefresh was unsuccessful
DisabledRefresh is disabled by a selective refresh

So to trigger our refresh we want to make sure the status is not unknown. Further we can also specify what should happen if the last refresh status is failed or disabled (e.g. enable it through another API – see https://docs.microsoft.com/en-us/rest/api/power-bi/datasets/update-refresh-schedule-in-group). Due to demo purpose I’m just going to print out a different message based on the status. This means our last code snippet checks the status and depending on it executes a different code.

I added an else section as well if for some reason a new status code will appear so I’m handling every situation separately.

If we now execute the last bit of our code we see the right message has been printed out – so far so good.

But let’s also check the Power BI Dataset Refresh History. If we switch now to Power BI Service, select our Dataset, move on to Settings and check the Refresh History we see following – our Dataset has been refreshed and triggered via API!

Imagine what you can do now – not only calling the Dataset Refresh REST API but as well any other API Power BI offers via Python! Because I’m in Azure Synapse I can even integrate my Python script in Azure Data Factory and trigger a refresh at the end of my ETL / ELT pipeline, and many more! I’m really excited about this opportunity and will definitely elaborate more possibilities in near future.

Keep in mind that the limitation of your licensing can’t be eliminated. This means if you’re using the Power BI Free license you can programmatically refresh a dataset once a day. With a Pro License it’s eight times a day, and with Premium or Premium per User it’s 48 (see https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data#data-refresh).

Please let me know if this post was helpful and give me some feedback. Also feel free to contact me if you have any questions.

If you’re interested in the files used in this blog check out my GitHub repo https://github.com/PBI-Guy/blog