github – Anthony Fourie

End-to-End Analytics Platform – GitHub Actions

This is part of my series on learning to build an End-to-End Analytics Platform project.

TLDR; This post we got started with GitHub Actions. Test drove Excalidraw for diagraming. While building the workflow learnt Yaml in y minutes. Started with a simple starter workflow. Worked through deploying Bicep files by using GitHub Actions.

Automate construction 🚧

Now that we have the code set up and we can deploy it manually it’s time to ~~autobot~~ automate. We are going to use GitHub Actions for Azure to work on building our Continuous Integration/Continuous Deployment pipeline.

What we are looking to create is a workflow to automate tasks. Actions are event-driven. An example is “Run my testing workflow when some creates a pull request.”. What makes up a workflow? Queue Intro to GitHub Actions. To start off the thinking I get to test out a pretty fantastic open-source whiteboard tool, with tons of potential, called Excalidraw. *Inner voice screaming… “Excalidraw, I choose you!”* ⚔️

A workflow is an automated procedure that you add to your repository.
A job is a set of steps trigger by event/webhook or scheduled that execute on the same runner.
A step is an individual task that can run commands in a job.
Actions are standalone commands that are combined into steps to create a job.
A runner is a server that has the GitHub Actions runner application installed.

Now that we know some basics, let’s dive in and give it a go. Here is the plan for what we want to achieve with an accompanying artistic depiction:

Develop some code, commit to our branch, push the changes
Then complete a pull request and merge the changes
The GitHub action triggers the workflow
The workflow runs our jobs and steps to deploy the resources
Validate the resources got deployed to Azure

Let’s create our first GitHub Actions workflow. Navigate to the ‘Actions‘ tab in our repo. We get a few workflow templates we can use. Let’s use the ‘Simple workflow‘ by using the ‘Set up this workflow‘ button.

GitHub actions use YAML syntax for defining events, jobs, and steps. When we created the workflow, GitHub added a new .github/workflows directory to our repo. It created a new .yml file which I renamed to build.yml in that directory. For those of us that don’t speak YAML fluently, we can learn x(yaml) in y minutes 🧪.

A brief summary of what is going on here:

Our new YAML file and path was added to our repo
We can see our workflow is triggered on a push or pull request actions to our main branch
We have one job that runs on a Ubuntu runner server
We have three demo steps
1. Uses a packaged action from https://github.com/actions/ called checkout to checkout our repo
2. Runs a single line script in the runners shell
3. Runs a multi-line script in the runners shell

That’s a good enough starter template. From this point on, for brevity, I followed the documentation to deploy Bicep files by using GitHub Actions. That covers setting a deployment service principal (My choice, Windows Terminal Prettified), configuring GitHub secrets, and the sample workflow to deploy Bicep files to Azure. Once that is all set up, we have a workflow that looks like this:

Note: Though not the recommended practice. I adjusted the scope of my service principal to a subscription level. For my testing I would like GitHub Actions to create resource groups using my Bicep file definitions not the CLI. So my command was a little different:

az ad sp create-for-rbac --name {myApp} --role contributor --scopes /subscriptions/{subscription-id} --sdk-auth

After reading the Azure/arm-deploy documentation and the exampleGuide

To commit our changes, just use the ‘Start commit’ button in the top right corner. I created a new branch from here for the change, then just finished the pull request from there, merging our changes to the main branch.

Remember our triggers? On pull_request to main? That kicks off our pipeline 😎

Full disclosure: It failed. lol.

I got an error “missing ‘region’ parameter” for the Azure/arm-deploy action
I adjusted the file path for the ‘deployAnalyticsPlatform.bicep’ file from root to the src/bicep directory.
I also modified the trigger to only fire when a push is done to the main branch. That prevents us running the pipeline twice, once for the pull request, then again when the merge us run.

So we learn 😉

Quick update to the code. Run through the GitHub flow again and we are back in business. When we navigate into the Action we can see a bunch of information. Why the workflow was triggered. What’s the status. Which jobs are running.

When we click on the job, we can drill into the runner logs as well. This helps a bunch in debugging workflows. An example, is that we have a property for the storage account which is read-only:

The deployment succeeded though which I think is great progress!

That’s it! All done. Explored GitHub actions. Created service principals and GitHub secrets. Learnt some YAML. Broke and fixed our workflows. Then successfully deployed resources from Bicep code.

🐜

P.S. A really simple video that also helped me rapidly establish some key points quickly was: GitHub Actions Tutorial – Basic Concepts and CI/CD Pipeline with Docker

End-to-End Analytics Platform – Bicep What-If deployment

This is part of my series on learning to build an End-to-End Analytics Platform project.

TLDR; After I refactored my code to use modules I found that Bicep supports ‘What-If’ operations which explain what the code is going to do before deploying it. This post I do a short test on that. Found an issue not showing Azure Synapse resource creation. Then browsed the Bicep GitHub repo to search issues related to What-If operations. Didn’t find what I was hoping for, so logged my first public GitHub issue 😁.

Update: The issue we encountered seems to be related to another preflight improvement which is being worked on but is a “…bit of a gnarly, low level issue so please be patient 🙂“. I was amazed to see how quickly Bicep the team responded on this.

What happens when I push this button? 🤔

So after my previous post on factoring in some Bicep best practices for code reuse I noticed that Bicep supports ‘What-If’ operations.

az deployment sub create --name '<name of deployment>' --location '<location name>' --template-file '<path to bicep file>' --confirm-with-what-if

Side note: I had to change the VS Code theme to save us all from the agony of lime green on light grey background reading.

What’s nice is we get a breakdown of changes that we are about to apply to our environment. I think that is awesome.

Terminal output of a Bicep deployment what-if operation. — I have one question. Explosions? 🧨

Yes, for the eagle-eyed reader, I realised my storage account name is an Azure Region name hahaha 😂

Looking at the terminal output, reading top to bottom, I can see:

We are about to deploy at the subscription scope.
We are deploying a Azure Data Lake Gen2 Storage Account with blob container and all their configuration goodness.
We are deploying an Azure Synapse… wait a minute…

What was weird was that I didn’t see the Synapse Workspace. I checked the deployment details/output and it was there.

Azure portal deployments screen. — Deployed

I wondered if the reason it didn’t output the Azure Synapse Resource during the What-If was because I didn’t define any output variables for it which I did for the storage account.

I updated my variables, added output variables for my synapse.bicep module, then ran the What-If again. Aaaaand…. nothing changed. Considering Bicep is an Open Source project on GitHub we get to search for issues with ‘What-If’ operations. So, we get to create a issue 😁 Taking the things learnt over the past few posts on

That’s it. Done. Created our first public issue: what-if operation doesn’t seem to include all bicep defined or created resources · Issue #3682 · Azure/bicep (github.com).

The what-if behaviour doesn’t block us at this stage. The deployment works so at this point I think we are set for the next section to work on getting this into a GitHub Actions pipeline.

🐜

End-to-End Analytics Platform – Infrastructure as Code (IaC)

Photo by John Nail from Pexels

This is part of my series on learning to build an End-to-End Analytics Platform project.

TLDR; I set up a GitHub milestone with two issues. Started working with the Bicep language to build Azure resources. It’s basically a language that simplifies building Azure Resource Manager (ARM) templates. I installed the Bicep tooling. Defined resources using things like parameters, modules, and others using an ARM template guide. Used Bicep build to generate an ARM template from the Bicep file. Experimented with Bicep decompile to generating a Bicep file from an ARM template. Created my first gist to share some code. Lastly, used the Azure CLI to deploy the Bicep resource. Also… found a Bicep playground 🤸‍♂️ Just saying..

Prepping for Dev 👨‍💻

We are using the development flow from my previous post. Not enough time? Check out the GitHub Flow.

We need a starting point to build out our end-to-end analytics platform. We are going to attempt to deploy a Azure Synapse Analytics and required services with Bicep templates. This gives us two key capabilities:

Data Lake Storage to store our data
Pipelines to support orchestration and batch ingestion of data

Let’s get started. We created a new issue, updated the project board, set up our new branch in GitHub. Pulled the updates locally. Then checkout to that new branch.

In this post though I wanted to learn about GitHub Milestones. They make it easy track a bunch of related issues. They also have convenient progress tracking built in. So I added another issue. Then made my way to the milestone page from the issues tab:

Used the ‘New milestone’ button to create a new milestone. Gave it an name and filled in the details.

After that jump back to the issue and assign it to the milestone we just created. Notice that the milestone has a progress bar.

Nice. Issues, milestones, branches, in the flow. Time to get to building things.

Building Biceps 💪

To work with Bicep files we need to install the Bicep tools (Azure CLI + Bicep install, VS Code Bicep Extension) . Once that’s done, we add our first .bicep file 🦾 to the project. Remember to check which branch you are on locally.

Stepping back, according to the documentation, Bicep is a domain-specific language (DSL) that uses declarative syntax to deploy Azure resources. We not covering every area of Bicep here. The documentation does a good job of that. There are a few things that we am going to use in this post:

Parameters – to support re-usability
Resources/Child resources – to deploy things
Conditional deployment – to control what gets deployed
Modules – to group related code for reuse
Outputs – to reuse deployed resource output variables at some point

There are a bunch of other capabilities that you can explore from loops, functions, and more. So much goodness, so little time… someday maybe.

Now to start building out our resources. Let’s start with some parameters:

Intellisense integration for Bicep development.

Yes please! Intellisense for the win! I also added a comment which is being kind to my future self. Now let’s add a resource. The intellisense really helps a bunch to expedite development. We can get to the resource, the API version, and more using the Tab key. Another nice thing is using Ctrl+Space to expand more options, properties, and more:

We have some basic building blocks for figuring out how to create a resource. Next I expanded the declaration with more resources, parameters, comments, and properties using the template documentation and Azure-Samples.

Note: You could try create a Synapse Analytics Workspace using the portal and grab the template just before you create it as well.

Notice that you can reference the parameters in the resource declaration which helps with code reuse. I added a deployment condition to control what services get deployed. The code is in an intermediate state to showcase that I can use strings or parameters to assign values. I’ll gist show you what I did 😉:

	/Global parameters/
	param resLocation string = resourceGroup().location

	/This controls if we deploy the resource our not/
	param deployDataLake bool = true
	param deploySynapse bool = true

	/Resource specific parameters – Synapse Analytics/
	param synapsWorkspaceName string = 'fancy-name'
	param synapseSqlAdministratorLogin string = 'majestic-username'
	param synapseSqlAdministratorLoginPassword string = 'your-complex-password'

	/Create a data lake storage account which we use as the Synapse Analytics default data lake/
	resource datalake 'Microsoft.Storage/storageAccounts@2021-04-01' = if (deployDataLake == true) {
	name: 'fancy-name'
	location: resLocation
	sku: {
	name: 'Standard_LRS'
	tier: 'Standard'
	}
	kind: 'StorageV2'
	properties: {
	isHnsEnabled: true
	supportsHttpsTrafficOnly: true
	accessTier: 'Hot'
	networkAcls: {
	defaultAction: 'Allow'
	bypass: 'AzureServices'
	virtualNetworkRules: []
	ipRules: []
	}
	encryption: {
	services: {
	blob: {
	enabled: true
	}
	file: {
	enabled: true
	}
	}
	keySource: 'Microsoft.Storage'
	}
	}
	}

	/*
	I built this child resource by wroking my way back through these templates: https://github.com/Azure-Samples/Synapse/tree/main/Manage/DeployWorkspace/storage
	It get's a little tricky, but we are building a dependency chain of parent-child resources. e.g. Storage account -> Blob -> Container
	*/
	resource blobService 'Microsoft.Storage/storageAccounts/blobServices@2021-04-01' = if (deployDataLake == true) {
	parent: datalake
	name: 'default'
	properties: {
	cors: {
	corsRules: []
	}
	deleteRetentionPolicy: {
	enabled: false
	}
	}
	}

	resource container 'Microsoft.Storage/storageAccounts/blobServices/containers@2021-04-01' = if (deployDataLake == true) {
	parent: blobService
	name: 'workspace'
	properties: {
	publicAccess: 'None'
	}
	}

	/Create a Synapse Analytics workspace/
	resource synapseWorkspace 'Microsoft.Synapse/workspaces@2021-04-01-preview' = if (deploySynapse == true) {
	name: synapsWorkspaceName
	location: resLocation
	identity: {
	type: 'SystemAssigned'
	}
	properties: {
	defaultDataLakeStorage: {
	/I used the datalake resource and can use dot notation to reference information about it. This establishes a dependency./
	accountUrl: datalake.properties.primaryEndpoints.dfs
	filesystem: container.name
	}
	sqlAdministratorLogin: synapseSqlAdministratorLogin
	sqlAdministratorLoginPassword: synapseSqlAdministratorLoginPassword
	}

	resource workspaceFirewall 'firewallRules@2021-04-01-preview' = {
	name: 'allowAll'
	properties: {
	startIpAddress: '0.0.0.0'
	endIpAddress: '255.255.255.255'
	}
	}
	}

view raw deployAnalyticsPlatform.bicep hosted with ❤ by GitHub

It’s a basic Synapse deployment. The goal is to start deploying using Bicep. We can add things like RBAC assignment for storage access, network configurations on the storage firewalls, and others.

To ship it 🚢 we can use the Azure CLI in the VS Code integrated terminal. The deployment is pretty simple. Login into your Azure subscription with the Azure CLI. Set your subscription context.

az login
az account list
az account set --subscription 'your-subscription-name-or-id'

Create a resource group in which we want to deploy the resources defined in the .bicep file. Bicep can do this which we will get to another day.

az group create --resource-group 'your-resource-group' -location 'azure-region'

Deploy the resources at a resource group level specifying the Bicep file path as our template file. Once you submit the terminal will indicate that the deployment is running. We should see a JSON summary output when it’s done similar to our resource group deployment.

az deployment group create --resource-group 'your-resource-group' --template-file 'path-to-your-bicep-file'

Checking the deployment in the Azure Portal is simple. Navigate to the resource group. On the ‘Overview’ page, there is a ‘Deployments’

If you keep following the trail, you end up at the deployment detail screen:

We can validate the resources are deployed in the Azure Portal:

Let’s close off one of our issues and see what it does to the milestone:

Awesome! To clean up, just delete the resource group 👍

az group delete --resource-group 'your-resource-group'

Interesting finds

Bicep has nice capabilities for users coming from an ARM background is that you can use the Bicep build to have it build the ARM template 😉.

az bicep build --file 'path-to-your-bicep-file'

If you have ARM templates, you can try out the Bicep decompile functionality to TRY (it’s not perfect, so no guarantees) convert your ARM templates to Bicep files.

az bicep decompile --file 'path-to-your-bicep-file'

Wow! Another lengthy post. Thanks for sticking around. We covered some serious ground. We learnt a bunch and kept building a foundation for our future work. Future posts we can tackle things like modules, advanced resource deployments, and deploying using GitHub actions which should be fun.

🐜

End-to-End Analytics Platform – GitHub Setup

TLDR; I set up a GitHub repo for my learning project to build an End-to-End Analytics Platform. Used a GitHub project as reference (Atom). Cloned my repo with VS Code. Closed the loop with the GitHub Flow by creating an issue (GitHub Guides – Mastering Issues), then a branch, changed/commit stuff, pushed changes, created a pull request, merge the changes. Found an exciting learning tool (GitHub Lab | How it works), project (GitHub Training Kit), and VS Code GitHub integration along the way.

As part of my series on learning to build an End-to-End Analytics Platform project I want to start To get started here are a few things I want to get set up:

README.md – project on a page 📄.
.gitignore file – ignores files starting with VS Code using the gitignore extension.
docs folder – using this to store documents related to the project.
src folder – planning to use this as the root folder for code in the repo.
test folder – planning to use this as the root folder for unit tests.
actions folder – as far as I can tell at the moment for GitHub action-related files.
GitHub Project – using this to track my ~~work~~ issues over time.

How did I figure out that’s how I wanted to setup the repo you ask? I took my inspiration from the atom/github: Git and GitHub integration for Atom repository. Why? Well, I am starting my learning for GitHub through a LinkedIn Learning course Learning GitHub and that was the repo showcased in intro videos.

When I grow up I’ll work on using an approach outlined by Rob Sewell who has a fabulous beard, and some next-level skills: How to fork a GitHub repository and contribute to an open source project – Rob Sewell – SQLDBAWithABeard. For now, no forking the repo. The plan is to use VS Code in with my development environment setup to clone the repo locally, setup a new remote/local branch to get into the swing of feature branching, then make the changes are complete a very simple pull request.

Creating and cloning repo was easy enough using VS Code. Now I didn’t do everything in VS Code purely because I want to get a used to a few things first. So I manually grabbed the Clone URL from the GitHub repo.

Screen clip of GitHub clone URL. — Grabbing the repository URL for cloning.

All by the book so far. I opened up the Command Palette with Ctrl+Shift+P. Start typing ‘git clone’ and select the ‘Git: Clone’ operation.

Using VS Code command palette to start Git clone. — Use the VS Code command palette to start Git Clone.

Pasted in the repo URL.

Passing in the repository URL to clone in VS Code. — Paste the URL we copied earlier.

I was prompted to choose an repository location to clone the repo to. I chose to go with a folder that I can use for future repos, something like <your path>\github\<your repo>. Once that’s done, we get prompted to open up the repo.

Open repository pop-up in VS Code. — VS Code picks up I did something.

Great! Repo has been cloned locally and the files are there. Local repository has been initialised and we are on the local main branch (see bottom left).

Side note: I set a local repo user configs aligned with the user name and email to be used by git for committing to my remote GitHub repo:

git config --local user.name 'example'
git config --local user.email 'example@domain.com'

Before creating feature branches I want to track my work for updating the project structure and the documentation. To do that we create a ‘Project Board‘:

Create a new project from the Projects tab.

Give it a name, description, and if you want to use a template go for it. I opted for a ‘basic kanban’ project template. As I go a long learning how all the triggers and things work this will evolve to something more sophisticated.

First look at the new spiffy project board.

It’s beginning to take shape, nice! Once that is done I need to add a work item to track my work against, to do that we create an ‘Issue‘:

Adding a new issue from the project Issues tab.

Fill it with information. Assign it to myself. Then label it with a ‘documentation’ label. Then hit the ‘Submit new issue’ button. Awesome. We now have an new issue that we can discuss, subscribe to for notifications, assign to projects, milestones, and more.

Once the issue is logged, I jumped over to the Project Board and added the ‘card’ to the ‘To Do’ swim lane. Now it automatically links up to this project.

Switching over to the Code tab, we create a branch from the main branch in GitHub so that we can make our changes to that branch.

Creating a new branch from the main branch.

Once the branch is created we need to sync our local repo with a Git Pull to get the latest changes one of which is the addition of the new branch. The next point is to switch to or ‘check out’ that branch we just created. Opening up the Command Palette with Ctrl+Shift+P. Start typing ‘git checkout’ and select the ‘Git: Checkout to..’ operation. Then the feature branch you want to switch to.

Next up. Make changes.

I added all the files and folders that I mentioned earlier. A noteworthy mention though was adding a .gitignore file using the gitignore extension which pulls .gitignore templates from the https://github.com/github/gitignore repository. Awesome. Didn’t know there was a repo full of .gitignore templates.

Opening up the Command Palette with Ctrl+Shift+P. Start typing ‘add gitignore’ and select the ‘Add gitignore’ operation.

Using the Command Palette to add a gitignore file.

Then just choose a template. In my case that’s ‘VisualStudioCode’.

Adding a gitignore file for Visual Studio Code.

Nice! That was pretty easy.

Next up push it up remote. Using the menu in the left, I switched to the Source Control view (Ctrl+Shift+G). Added a commit message. Then hit the commit button. In my case I went for the ‘stage all changes and commit them directly‘ option.

Now notice at the bottom of the screen in the status bar we can see we have a change waiting to be pushed to the remote repository in GitHub.

Just click on that. Follow the prompts and VS Code will push your changes up to the remote repository. In future with branches we will follow this up with a pull request.

Once that’s done, we jump back to GitHub and we can see our branch is updated with the files we added. Next step is to finish of the flow by creating a Pull Request and merging our changes back into the main branch. To start that process we click on the ‘Compare & pull request’ button.

Starting the review and pull request process.

As part of the pull request we fill in as much information as we can and link any issues that we want to close automatically using keywords. Once we filled everything out and we are happy there are no conflicts, we can click on the ‘Create pull request’ button.

Filing out the pull request and closing issues.

We get a valuable amount of rich history related to the work we have done here. Looking through the details I see I have no conflicts either. Let the merge begin.

The merge is successful. In this case the work I have done on the ‘anthonyfourie-skeleton’ branch is complete. I don’t need that branch anymore so I am going to delete it.

Deleting the branch after a successful merge.

That’s it! Flipping back to the Issues tab we can see our issue has been closed. Whoop there it is!

All in all I think that was a pretty good start. Looking forward to the next post where I think I will tackle spinning up some basic infrastructure by using infrastructure as code.

🐜