Streaming with AWS Kinesis and Databricks

Streaming data is fresh data and it plays a big role in actionable decision that can be made with that data.Let’s look at the below chart on how data loses it’s value quickly over time.

Productionize Databricks Notebooks

This article shows how to promote the Databricks notebooks to different stages of code promotion process using Azure Databricks and Azure DevOps.


To follow along please create the following:

I have already created all the above mentioned services as shown in the screenshot below.

Apache Spark optimization techniques for better performance

Source: Planning above and beyond

A Shuffle operation is the natural side effect of wide transformation. We see that with wide transformations like, join(), distinct(), groupBy(), orderBy() and a handful of others.

Apache Spark optimization techniques for better performance

Photo by Thorn Yang from Pexels

The easier way to think about spill is, for every task there is a corresponding partition and if that task can not process that partition with memory allocated to store that particular partition the data that represents the partition is spilled to disk or written to disk and read back again.

Spill is the term used to refer to the act of moving an RDD from RAM to disk, and later back into RAM again.

This occurs when a given partition is simply too large to fit into RAM. …

Apache Spark optimization techniques for better performance

Copyright: © HanLockwood

In Spark cluster data is typically read in as 128 MB partitions which ensures even distribution of data. However, as the data is transformed (e.g. aggregated), it is possible to have significantly more records in one partition than in another. This significantly uneven distribution of records in partitions is called Skew.

→ Up to some degree a small amount of skew can be ignored in the range of 10 percent.

→ But large skews can result in spill or worse, out of memory errors.

You need to learn Kubernetes right now!!!! But Why??? Well Kubernetes can deploy hundreds of containers with just one command and it is almost present in every DevOps pipeline. So it is important!!!

In this article we are going to walk through the Kubernetes Architecture and it’s various components and how they work together.

Himansu Sekhar

Data Engineering | DevOps | DataOps | Distributed Computing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store