Published in road to data engineering·Jan 5, 2021Stream Data from Kinesis to Databricks with PysparkStreaming with AWS Kinesis and Databricks — Streaming data is fresh data and it plays a big role in actionable decision that can be made with that data.Let’s look at the below chart on how data loses it’s value quickly over time.Databricks4 min read
Published in road to data engineering·Jan 3, 2021Databricks Notebook Promotion using Azure DevOpsProductionize Databricks Notebooks — This article shows how to promote the Databricks notebooks to different stages of code promotion process using Azure Databricks and Azure DevOps. Prerequisites: To follow along please create the following: An Azure account (Free) Azure Storage Account Azure Key Vault Azure Databricks Azure DevOps I have already created all the above…Databricks6 min read
Published in road to data engineering·Dec 29, 2020Spark Performance Optimization Series: #3. ShuffleApache Spark optimization techniques for better performance — A Shuffle operation is the natural side effect of wide transformation. We see that with wide transformations like, join(), distinct(), groupBy(), orderBy() and a handful of others.Apache Spark3 min read
Published in road to data engineering·Dec 28, 2020Spark Performance Optimization Series: #2. SpillApache Spark optimization techniques for better performance — The easier way to think about spill is, for every task there is a corresponding partition and if that task can not process that partition with memory allocated to store that particular partition the data that represents the partition is spilled to disk or written to disk and read back…Apache Spark3 min read
Published in road to data engineering·Dec 27, 2020Spark Performance Optimization Series: #1. SkewApache Spark optimization techniques for better performance — In Spark cluster data is typically read in as 128 MB partitions which ensures even distribution of data. However, as the data is transformed (e.g. aggregated), it is possible to have significantly more records in one partition than in another. …Spark3 min read
Nov 25, 2020Kubernetes Architecture,Hands On!You need to learn Kubernetes right now!!!! But Why??? Well Kubernetes can deploy hundreds of containers with just one command and it is almost present in every DevOps pipeline. So it is important!!! In this article we are going to walk through the Kubernetes Architecture and it’s various components and…Kubernetes7 min read