MyTechBlog

Posts

Showing posts from December, 2025

Switching from Spark Dataframe .toPandas()

December 07, 2025

Pandas vs. PySpark Performance: Understanding the Differences and Avoiding .toPandas() 1. Introduction This document aims to delineate the fundamental performance characteristics of Pandas and PySpark DataFrames, particularly in the context of large-scale data processing. It will highlight why PySpark is the preferred choice for big data analytics and critically examine the implications and pitfalls of converting a distributed PySpark DataFrame to a single-node Pandas DataFrame using the .toPandas() operation. 2. Understanding Pandas DataFrames Pandas is a powerful and widely used open-source data analysis and manipulation library for Python. In-Memory Processing: Pandas DataFrames operate entirely in memory on a single machine. All data must fit within the RAM of the machine running the Pandas process. Single-Threaded (mostly): While some Pandas operations can leverage multiple CPU cores, the core architecture is fundamentally single-node and often single-threaded for many common o...

December 07, 2025

🚀 Stop Building Data Swamps: My Blueprint for an AI-Ready Lakehouse on Databricks & Azure/AWS The biggest bottleneck for high-impact AI adoption isn't the model—it's the data. Most Data Lakes are simply repositories; they are not engineered for the speed, quality, and governance that production ML demands. They are data swamps. As a Senior Data Engineer , I've standardized on the Lakehouse Architecture , anchored by Databricks (using Delta Lake on Azure/AWS ), as the only pattern that delivers the necessary ACID properties, governance, and real-time performance for reliable MLOps. This is the three-step architectural pattern ( The Medallion Architecture ) I use to transform raw data into a reliable, high-performance Gold Layer specifically designed for model training and serving. 1. 🥉 The Bronze Layer: The Immutable Source This layer is pure ingestion. It's the "raw" data stored as-is from sources (Event Hubs, Kinesis, transactional DBs). Goal: Cap...