Skip to content

Amrit-Hub/How-to-become-Data-Engineering-Essentials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

67 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

How to become Data Engineer?

Watch this below video

YouTube - What is Data Engineering

Must go-through (free resources):

Azure Data factory


Azure Synapse


Azure Databricks


Pyspark


SQL


Python


git


Azure Fundamentals

Spark Advanced videos with slides (must for Interview)

  1. Making Apache Spark Better with Delta Lake [Presentation slides here]
  2. Understanding Query Plans and Spark UIs - Xiao Li Databricks [Presentation slides here]
  3. Optimizing Delta Parquet Data Lakes for Apache Spark - Matthew Powers [Presentation slides here]
  4. Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs [Presentation slides here]
  5. Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha [Presentation slides here]
  6. Apache Spark Coreโ€”Deep Diveโ€”Proper Optimization Daniel Tomes Databricks [Presentation slides here]
  7. The Parquet Format and Performance Optimization Opportunities Boudewijn Braams [Presentation slides here]
  8. Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark [Presentation slides here]
  9. Spark Architecture, Alexey Grishchenko [Presentation slides here]
  10. Deeper Understanding of Spark Internals - Aaron Davidson [Presentation slides here]
  11. Advanced Apache Spark Training - Sameer Farooqui [Presentation slides here]
  12. Top 5 Mistakes When Writing Spark Applications [Presentation slides here]
  13. Spark SQL: A compiler from Queries to RDDS with Sameer Agarwal [Presentation slides here]
  14. Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenchen Fan [Presentation slides here]
  15. A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai [Presentation slides here]
  16. Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland [Presentation slides here]
  17. Change Data Feed in Delta [Presentation slides here]
  18. Deep Dive into Delta Lake [Presentation slides here]
  19. Diving into Delta Lake: Unpacking the Transaction Log [Presentation slides here]
  20. Delta Lake 2.0 Overview [Presentation slides here]
  21. Accelerating Data Ingestion with Databricks Autoloader Simon [Presentation slides here]
  22. Tuning and Debugging in Apache Spark Patrick Wendell [Presentation slides here]
  23. Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia [Presentation slides here]
  24. Understanding the Performance of Spark Applications - Patrick Wendell [Presentation slides here]
  25. SQL, DataFrames, Datasets And Streaming - by Michael Armbrust [Presentation slides here]
  26. Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das [Presentation slides here]
  27. Designing ETL Pipelines with Structured Streaming and Delta Lake How to Architect Things Right [Presentation slides here]
  28. Deep Dive: Apache Spark Memory Management [Presentation slides here]

Fastrack Interview

WIP

Optional resources for Mechanical Engineers:

Certifications:

Advanced Reads:

Free Cloud Resources:

๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด Burger

image

Road Map to Data Engineer

1696162341930

Data Warehouse vs Lake vs Mesh

1696248040560

Data Warehouse vs Lake vs Lakehouse vs Mesh

1694950258656

Cloud Platform Models

1695779445753

ETL vs ELT vs reverse ETL

1695032965655

Star vs Snowflake Schema

1693622854283

Medallion Architecture

1719381184666

Database Types1722698423663

Database Indexing

1722698462696

SQL Execution Order

1722698490703

HTTP status code

1722698521421

Containerization

1723181279973