Data Engineering and Scalable Analytics Pipelines
Synopsis
Data engineering underpins the design of scalable analytics pipelines, yet most books, courses, and tutorials on data engineering ignore essential foundational topics. An appreciation of these topics enhances proficiency in data engineering, improves the quality of deployed pipelines, and enables engineers to avoid implementing common data problems. A detailed overview is provided of fundamental concepts in data engineering—not explanations of the internals of a cloud provider, the machine-learning pipeline, a specific tool, or a programming framework—but topics that every data engineer should know about, regardless of how they are assigned the title.
Some solutions to these problems are also proposed, based on foundational concepts in data engineering. At a high level, data engineering is about building scalable analytics pipelines that meet business requirements. The analytics pipeline encompasses data acquisition, ETL, data modeling, and the consumption of data by data science, business intelligence, and operational teams for analysis, reporting, and operational decision-making. The pipeline is usually orchestrated with a high-level tools like Apache NiFi or AWS Glue. The emphasis on scalability arises from the reality that data pipelines and the associated analytics are often processing large volumes and varieties of data. Additionally, the solutions are specific to the fundamental concepts in data engineering enumerated next: data modeling and storage; ingestion, cleansing, and validation; architectural patterns for scalability; data processing frameworks and tools; data quality; and orchestration and scheduling.








