(AI/ML): Introduction to Data Engineering

Data has always been dubbed as the "new oil". A very fitting anology. Just as with oil, data is useless  and even dangerous in its raw form. 
Just like with oil, which needs to  be refined, transported safely, and delivered to the right place at the right time in order to useful, data must go through the similar process. This is where the Data Engineer comes in.

While Data Scientists are like chefs who create a masterpiece meal (the insights and AI models), Data Engineers are the architects and contractors who build the industrial kitchen. They ensure the water lines are pressurised, the electricity is stable, and the ingredients arrive fresh and sorted every morning.



In this article, we will have a quick look into: 

What is data engineering? what are data pipelines? what is ETL and ELT? why is data quality important? etc.

1. The Core Mission: Building the Pipes

The primary responsibility of a data engineer is to build the infrastructure and reliability required for a business to trust its data. It isn't just about cleaning data; it is about ensuring the "pipes" don't break and that the flow of information is automated.

The Data Pipeline

A data pipeline is the automated "conveyor belt" of the digital world. Instead of a human manually moving files, a pipeline automatically moves data from a Source (like a mobile app or a database) to a Destination (like a dashboard).

2. Moving the Data: ETL vs. ELT

For decades, engineers used a sequence called ETL. Today, a more modern approach called ELT is often preferred.

  • Extract: Grabbing the raw data from the source.

  • Load: Putting that data into a storage system (a Data Warehouse).

  • Transform: Cleaning the data, fixing dates, and removing duplicates.

Why ELT? By loading the Raw data first, companies maintain a "source of truth." If you ever realize you made a mistake in your cleaning process, you can go back to the raw file and "re-run" your logic without losing anything.

3. The Medallion Architecture: Organizing the Layers

Data engineers often organize data into three "zones" to keep things tidy:

  1. Bronze (Raw) Layer: The messy, "as-is" data. It includes every error and duplicate.

  2. Silver (Cleaned) Layer: The data is filtered and standardized. It’s "safe to drink," but not yet fully prepared.

  3. Gold (Curated) Layer: The final product. This data is joined together and summarized (e.g., "Monthly Sales Trends") so a business executive can read it immediately.

4. Orchestration: The Conductor of the Orchestra

In a complex system, some tasks must finish before others can start. You shouldn't try to build a "Gold" report until the "Silver" cleaning is done.

Engineers use Orchestration tools to manage this. They create a DAG (Directed Acyclic Graph)—a map of tasks that move in one direction. If a step fails (like a server going down), the orchestrator can automatically "retry" the task or alert the team before incorrect data reaches the end user.

5. Data Quality: The Gatekeeper

Even if the pipes are working, the "water" might be poisoned. For example, if a bug causes every customer’s age to be listed as 999, the system hasn't "broken," but the data is useless.

Data engineers program Sanity Tests into the pipeline:

  • Range Checks: "Age must be between 0 and 120."

  • Null Checks: "Every order must have a customer ID."

  • Uniqueness: "No two transactions can have the same ID."

Conclusion

Data Engineering is the backbone of the AI and analytics revolution. By applying software principles like automation, version control, and testing to data (a practice known as DataOps), engineers ensure that the foundation of the business is stable, scalable, and—most importantly—trusted.

Comments

Popular posts from this blog

(Commentary): Trusting the process

About me

(Hat) AI Engineer