Data Modeling & ETL

Data Modeling and ETL work together to meet business goals facilitating business intelligence

Data modeling defines the relationships between data objects or tables. It is a visual representation of information. The goal is to illustrate the types of data used and stored within the system, the relationships among these data objects, the ways the data can be grouped and organized, and its formats and attributes.

Types of data models include – Conceptual, Logical, and Physical Data Models

Conceptual data (domain) models offer a big-picture view of what the system will contain, how it will be organized and which business rules are involved.

Logical data models are less abstract and provide greater detail about the concepts and relationships. These indicate data attributes, such as data types and their corresponding lengths, and show the relationships among entities.

Physical data models provide a schema for how the data will be physically stored within a database. This data model has a design that can be used on implementing a relational database.

Data modeling follows an iterative workflow, indicated by identifying the entities, key properties of each entity, relationships among entities, mapping attributes to entities completely, assigning keys as needed in order to decide on a degree of normalization that balances the need to reduce redundancy with performance, finalize and validate the data model.

 

Types of data modeling include Relational data models – data segments are explicitly joined through the use of tables, reducing database complexity; frequently employ SQL for data management. These databases work well for maintaining data integrity and minimizing redundancy. They’re often used in transaction processing. Second type, Entity-relationship (ER) data models – use formal diagrams to represent the relationships between entities in a database, and a third type, Dimensional data models – increase redundancy in order to make it easier to locate information for reporting and retrieval. Typically used across OLAP systems.

 

Some benefits of data modeling are increased consistency in documentation and system design across the enterprise, improved application and database performance, eased data mapping throughout the organization, improved communication between developers and business intelligence teams, ease and speed the process of database design at the conceptual, logical and physical levels.

What is ETL & ELT?

Extract, Transform, Load (ETL) is a pipelined process to receive/ingest (extract) data from one or many source(s), process (transform) that data and store (load) the cleaned, transformed and aggregated data for analyzing and driving business decisions.

 

As organizations evolve with many business logic in place, there is a shift towards loading data from multiple data sources, then performing many and different transformations at the same time for different data end users. That process is known as Extract-Load-Transform (ELT).

 

Data characteristics that ETL should be capable of handling are data velocity – responsivity to the speed of data, especially for low-latency applications; volume – processing and storage for expanding data sizes; variety – accommodating a broader range of data formats and data contexts; veracity – quantifying and improving data quality; and value: maximizing the utility of data by shaping and aligning it with business goals.

 

Well-designed ETL pipelines has benefits of competitiveness – need to act on faster, vaster and more complex data in a timely, cost efficient business-relevant manner; agility – meeting the needs of data users like analysts, data scientists, marketers, executives, sales and managers by designing data ingestion pipelines that meet standards of usability, flexibility and reusability; ROI – shaping data to increase its value and relevance to the business value, enabling use cases such as: operational automation, analytical insights while shortening time-to-value; quality improvements – lowers data errors which fosters increased trust and usage of data around the organization; and scalability – allowing data use to grow the businesses use cases.

ETL components (low-level functionality) are:

Extract – pulls data from a variety of data source types such as – relational databases (RDBMS), data warehouses (DWH), data lake, doc store and graph databases. Data source locations include – online platforms, raw data files, and on-prem databases. Data source formats include – unstructured data and structured data.

 

Transform – this phase consists of shaping data to align with the business needs. GCP services used for data transformation processes include:

 

Load – this phase consists of data landing into a variety of SQL and NoSQL data stores which store real-world relationships, allowing for different data querying and access patterns. BigQuery, is a popular service by Google for storing data whether using ETL or ELT processes.

ETL system perspective

ETL systems have evolved and matured from the use of ad hoc data queries and scheduled CRON jobs to sophisticated pipeline frameworks and best practices. The demand for more elaborate ETL grows as downstream fields such as data science expands in sophistication and grows in terms of their data demands.

 

Data processing nature of these pipelines are of two forms, batch processing – where data is grouped for processing at ad hoc, triggered or scheduled intervals; and streaming processing: where incoming data is a continuous stream that is processed using a range of time-bound rules.

 

Orchestration of these pipelines aims to manage complex workloads by arranging tasks execution by means of directed acyclic graphs (DAGs), which define the sequence and dependencies between data jobs. GCP provides a managed data pipeline service:
Composer – This service runs ETL workloads on fully managed Kubernetes, allowing costs to be scaled up or down to zero as needed.

 

Automated deployment based on Infrastructure as Code (IaaS) brings consistency to a system’s setup and takes advantage of continuous integration and continuous delivery (CI/CD) for pipelines, bringing consistency to its testing process.

 

Holistic development extracts the most value out of data assets by carefully planning the development and operational cost of data ingestion infrastructure. GCP provides integrated cloud services that help coordinate and govern pipelines in a more holistic fashion, including Cloud Data Fusion.

Have a question, get an answer. We would be happy to chat.