Data mesh and data virtualization: the winning duo


An innovative and still evolving data analysis paradigm, data mesh was designed to transform monolithic architectures, such as data warehouses and data lakes, into a more decentralized architecture. Let’s take a look at its main principles and the interest for organizations to associate it with data virtualization.

What is the data mesh?

The data mesh responds to the challenges inherent in centralized and monolithic data architectures, namely:

1 – A lack of business knowledge within the teams responsible for data: centralized data teams too often have to deal with poorly understood data to solve equally poorly understood business problems. As a result, a lot of back and forth between the data team and the business teams slows down the process and affects the quality of the end results.

2 – Lack of flexibility in centralized data management platforms : Centralizing all data on a single platform can be problematic. The needs of large organizations are indeed too heterogeneous to be met by a single platform.

3 – Slowness in the supply of data and responses to requests for changes: each business request requires the integration of data into the centralized architecture and the modification of flows at all levels of the system. This makes the architecture rigid and prone to failure when changes occur.

The goal of the data mesh is to solve these problems by making organizational units (called “domains”) responsible for managing and exposing their own data to the rest of the organization. Domains better understand how their data should be used, reducing iterations and improving quality before business needs are met. It also removes the bottleneck of a centralized infrastructure and gives domains the autonomy to use the tools that best suit their own circumstances.

Nevertheless, it also introduces obvious risks such as the creation of data silos, duplication of efforts between domains and the lack of unified governance. To deal with these risks, the data mesh introduces several additional concepts:

  • Data as a product: the data exposed by the different domains must be easily discoverable, understandable and usable by other units.
  • Self-service data platforms: building and managing a data infrastructure is complex. Not all areas will have the appropriate resources, and duplication of effort should be avoided. Domains should be able to use a self-service platform to automate or simplify tasks such as data integration and transformation, security policy enforcement, data traceability, and identity management.
  • Federated IT governance: to ensure the interaction between the data products created by the different domains, a certain level of standardization is necessary. This includes the semantics of entities common to multiple domains (for example, customer and product entities) and technical aspects such as addressability of data products and identity management. Some security policies can also be applied globally. Where possible, all of these standardizations and policies should be applied automatically.

Because it ensures unified data access, data security, and a data governance layer on top of distributed and heterogeneous data systems, it is evident that data virtualization is a key technology in the implementation. data mesh.

Build data products with data virtualization

Data virtualization enables domains to quickly implement data products by creating virtual models on any data source. Thanks to its ease of use and its ability to minimize data replication, it allows for much faster creation of data products than traditional alternatives. It’s also faster to iterate multiple versions of data products until business needs are met (Gartner estimates productivity savings of over 45% when using data virtualization).

Virtual models provide a semantic layer by presenting data in an ergonomic way, while hiding from consumers the complexity of the underlying systems, such as the location of the data or the formats of the sources. Data products are exposed via standardized formats such as SQL, REST, OData, GraphQL or even MDX interfaces, without the developer having to write any code. Data products can also be automatically published to a corporate global data catalog, which is used as a corporate data marketplace.

Preserving the autonomy of the domains

Another essential benefit of data virtualization in such an architecture is that it allows domains to autonomously select and scale the data sources that implement their products. For example, many business departments already have their own domain-specific data analysis systems (eg: data markets) that they can reuse almost effortlessly and without introducing new skills into their teams.

They can also directly reuse applications specifically adapted to their fields (for example SaaS applications). If necessary, it is also possible to exploit the caching / query acceleration capabilities offered by the data virtualization platform to ensure adequate performance and avoid interfering with other internal processes running on these systems. . For additional isolation and autonomy between domains, the data virtualization servers used by each domain can also scale independently.

Of course, domains can always choose to go through a data warehouse / data lake process for certain types of data, when they have the appropriate skills. For example, a central data lake infrastructure may be a good choice for products requiring machine learning. However, this is not necessarily necessary for all domains and not for all their data.

Even in this case, the resulting products are accessible through the unified data virtualization layer to ensure consistency and governance, and also to offer the organization additional functionalities such as the creation of a semantic layer, data cataloging and data access through multiple technologies.

Federated IT governance

Data virtualization also naturally enables the implementation of the principle of federated governance. First, the layered structure of virtual models allows for easy reuse of definitions in all areas. Therefore, this allows the definition of common entities with a consistent representation for all data types, which ensures their interoperability. It also allows developers to easily reuse data from other areas without duplicating the integration load.

The data virtualization layer also allows organizations to automate the implementation of comprehensive data security policies (such as the masking of salary data in all data products, unless the user has an HR function), source systems protect them from direct access and provide a single point to apply other standardizations by domains (eg, naming conventions, addressability, and versions).

Data mesh is a new approach to the design and development of data architectures. Unlike a centralized and monolithic architecture based on a data warehouse or a data lake, a data mesh is a highly decentralized data architecture. To minimize data silos, avoid duplication of efforts and ensure consistency, the data mesh paradigm offers a unified infrastructure allowing domains to create and share data products while applying interoperability, quality, standards, governance and security.

Data virtualization solutions have been designed precisely to provide a unified, governed, and secure data layer on top of multiple distributed data systems, to be ideally suited for implementing data mesh principles.





Source link -97