Organizations are adopting modern data management approaches, such as semantic knowledge graphs, to link data across the enterprise and accelerate value from their data lake investments.
Data lakes have the ability to store a variety of data types and quickly handle huge volumes of data, which has led to their widespread adoption. Gartner defines a data lake As a set of storage instances of different data assets that are stored in a semi-accurate or even exact copy of the source format of the original data stores. So, data lakes hold promise in supporting modern enterprise data structures. Applications continue to be successful in physically standardizing enterprise data; However, they can fall short in generating revenue for business users. This is because the bulk of the data within the data lake is offline and stored in its original form, which requires companies to spend a lot of time and money preparing it for analysis.
When used in conjunction with data lakes, data lake warehousing, an approach that combines data warehouse elements with data lake elements, helps organizations locate data from across the enterprise using cost-effective approaches to storage. It also provides the opportunity to leverage data at the computing layer to take advantage of the benefits of artificial intelligence and reduce the need to maintain expensive and fragile ETL pipelines versus organized and costly traditional data warehouses. However, while data lakes address the problem of access to data, they have not yet democratized access so that non-technical users can self-service and collaborate to generate the rapid insights needed to keep pace with consumer preferences and changing business dynamics.
In the past, organizations have linked business intelligence tools to their data lake, but this has led to other issues, such as increased latency, reduced collaboration and reuse, and the inability to leverage data across domains to provide context. These storage solutions have also hampered the ability to perform self-service by exploring data to support enriching analytics and eliciting new insights.
To solve these challenges, organizations are adopting modern data management approaches such as enterprise knowledge graphs to link data across the enterprise and accelerate value from their data lake investments. By linking enterprise data to business semantics, knowledge graphs reduce the cost of data integration and help generate powerful insights for complex business challenges, all while enabling more agile data operations.
Semantic layers are linked to data for real-world use cases
The semantic layer is a data layer that works between data storage and analytics. It represents a logically rich view of information as a set of interrelated business concepts and checks the implementation of the knowledge graph. By operating this semantic data layer, the enterprise knowledge graph enables users to explore and exploit connections across their data world with a business context so that they can fully and accurately investigate
Understand any particular scenario, such as:
- Ask questions based on business concepts and their interrelationships. Assigning concepts to basic metadata (i.e. tables, views, and attributes) makes it possible to create a fast path for sharing data across applications.
- Run flexible unified queries quickly between data in the data lake and other structured, semi-structured, or unstructured sources to support ad hoc data analysis. By connecting and querying data within and beyond the data lake, organizations can achieve just-in-time cross-domain analytics for richer, faster insights without creating challenges for data proliferation.
- Reduce the hassle of data and data traffic by easily sharing results through visualizations to enhance data storytelling and enable self-service analytics directly within the reusable semantic layer.
See also: What is a data lake?
How Boehringer Ingelheim used the semantic layer to transform its data lake
As the world’s largest private pharmaceutical company, Boehringer Ingelheim has had several teams of researchers working independently to develop new treatments. However, data has often been isolated within these groups, making it difficult to link targets, genes and disease data across different parts of the company. The team experimented with many different technical stack approaches with some build data lakes, but insufficient virtualization capabilities necessitated ETL pipelines to transfer data. Others have worked to pre-define all requirements from scratch in an RDBMS, but this approach cannot support the necessary levels of complexity or flexibility.
Ultimately, they realized they needed an approach that would create a technology foundation to enable data sharing across the entire company and link data from different parts of the company to increase research and operational efficiencies, scale up production, and accelerate drug research. To support these goals, Boehringer Ingelheim has begun implementing the enterprise knowledge graph platform as a semantic layer on its data lake, making information easier to navigate, query and analyze. The semantic layer served as a unified unified store for 90% of their R&D data, metadata connected to the knowledge graph from all workflow systems, for example, by integrating data on how samples were created and stored, and identifying studies that were currently ongoing or completed , and record how specific data points are created and stored.
The semantic layer allowed bioinformaticians to access and work with data without having to clean and maintain the data in relation to the appropriate entities. Users were able to search for a specific disease, study, or gene and then explore the results in a Wikipedia-like experiment. It also allowed analysts to see, directly in the data model, how one piece of data relates to the rest of the R&D data, enabling them to use the query builder’s intuitive user interface to pull reports from the knowledge graph without requiring Sparkle knowledge.
The knowledge graph has allowed bioinformaticians to easily identify useful cues within large sets of noisy data and answer very specific questions. This is possible because they can query directly with the associated data dictionary and immediately go to analysis without the need for any integration or cleanup.
Analysts were also able to operate more efficiently because R&D data was made available through a standardized protocol. This means that they no longer need to refer to data catalogs, use alternative methods to find out where the data is, or spend time trying to understand how different data sets are organized to combine them. Instead, they simply pointed to the knowledge graph and asked questions using the natural language interface.
Finally, by using the knowledge graph virtualization capabilities, the organization was able to save money on redundant data storage and costly and time-consuming ETL operations. Virtualization allowed data scientists to create a single central access point for data scientists to work through while allowing data to remain in relational databases and other environments in which it already exists. The data models accompanying this integration have also enabled the organization to be more efficient by avoiding redundant research, reusing previous answers, and focusing on new opportunities based on existing knowledge.