Why you need a Semantic Layer

Sergi Gomez

Co-founder of Saivo

April 11, 2024

In today's fast-paced world of data and AI, there's something that's redefining how people work and interact with data: the Semantic Layer.

At SAIVO, we have been evanglezing about the importance of the Semantic Layer ever since we were inspired by the keynote “The Metrics System” given by, Drew Banin, dbt Lab’s co-founder¹. Since that moment, we have helped many companies adopt it, and we have seen first-hand its real impact.

However, despite its growing importance, many people still don't know what it is and what practical benefits it offers. That's why we have decided to write this article. We want to explain the essence of Semantic Layer, and why organizations need to embrace it to get ready for the AI era.

Here’s a quick preview of what we’ll cover:

An intro to the role of semantics in business
What problems motivated the creation of the Semantic Layer
A deep dive into what the Semantic Layer is and its added value.
Why it will become critical in the AI era
A technical perspective on how data modeling is simplified with the Semantic Layer.

Semantics in business

Semantics are the way we give meaning to words and numbers. They serve as the bridge between abstract concepts and their representation in our world, and they are essential for communication and understanding.

Semantics have been ingrained in our cultures and languages since the beginning of human history. For instance, when ancient civilizations were defining the first units of measure, they were using semantics. Or when scientists defined the concept of gravity—it was not just a word but an entire system of understanding and semantics.

Today, in the business world, semantics are everywhere. When we look at sales numbers, website visits, or how many people bought a new product, we are using semantics. We take these numbers and turn them into something we understand: "This many people visited our site," or "We made this much money last month", and we create meaningful concepts like "Active User" or "Monthly Recurring Revenue".

We refer to these concepts as metrics. We calculate them and show them off in fancy charts and graphs. We display them in PowerPoint presentations. We create business narratives around them. Metrics are the language that business users speak.

We all know that every business needs to track key metrics or KPIs. Metrics guide us, they show what's working well and what's not, and where we might need to put more effort or resources. Tom Blomfield, a partner at Y Combinator, explains it really well:

With better metrics you'll make better decisions. It's like flying an airplane. With no instruments, you're flying blind. You don't know what's happening to the aircraft, you are not in control. Having great metrics is like having great instruments in an aircraft. They let you tweak, iterate, and make sure you're really in control of your startup. [...] As an investor, it's really easy to tell founders who are in command of their metrics versus founders who aren't. It's really impressive when founders can talk about what % of their signups DAU or WAU. It's a big differentiator when a founder can talk so fluently about these metrics.

Every metric has a lot of semantics and business rules involved. Let's take the Churn Rate as an example. This number shows how many customers stop using our service over time. But to make sense of Churn Rate, we need to agree on how we define the "churn" event. Is it when they cancel a service, don't renew it, or is it something more nuanced?

It's crucial to have an agreed and accurate definition of each metric, because otherwise, the resulting numbers will be wrong. Following the Churn Rate example, imagine you're presenting to the C-level team, and you've got this big, fancy chart showing the Churn Rate. But if that number is based on a shaky understanding of what "churn" really means, people will question the numbers they see. Suddenly, they're not just questioning the chart; they're questioning your work and whether they can trust the data you provide. In business, confidence in the data is everything. Keeping your metrics accurate and updated means everyone can trust in the direction you're pointing, and that trust is gold.

In business intelligence, there is another element with lots of semantics involved: Dimensions. Dimensions are the categories that we use to segment, filter, group, slice, and dice the metrics. Some dimensions are simple, like country or product_category. But then, there are dimensions where things get a bit more complex. Take the dimension is_recurrent_customer, for example. This dimension splits our metrics, like Net Revenue, into new versus recurrent customers. Here, semantics play a big role because defining who counts as a "recurrent customer" isn't always straightforward. Is it someone who buys twice in a year? Or does a customer need to make a purchase every month to be considered recurrent? This is where clear definitions become crucial. So, just like with metrics, getting the semantics right in dimensions is key to keeping your data accurate.

As a side note, semantics are sometimes also referred to as ontology. However, just for your context, while semantics is about the meaning behind words and numbers, ontology goes a step further. Think of ontology as the blueprint that shows how all these meanings (or semantics) fit together in the big picture of a business. It's like comparing a dictionary (semantics) to a detailed map of a city (ontology). The dictionary can tell you what individual words mean, but the map (or ontology) shows you how everything is connected.

The problems that gave rise to the Semantic Layer

Semantics used to be directly embedded within Excel formulas or SQL queries. The process was something like this: People with the business knowledge express the rules and semantics of a metric in plain English to the analyst who then takes that definition and converts it into a formula or query.

However, whenever business teams needed to refine or update the logic of a metric, they have to ask data teams to do it. Additionally, when trying to access or understand the metrics they need, questions like which metric to use, or where it can be found are very common. This constant back-and-forth and the lack of self-service capabilities for business teams slow down the pace at which businesses generate insights and make data-driven decisions. Meanwhile, data teams find a significant portion of their time consumed by responding to ad-hoc requests from their non-technical colleagues, which prevents them from working on higher-leverage (and more fulfilling) tasks. As we all know, self-service has always been the eternal promise in our field. And, while we have seen important innovations (such as BI platforms that offer user-friendly, no-code chart builders), true self-service remains largely unachieved. This is, in big part, due to to the inefficient processes associated with creating and managing our metrics' semantics.

On top of that, companies today use an ever-growing number of data tools, including Business Intelligence (BI) platforms, Google Sheets, analytical notebooks, among others. This proliferation of data applications within companies results in semantics scattered across numerous places. This means that, this translation from "business language to code" has to be replicated across all the different tools, leading to duplication of work and, worse still, inconsistencies among metrics (e.g., three different definitions of "churn" found across various places).

The inefficiencies described above lead to two primary issues for organizations:

Poor data quality
Low agility

While low agility hinders companies from moving fast, the impact of low data quality is even higher. In an era where data is increasingly seen as a critical business asset, ensuring its quality and consistency is crucial. Indeed, data quality is often cited as the foremost priority on data teams' roadmaps today².

Figure extracted from the 2024 Stated of Analytics Engineering provided by dbt Labs

‍

In response to these challenges, some companies have pushed to write all these definitions down in documentation tools like Google Docs, Confluence, or Notion, attempting to construct their version of a "semantic layer." Yet, these "business glossaries" are disconnected from data applications and are not automatically updated, meaning any change in definition, such as for "churn," doesn't get reflected in the BI tools automatically. Someone has to go to each data application and make the update manually, which doesn't solve the inefficiencies previously mentioned. We need something that is seamlessly connected and synchronizes with the platforms where metrics are consumed. Enter the Semantic Layer.

What is the Semantic Layer?

Put simply, the Semantic Layer is the "business representation" of your data. It acts as the translation layer between data structures—tables in a database or a data warehouse—into business language. Also known as "Metrics Layer", or "Metrics Store", it offers a unified and universal way for everyone within the company to access metric data using business-friendly concepts.

Consider it as an abstraction layer; it abstracts all the business logic from SQL (or whichever programming language used in your transformation pipelines) and encapsulates it into a higher-level form of metadata within configuration files, primarily using YML syntax. These YML files contain all the logic, rules, instructions, and definitions needed to transform raw numbers of a database into metrics like "Active User" or "Total Production Costs".

However, while we now have the language (syntax, instructions, etc.) to define your semantics, a crucial component remains. The language for interacting with data in a relational database is SQL, not YML. You can't directly query the data with your metadata. There needs to be a software intermediary capable of translating this set of orders in a YML file into a SQL query that can be run against the database. This piece of software dynamically compiles the business language encoded in the metadata into the corresponding SQL query³, executes this query on the database, and sends the resulting table to downstream data applications.

There are providers in the market with platforms that do this compilation. The ones that are gaining broader adoption, particularly among modern tech companies, are dbt and Cube. This article will not delve into a comparison between them (that will be covered in a dedicated post). These players provide the two essential components described above: the programming language (typically offered as an open-source library) for writing the YML metadata, and a Cloud platform that compiles that code as a service.

To clarify, the concept of the semantic layer is not new⁴; some BI platforms incorporate their own semantic layer, with Looker's LookML being the most notable. However, the current innovation lies in the decoupling of the semantic layer from the BI tools. This is bigger than you may think, because it allows different data applications access to a centralized repository of metrics ensuring that values be consistent across platforms. The next section presents what main benefits the Semantic Layer brings to the table.

What's the value of the semantic layer?

As we have seen, the essence of the Semantic Layer is the following: We define the metrics once, in business language, and centralizing them so that every data application (BI and others) can access getting consistent results. As you'll see, all the benefits that derive from this revolve around enhancing data quality and efficiency, both for business users and for data teams. Let's break it down.

Semantics are defined once and consolidated in a central repository:

A single source of truth: Metrics are no longer scattered and duplicated across different platforms (such as BI tools, Google Sheets, Notebooks, etc.), ensuring that everyone in the organization gets the same answer, every time. So, if you're looking at something like “Monthly Churn Rate for the last quarter,” you can trust that it’s right because it's based on agreed-upon definitions.‍
Avoids duplication of work: Acting as the central repository for metrics, the Semantic Layer eliminates redundant efforts. This efficiency frees up substantial time and resources for higher-value task. Considering that data scientists spend, on average, around 80% of their time on preparing and managing data for analysis , the impact created is very significant.

Semantics can be accessed using business-friendly concepts:

Facilitates self-service to end users: It creates a common language for data that is accessible to all. Whether you’re in marketing, sales, or any other team, you can find and understand the metrics without having to bug the tech team every time. And, even if I wouldn't argue that the biggest pain in analytics is to help people answer ad-hoc questions, this autonomy will certainly accelerate the process of insight generation, pushing the boundaries of self-service analytics higher.

Semantics are written as code, so data teams adopt the best practices of software development⁵:

It enables data governance in metrics: Metrics and dimensions are now version-controlled, and this approach ensures more collaboration and control for updating and managing metrics. For instance, consider the metric “Customer Lifetime Value” (CLV). If there's a need to update the CLV calculation, the proposed changes are pushed to the central repo, then reviewed and approved through a collaborative process involving both data and business teams. ‍
Data teams are more efficient: As we explain in a dedicated session at the end of the article, the semantic layer allows teams to improve the development, reusability, and maintainability of the code.

👉🏻 In summary, the Semantic Layer empowers every member of the organization—whether in marketing, business operations, or beyond—to query data in a business-friendly language, using their preferred tools, and receive accurate and consistent answers. But, why is this more important than ever today? We discuss this in the next section.

A necessary piece in today's AI era

We have seen that generative AI is fundamentally changing the way we interact with data, by providing a natural language interface that enables a more conversational BI experience. As Bob Muglial, former CEO Snowflake, noted, "English will be the primary language in BI tools”.

This brings tremendous value for everyone. For data consumers, self-service will reach new heights. They'll be able to ask any (quantitative) questions to LLMs and receive immediate metric values. Data teams direct their focus towards more impactful (and fulfilling!) tasks moving away from answering ad-hoc questions and building charts for business teams.

However, we all know about the concerns in LLMs' tendency for "hallucinations". While we can accept a certain degree of unpredictability in other fields (like in marketing copies or in image generation), because there is no right or wrong answer, data is a very precise work and numbers have to be 100% right. Moreover, LLMs struggle with relational databases due to the lack of contextual understanding. For instance, how can LLM know which column, in a HubSpot table, represents the correct lead identifier when there are dozens of them with the keyword 'lead' (lead_124958, lead_34835, etc.)?

As explained here, building a good text-to-SQL solution requires architectures that can include all of the nexessary context in the prompt. As you might have guessed, these context and guardrails are provided by the Semantic Layer. It acts as a vital intermediary between the raw data stored in databases and the LLMs that parse and interpret natural language queries. Therefore, the presence of a Semantic Layer is crucial if we want the AI model to produce reliable and consistent results. For more on topic, I recommend this talk given by Paul Blankley from the Zenlytic team.

👉🏻 Put simply, Semantic Layers will become the data interface for LLMs. And the evidence is clear. You can see it here.

Moreover, as the cost of generating code decreases very fast, with a proliferation of new models and metrics everywhere, data governance will become critical to avoid organizational chaos. Particularly for LLMs, having a centrally managed and high-quality metadata will be key. This metadata will act as the knowledge base that LLMs can draw upon to ensure their outputs are accurate and aligned with the organization's definitions without hallucinating.

In this new AI era, the Semantic Layer is not just another piece in your data stack; we see it as the foundation upon which the future of conversational BI and AI-driven data analysis will be built. And, the potential of this synergy (semantic layer + LLMs) goes far beyond BIs. As Tristan Handy, dbt's CEO, said:

There is a whole unmapped territory that we are not addressing with BIs. The Semantic Layer opens up lots of opportunities and experiences. In the world, there are ~30-40M BI users, vs ~1Bn knowledge workers that would like to interact with data in their day-to-day work.

For data teams: Data modeling becomes simpler

The Semantic Layer is redefining data modeling. As we have seen, semantics are defined as metadata in a higher layer of the data stack, hence are abstracted (or decoupled) from the transformation layer (the 'T' in the ETL/ELT pipeline). This structural shift means that logic previously hardcoded into dbt models is now managed by the Semantic Layer. For example, metrics aggregation no longer require rigid OLAP tables, or the relationship between entities (e.g., primary and foreign keys) is defined once at the beginning of the project and then the semantic layer will seamlessly perform all the recurring Join operations afterward. This evolution presents several key benefits for data teams:

‍Dbt models become lighter: The dbt code has fewer lines of code and reduced complexity, so it's easier to maintain.

Semantics changes don't affect the core data models: You can change the definition of a metric, or the relationship between two entities, without editing the dbt code. This provides more stability to core data models. This is particularly advantageous considering that the semantics updates are much more frequent than changes in foundational models (for instance, think how many times you change the logic of a metric vs how many times you change the grain of a model).‍
Metrics are constructed modularly (DRYer): Complex metrics are assembled from simpler ones (like reusable building blocks), and this boosts efficiency and minimizes code redundancy.

We can work with normalized data: The Semantic Layer supports working with normalized data more effectively, thanks to caching mechanisms among other strategies, which enhances query performance and data integrity. Along these lines, an interesting insight from this article is that we'll no longer design our data models to be consumed for people, but for LLMs.

There are different standalone Semantic Layers in the industry. Our preferred one is dbt's, whose new version was released in June 2023. In a future article, we will delve into a comparison of different Semantic Layers available in the market and provide guidance on how to get started in Semantic Layer journey. In the meantime, don't hesitate to reach out to the SAIVO team if you have any questions about the Semantic Layer.

-

You can watch it here: https://www.getdbt.com/coalesce-2021/keynote-the-metrics-system‍
Survey provided by dbt Labs, the 2024 State of Analytics Engineering: https://www.getdbt.com/resources/reports/state-of-analytics-engineering-2024
This process of compilation involves mapping the business terms to the specific data fields and structures within the database, applying any necessary filters, calculations, and transformations along the way.
It was created by SAP in 1991 with their Business Objects (BOs). Here's a good one that explores the history of the Semantic Layer: airbyte.com/blog/the-rise-of-the-semantic-layer-metrics-on-the-fly‍
Cloud Data Warehouses, like Snowflake or BigQuery, have paved the way for consolidating diverse data sources into a central place. The Semantic Layer builds upon this foundation, applying the concept of a single source of truth not just to raw data but also to the metrics and dimensions derived from it.