In today's fast-paced world of data and AI, there's something that's redefining how people work and interact with data: the Semantic Layer.
At SAIVO, we have been evangelizing about the importance of the Semantic Layer ever since we were inspired by the keynote “The Metrics System” given by Drew Banin, dbt Lab’s co-founder1. Since that moment, we have helped many companies adopt it, and we have seen first-hand its real impact.
However, despite its growing importance, many people still don't know what it is and what practical benefits it offers, which is why we have decided to write this article. We want to explain the essence of the Semantic Layer, and why organizations need to embrace it to get ready for the AI era.
Here’s a quick preview of what we’ll cover:
Semantics are the way we give meaning to words and numbers. They serve as the bridge between abstract concepts and their representation in our world, and they are essential for communication and understanding.
Semantics have been ingrained in our cultures and languages since the beginning of human history. For instance, when ancient civilizations were defining the first units of measure, they were using semantics. Or when scientists defined the concept of gravity—it was not just a word but an entire system of understanding and semantics.
Today, in the business world, semantics are everywhere. When we look at sales numbers, website visits, or how many people bought a new product, we are using semantics. We take these numbers and turn them into something we understand: "This many people visited our site," or "We made this much money last month", and we create meaningful concepts like "Active User" or "Monthly Recurring Revenue".
We refer to these concepts as metrics. We calculate them and show them off in fancy charts and graphs. We display them in PowerPoint presentations. We create business narratives around them. Metrics are the language that business users speak.
We all know that every business needs to track key metrics or KPIs. Metrics guide us, they show what's working well and what's not, and where we might need to put more effort or resources. Tom Blomfield, a partner at Y Combinator, explains it really well:
With better metrics you'll make better decisions. It's like flying an airplane. With no instruments, you're flying blind. You don't know what's happening to the aircraft, you are not in control. Having great metrics is like having great instruments in an aircraft. They let you tweak, iterate, and make sure you're really in control of your startup. [...] As an investor, it's really easy to tell founders who are in command of their metrics versus founders who aren't. It's really impressive when founders can talk about what % of their signups DAU or WAU. It's a big differentiator when a founder can talk so fluently about these metrics.
Every metric has a lot of semantics and business rules involved. Let's take the Churn Rate as an example. This number shows how many customers stop using our service over time. But to make sense of Churn Rate, we need to agree on how we define the "churn" event. Is it when they cancel a service, don't renew it, or is it something more nuanced?
It's crucial to have an agreed and accurate definition of each metric, because otherwise, the resulting numbers will be wrong. Following the Churn Rate example, imagine you're presenting to the C-level team, and you've got this big, fancy chart showing the Churn Rate. But if that number is based on a shaky understanding of what "churn" really means, people will question the numbers they see. Suddenly, they're not just questioning the chart; they're questioning your work and whether they can trust the data you provide. In business, confidence in the data is everything. Keeping your metrics accurate and updated means everyone can trust in the direction you're pointing, and that trust is gold.
In business intelligence, there is another element with lots of semantics involved: Dimensions. Dimensions are the categories that we use to segment, filter, group, slice, and dice the metrics. Some dimensions are simple, like country
or product_category
. But then, there are dimensions where things get a bit more complex. Take the dimension is_recurrent_customer
, for example. This dimension splits our metrics, like Net Revenue, into new versus recurrent customers. Here, semantics play a big role because defining who counts as a "recurrent customer" isn't always straightforward. Is it someone who buys twice in a year? Or does a customer need to make a purchase every month to be considered recurrent? This is where clear definitions become crucial. So, just like with metrics, getting the semantics right in dimensions is key to keeping your data accurate.
As a side note, semantics are sometimes also referred to as ontology. However, just for your context, while semantics is about the meaning behind words and numbers, ontology goes a step further. Think of ontology as the blueprint that shows how all these meanings (or semantics) fit together in the big picture of a business. It's like comparing a dictionary (semantics) to a detailed map of a city (ontology). The dictionary can tell you what individual words mean, but the map (or ontology) shows you how everything is connected.
Semantics used to be directly embedded within Excel formulas or SQL queries. The process was something like this: People with the business knowledge express the rules and semantics of a metric in plain English to the analyst who then takes that definition and converts it into a formula or query.
However, whenever business teams needed to refine or update the logic of a metric, they have to ask data teams to do it. Additionally, when trying to access or understand the metrics they need, questions like which metric to use, or where it can be found are very common. This constant back-and-forth and the lack of self-service capabilities for business teams slow down the pace at which businesses generate insights and make data-driven decisions. Meanwhile, data teams find a significant portion of their time consumed by responding to ad-hoc requests from their non-technical colleagues, which prevents them from working on higher-leverage (and more fulfilling) tasks. As we all know, self-service has always been the eternal promise in our field. And, while we have seen important innovations (such as BI platforms that offer user-friendly, no-code chart builders), true self-service remains largely unachieved. This is, in big part, due to to the inefficient processes associated with creating and managing our metrics' semantics.
On top of that, companies today use an ever-growing number of data tools, including Business Intelligence (BI) platforms, Google Sheets, analytical notebooks, among others. This proliferation of data applications within companies results in semantics scattered across numerous places. This means that, this translation from "business language to code" has to be replicated across all the different tools, leading to duplication of work and, worse still, inconsistencies among metrics (e.g., three different definitions of "churn" found across various places).
The inefficiencies described above lead to two primary issues for organizations:
While low agility hinders companies from moving fast, the impact of low data quality is even higher. In an era where data is increasingly seen as a critical business asset, ensuring its quality and consistency is crucial. Indeed, data quality is often cited as the foremost priority on data teams' roadmaps today2.
In response to these challenges, some companies have pushed to write all these definitions down in documentation tools like Google Docs, Confluence, or Notion, attempting to construct their version of a "semantic layer." Yet, these "business glossaries" are disconnected from data applications and are not automatically updated, meaning any change in definition, such as for "churn," doesn't get reflected in the BI tools automatically. Someone has to go to each data application and make the update manually, which doesn't solve the inefficiencies previously mentioned. We need something that is seamlessly connected and synchronizes with the platforms where metrics are consumed. Enter the Semantic Layer.
Put simply, the Semantic Layer is the "business representation" of your data. It acts as the translation layer between data structures—tables in a database or a data warehouse—into business language. Also known as "Metrics Layer", or "Metrics Store", it offers a unified and universal way for everyone within the company to access metric data using business-friendly concepts.
Consider it as an abstraction layer; it abstracts all the business logic from SQL (or whichever programming language used in your transformation pipelines) and encapsulates it into a higher-level form of metadata within configuration files, primarily using YML syntax. These YML files contain all the logic, rules, instructions, and definitions needed to transform raw numbers of a database into metrics like "Active User" or "Total Production Costs".
However, while we now have the language (syntax, instructions, etc.) to define your semantics, a crucial component remains. The language for interacting with data in a relational database is SQL, not YML. You can't directly query the data with your metadata. There needs to be a software intermediary capable of translating this set of orders in a YML file into a SQL query that can be run against the database. This piece of software dynamically compiles the business language encoded in the metadata into the corresponding SQL query3, executes this query on the database, and sends the resulting table to downstream data applications.
There are providers in the market with platforms that do this compilation. The ones that are gaining broader adoption, particularly among modern tech companies, are dbt and Cube. This article will not delve into a comparison between them (that will be covered in a dedicated post). These players provide the two essential components described above: the programming language (typically offered as an open-source library) for writing the YML metadata, and a Cloud platform that compiles that code as a service.
To clarify, the concept of the semantic layer is not new4; some BI platforms incorporate their own semantic layer, with Looker's LookML being the most notable. However, the current innovation lies in the decoupling of the semantic layer from the BI tools. This is bigger than you may think, because it allows different data applications access to a centralized repository of metrics ensuring that values be consistent across platforms. The next section presents what main benefits the Semantic Layer brings to the table.
As we have seen, the essence of the Semantic Layer is the following: We define the metrics once, in business language, and centralizing them so that every data application (BI and others) can access getting consistent results. As you'll see, all the benefits that derive from this revolve around enhancing data quality and efficiency, both for business users and for data teams. Let's break it down.
Semantics are defined once and consolidated in a central repository:
Semantics can be accessed using business-friendly concepts:
Semantics are written as code, so data teams adopt the best practices of software development5:
👉🏻 In summary, the Semantic Layer empowers every member of the organization—whether in marketing, business operations, or beyond—to query data in a business-friendly language, using their preferred tools, and receive accurate and consistent answers. But, why is this more important than ever today? We discuss this in the next section.
We have seen that generative AI is fundamentally changing the way we interact with data, by providing a natural language interface that enables a more conversational BI experience. As Bob Muglial, former CEO Snowflake, noted, "English will be the primary language in BI tools”.
This brings tremendous value for everyone. For data consumers, self-service will reach new heights. They'll be able to ask any (quantitative) questions to LLMs and receive immediate metric values. Data teams direct their focus towards more impactful (and fulfilling!) tasks moving away from answering ad-hoc questions and building charts for business teams.
However, we all know about the concerns in LLMs' tendency for "hallucinations". While we can accept a certain degree of unpredictability in other fields (like in marketing copies or in image generation), because there is no right or wrong answer, data is a very precise work and numbers have to be 100% right. Moreover, LLMs struggle with relational databases due to the lack of contextual understanding. For instance, how can LLM know which column, in a HubSpot table, represents the correct lead identifier when there are dozens of them with the keyword 'lead' (lead_124958, lead_34835, etc.)?
As explained here, building a good text-to-SQL solution requires architectures that can include all of the nexessary context in the prompt. As you might have guessed, these context and guardrails are provided by the Semantic Layer. It acts as a vital intermediary between the raw data stored in databases and the LLMs that parse and interpret natural language queries. Therefore, the presence of a Semantic Layer is crucial if we want the AI model to produce reliable and consistent results. For more on topic, I recommend this talk given by Paul Blankley from the Zenlytic team.
👉🏻 Put simply, Semantic Layers will become the data interface for LLMs. And the evidence is clear. You can see it here.
Moreover, as the cost of generating code decreases very fast, with a proliferation of new models and metrics everywhere, data governance will become critical to avoid organizational chaos. Particularly for LLMs, having a centrally managed and high-quality metadata will be key. This metadata will act as the knowledge base that LLMs can draw upon to ensure their outputs are accurate and aligned with the organization's definitions without hallucinating.
In this new AI era, the Semantic Layer is not just another piece in your data stack; we see it as the foundation upon which the future of conversational BI and AI-driven data analysis will be built. And, the potential of this synergy (semantic layer + LLMs) goes far beyond BIs. As Tristan Handy, dbt's CEO, said:
There is a whole unmapped territory that we are not addressing with BIs. The Semantic Layer opens up lots of opportunities and experiences. In the world, there are ~30-40M BI users, vs ~1Bn knowledge workers that would like to interact with data in their day-to-day work.
The Semantic Layer is redefining data modeling. As we have seen, semantics are defined as metadata in a higher layer of the data stack, hence are abstracted (or decoupled) from the transformation layer (the 'T' in the ETL/ELT pipeline). This structural shift means that logic previously hardcoded into dbt models is now managed by the Semantic Layer. For example, metrics aggregation no longer require rigid OLAP tables, or the relationship between entities (e.g., primary and foreign keys) is defined once at the beginning of the project and then the semantic layer will seamlessly perform all the recurring Join operations afterward. This evolution presents several key benefits for data teams:
There are different standalone Semantic Layers in the industry. Our preferred one is dbt's, whose new version was released in June 2023. In a future article, we will delve into a comparison of different Semantic Layers available in the market and provide guidance on how to get started in Semantic Layer journey. In the meantime, don't hesitate to reach out to the SAIVO team if you have any questions about the Semantic Layer.