March 27, 2023
  • UPM is our inner standalone library to carry out static analysis of SQL code and improve SQL authoring. 
  • UPM takes SQL code as enter and represents it as a knowledge construction known as a semantic tree.
  • Infrastructure groups at Meta leverage UPM to construct SQL linters, catch person errors in SQL code, and carry out knowledge lineage evaluation at scale.

Executing SQL queries in opposition to our knowledge warehouse is essential to the workflows of many engineers and knowledge scientists at Meta for analytics and monitoring use instances, both as a part of recurring knowledge pipelines or for ad-hoc knowledge exploration. 

Whereas SQL is extraordinarily highly effective and very fashionable amongst our engineers, we’ve additionally confronted some challenges over time, particularly: 

  • A necessity for static evaluation capabilities: In a rising variety of use instances at Meta, we should perceive programmatically what occurs in SQL queries earlier than they’re executed in opposition to our question engines — a job known as static evaluation.  These use instances vary from efficiency linters (suggesting question optimizations that question engines can’t carry out routinely) and analyzing knowledge lineage (tracing how knowledge flows from one desk to a different). This was onerous for us to do for 2 causes: First, whereas question engines internally have some capabilities to investigate a SQL question as a way to execute it, this question evaluation part is usually deeply embedded contained in the question engine’s code. It isn’t straightforward to increase upon, and it’s not meant for consumption by different infrastructure groups. Along with this, every question engine has its personal evaluation logic, particular to its personal SQL dialect; because of this, a group who desires to construct a chunk of study for SQL queries must reimplement it from scratch inside of every SQL question engine.
  • A limiting kind system: Initially, we used solely the mounted set of built-in Hive data types (string, integer, boolean, and many others.) to explain desk columns in our knowledge warehouse. As our warehouse grew extra complicated, this set of sorts turned inadequate, because it left us unable to catch frequent classes of person errors, resembling unit errors (think about making a UNION between two tables, each of which comprise a column known as timestamp, however one is encoded in milliseconds and the opposite one in nanoseconds), or ID comparability errors (think about a JOIN between two tables, every with a column known as user_id — however, in actual fact, these IDs are issued by completely different programs and subsequently can’t be in contrast).

How UPM works

To deal with these challenges, we’ve got constructed UPM (Unified Programming Mannequin). UPM takes in an SQL question as enter and represents it as a hierarchical knowledge construction known as a semantic tree.

 For instance, if you happen to go on this question to UPM:

COUNT(DISTINCT user_id) AS n_users
FROM login_events

UPM will return this semantic tree:

                arguments=[ColumnRef(name="user_id", parent=Table("login_events"))],
    father or mother=Desk("login_events"),

 Different instruments can then use this semantic tree for various use instances, resembling:

  1. Static evaluation: A software can examine the semantic tree after which output diagnostics or warnings concerning the question (resembling a SQL linter).
  2. Question rewriting: A software can modify the semantic tree to rewrite the question.
  3. Question execution: UPM can act as a pluggable SQL entrance finish, that means {that a} database engine or question engine can use a UPM semantic tree on to generate and execute a question plan. (The phrase front end on this context is borrowed from the world of compilers; the entrance finish is the a part of a compiler that converts higher-level code into an intermediate illustration that may finally be used to generate an executable program). Alternatively, UPM can render the semantic tree again right into a goal SQL dialect (as a string) and go that to the question engine.

A unified SQL language entrance finish

UPM permits us to supply a single language entrance finish to our SQL customers in order that they solely must work with a single language (a superset of the Presto SQL dialect) — whether or not their goal engine is Presto, Spark, or XStream, our in-house stream processing service.

This unification can be helpful to our knowledge infrastructure groups: Because of this unification, groups that personal SQL static evaluation or rewriting instruments can use UPM semantic timber as a typical interop format, with out worrying about parsing, evaluation, or integration with completely different SQL question engines and SQL dialects. Equally, very like Velox can act as a pluggable execution engine for knowledge administration programs, UPM can act as a pluggable language entrance finish for knowledge administration programs, saving groups the trouble of sustaining their very own SQL entrance finish.

Enhanced type-checking

UPM additionally permits us to supply enhanced type-checking of SQL queries.

 In our warehouse, every desk column is assigned a “bodily” kind from a set checklist, resembling integer or string. Moreover, every column can have an optionally available user-defined kind; whereas it doesn’t have an effect on how the information is encoded on disk, this sort can provide semantic data (e.g., Electronic mail, TimestampMilliseconds, or UserID). UPM can reap the benefits of these user-defined sorts to enhance static type-checking of SQL queries.

 For instance, an SQL question creator may need to UNION knowledge from two tables that comprise details about completely different login occasions:

 Within the question on the proper, the creator is making an attempt to mix timestamps in milliseconds from the desk user_login_events_mobile with timestamps in nanoseconds from the desk user_login_events_desktop — an comprehensible mistake, as the 2 columns have the identical title. However as a result of the tables’ schema have been annotated with user-defined sorts, UPM’s typechecker catches the error earlier than the question reaches the question engine; it then notifies the creator of their code editor. With out this examine, the question would have accomplished efficiently, and the creator won’t have seen the error till a lot later.

Column-level knowledge lineage

Knowledge lineage — understanding how knowledge flows inside our warehouse and thru to consumption surfaces — is a foundational piece of our knowledge infrastructure. It permits us to reply knowledge high quality questions (e.g.,“This knowledge seems to be incorrect; the place is it coming from?” and “Knowledge on this desk have been corrupted; which downstream knowledge belongings have been impacted?”). It additionally helps with knowledge refactoring (“Is that this desk protected to delete? Is anybody nonetheless relying on it?”). 

 To assist us reply these crucial questions, our knowledge lineage group has constructed a question evaluation software that takes UPM semantic timber as enter. The software examines all recurring SQL queries to construct a column-level knowledge lineage graph throughout our complete warehouse. For instance, given this question:

INSERT INTO user_logins_daily_agg
   DATE(login_timestamp) AS day,
   COUNT(DISTINCT user_id) AS n_users
FROM user_login_events

Our UPM-powered column lineage evaluation would deduce these edges:

   from: “user_login_events.login_timestamp”,
   to: “”,
   transform: “DATE”

   from: “user_login_events.user_id”,
   to: “user_logins_daily_agg.n_user”,
   transform: “COUNT_DISTINCT”

By placing this data collectively for each question executed in opposition to our knowledge warehouse every day, the software exhibits us a world view of the total column-level knowledge lineage graph.

What’s subsequent for UPM

We look ahead to extra thrilling work as we proceed to unlock UPM’s full potential at Meta. Finally, we hope all Meta warehouse tables might be annotated with user-defined sorts and different metadata, and that enhanced type-checking might be strictly enforced in each authoring floor. Most tables in our Hive warehouse already leverage user-defined sorts, however we’re rolling out stricter type-checking guidelines progressively, to facilitate the migration of current SQL pipelines.

We’ve already built-in UPM into the primary surfaces the place Meta’s builders write SQL, and our long-term aim is for UPM to turn into Meta’s unified SQL entrance finish: deeply built-in into all our question engines, exposing a single SQL dialect to our builders. We additionally intend to iterate on the ergonomics of this unified SQL dialect (for instance, by permitting trailing commas in SELECT clauses and by supporting syntax constructs like SELECT * EXCEPT <some_columns>, which exist already in some SQL dialects) and to finally elevate the extent of abstraction at which individuals write their queries.