03.02.2021
Open Source

Announcing Cubed - An Open Source Data Mart as a Service Platform

Haotian Zhang (Software Dev Engineer II), Guruganesh Kotta (Sr. Software Dev Engineer), Michael Natkovich (Software Engineering Sr. Director)

Person wearing denim jacket holding phone while standing next to street

Today, we are excited to open source Cubed. Cubed is a “data mart as a service” platform that hides the complexity of data pipeline infrastructure in the background, and presents to users a neat and straightforward data mart creation experience. Cubed aims to democratize analytics by bringing data to each team in the company.

Motivation

Most large Internet companies have one or more huge data warehouses that store users’ online activities and use them for applications such as clickstream analysis and ad targeting. However, it’s very difficult to directly derive insights from such wide and sparse datasets given their sizes. For many of the analytics tasks, we only need a certain slice of data. For example, for tasks like calculating e-commerce customer conversions, data analysts may only need a small subset of data related to commerce.

Data marts can come to the rescue, but they are difficult to build. To build a data mart end-to-end, the team needs to take care of many error-prone tasks such as data ingestion, data modeling, distributed big data systems onboarding, storage choices and optimization, data pipeline CI/CD, and this list goes on.

Features

Given the pain points shared above, we built Cubed, with the following core features:

  1. Connects a set of tech stacks that are essential for building ETL data pipelines, and establishes a pattern to create data marts end-to-end. Once the Cubed service and its associated clusters are set up, the stack can be reused to create other pipelines conveniently.
  2. Allows data analysts to create data marts on demand. They are not required to write any code or configurations. They can experiment with the data, define data marts on the UI, and deploy the pipeline in one click.
  3. Business Intelligence (BI) tool agnostic. Analysts can easily plug in any of their favorite BI tools that support Druid (such as TurniloSupersetYavinLooker, or Tableau) for visualizations and reporting.
  4. Allows analysts to create data marts specifically designed for funnel conversion analysis.

Aside from the core features above, Cubed also:

  1. Makes it easy to onboard multiple schemas.
  2. Employs Apache DataSketches - a software library of stochastic streaming algorithms -  which can produce results orders of magnitude faster and with mathematically proven error bounds.
  3. Employs hive-funnel-udf, enabling funnel analysis to be performed easily and efficiently on Hive tables.
  4. Allows analysts to run ad-hoc queries to test their data marts against streaming data via Bullet - a real-time query engine for very large data streams, or test their funnel marts against batch data via Hive, before deploying a pipeline.

Data Mart & Funnel Mart

Cubed supports the creation and management of data marts and funnel marts. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. A data mart can be created by slicing and dicing on the source data to break it down into smaller and more manageable parts. To be more specific, an analyst can apply filters (e.g. sector equals to “sports”, country equals to “us”, etc.), project a subset of dimensions (e.g. she may only be interested in studying the customers’ gender, age, and location), and construct multiple metrics (e.g. the sum of purchase amount) to define a data mart on Cubed.

 

Data Mart Creation Interface in Cubed

Data Mart Creation Interface in Cubed

 

When a data mart is deployed for regular loading and becomes available in mainstream BI tools, the analyst will be able to interactively work with the predefined dimensions and metrics to perform data analysis. Cubed uses Apache Druid as its analytics store because of its high performance on OLAP queries.

Cubed also allows analysts to create data marts specific to funnel analysis, and we call such data marts “funnel mart”. It’s very difficult to perform funnel analysis on large datasets. When the number of steps goes up, the query can take very long to run. In Cubed, an analyst can predefine a set of user interaction points using predicates represented by SQL filters, and connect these points to represent user journeys. This defines a funnel mart. Cubed tracks the unique user identifications (e.g. cookies) in the data source to calculate the number of distinct users landing on each point. This calculation is fast and scalable with the integration of the Hive Funnel UDF and Apache DataSketches.


Funnel Mart Creation Interface in Cubed

Funnel Mart Creation Interface in Cubed

 

When a funnel mart is deployed for regular loading and becomes available in mainstream BI tools, the analyst will be able to perform data analysis on demand by dragging and dropping dimensions and funnel steps on the BI tool. The analyst can visualize the big picture of user journeys (by using a Sankey chart or similar visualization), or drill down into each individual funnel to learn more about their conversion rates.

 

An Example Dashboard Employs Sankey Chart and Funnel Charts

An Example Dashboard Employs Sankey Chart and Funnel Charts

 

System Overview

High-level User Interaction Flow

 High-level User Interaction Flow

Data analysts interact with the Cubed web service to define the data marts. If a Bullet service is configured, she can test run the data mart against real-time streaming data. If building a funnel mart, she can test run funnel queries against Hive. The test run results can help the analyst fine-tune her marts before deploying them. Once the analyst decides to launch the mart, Cubed generates and packages all the files and configurations, and deploys the ETL pipeline to Oozie.

 

Data Flow of Pipelines Generated by Cubed

Data Flow of Pipelines Generated by Cubed

The deployed pipeline runs composed queries against Hive regularly, and loads the results to Druid. The analyst can plug BI tools into Druid and run high-performance analytics and reporting based on the results Cubed produced.

 

Getting Started

Interested in bringing Cubed to your company?

  • Follow these steps to quickly boot up a Cubed service locally with sample schemas.
  • Onboard your own schemas by following these steps. To enable deploying data marts, you will need to integrate with HDFS, Hive, Oozie, and Druid clusters. Registering the hive-funnel-udf can further unlock its funnel mart deployment capability. A full-fledged Cubed service can make use of Bullet to perform real-time data mart cardinality estimation and test-run against streaming data.
  • Feel free to contact us with any questions.

 

What’s Ahead

We are planning to make Cubed more versatile - including adding the capability to query low-latency (streaming) data marts. We are also planning to add more data sources and scheduler (e.g. Airflow) support so that Cubed can be applied to more systems. In addition, we are going to add authentication and authorization to Cubed. These are the major areas of focus in the near future, and we warmly welcome contributors to join this project.

 

Acknowledgments

Thank you to Josh WaltersZeyu (Troy) TaoKaiyu ZhengJigar Patel, and Tushar Sircar for their contributions to this project.