Today, we are excited to open source Cubed. Cubed is a “data mart as a service” platform that hides the complexity of data pipeline infrastructure in the background, and presents to users a neat and straightforward data mart creation experience. Cubed aims to democratize analytics by bringing data to each team in the company.
Most large Internet companies have one or more huge data warehouses that store users’ online activities and use them for applications such as clickstream analysis and ad targeting. However, it’s very difficult to directly derive insights from such wide and sparse datasets given their sizes. For many of the analytics tasks, we only need a certain slice of data. For example, for tasks like calculating e-commerce customer conversions, data analysts may only need a small subset of data related to commerce.
Data marts can come to the rescue, but they are difficult to build. To build a data mart end-to-end, the team needs to take care of many error-prone tasks such as data ingestion, data modeling, distributed big data systems onboarding, storage choices and optimization, data pipeline CI/CD, and this list goes on.
Given the pain points shared above, we built Cubed, with the following core features:
Aside from the core features above, Cubed also:
Cubed supports the creation and management of data marts and funnel marts. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. A data mart can be created by slicing and dicing on the source data to break it down into smaller and more manageable parts. To be more specific, an analyst can apply filters (e.g. sector equals to “sports”, country equals to “us”, etc.), project a subset of dimensions (e.g. she may only be interested in studying the customers’ gender, age, and location), and construct multiple metrics (e.g. the sum of purchase amount) to define a data mart on Cubed.
Data Mart Creation Interface in Cubed
When a data mart is deployed for regular loading and becomes available in mainstream BI tools, the analyst will be able to interactively work with the predefined dimensions and metrics to perform data analysis. Cubed uses Apache Druid as its analytics store because of its high performance on OLAP queries.
Cubed also allows analysts to create data marts specific to funnel analysis, and we call such data marts “funnel mart”. It’s very difficult to perform funnel analysis on large datasets. When the number of steps goes up, the query can take very long to run. In Cubed, an analyst can predefine a set of user interaction points using predicates represented by SQL filters, and connect these points to represent user journeys. This defines a funnel mart. Cubed tracks the unique user identifications (e.g. cookies) in the data source to calculate the number of distinct users landing on each point. This calculation is fast and scalable with the integration of the Hive Funnel UDF and Apache DataSketches.
Funnel Mart Creation Interface in Cubed
When a funnel mart is deployed for regular loading and becomes available in mainstream BI tools, the analyst will be able to perform data analysis on demand by dragging and dropping dimensions and funnel steps on the BI tool. The analyst can visualize the big picture of user journeys (by using a Sankey chart or similar visualization), or drill down into each individual funnel to learn more about their conversion rates.
An Example Dashboard Employs Sankey Chart and Funnel Charts
High-level User Interaction Flow
Data analysts interact with the Cubed web service to define the data marts. If a Bullet service is configured, she can test run the data mart against real-time streaming data. If building a funnel mart, she can test run funnel queries against Hive. The test run results can help the analyst fine-tune her marts before deploying them. Once the analyst decides to launch the mart, Cubed generates and packages all the files and configurations, and deploys the ETL pipeline to Oozie.
Data Flow of Pipelines Generated by Cubed
The deployed pipeline runs composed queries against Hive regularly, and loads the results to Druid. The analyst can plug BI tools into Druid and run high-performance analytics and reporting based on the results Cubed produced.
Interested in bringing Cubed to your company?
We are planning to make Cubed more versatile - including adding the capability to query low-latency (streaming) data marts. We are also planning to add more data sources and scheduler (e.g. Airflow) support so that Cubed can be applied to more systems. In addition, we are going to add authentication and authorization to Cubed. These are the major areas of focus in the near future, and we warmly welcome contributors to join this project.
Thank you to Josh Walters, Zeyu (Troy) Tao, Kaiyu Zheng, Jigar Patel, and Tushar Sircar for their contributions to this project.