Multidimensional Data Modeling in Pentaho
Pentaho Business Analytics is built on the Mondrian online analytical processing (OLAP) engine. OLAP relies on a multidimensional data model that, when queried, returns a dataset that resembles a grid. The rows and columns that describe and bring meaning to the data in that grid are dimensions, and the hard numerical values in each cell are the measures or facts. In Pentaho Analyzer, dimensions are shown in yellow and measures are in blue.
OLAP requires a properly prepared data source in the form of a star or snowflake schema that defines a logical multi-dimensional database and maps it to a physical database model. Once you have your initial data structure in place, you must design a descriptive layer for it in the form of a Mondrian schema, which consists of one or more cubes, hierarchies, and members. Only when you have a tested and optimized Mondrian schema is your data prepared on a basic level for end-user tools like Pentaho Analyzer.
Pentaho also offers expanded functionality for Pentaho Analysis Enterprise Edition customers, including:
- The Pentaho Analyzer visualization tool.
- A pluggable Enterprise Cache with support for highly scalable, distributable cache implementations including Infinispan and Memcached.
Use of these features requires a Pentaho Analysis Enterprise Edition license installed on the Pentaho Server and workstations that have Schema Workbench and Metadata Editor. A special Pentaho Server package must also be installed; this process is covered in the Installation documentation.
All relevant configuration options for these features are covered in this section.
Prepare Your Data
To prepare data for use with the Pentaho Analyzer and Report Designer, you should perform the following basic tasks:
- Design a Star or Snowflake Schema
The entire process starts with a data warehouse. This section will not attempt to explain how to build this structure -- there are entire books on the subject, and an entire consulting industry dedicated to it already. The end result should be data model in the star or snowflake schema pattern. You don't have to worry too much about getting the model exactly right on your first try. Just cover all of your anticipated business needs; part of the process is coming back to the data warehouse design step and making changes to your initial data model after you've discovered what your operational needs are.
- Populate the Star/Snowflake Schema
Once your data model is designed, the next step is to populate it with actual data, thereby creating your data warehouse. The best tool for this job is Pentaho Data Integration, an enterprise-grade extract, transform, and load (ETL) application.
- Build a Mondrian Schema
Now that your initial data warehouse project is complete, you must build a Mondrian schema to organize and describe it in terms that Pentaho Analysis can understand. You can use the Pentaho Schema Workbench to create an analysis schema.
- Initial Testing
At this point you should have a multi-dimensional data structure with an appropriate metadata layer. You can now start using the data inspection tools to drill down into your data and see if your first attempt at data modeling was successful. In all likelihood, it will need some adjustment, so take note of all of the schema limitations that you're unhappy with during this initial testing phase. Do not be concerned with performance issues at this time -- just concentrate on the completeness and comprehensiveness of the data model.
- Adjust and Repeat Until Satisfied
Use the notes you took during the testing phase to redesign your data warehouse and Mondrian schema appropriately. Adjust hierarchies and relational measure aggregation methods. Create virtual cubes for analyzing multiple fact tables by conforming dimensions. Re-test the new implementation and continue to refine the data model until it matches your business needs perfectly.
- Test for Performance
Once you're satisfied with the design and implementation of your data model, you should try to find performance problems and address them by tuning your data warehouse database, and by creating aggregation tables. The testing can only be reasonably done by hand, using Pentaho Analyzer. Take note of all of the measures that take an unreasonably long time to calculate. Also, enable SQL logging and locate slow-performing queries, and build indexes for optimizing query performance.
- Create Aggregation Tables
Using your notes as a guide, create aggregation tables in Pentaho Aggregation Designer to store frequently computed Analyzer reports. Re-test and create new aggregation tables as necessary. If you are working with a relatively small data warehouse or a limited number of dimensions, you may not have a real need for aggregation tables. However, be aware of the possibility that performance issues may come up in the future. Check in with your users occasionally to see if they have any particular concerns about the speed of their BI content.
- Deploy to Production
Your data warehouse and Mondrian schema have been created, tested, and refined. You're now ready to put it all into production. You may need to train or purchase Pentaho training for those in your organization who use Pentaho client tools.
With the initial data structure in place, you can use dimensional modeling to design a descriptive layer. Dimensional modeling is the process of transforming data from multiple sources in non-human-friendly formats into a single data source that is organized to support business analytics. Below is a typical workflow for developing a dimensional model:
- Collect user requirements for business logic and processes.
- Considering the entirety of your data, break it down into subjects.
- Isolate groups of facts into one or more fact tables.
- Design dimensional tables which draw relationships between levels (fact groups).
- Determine which members of each level are useful for each dimensional table.
- Build and publish a Mondrian (Pentaho Analysis) schema and collect feedback from users.
- Refine your model based on user feedback, continue iterating through this list until users are productive.
Or, expressed as a series of questions:
- What topics or subjects are important to the users who are analyzing the data? What do your users need to learn from the data?
- What are the important details your users will need to examine in the data?
- How should each data column relate to other data columns?
- How should datasets be grouped and organized?
- What are some useful short descriptions for each dimensional level in a hierarchy (for each element, decide what is useful within that element; for instance, in a dimensional table representing time, your levels might be year, month, and day, and your members for the year level might be 2003, 2004, 2005).
- How effective is this dimensional model for the intended user base? How can it be improved?
Pentaho Data Integration offers data inspection tools to make dimensional modeling much easier than more traditional methods. Through PDI, you can quickly adjust your business logic, the granularity of your fact tables, and the attributes of your dimension tables, then generate a new model and push it out to a test environment for evaluation.
Understanding Data Cubes
Another name for a dimensional model is a cube. Each cube represents one fact table and several dimensional tables. This model should be useful for reporting and analysis on the subject of the data in the fact table. However, if you want to cross-reference this data with another cube -- if you need to analyze data across two or more cubes, or need to combine information from two fact tables on the same subject but with different granularity -- then you must create a virtual cube. The XML elements that compose a virtual cube are explained in detail below.
Virtual cubes cannot presently be created through Pentaho Data Integration's model perspective; you must use Schema Workbench instead.
The <CubeUsages> element specifies the cubes that are imported into the virtual cube. It holds <CubeUsage> elements.
The <CubeUsage> element specifies the base cube that is imported into the virtual cube. Alternatively you can define a <VirtualCubeMeasure> and use similar imports from the base cube without defining a <CubeUsage>.The cubeName attribute specifies the name of the base cube. The ignoreUnrelatedDimensions attribute determines whether or not the measures from this base cube will have non-joining dimension members pushed to the top level member. This attribute is false by default because it is still experimental.
The <VirtualCubeDimension> element imports a dimension from one of the constituent cubes. If you do not specify the cubeName attribute, this means you are importing a shared dimension.
The <VirtualCubeMeasure> element imports a measure from one of the constituent cubes. It is imported with the same name. If you want to create a formula or rename a measure as you import it, use the <CalculatedMember> element instead.
Virtual cubes are useful for situations where there are fact tables of different granularities (for instance, one Time fact table might be configured on a Day level, another at the Month level), or fact tables of different dimensionalities (for instance one on Products, Time and Customer, another on Products, Time and Warehouse), and you need to present the results to users who don't know how the data is structured.
Any common dimensions -- shared dimensions which are used by both constituent cubes -- are automatically synchronized. In this example, [Time] and [Products] are common dimensions. So if the context is ([Time]..[Q2], [Products].[Productname].[P-51-D Mustang]), measures from either cube will relate to this context.
Dimensions which only belong to one cube are called non-conforming dimensions. The [Gender] dimension is an example of this; it exists in the Sales cube, but not Warehouse. If the context is ([Gender].[F], [Time]..[Q1]), it makes sense to ask the value of the [Unit Sales] measure (which comes from the [Sales] cube) but not the [Units Ordered] measure (from [Warehouse]). In the context of [Gender].[F], [Units Ordered] has value NULL.
Map a Model with Schema Workbench
With a physical multidimensional data model in place, you must create a logical model that maps to it. A Mondrian schema is essentially an XML file that performs this mapping, thereby defining a multidimensional database structure. You can create Mondrian schemas using the Pentaho Schema Workbench.
Configure Mondrian Engine
The Pentaho Analysis (Mondrian) engine is configurable through a properties file. Mondrian options allow for enhanced engine and data source performance and functionality under certain conditions.
Mondrian Cache Control
You can configure and control the cache infrastructure that the Pentaho Analysis engine uses for OLAP data.
The following topics help to extend your knowledge of multidimensional data beyond basic setup and use:
- Analysis Schema Security
You can define user roles to secure access to your schema.
- About Multidimensional Expression Language (MDX)
You can query your schema using the Multidimensional Expression Language.
- Mondrian Schema Element Quick Reference
You can use this guide to understand each element of your schema.
- OLAP Log Output
You can examine logs to fine tune performance OLAP queries.