Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at


Hitachi Vantara Lumada and Pentaho Documentation

Plan and build your Data Catalog

Parent article

As a data steward, you can start planning and building Data Catalog for your data analysts to use.

Planning your Data Catalog

With tens or even hundreds of data sources in your enterprise data lake catering to the curiosities and needs of thousands of users, it is helpful to plan your data catalog before building it with Lumada Data Catalog.

Use the following guidelines to plan your Data Catalog:

  • Plan data sources to be added along with their path details

    One of the ways Data Catalog achieves data security is by allowing control over virtual folder designation based on roles. For example, if you add a data source with the root path, then controlling access to specified content in that path for all roles is difficult. In this case, the Data Catalogadmin relies on access control with virtual folders by using the path and include/exclude patterns. Also note that Data Catalog does not allow data sources with overlapping paths. Virtual folder paths can overlap with the use of include and exclude patterns.

  • Plan custom roles

    Default Data Catalog roles (SysAdmin and Guest) are predefined with a limited set of permissions and access to all virtual folders in the Data Catalog. The permissions are categorized in predefined roles: Administrator, Steward, Analyst, and Guest. You can create custom roles with a tailored set of permissions and assign the scope for virtual folders.

    To assign virtual folders to a role, you can create custom roles with a finer control on role-based accessibility and job management. To plan your custom roles, see Role-based access control (RBAC). You may also want to plan how many admin users are required to manage your data lake. Based on your organizational structure, you may want an admin for each business function/division/region.

  • Plan tag domains and tag glossary

    Tags are labels that users can attach to the data or the resources to mark a particular data pattern or resource that contributes to their business value. These tags can then be automatically propagated throughout the data lake to identify similar data patterns. Tags can be grouped into tag domains with multi-level hierarchy and assigned to roles to perform data analysis. Depending on the permissions assigned to the roles, the actions that can be performed with tags differs. For example, admins are granted the most control while guest users may be limited to just viewing tags and tag suggestions. Tag domains also help with the separation of business domains: tags from one business function cannot see tags from another business function unless they are assigned to them.

  • Plan job management functions

    Plan on which roles are allowed to run jobs. Data Catalog provides the ability to run profiling jobs as basic sequences with default attributes to any role with the job profiling permission enabled. However, for resources that require custom system or Data Catalog attributes, an admin user may need to create job templates that apply these custom attributes for profiling such resources. A basic plan using the previous guidelines can help in building your Data Catalog in an efficient manner.

You are now ready to start building your Data Catalog.

Building your Data Catalog

Here are the generic steps you would take towards building your Data Catalog.

Step 1: Add data sources

Adding data sources is the first step towards building your Data Catalog. They are the building blocks in configuring your catalog. Here, you can connect the different data sources in your data lake, including HDFS, Hive, Oracle, Teradata, Redshift, and Snowflake, for both on premises or hosted in the cloud.

See Manage data sources for how to add data sources to the Lumada Data Catalog.

Step 2: Create virtual folders

Data Catalog allows admins to create virtual folders from groups of resources belonging to the same data source. These can then be delegated to the target role group for data analysis and cataloging. In this way, Data Catalog employs its own authorization layer over the native layer using role-based access control (RBAC).

See Manage virtual folders for more information about creating virtual folders.

Step 3: Add tag domains and tag glossary

Depending on the workflows carried out by the different admins, you may want to create functional tag domains and any parent tags that are commonly used. These tags can also be created by the regional admins.

See Managing tag domains for tag domain and tag management. See Getting started with tags and tag propagation for information about tags, different types of tags, and tag propagation.

Step 4: Add custom roles

If the default roles in Data Catalog are too rigid for your application, you can create custom roles with custom-defined functions using the role-based access control tools. For example, you may want to create a regional administrator role that is not permitted to run jobs, or a divisional data steward role with metadata level resource access.

See Managing roles for more information about creating and managing custom roles.

Step 5: Add users

After you have set up the roles and their virtual folders, it is time to add the users. A best practice is to add at least the divisional or departmental admins and delegate their share of the virtual folders. These admins can then add their own stewards and analysts.

See Manage usersfor more information about managing users.

Step 6: Profile data

If you decide to delegate the job profiling to your functional admins, you may need to create job templates with custom Data Catalog or system parameters. However, a best practice is to perform the initial profiling of your data lake as the Data Catalog service user. Using this method, your Lumada Data Catalog would be fully functional for users to start their data analysis and cataloging functions.

See Managing jobs for more information on profiling jobs. See Managing job templates for creating and managing job templates and sequences.

Step 7: Run tag discovery

Assuming that you have defined seed tags and regex tags for data that needs to be identified in the Data Catalog, now is the time to run the initial tagging discovery job. The tagging discovery job looks for data that matches the built-in tags and user-defined value tags and regex tags, then populates the glossary with the discovered suggestions. See Getting started with tags and tag propagation for more information about tags. To learn how to run the tagging discovery job, see Managing jobs.

Step 8: Tag curation and learning

After the catalog glossary is populated with the discovered tag suggestions, any user, especially those users with an analyst role, can then curate by accepting desired tags or rejecting undesired ones.

Depending on the definition of the user-defined value tags and regex tags, the discovered suggested tags may be false positives or false negatives. In such cases, the data stewards would update the regex definitions or reassign seed values and re-run tag discovery. Data stewards can optimize tag curation by modifying tag definitions and re-running the tagging discovery job to help Data Catalog learn tag discovery.

Steps 7 and 8 can be repeated multiple times until your users are satisfied with the discovered tag quality.

Next steps: You are now ready to use the Data Catalog. For more information, see Data Catalog overview.