Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Plan and build your Data Catalog

Parent article

As a data steward, you can start planning and building Data Catalog for your data analysts to use.

Planning your Data Catalog

With tens or even hundreds of data sources in your enterprise data lake catering to the curiosities and needs of thousands of users, it is helpful to plan your data catalog before building it with Data Catalog.

The following is a guideline for planning your Data Catalog:

  • Plan data sources to be added along with their path details

    One of the ways Data Catalog achieves data security is by allowing control over virtual folder designation based on roles. Also, Data Catalog does not allow data sources with overlapping paths. Virtual folder paths can overlap with the use of include and exclude patterns.

  • Plan custom roles

    Default Data Catalog roles (Administrator, Steward, Analyst and Guest) have a defined behavior with access to all virtual folders in the Data Catalog. To be able to assign virtual folders to a role, you can emulate the default roles in custom roles with a finer control on role-based accessibility and job management. To plan your custome roles, see Role-based access control. You may also want to plan how many admin users are required to manage your data lake. Based on your organizational structure, you may want an admin for each business function/division/region.

  • Plan tag domains and tag glossary

    Tags are labels that users can attach to the data or the resources to mark a particular data pattern or resource that contributes to their business value. These tags can then be automatically propagated throughout the data lake to identify similar data patterns. Tags can be grouped into tag domains with multi-level hierarchy and assigned to roles to perform data analysis. Depending on the roles, the actions that can be performed with tags differs. For example, admins are granted the most control while guest users are limited to just viewing tags and tag suggestions. Tag domains also help with the separation of business domains: tags from one business function cannot see tags from another business function unless they are assigned to them.

  • Plan job management functions

    Plan on which roles are allowed to run jobs. Data Catalog provides the ability to run profiling jobs as basic sequences with default attributes to any role with the job profiling permission enabled. However, for resources that require custom system or Data Catalog attributes, an admin user may need to create job templates that apply these custom attributes for profiling such resources. A basic plan using the previous guidelines can help in building your Data Catalog in an efficient manner.

You are now ready to start building your Data Catalog.

Building your Data Catalog

Here are the generic steps you would take towards building your Data Catalog.

Step 1: Add data sources

Adding data sources is the first step towards building your Data Catalog. They are the building blocks in configuring your catalog. Here you have the opportunity to connect up the different data sources in your data lake, inlcuding HDFS, Hive, Oracle, Teradata, Redshift, an Snowflake, for both on premises or hosted in the cloud.

See Manage data sources for how to add data sources to the Lumada Data Catalog.

Step 2: Create virtual folders

Data Catalog allows admins to create virtual folders from groups of resources belonging to the same data source. These can be then delegated to the target role group for data analysis and cataloging. In this way, Data Catalog employs its own authorization layer over the native layer using role-based access control (RBAC).

See Manage virtual folders for more information about creating virtual folders.

Step 3: Add tag domains and tag glossary

Depending on the workflows carried out by the different admins, you may want to create functional tag domains and any parent tags that are commonly used. These tags can also be created by the regional admins.

See Managing tag domains for tag domain and tag management. See Tags and tag propagation for information about tags, different types of tags, and tag propagation.

Step 4: Add custom roles

If the default roles in Data Catalog are too rigid for your application, you can create custom roles with custom-defined functions using the role-based access control tools. For example, you may want to create a regional administrator role that is not permitted to run jobs, or a divisional data steward role with metadata level resource access.

See Managing roles for more information about creating and managing custom roles.

Step 5: Add users

After you have set up the roles and their virtual folders, it is time to add the users. A best practice is to add at least the divisional or departmental admins and delegate their share of the virtual folders. These admins can then add their own stewards and analysts.

See Manage usersfor more information about managing users.

Step 6: Profile data

If you decide to delegate the job profiling to the functional admins, you may need to create job templates with custom Data Catalog or system parameters. However, a best practice is to perform the maiden profiling of your data lake as the Data Catalog service user. Using this method, your Lumada Data Catalog would be fully functional for users to start their data analysis and cataloging functions.

See Managing jobs for more information on profiling jobs. See Managing job templates for creating and managing job templates and sequences.

Step 7: Run tag discovery

Assuming that you have defined seed tags and regex tags for data that needs to be identified in the Data Catalog, now is the time to run the initial tag discovery job. The tag discovery job looks for data that matches the built-in tags and user-defined value tags and regex tags and populates the Catalog glossary with the discovered suggestions. See Tags and tag propagation for more information.

Step 8: Tag curation and learning

After the catalog glossary is populated with the discovered tag suggestions, any user, especially those users with an analyst role, can then curate these tags by accepting desired tags or rejecting undesired ones.

Depending on the definition of the user-defined value tags and regex tags, the discovered suggested tags may be false positives or false negatives. In such cases, the data stewards would update the regex definitions or reassign seed values and re-run tag discovery. Tag curation requires user interaction to modify tag definitions and re-run tag discovery in order to help Data Catalog learn tag discovery.

Steps 7 and 8 can be repeated multiple times until the user is satisfied with the discovered tag quality.

Next steps: You are now ready to use the Data Catalog. For more information, see Data Catalog overview.