Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Plan and build your Data Catalog

Parent article

As a data steward, you can start planning and building Data Catalog for your data analysts to use.

Planning your Data Catalog

With tens or even hundreds of data sources in your enterprise data environment catering to the curiosities and needs of thousands of users, it is helpful to plan your data catalog before building it with Lumada Data Catalog.

Use the following guidelines to plan your Data Catalog:

  • Plan data sources to be added along with their path details

    One of the ways Data Catalog achieves data security is by allowing control over virtual folder designation based on roles. For example, if you add a data source with the root path, then controlling access to specified content in that path for all roles is difficult. In this case, the Data Catalog admin relies on access control with virtual folders by using the path and include/exclude patterns. Also note that Data Catalog does not allow data sources with overlapping paths. Virtual folder paths can overlap with the use of include and exclude patterns.

  • Plan custom roles

    The default Data Catalog role of 'global_administrator' is predefined with permissions to create the custom roles with custom permissions that your organization requires. The permissions are categorized in predefined tiers: Administrator, Steward, and Business, which includes Analyst and Guest roles. You can create custom roles with a tailored set of permissions and assign the scope for virtual folders.

    To assign virtual folders to a role, you can create custom roles with a finer control on role-based accessibility and job management. To plan your custom roles, see Role-based access control (RBAC). You may also want to plan how many admin users are required to manage your data enivironment. Based on your organizational structure, you may want an admin for each business function/division/region.

  • Plan business glossaries and business terms

    Business terms are tags that users with permission can attach to the data or the resources to mark a particular data pattern or resource that contributes to their business value. These terms can then be automatically propagated throughout the data environment to identify similar data patterns. Terms can be grouped into glossaries with multi-level hierarchy and assigned to roles to perform data analysis. Depending on the permissions assigned to the roles, the actions that can be performed with terms differ. For example, admins are granted the most control while guest users may be limited to just viewing terms and term suggestions. Glossaries also help with the separation of business domains. For example, users from one business function may not be able see terms from another business function unless the users are assigned permission to view them.

  • Plan job management functions

    Plan for which roles are allowed to run jobs. Data Catalog provides the ability to run profiling jobs as basic sequences with default attributes to any role with the job profiling permission enabled. However, for resources that require custom system or Data Catalog attributes, an admin user may need to create job templates that apply these custom attributes for profiling such resources. A basic plan using the previous guidelines can help in building your Data Catalog in an efficient manner.

You are now ready to start building your Data Catalog.

Building your Data Catalog

Here are the generic steps you would take towards building your Data Catalog.

Step 1: Add data sources

Adding data sources is the first step towards building your Data Catalog. They are the building blocks in configuring your catalog. Here, you can connect the different data sources in your data lake, including HDFS, Hive, Oracle, Teradata, Redshift, and Snowflake, for both on premises or hosted in the cloud.

See Manage data sources for how to add data sources to the Lumada Data Catalog.

Step 2: Create virtual folders

Data Catalog allows admins to create virtual folders from groups of resources belonging to the same data source. These can then be delegated to the target role group for data analysis and cataloging. In this way, Data Catalog employs its own authorization layer over the native layer using role-based access control (RBAC).

See Manage virtual folders for more information about creating virtual folders.

Step 3: Add glossaries and business terms

Depending on the workflows carried out by the different admins, you may want to create functional business glossaries and parent terms that are commonly used. These terms can also be created by the regional admins.

See Managing business glossaries for glossary and business term management. See Getting started with business terms and term propagation for information about terms, different types of terms, and term propagation.

Step 4: Add custom roles

If the default roles in Data Catalog are too rigid for your application, you can create custom roles with custom-defined functions using the role-based access control tools. For example, you may want to create a regional administrator role that is not permitted to run jobs, or a divisional data steward role with metadata level resource access.

See Managing roles for more information about creating and managing custom roles.

Step 5: Add users

After you have set up the roles and their virtual folders, it is time to add the users. Users are added and assigned roles using the Keycloak identity provider. A best practice is to add the division or department admins and then delegate the applicable virtual folders to those admins. They can then add their own stewards and analysts.

See Manage users for more information about managing users.

Step 6: Profile data

If you decide to delegate the job profiling to your functional admins, you may need to create job templates with custom Data Catalog or system parameters. However, a best practice is to perform the initial profiling of your data lake as the Data Catalog service user. Using this method, your Lumada Data Catalog would be fully functional for users to start their data analysis and cataloging functions.

See Managing jobs for more information on profiling jobs. See Managing job templates for creating and managing job templates and sequences.

Step 7: Run business term discovery

Assuming that you have defined seed terms and regular expression (regex) terms for data that needs to be identified in the Data Catalog, now is the time to run the initial business term discovery job. The business term discovery job looks for data that matches the built-in terms and user-defined value terms and regex terms, then populates the glossary with the discovered suggestions. See Getting started with business terms and term propagation for more information about terms. To learn how to run the business term discovery job, see Managing jobs.

Step 8: Business term curation and learning

After Data Catalog glossaries are populated with the discovered business term suggestions, any user with permission, especially those users with an analyst role, can then curate by accepting desired terms or rejecting undesired ones.

Depending on the definition of the user-defined value terms and regex terms, the discovered suggested terms may be false positives or false negatives. In such cases, the data stewards would update the regex definitions or reassign seed values and re-run term discovery. Data stewards can optimize term curation by modifying term definitions and re-running the business term discovery job to help Data Catalog learn term discovery.

Steps 7 and 8 can be repeated multiple times until your users are satisfied with the discovered term quality.

Next steps: You are now ready to use the Data Catalog. For more information, see Data Catalog user features.