Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog assets

Parent article

Lumada Data Catalog provides data management and data representation in the form of its own logical data units. These units are described briefly here.

Virtual folders

Use virtual folders to create smaller groups of resources belonging to a data source for easier management. Data resources can be members of multiple folders, so you can create folders with overlapping sets of data resources.

Collections

When you have a growing set of data that extends across multiple files in Lumada Data Catalog, you can view the data as a single resource known as a collection.

When files are added to one of the directories identified as part of the collection, Lumada Data Catalog needs to run schema/profile discovery to reflect the newly added data in the collection.

When a folder becomes a collection, the files inside the folder no longer appear individually in search results. instead, search results show a single representation of all the files. The collection can be made up of files in a single folder or files in many folders all under a single top-level folder.

Datasets

Datasets in Data Catalog allow users to create groups of resources having the same schema and span different folders in your data lake, into a single virtual unit for easier management.

Datasets can be considered as user-defined virtual collections that have the matching schema but may have different path specifications/hierarchy with respect to physical location in your data lake irrespective of their data source type.

Data Objects

Data Objects are virtual resources created by users, by defining set of join conditions between resources. Say a user has access to multiple resources (one tabulating the employee contact information), while another tabulating their skills, and yet another tabulating their assignment on a particular project. Now this user wants to gather employee contact information for all employees assigned on some project and their skills set. Data Objects feature enables this user to build a resource by defining join conditions between these resources.

Custom Properties

Lumada Data Catalog provides the business users the functionality to create Custom Properties that collect additional metadata about resources specific to their business environment or engagement. For example, one could define a custom property to include a business user name for a resource. Alternatively, you could define a property that includes values that are used by system-level processes.

These properties can further be grouped together in Custom Property Groups based on their business value or category and can be moved individually or in bulk across Custom Property Groups. These custom property groups can further be used as custom facets in the search results with search dimensions.

Search Dimensions and Custom Facets

The Search Dimensions feature gives the admin the control over visibility of facets in the search results for an end user. With the search dimension defined for a particular role, the users with that role will then be shown the search results categorized by the search dimensions defined for that role.

For example the admin may want to limit the search results for an analyst role to categories like Rating, Last Profiled Time, Resource Tag, Virtual Folder and a custom facet Claims that pertains to the analyst role's business function. For the analyst user, any search results will be faceted depending on the search dimensions set by the admin, for example to Virtual Folder, Resource Tag, Last Profiled Time, Rating and Claims.

Job templates and sequences

Templates are pre-defined job templates created by the administrators to run specific job sequences that apply to specific clusters. It is when creating these job templates that you have a chance to pass any system or spark specific parameters as command line arguments for the job sequences - like driver memory, executor memory, number of threads required based on a cluster size, or override the default ldc script parameters like profiling Collections as a single resource or profile last partition, etc. These templates can then be promoted to end users handling such clusters to enable efficient job executions.

Sequences are Lumada Data Catalog's job sequences that are exposed to the UI to enable job execution by the users with job execution privileges. It should be noted, however, that these sequences are executed with Lumada Data Catalog's default parameters which cannot be overridden by the user using Sequence option.

Rules engine

Lumada Data Catalog's powerful rules engine facilitates rule-based tagging, selection of custom property updates, and conditional metadata processing by allowing users to define SQL-like rules for selective actions based on specific data or metadata conditions. Two types of rules - the Data Rule and the Metadata Rule - provide the flexibility to create rules that operate on the data or operate on the metadata of the resources in the Catalog and make subsequent tag association, property updates or conditional processing decisions.