Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog assets

Parent article

Lumada Data Catalog provides data management and data representation with its own logical data entities, including the following assets:

Virtual folders

Use virtual folders to create smaller groups of resources belonging to a data source for easier management. Data resources can be members of multiple folders, so you can create folders with overlapping sets of data resources.

Collections

When you have a growing set of data that extends across multiple files in Lumada Data Catalog, you can view the data as a single resource known as a collection.

When files are added to one of the directories identified as part of the collection, Lumada Data Catalog needs to run schema and profile discovery to reflect the newly added data in the collection.

When a folder becomes a collection, the files inside the folder no longer appear individually in search results. Instead, search results show a single representation of all the files. The collection can be made up of files in a single folder or files in many folders all under a single top-level folder.

Datasets

Datasets in Data Catalog allow users to create groups of resources having the same schema and span different folders in your data lake, into a single virtual unit for easier management.

Datasets can be considered as user-defined virtual collections that have the matching schema but may have different path specifications/hierarchy with respect to physical location in your data lake irrespective of their data source type.

Datasets are a single virtual unit comprised of groups of resources that have the same schema and span different folders in your data lake. Creating datasets in Data Catalog provides users with easier management of resources.

You might consider datasets as user-defined virtual collections that have the matching schema, but different path specifications or different hierarchies in respect to physical location in your data lake, irrespective of their data source type.

Data objects

Data objects are virtual resources created by users who define a set of join conditions between resources. Users may have access to multiple resources which they can join together as a data object to find specific information easily.

For example, a user may have access to a resource for tabulating employee contact information, another resource for tabulating employee skills, and a third resource for tabulating employee assignments for a specified project. If this user wants to gather contact information and skillset information for all employees assigned to a specific project, then they can create a data object to build a resource by defining join conditions between all three of these resources.

Custom properties

Custom properties collect additional metadata about resources specific to a business user's environment or engagement. For example, you could define a custom property to include a business user's name for a resource. Or, you could define a property that includes values that are used by system-level processes.

Additionally, you can group these properties together in custom property groups based on their business value or category. Custom properties can be moved individually or in bulk across custom property groups. You can then use custom property groups as custom facets in the search results with search dimensions.

Search dimensions and custom facets

As the admin, use search dimensions to control the visibility of facets in the search results for an end user. When a search dimension is defined for a specified role, the users with that role can then see the search results categorized by the search dimensions defined for that role.

For example, the admin can limit the search results for the Analyst role to the categories Rating, Resource Tag, Virtual Folder, and the custom facet Claims, which is specific to business users. For Analyst users, search results are faceted depending on the search dimensions set by the admin: Rating, Resource Tag, Virtual Folder, and Claims.

Job templates and sequences

Templates are pre-defined job templates created by the administrators to run specific job sequences that apply to specific clusters. Job templates have system or Spark-specific parameters as command line arguments for the job sequences, such as driver memory, executor memory, or number of threads required based on a cluster size. You can override the default Data Catalog parameters. For example, you can set the incremental profile to false, profile a Collection as a single resource, or force a full profile instead of the default sampling option.

Contact your system administrator to determine the template that is best suited for your data cluster.

Sequences are Lumada Data Catalog's job sequences that users with the proper permissions can execute. These jobs are executed with default parameters, and you cannot use the Sequence option to override the default parameters.

Rules engine

With Lumada Data Catalog's rules engine you can define, execute, and manage tag-based rules. These rules can evaluate data and metadata properties to add tags, remove tags, modify custom properties on data assets, and generate reports.

Users define SQL-like rules for selective actions based on specific data or metadata conditions. Both data and metadata rules provide users the flexibility to create rules that operate on the data or operate on the metadata of the resources in Data Catalog, and then associate tags, update properties, and define conditions.