Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Managing datasets

Parent article

This article discusses how to manage datasets if you are a non-administrative user with permissions to manage properties for datasets and data objects in Lumada Data Catalog.

Managing datasets

After administrators create datasets, they can delegate non-admin roles to manage the properties and features of the dataset. Non-admin user roles with permissions can manage the following dataset properties:

  • Name
  • Description
  • Path specification
  • Reported Schema

They can also perform the following tasks:

  • Add resource
  • Add tag
  • View as list
  • Run profiling and discovery jobs
If you have permissions to manage datasets, then the Data Catalog dashboard displays a Manage menu in the menu bar. You can update properties, view member properties, and add or delete member resources.

Dashboard menu

Updating dataset properties

You can update the name, description, path specifications, and include/exclude parameters of an existing dataset in Lumada Data Catalog.

Manage Dataset [Selected Dataset]

Managing Datasets

Name and description updates do not alter the profiling information, but changes to the path specification and include/exclude patterns can alter the dataset metadata. This update affects only the resources added after the update. Path validation also occurs only for resources added after the update. There is no effect on the existing dataset resources.

Settings tab

In Lumada Data Catalog, you can update dataset names by selecting the Settings tab and entering the new name as described in the following table.

FieldDescription
Dataset NameData Catalog identifies a dataset by the unique name entered in this description. The name must begin with a letter and only contain alphanumeric characters, hyphens, and underscores.

Path specifications tab

In Lumada Data Catalog, you can select the Path Specifications tab to update path specifications as described in the following table.

FieldDescription
Path SpecificationEnter the source path with the include/exclude patterns to create the qualifying template for the dataset.
Source PathEnter the absolute path of the virtual folder that will be a part of this dataset. This path becomes the template against which all new resources added to the dataset are compared. New dataset resources must belong to this source path or to a subset of the source path.
NoteThe new source path must conform to the original virtual folder path specification.
Include PatternInclude a list of resources from the virtual folder to specify the regex pattern you want to include in this dataset.
Exclude PatternSpecify the regex pattern for a list of resources from the Source Path specification that you want to exclude from this dataset.
Multiple Path SpecificationsInclude one or more path specifications for the dataset as long as the paths belong to the same virtual folder specified in the Virtual Folder field. When paths belong to the same virtual folder, resources across the virtual folder source can be added to the dataset. Different combinations of include and exclude patterns make it possible to include or exclude specific types of resources.

Reported schema tab

In Data Catalog you can update reported schema on the Reported Schema tab. Reported schema is a user-defined schema that is representative of the expected schema for the resources in the dataset.

The reported schema is currently used only for display convenience. To that end, keep in mind that the Data Catalog does not perform schema validation on the member resources against the reported schema. The reported schema is just listed along with the discovered schema of the member resources in a single-file view for the dataset.

You can update an existing field or define a new field by using the following features:

  • Edit/Create Schema Field dialog box

    Specify the field name and expected data type.

  • Custom Data Type field

    In addition to standard built-in data types (integer, string, float, Boolean, byte, short, long, and double), you can add custom data type options to support user-defined data types required for specific applications.Create schema field dialog

For each reported schema in the Reported Schema tab, you can click the Action icon to select from the following menu options:Reported Schema action menu

  • Select Edit to redefine the name, label, data type and description of a schema field.
  • Select Insert field above or Insert field below to insert schema fields in an existing schema.
  • Select Delete to delete a schema field.

Add member resources to a dataset

A dataset can have multiple source path specifications. Each source path specification can have an include and exclude pattern which allows resources from different virtual folders and formats to be combined as a logical collection or dataset.

Perform the following steps to add a resource to a dataset:

Procedure

  1. Navigate to Manage Datasets.

  2. Click the Action menu (icon) and select Add resource from the drop-down menu.

  3. Enter the absolute path of the resource and click Add.

    Add a dataset member

Results

Data Catalog performs path validation against the specification template for the dataset path to add a resource to a dataset. If the resource path matches one of the source paths in the specification template, the resource is added to the dataset. Otherwise, a corresponding error message appears.

At this point, Data Catalog does not check that a resource is added. If a resource does not exist in your data lake, it can still be added to the dataset provided it satisfies the path specifications.

NoteWhile Data Catalog does not initially check if a resource is added, the check is performed and handled when profiling the dataset. If applicable, an error is logged in the log file (/var/log/waterlinedata/wd-ui.log).

View Dataset member resources

Just as in the Collections resource, you can view the member resources of a dataset by selecting View as list from the Action menu (icon) menu. The member resources of a dataset can be found in the following three locations in Lumada Data Catalog:
  • The Manage Datasets menu
  • The Browse Datasets list
  • Dataset single resource view (SRV)

Perform the following steps to view member resources in a dataset.

Procedure

  1. Click the Action menu (icon) and select View as list to navigate to a page displaying the list of the member resources for that dataset.

    View dataset member resources
  2. (Optional) Click the member resource for the single resource view.

  3. (Optional) Click Filter to filter this view by member.

Delete dataset member resources

You can delete member resources from datasets.

Perform the following steps to delete a member resource from a dataset:

Procedure

  1. Navigate to the single resource view of the dataset.

    Delete a dataset member
  2. Click the Action menu (icon) and select Remove from dataset in the drop-down menu.

    The dataset is removed.

Next steps

As with any other resource in the Lumada Data Catalog, deleting a dataset member requires re-profiling of the dataset to properly reflect the effects of deleting the member.