Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Utility jobs

Parent article

You can use the Data Catalog utility tools to update dashboards, and troubleshoot discovery and metadata details. Extended profiling generates resource metadata used for discoveries that reside in the discovery cache. Use the following tools to troubleshoot specific data in the discovery cache for qualifying discovery operations, updating dashboards, and resolving metadata inconsistencies.

CautionBecause these tools may expose specific information from Data Catalog's discovery cache, they are strictly for administrator use and should not be exposed to general users.

To open Tools from any page, select the Tools icon in the left toolbar of the navigation pane. The Tools page opens.

Tools include:

NoteTools are only available if you have administrator permissions. These utilities are generally used with the LDC Agent component.
  • Maintaining the discovery cache (DiscoveryCacheCompactor)

    Removes unused MapFile directories and merges small MapFile directories. It is a distributed replacement for RepoCacheMaintenance.

  • Run a Data Rationalization job

    Reports details about duplicated and overlapping data. Use this utility to populate the Data Rationalization dashboard.

  • Encrypt text

    Use this tool to encrypt your plain text values for use in a Data Catalog configuration.

  • Utils

    You can perform the following actions on CSV files:

The output size from some of these utilities may be large. As a best practice, you should redirect the console output to a temporary file.

CautionThese utilities may expose actual data in their output. For example, some of these utilities expose actual top K values without masking them. Use caution to make sure your data is not exposed beyond its intended audience.

Maintaining the discovery cache

As a best practice, you should maintain the discovery cache by removing unused MapFile directories and merging small MapFile directories. The Data Catalog discovery cache is a repository containing large properties of entities generated during profiling and term propagation. Large properties are used in term propagation, lineage discovery, and in some instances regular expression evaluation.

The location and content of the discovery cache are controlled by configuration settings made in the Configurations user interface page. To change a configuration setting, go to Management and click View Configuration on the Configuration card. Then enter a keyword in the search box to search for a specific configuration setting.

  • URI for discovery cache metadata store

    Defines the URI of the discovery cache metadata files. Its default value is s3a://ldc-discovery-cache. You should configure large properties only on the default agent.

  • Relative location for a large properties metadata store

    Defines the location relative to the large properties URI for a large properties metadata store. Its default value is "/".

You should change the URI and location to fit your environment, and both should be accessible by the users running jobs.

ImportantThe properties above must be adjusted during Data Catalog installation. If these properties change after installation, re-run discovery profiling starting with extended profiling.

The large properties of data resources and term propagation information in the discovery cache are stored in the resource_lp and tag_lp directories.

ImportantDo not manually remove the resource_lp and tag_lp directories.

To view the metadata for term and large properties, enter the following command:

$ hdfs dfs -ls /user/ldcuser/.ldc_hdfs_metadata/

The output is similar to the following example:

$ hdfs dfs -ls /user/ldcuser/.ldc_hdfs_metadata/
Found 3 items
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:25 /user/ldcuser/.ldc_hdfs_metadata/Built-in_Terms
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:45 /user/ldcuser/.ldc_hdfs_metadata/resource_lp
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:45 /user/ldcuser/.ldc_hdfs_metadata/tag_lp
$ 

Each directory contains the map files with generated unique names.

NoteWhen you install Data Catalog for the first time, you should use the following command to manually clean out these directories so you can start with an empty repository:

$ hdfs dfs -rm -R /user/ldcuser/.ldc_hdfs_metadata

The following configuration setting controls compression:

  • Minimum number of objects in discovery cache MapFile

    If below this number, the object from MapFile will be relocated during discovery cache compression. The default is 50.

Compress the discovery cache

Perform the following steps to use the cache compactor to remove unused MapFile directories and to merge small MapFile directories.
NoteAs a best practice, configure at least 6GB each of driver memory and executor memory for the cache compactor.

Procedure

  1. Navigate to the default agent and enter the following command to see the command parameters:

    ./ldc compact -help

    The syntax is: ldc compact -numPartitions [number of partitions] -remove [false/true] -compact [false/true]

    where:

    • -numPartitions

      (Optional) Number of partitions. The default value is 10.

    • -remove

      (Optional) If set to true, unreferenced discovery cache files are removed. If set to false, unreferenced discovery cache files are reported, but not removed. The default value is true.

    • -compact

      (Optional) If set to true, small or partially referenced discovery cache map file directories are merged according to the configuration settings for the percentage of discovery cache unreferenced objects in one MapFile and the minimum number of objects in a discovery cache MapFile. If set to false, the directories are reported, but not merged. The default value is true.

  2. Enter a command with the parameters you want to specify.

Results

The submitted job is executed per the parameters you specified.

Run a Data Rationalization job

You must run a Data Rationalization job to gather all the necessary information for populating the Data Rationalization dashboard. Resources are analyzed by their overlap ratio, which is the amount of overlap divided by the cardinality. Data resources with an overlap ratio of 1.0 means the resources are a 100 percent match.

Before you run a Data Rationalization job, you must first run the applicable Data Profiling, Format Discovery, and Schema Discovery jobs on your data. Also, allocate at least 6 GB of memory for the executors and driver for this job.

Perform the following steps to run a Data Rationalization job:

Procedure

  1. Open the Data Canvas page.

  2. Select the data sources that you want to investigate.

  3. Click the Action menu and select Process.

  4. Click Data Rationalization.

  5. Add any of the following parameters and values to the Command line text box according to your requirements.

    NoteAll parameters are optional. If a value is not specified, the default value is used.
    ParameterDescription
    -firstDataSourceSpecifies the first data source to use for overlap analysis. If not defined, all data sources are analyzed.
    -secondDataSourceSpecifies the second data source to use for overlap analysis. If the second data source is not defined, all data sources are analyzed.
    -incrementalSpecifies whether to perform incremental analysis. Set to true to perform overlap analysis for data sources only if the analysis has not been done before or requires an update. Set to false to redo the overlap analysis.

    The default is true.

    -reprocessSpecifies whether to reprocess data resources. Set to true to perform overlap analysis for data resources that have been processed. Set to false to perform overlap analysis only for resources that were re-profiled after the last analysis. The -incremental parameter must be set to false for this parameter to be active.

    The default is false.

    -fth_copySpecifies the amount of overlap ratio when assessing a field as a copy. The formula is (1.0 - cardinality ratio < -fth_copy value).

    The default is 0.1.

    -fth_overlapSpecifies the precision to cut off accidental field overlaps. The file is considered an overlap if both the source and target overlap ratios are greater than this value.

    The default is 0.2..

    -rth_copySpecifies the amount of precision required to detect if resources are copies. Resources are assessed as copies if the ratio of the matching fields in the compared resources is less than this value.

    The default is 0.1.

    -rth_overlapSpecifies the precision in overlap relationships needed to cut off accidental resource overlaps. Overlap relationships are denoted between resource1 and resource2 only if max([resource1 overlapped fields count]/[resource1 fields count], [resource2 overlapped fields count]/[resource2 fields count]) > [this value].

    The default is 0.32.

    -same_semanticsSpecifies whether to perform semantic analysis. Set to true to perform overlap analysis for a pair of source and target fields that have the same internally discovered semantics attributes, such as temporal, free text, numeric, or strings. To extend the scope of fields compared for overlap analysis, set this value to false. Setting this value to false may increase the accuracy of your overlap analysis, but also may cause a decrease in performance.

    The default is true.

    -exclude_field_listSpecifies a list of field names to exclude from overlap analysis.
  6. Click Save.

Results

The job runs and displays on the Job Activity page.

Next steps

Encrypt text

You can use the Encrypt tool to encrypt plain text values in a Lumada Data Catalog configuration. Some use cases for encryption might be for API keys, Kubernetes secrets, or discovery cache attributes. Perform the following steps to encrypt text:

Procedure

  1. In the left-hand navigation pane, click the Tools icon.

  2. Under Encryption Tool, enter the text you want to encrypt in the Plain Text field and click Encrypt. The encrypted results display in the Encrypted result field. Plain text and encrypted texts enclosed with enc(…) are interchangeable.

    The encrypted results display in the Encrypted result field. Plain text and encrypted texts enclosed with enc(…) are interchangeable.
  3. (Optional) You can copy the encrypted text by clicking the Copy Result icon.

    You can use the result in place of any plain texts. For example, use for large properties with AWS secret keys. You do not have to paste the encrypted text in the data source passwords. Passwords are automatically encrypted when they are created or updated.
  4. (Optional) You can clear the plain text value by clicking Reset.

Lineage export and import

You can export lineages discovered by Data Catalog and import user-defined lineages with the Lineage – Import/Export tool. You can also use this tool to delete lineages that were exported as a CSV file.

The lineage relationships, both resource level and field level, are exported in the CSV format in a pre-defined structure. Lineage import also requires the user-defined lineages to be submitted only in this pre-defined format. Lineage export and imports are based on lineage targets, not sources.

When limiting lineage export to a specific virtual folder or path, all lineages are exported to defined targets in the virtual folder or path.

Export-import CSV structure

Data Catalog exports lineages in a pre-defined fixed format. The following table describes the columns of this fixed format:

Col#Col NameDefault ValueDescription
Aexternal_idsystem generated(Required) Unique ID of the entity represented by CSV line, operation, or operation execution
Bexternal_source_name"LDC"Name of the external source. Use LDC for lineages created manually in Data Catalog.
Ctarget_data_source_name""Name of the target data source
Dtarget_resource_path""Target data resource path
Elineage_type""Lineage type (options: INFERRED, INFERRED_ALTERNATIVE, FACTUAL, IMPORTED, HDFS2HIVE, OTHER)
Flineage_kind""Lineage kind (options: COPY, PARTIAL_COPY, JOIN, UNION, UNION_PART, MASK, ENCRYPT, STANDARDIZE, CALCULATION, SUBSET, SUPERSET, HIVE_EXTERNAL, OTHER)
Glineage_exec_type""Lineage operation level (options: lineage_exec_type_resource, lineage_exec_type_field)
Htarget_resource_field""Target field
Iresource_lineage_reference""GUID for the Operation execution entity GUID
Jprincipal""Principal or lineage creator as group:user
Ksource_data_source_name""Source data source name
Lsource_resource_path""Source data resource path
Msource_resource_field""Source field
Nlineage_state""Lineage state (options: ACCEPTED, REJECTED, SUGGESTED, IMPORTED)
Odescription""Lineage description
Pcode""Transformation code
Qoperation_reference""GUID for the Operation entity
Roperation_type"operation_execution"Lineage entity type (options: operation, operation_execution). The default value is operation_execution.

For every lineage you export, the details for the lineage operation include the operation execution details.

The following sample is a portion of the CSV file exported:

CSV export example

The following details apply to the selected sample of the CSV export:

  • Row 75: Identifies the lineage and shows it was REJECTED.
  • Row 76: Lists all the information in the DETAILS panel for the lineage. It identifies lineage type, kind, status, description, and code.
  • Rows 77 – 94: Lists the field level lineage details for the selected lineage operation. The Code column (Col P) identifies the source field to target field mapping found in the View Mapping or Code section under the DETAILS tab.

The operation_reference values (Column Q in this example) for all rows following the lineage identifier row (Row 75 per this example) are the same and represent the external_id of the lineage operation.

The following lineage types can occur:

  • Inferred

    Primary lineage discovered by Data Catalog. The AI engine identifies one primary lineage based on proprietary algorithms.

  • Inferred alternative

    Lineage discovered by Data Catalog. If more than one source is a candidate for primary lineage, the AI engine marks these inferred lineages as Inferred_alternative.

  • Factual

    Any lineage added manually.

  • Imported

    Any lineage imported from third-party applications such as Apache Atlas.

  • HDFS2HIVE

    HDFS to HIVE lineage identified as part of schema discovery.

Export a lineage

Perform the following steps to export lineages:

Procedure

  1. On the Home page from the left-side menu bar, click Tools.

    The Tools page opens.
  2. Click the Lineage – Import/Export tab.

    The Lineage – Import/Export Tool opens.
  3. Click Export.

    The export-specific fields appear.
  4. Select the limit of the export target for the Virtual Folder Selection.

    The default value is All of the virtual folders.
  5. Specify the following parameter in the Enter Parameters text box if you want to apply it to the export:

    • -path

      (Optional) Limit the lineage export to this path.

    The specified parameters appear in the Command Summary section.
  6. Click Submit.

Results

Data Catalog exports the CSV file to the exports directory under the local agent. The exportLineageToCSV job in the /agent/bin/ldc directory also executes, as shown in the following example with the -path parameter specified: ./ldc exportLineageToCSV -virtualFolder vf -path pathtoDataEntity

Save exported lineage to your local machine

The ExportLineage command exports to the CSV file in the exports directory under the local agent. If you want to save the exported lineage to your local machine, perform the following steps:

Procedure

  1. On the Home page from the left-side menu bar, click Management.

    The Manage Your Environment page opens.
  2. Select Job Activity under Job Management.

    The Job Activity window opens.
  3. Expand the exportLineageToCSV activity by clicking the down arrow on the activity row, then click the right arrow.

    The Instance Steps dialog box appears.
  4. Click Download Supported Files.

Results

The exported CSV file is saved to your local machine.

Import a lineage

Perform the following steps to import lineages:

Procedure

  1. On the Home page from the left-side menu bar, click Tools.

    The Tools page opens.
  2. Click the Lineage – Import/Export tab.

    The Lineage – Import/Export Tool opens.
  3. Click Import.

    The import-specific fields appear.
  4. Specify the location of the CSV file in Upload File.

  5. (Optional) Select Undo to delete all defined lineages from the virtual folder associated with the Upload File. Otherwise, deselect Undo to retain all defined lineages and insert the imported lineage.

    The Undo option is deselected by default.

    NoteThe use of parameters in the Enter Parameters field is not currently supported.
  6. Click Submit.

Results

The submitted job is executed according to the parameters you specified. The ImportLineage command imports the lineage into Data Catalog.
NoteS-Fit and T-Fit values under View Field Mapping are available for imported lineages.

Next steps

See Lineage best practices for best practices for updating existing lineages.

Lineage best practices

You should adhere to the following best practices when updating existing lineages in Data Catalog.

  • You can update existing lineages by exporting them to a CSV file, curating the lineages in the CSV file, then importing the transformed lineages back into Data Catalog.
  • When updating lineages, do not manipulate the external_id, lineage_exec_type, resource_lineage_reference or operation_reference fields. Errors occur when any of these fields are modified.
    • The external_ids are generated by Data Catalog and are unique to the system.
    • The lineage_exec_type field identifies resource lineage versus field lineage.
    • The resource_lineage_reference and operation_reference fields depend on the external_id field.
  • The principal, description, and code fields are descriptive. No validations are performed on the content of these fields while importing lineages.
  • The lineage_type and lineage_kind fields are also considered descriptive when importing lineages. No validations are performed on the user updates to these fields.
  • Validations are performed on any changes to the target_data_source_name field, the source_data_source_name field, the target path, the target field, the resource path, the resource field, and the lineage_state. Changes to resource names as part of path name changes are not supported.
  • Data Catalog does not offer field curation.

Removing lineages

You can also use the Lineage – Import/Export tool to remove lineages from Data Catalog. When you select Undo for Import, the import process removes all lineages specified in the import CSV file.

Model - Import/Export tool

Parent article

You can use the Model - Import/Export tool, a migration utility that functions like a data model mapper, to export and import entities according to the applicable environment:

  • In environments with different data mappings, you can export or import the basic definitions of Data Catalog assets or the data model.
  • In environments with matched data mappings, you can export or import most of the asset metadata.

The following metadata can be exported and imported:

  • Data sources
  • User roles
  • Glossaries
  • Business terms
  • Term associations
  • Virtual folders
  • Custom properties

Exporting metadata

You can export metadata from Data Catalog and save the metadata in a CSV format.

The entities are exported to a file in CSV format with the following file name convention:

exported_<entity>.csv

For example:

exported_termassociations.csv
exported_customproperties.csv
exported_glossaries.csv
exported_folders.csv
exported_userroles.csv
exported_sources.csv
exported_businessterms.csv

Export metadata

Follow the steps below to export metadata in a CSV format:

Procedure

  1. On the Home page from the left-side menu bar, click Tools.

    The Tools page opens.
  2. Select the Model – Import / Export tab.

    The Model – Import / Export Tool page opens.
  3. Click Export.

    Model Import Export tool, Export fields

    The export-specific fields appear.
  4. Select the asset type of the exported metadata.

    The default value is All of the types.
  5. In the Enter Parameters text box, enter any optional command line parameters, such as [--num-executor], [--driver-memory] or [--executor-memory], that you want to apply to the export.

    The specified parameters appear in the Command Summary section.
  6. Click Submit.

Results

The entities are exported to a file in the CSV format to the exports directory under the local agent.

Importing metadata

You can import metadata that has been saved in the pre-defined Data Catalog CSV file format.

To import user roles, data sources, and term associations, you must maintain the following import-export compatibility:

  • Use only the CSV metadata files that were exported earlier from Data Catalog.
  • Use the same version of the Model - Import/Export Tool that was used for exporting the metadata files.

Metadata file structures

The data is formatted inside the CSV file. When you are importing metadata into Data Catalog, a specific import sequence is required due to internal dependencies. For example, if you are importing business terms, make sure the required glossary is available in Data Catalog. If not, import the glossary first and then import the business terms. These dependencies are described for each of the metadata types, below.

Data sources

When you are importing data sources, make sure to create and connect agents, and profile the data before importing any term associations.

The data source import file should have the following columns in this order:

  • DATASOURCE_NAME
  • DATASOURCE_DESC
  • DATASOURCE_TYPE
  • DATASOURCE_PATH
  • DATASOURCE_URL
  • DATASOURCE_USER
  • DATASOURCE_PASSWORD
  • DATASOURCE_JDBC_DRIVER
User roles

The user roles import file should have the following columns in this order:

  • USER_ROLE_NAME
  • USER_ROLE_DESC
  • USER_POLICY_NAMES
  • VIRTUAL_FOLDER_NAMES
  • GLOSSARY_NAMES
  • METADATA_ACCESS
  • DATA_ACCESS
  • IS_DEFAULT_ROLE
  • IS_ALLOW_JOB_EXECUTION
Glossaries

The glossaries import file should have the following columns in this order:

  • GLOSSARY_NAME
  • GLOSSARY_DESC
  • GLOSSARY_COLOR
Business terms

Before importing any business terms, make sure you have created or imported the corresponding glossary.

The business terms import file should have the following columns in this order:

  • GLOSSARY_NAME
  • PARENT_LOGICAL_PATH
  • TERM_NAME
  • TERM_DESCRIPTION
  • TERM_METHOD
  • TERM_REGEX
  • TERM_MINSCORE
  • TERM_MINLEN
  • TERM_MAXLEN
  • TERM_ENABLED
  • TERM_LEARNING_ENABLED
  • EXTERNAL_ID
  • EXTERNAL_SOURCE_NAME
  • TERM_SYNONYMS
  • ENTITY
  • ANCHOR
  • CREATE_RESOURCE_TERM_ASSOCIATION
  • DEFAULT_OR_CUSTOM
  • EXPRESSION

You can use the term name to show hierarchy in business terms by using dot notation for naming the business term, as shown in the following example:

1. "Sales","Orders.Jan.Week1","Tag for Orders",VALUE,,0.0,0,0,true,true,,,""

The example above identifies a business term Week1 with the parent business term Jan and grandparent business term Orders under the Sales glossary.

Term associations

Before importing any term associations, make sure you have profiled the virtual folder and have imported the corresponding business terms and glossary.

The term associations import file should have the following columns in this order:

  • GLOSSARY_NAME
  • TERM_NAME
  • DATASOURCE_NAME
  • RESOURCE_NAME
  • FIELD_NAME
  • TERM_PROPAGATION
  • TERM_ASSOCIATION_STATE
Virtual folders

Before importing csv files into virtual folders, you must create or import the respective parent data source.

The virtual folders import file should have the following columns in this order:

  • VIRTUAL_FOLDER_NAME
  • VIRTUAL_FOLDER_DESC
  • VIRTUAL_FOLDER_PARENT_NAME
  • VIRTUAL_FOLDER_DATASOURCE
  • VIRTUAL_FOLDER_PATH
  • VIRTUAL_FOLDER_INCLUDE
  • VIRTUAL_FOLDER_EXCLUDE
  • IS_VIRTUAL_FOLDER_ROOT
Custom properties

The custom properties import file should have the following columns in this order:

  • CUSTOM_PROPERTY_NAME
  • CUSTOM_PROPERTY_DISPLAY_NAME
  • CUSTOM_PROPERTY_DATATYPE
  • CUSTOM_PROPERTY_DESC
  • CUSTOM_PROPERTY_GROUP_NAME
  • CUSTOM_PROPERTY_GROUP_DESC
  • IS_CUSTOM_PROPERTY
  • IS_CASE_SENSITIVE
  • IS_SEARCHABLE
  • IS_FACE_TABLE
  • CUSTOM_PROPERTY_READ_ROLE
  • CUSTOM_PROPERTY_WRITE_ROLE

Import metadata

Follow the steps below to import metadata:

Procedure

  1. On the Home page from the left-side menu bar, click Tools.

    The Tools page opens.
  2. Select the Model – Import / Export tab.

    The Model – Import / Export Tool opens.
  3. Click Import.

    Model Import /Export Tool, Import fields

    The import-specific fields appear.
  4. Select the asset type for the imported metadata.

  5. Specify the location of the metadata file in Upload File.

  6. In the Enter Parameters text box, enter any optional command line parameters, such as [--num-executor], [--driver-memory] or [--executor-memory], that you want to apply to the import.

    The specified parameters appear in the Command Summary section.
  7. Click Submit.

Results

The metadata is imported into Data Catalog according to the parameters you specified.