Utility jobs

Last updated
Save as PDF

You can use the Data Catalog utility tools to update dashboards, and troubleshoot discovery and metadata details. Extended profiling generates resource metadata used for discoveries that reside in the discovery cache. Use the following tools to troubleshoot specific data in the discovery cache for qualifying discovery operations, updating dashboards, and resolving metadata inconsistencies.

CautionBecause these tools may expose specific information from Data Catalog's discovery cache, they are strictly for administrator use and should not be exposed to general users.

To open Tools from any page, select the Tools icon in the left toolbar of the navigation pane. The Tools page opens.

Tools include:

NoteTools are only available if you have administrator permissions. These utilities are generally used with the LDC Agent component.

Maintaining the discovery cache (DiscoveryCacheCompactor)
Removes unused MapFile directories and merges small MapFile directories. It is a distributed replacement for RepoCacheMaintenance.

Run a Data Rationalization job
Reports details about duplicated and overlapping data. Use this utility to populate the Data Rationalization dashboard.

Encrypt text
Use this tool to encrypt your plain text values for use in a Data Catalog configuration.
Utils
You can perform the following actions on CSV files:
- Export all the lineages (ExportLineage)
- Import all lineages (ImportLineage)

The output size from some of these utilities may be large. As a best practice, you should redirect the console output to a temporary file.

CautionThese utilities may expose actual data in their output. For example, some of these utilities expose actual top K values without masking them. Use caution to make sure your data is not exposed beyond its intended audience.

Maintaining the discovery cache

As a best practice, you should maintain the discovery cache by removing unused MapFile directories and merging small MapFile directories. The Data Catalog discovery cache is a repository containing large properties of entities generated during profiling and term propagation. Large properties are used in term propagation, lineage discovery, and in some instances regular expression evaluation.

The location and content of the discovery cache are controlled by configuration settings made in the Configurations user interface page. To change a configuration setting, go to Management and click View Configuration on the Configuration card. Then enter a keyword in the search box to search for a specific configuration setting.

URI for discovery cache metadata store
Defines the URI of the discovery cache metadata files. Its default value is s3a://ldc-discovery-cache. You should configure large properties only on the default agent.
Relative location for a large properties metadata store
Defines the location relative to the large properties URI for a large properties metadata store. Its default value is "/".

You should change the URI and location to fit your environment, and both should be accessible by the users running jobs.

ImportantThe properties above must be adjusted during Data Catalog installation. If these properties change after installation, re-run discovery profiling starting with extended profiling.

The large properties of data resources and term propagation information in the discovery cache are stored in the resource_lp and tag_lp directories.

ImportantDo not manually remove the resource_lp and tag_lp directories.

To view the metadata for term and large properties, enter the following command:

$ hdfs dfs -ls /user/ldcuser/.ldc_hdfs_metadata/

The output is similar to the following example:

$ hdfs dfs -ls /user/ldcuser/.ldc_hdfs_metadata/
Found 3 items
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:25 /user/ldcuser/.ldc_hdfs_metadata/Built-in_Terms
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:45 /user/ldcuser/.ldc_hdfs_metadata/resource_lp
drwxr-xr-x  - ldcuser ldcuser    0 2021-09-28 05:45 /user/ldcuser/.ldc_hdfs_metadata/tag_lp
$

Each directory contains the map files with generated unique names.

NoteWhen you install Data Catalog for the first time, you should use the following command to manually clean out these directories so you can start with an empty repository:

$ hdfs dfs -rm -R /user/ldcuser/.ldc_hdfs_metadata

The following configuration setting controls compression:

Minimum number of objects in discovery cache MapFile
If below this number, the object from MapFile will be relocated during discovery cache compression. The default is 50.

Compress the discovery cache

Perform the following steps to use the cache compactor to remove unused MapFile directories and to merge small MapFile directories.

NoteAs a best practice, configure at least 6GB each of driver memory and executor memory for the cache compactor.

Procedure

Navigate to the default agent and enter the following command to see the command parameters:
./ldc compact -help
The syntax is: ldc compact -numPartitions [number of partitions] -remove [false/true] -compact [false/true]
where:
- -numPartitions
  (Optional) Number of partitions. The default value is 10.
- -remove
  (Optional) If set to true, unreferenced discovery cache files are removed. If set to false, unreferenced discovery cache files are reported, but not removed. The default value is true.
- -compact
  (Optional) If set to true, small or partially referenced discovery cache map file directories are merged according to the configuration settings for the percentage of discovery cache unreferenced objects in one MapFile and the minimum number of objects in a discovery cache MapFile. If set to false, the directories are reported, but not merged. The default value is true.
Enter a command with the parameters you want to specify.

Results

The submitted job is executed per the parameters you specified.

Run a Data Rationalization job

You must run a Data Rationalization job to gather all the necessary information for populating the Data Rationalization dashboard. Resources are analyzed by their overlap ratio, which is the amount of overlap divided by the cardinality. Data resources with an overlap ratio of 1.0 means the resources are a 100 percent match.

Before you run a Data Rationalization job, you must first run the applicable Data Profiling, Format Discovery, and Schema Discovery jobs on your data. Also, allocate at least 6 GB of memory for the executors and driver for this job.

Perform the following steps to run a Data Rationalization job:

Procedure

Open the Data Canvas page.
Select the data sources that you want to investigate.
Click the Action menu and select Process.
Click Data Rationalization.

Add any of the following parameters and values to the Command line text box according to your requirements.

NoteAll parameters are optional. If a value is not specified, the default value is used.

Parameter	Description
-firstDataSource	Specifies the first data source to use for overlap analysis. If not defined, all data sources are analyzed.
-secondDataSource	Specifies the second data source to use for overlap analysis. If the second data source is not defined, all data sources are analyzed.
-incremental	Specifies whether to perform incremental analysis. Set to `true` to perform overlap analysis for data sources only if the analysis has not been done before or requires an update. Set to `false` to redo the overlap analysis. The default is true.
-reprocess	Specifies whether to reprocess data resources. Set to `true` to perform overlap analysis for data resources that have been processed. Set to `false` to perform overlap analysis only for resources that were re-profiled after the last analysis. The -incremental parameter must be set to `false` for this parameter to be active. The default is false.
-fth_copy	Specifies the amount of overlap ratio when assessing a field as a copy. The formula is (1.0 - cardinality ratio < -fth_copy value). The default is 0.1.
-fth_overlap	Specifies the precision to cut off accidental field overlaps. The file is considered an overlap if both the source and target overlap ratios are greater than this value. The default is 0.2..
-rth_copy	Specifies the amount of precision required to detect if resources are copies. Resources are assessed as copies if the ratio of the matching fields in the compared resources is less than this value. The default is 0.1.
-rth_overlap	Specifies the precision in overlap relationships needed to cut off accidental resource overlaps. Overlap relationships are denoted between resource1 and resource2 only if `max([resource1 overlapped fields count]/[resource1 fields count], [resource2 overlapped fields count]/[resource2 fields count]) > [this value]`. The default is 0.32.
-same_semantics	Specifies whether to perform semantic analysis. Set to `true` to perform overlap analysis for a pair of source and target fields that have the same internally discovered semantics attributes, such as `temporal`, `free text`, `numeric`, or `strings`. To extend the scope of fields compared for overlap analysis, set this value to `false`. Setting this value to `false` may increase the accuracy of your overlap analysis, but also may cause a decrease in performance. The default is true.
-exclude_field_list	Specifies a list of field names to exclude from overlap analysis.

Click Save.

Results

The job runs and displays on the Job Activity page.

Next steps

See Monitoring job activity.

Encrypt text

You can use the Encrypt tool to encrypt plain text values in a Lumada Data Catalog configuration. Some use cases for encryption might be for API keys, Kubernetes secrets, or discovery cache attributes. Perform the following steps to encrypt text:

Procedure

In the left-hand navigation pane, click the Tools icon.
Under Encryption Tool, enter the text you want to encrypt in the Plain Text field and click Encrypt. The encrypted results display in the Encrypted result field. Plain text and encrypted texts enclosed with enc(…) are interchangeable.
The encrypted results display in the Encrypted result field. Plain text and encrypted texts enclosed with enc(…) are interchangeable.
(Optional) You can copy the encrypted text by clicking the Copy Result icon.
You can use the result in place of any plain texts. For example, use for large properties with AWS secret keys. You do not have to paste the encrypted text in the data source passwords. Passwords are automatically encrypted when they are created or updated.
(Optional) You can clear the plain text value by clicking Reset.

Lineage export and import

You can export lineages discovered by Data Catalog and import user-defined lineages with the Lineage – Import/Export tool. You can also use this tool to delete lineages that were exported as a CSV file.

The lineage relationships, both resource level and field level, are exported in the CSV format in a pre-defined structure. Lineage import also requires the user-defined lineages to be submitted only in this pre-defined format. Lineage export and imports are based on lineage targets, not sources.

When limiting lineage export to a specific virtual folder or path, all lineages are exported to defined targets in the virtual folder or path.

Export-import CSV structure

Data Catalog exports lineages in a pre-defined fixed format. The following table describes the columns of this fixed format:

Col#	Col Name	Default Value	Description
A	external_id	system generated	(Required) Unique ID of the entity represented by CSV line, operation, or operation execution
B	external_source_name	"LDC"	Name of the external source. Use LDC for lineages created manually in Data Catalog.
C	target_data_source_name	""	Name of the target data source
D	target_resource_path	""	Target data resource path
E	lineage_type	""	Lineage type (options: INFERRED, INFERRED_ALTERNATIVE, FACTUAL, IMPORTED, HDFS2HIVE, OTHER)
F	lineage_kind	""	Lineage kind (options: COPY, PARTIAL_COPY, JOIN, UNION, UNION_PART, MASK, ENCRYPT, STANDARDIZE, CALCULATION, SUBSET, SUPERSET, HIVE_EXTERNAL, OTHER)
G	lineage_exec_type	""	Lineage operation level (options: `lineage_exec_type_resource`, `lineage_exec_type_field`)
H	target_resource_field	""	Target field
I	resource_lineage_reference	""	GUID for the Operation execution entity GUID
J	principal	""	Principal or lineage creator as group:user
K	source_data_source_name	""	Source data source name
L	source_resource_path	""	Source data resource path
M	source_resource_field	""	Source field
N	lineage_state	""	Lineage state (options: ACCEPTED, REJECTED, SUGGESTED, IMPORTED)
O	description	""	Lineage description
P	code	""	Transformation code
Q	operation_reference	""	GUID for the Operation entity
R	operation_type	"operation_execution"	Lineage entity type (options: operation, operation_execution). The default value is operation_execution.

For every lineage you export, the details for the lineage operation include the operation execution details.

The following sample is a portion of the CSV file exported:

CSV export example

The following details apply to the selected sample of the CSV export:

Row 75: Identifies the lineage and shows it was REJECTED.
Row 76: Lists all the information in the DETAILS panel for the lineage. It identifies lineage type, kind, status, description, and code.
Rows 77 – 94: Lists the field level lineage details for the selected lineage operation. The Code column (Col P) identifies the source field to target field mapping found in the View Mapping or Code section under the DETAILS tab.

The operation_reference values (Column Q in this example) for all rows following the lineage identifier row (Row 75 per this example) are the same and represent the external_id of the lineage operation.

The following lineage types can occur:

Inferred
Primary lineage discovered by Data Catalog. The AI engine identifies one primary lineage based on proprietary algorithms.
Inferred alternative
Lineage discovered by Data Catalog. If more than one source is a candidate for primary lineage, the AI engine marks these inferred lineages as Inferred_alternative.
Factual
Any lineage added manually.
Imported
Any lineage imported from third-party applications such as Apache Atlas.
HDFS2HIVE
HDFS to HIVE lineage identified as part of schema discovery.

Export a lineage

Perform the following steps to export lineages:

Procedure

On the Home page from the left-side menu bar, click Tools.
The Tools page opens.
Click the Lineage – Import/Export tab.
The Lineage – Import/Export Tool opens.
Click Export.
The export-specific fields appear.
Select the limit of the export target for the Virtual Folder Selection.
The default value is All of the virtual folders.
Specify the following parameter in the Enter Parameters text box if you want to apply it to the export:
- -path
  (Optional) Limit the lineage export to this path.
The specified parameters appear in the Command Summary section.
Click Submit.

Results

Data Catalog exports the CSV file to the exports directory under the local agent. The exportLineageToCSV job in the /agent/bin/ldc directory also executes, as shown in the following example with the -path parameter specified: ./ldc exportLineageToCSV -virtualFolder vf -path pathtoDataEntity

Save exported lineage to your local machine

The ExportLineage command exports to the CSV file in the exports directory under the local agent. If you want to save the exported lineage to your local machine, perform the following steps:

Procedure

On the Home page from the left-side menu bar, click Management.
The Manage Your Environment page opens.
Select Job Activity under Job Management.
The Job Activity window opens.
Expand the exportLineageToCSV activity by clicking the down arrow on the activity row, then click the right arrow.
The Instance Steps dialog box appears.
Click Download Supported Files.

Results

The exported CSV file is saved to your local machine.

Import a lineage

Perform the following steps to import lineages:

Procedure

On the Home page from the left-side menu bar, click Tools.
The Tools page opens.
Click the Lineage – Import/Export tab.
The Lineage – Import/Export Tool opens.
Click Import.
The import-specific fields appear.
Specify the location of the CSV file in Upload File.
(Optional) Select Undo to delete all defined lineages from the virtual folder associated with the Upload File. Otherwise, deselect Undo to retain all defined lineages and insert the imported lineage.
The Undo option is deselected by default.

NoteThe use of parameters in the Enter Parameters field is not currently supported.
Click Submit.

Results

The submitted job is executed according to the parameters you specified. The ImportLineage command imports the lineage into Data Catalog.

NoteS-Fit and T-Fit values under View Field Mapping are available for imported lineages.

Next steps

See Lineage best practices for best practices for updating existing lineages.

Lineage best practices

You should adhere to the following best practices when updating existing lineages in Data Catalog.

You can update existing lineages by exporting them to a CSV file, curating the lineages in the CSV file, then importing the transformed lineages back into Data Catalog.
When updating lineages, do not manipulate the external_id, lineage_exec_type, resource_lineage_reference or operation_reference fields. Errors occur when any of these fields are modified.
- The external_ids are generated by Data Catalog and are unique to the system.
- The lineage_exec_type field identifies resource lineage versus field lineage.
- The resource_lineage_reference and operation_reference fields depend on the external_id field.
The principal, description, and code fields are descriptive. No validations are performed on the content of these fields while importing lineages.
The lineage_type and lineage_kind fields are also considered descriptive when importing lineages. No validations are performed on the user updates to these fields.
Validations are performed on any changes to the target_data_source_name field, the source_data_source_name field, the target path, the target field, the resource path, the resource field, and the lineage_state. Changes to resource names as part of path name changes are not supported.
Data Catalog does not offer field curation.

Removing lineages

You can also use the Lineage – Import/Export tool to remove lineages from Data Catalog. When you select Undo for Import, the import process removes all lineages specified in the import CSV file.

Model - Import/Export tool

You can use the Model - Import/Export tool, a migration utility that functions like a data model mapper, to export and import entities according to the applicable environment:

In environments with different data mappings, you can export or import the basic definitions of Data Catalog assets or the data model.
In environments with matched data mappings, you can export or import most of the asset metadata.

The following metadata can be exported and imported:

Data sources
User roles
Glossaries
Business terms
Term associations
Virtual folders
Custom properties

Exporting metadata

You can export metadata from Data Catalog and save the metadata in a CSV format.

The entities are exported to a file in CSV format with the following file name convention:

exported_<entity>.csv

For example:

exported_termassociations.csv
exported_customproperties.csv
exported_glossaries.csv
exported_folders.csv
exported_userroles.csv
exported_sources.csv
exported_businessterms.csv

Export metadata

Follow the steps below to export metadata in a CSV format:

Procedure

On the Home page from the left-side menu bar, click Tools.
The Tools page opens.
Select the Model – Import / Export tab.
The Model – Import / Export Tool page opens.
Click Export.

The export-specific fields appear.
Select the asset type of the exported metadata.
The default value is All of the types.
In the Enter Parameters text box, enter any optional command line parameters, such as [--num-executor], [--driver-memory] or [--executor-memory], that you want to apply to the export.
The specified parameters appear in the Command Summary section.
Click Submit.

Results

The entities are exported to a file in the CSV format to the exports directory under the local agent.

Importing metadata

You can import metadata that has been saved in the pre-defined Data Catalog CSV file format.

To import user roles, data sources, and term associations, you must maintain the following import-export compatibility:

Use only the CSV metadata files that were exported earlier from Data Catalog.
Use the same version of the Model - Import/Export Tool that was used for exporting the metadata files.

Metadata file structures

The data is formatted inside the CSV file. When you are importing metadata into Data Catalog, a specific import sequence is required due to internal dependencies. For example, if you are importing business terms, make sure the required glossary is available in Data Catalog. If not, import the glossary first and then import the business terms. These dependencies are described for each of the metadata types, below.

Data sources

When you are importing data sources, make sure to create and connect agents, and profile the data before importing any term associations.

The data source import file should have the following columns in this order:

DATASOURCE_NAME
DATASOURCE_DESC
DATASOURCE_TYPE
DATASOURCE_PATH
DATASOURCE_URL
DATASOURCE_USER
DATASOURCE_PASSWORD
DATASOURCE_JDBC_DRIVER

User roles

The user roles import file should have the following columns in this order:

USER_ROLE_NAME
USER_ROLE_DESC
USER_POLICY_NAMES
VIRTUAL_FOLDER_NAMES
GLOSSARY_NAMES
METADATA_ACCESS
DATA_ACCESS
IS_DEFAULT_ROLE
IS_ALLOW_JOB_EXECUTION

Glossaries

The glossaries import file should have the following columns in this order:

GLOSSARY_NAME
GLOSSARY_DESC
GLOSSARY_COLOR

Business terms

Before importing any business terms, make sure you have created or imported the corresponding glossary.

The business terms import file should have the following columns in this order:

GLOSSARY_NAME
PARENT_LOGICAL_PATH
TERM_NAME
TERM_DESCRIPTION
TERM_METHOD
TERM_REGEX
TERM_MINSCORE
TERM_MINLEN
TERM_MAXLEN
TERM_ENABLED
TERM_LEARNING_ENABLED
EXTERNAL_ID
EXTERNAL_SOURCE_NAME
TERM_SYNONYMS
ENTITY
ANCHOR
CREATE_RESOURCE_TERM_ASSOCIATION
DEFAULT_OR_CUSTOM
EXPRESSION

You can use the term name to show hierarchy in business terms by using dot notation for naming the business term, as shown in the following example:

1. "Sales","Orders.Jan.Week1","Tag for Orders",VALUE,,0.0,0,0,true,true,,,""

The example above identifies a business term Week1 with the parent business term Jan and grandparent business term Orders under the Sales glossary.

Term associations

Before importing any term associations, make sure you have profiled the virtual folder and have imported the corresponding business terms and glossary.

The term associations import file should have the following columns in this order:

GLOSSARY_NAME
TERM_NAME
DATASOURCE_NAME
RESOURCE_NAME
FIELD_NAME
TERM_PROPAGATION
TERM_ASSOCIATION_STATE

Virtual folders

Before importing csv files into virtual folders, you must create or import the respective parent data source.

The virtual folders import file should have the following columns in this order:

VIRTUAL_FOLDER_NAME
VIRTUAL_FOLDER_DESC
VIRTUAL_FOLDER_PARENT_NAME
VIRTUAL_FOLDER_DATASOURCE
VIRTUAL_FOLDER_PATH
VIRTUAL_FOLDER_INCLUDE
VIRTUAL_FOLDER_EXCLUDE
IS_VIRTUAL_FOLDER_ROOT

Custom properties

The custom properties import file should have the following columns in this order:

CUSTOM_PROPERTY_NAME
CUSTOM_PROPERTY_DISPLAY_NAME
CUSTOM_PROPERTY_DATATYPE
CUSTOM_PROPERTY_DESC
CUSTOM_PROPERTY_GROUP_NAME
CUSTOM_PROPERTY_GROUP_DESC
IS_CUSTOM_PROPERTY
IS_CASE_SENSITIVE
IS_SEARCHABLE
IS_FACE_TABLE
CUSTOM_PROPERTY_READ_ROLE
CUSTOM_PROPERTY_WRITE_ROLE

Import metadata

Follow the steps below to import metadata:

Procedure

On the Home page from the left-side menu bar, click Tools.
The Tools page opens.
Select the Model – Import / Export tab.
The Model – Import / Export Tool opens.
Click Import.

The import-specific fields appear.
Select the asset type for the imported metadata.
Specify the location of the metadata file in Upload File.
In the Enter Parameters text box, enter any optional command line parameters, such as [--num-executor], [--driver-memory] or [--executor-memory], that you want to apply to the import.
The specified parameters appear in the Command Summary section.
Click Submit.

Results

The metadata is imported into Data Catalog according to the parameters you specified.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Maintaining the discovery cache

Compress the discovery cache

Run a Data Rationalization job

Encrypt text

Lineage export and import

Export-import CSV structure

Export a lineage

Save exported lineage to your local machine

Import a lineage

Lineage best practices

Removing lineages

Model - Import/Export tool

Exporting metadata

Export metadata

Importing metadata

Metadata file structures

Data sources

User roles

Glossaries

Business terms

Term associations

Virtual folders

Custom properties

Import metadata