Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog utility jobs

Parent article

You can use the Data Catalog utility tools to update dashboards, troubleshoot discovery, and metadata details. Extended profiling generates resource metadata used for discoveries that reside in the discovery cache. Use the following tools to troubleshoot specific data in the discovery cache for qualifying discovery operations, updating dashboards, and solving metadata inconsistencies.

CautionBecause these tools may expose specific information from Data Catalog's discovery cache, they are strictly for administrator use and should not be exposed to general users.

These utilities are generally invoked from the agent component.

  • Cache Compactor (DiscoveryCacheCompactor)

    Removes unused MapFile directories and merges small ones. It is a distributed replacement for RepoCacheMaintenance.

  • Data Rationalization (Run a Data Rationalization job)

    Reports details about duplicated and overlapping data. Use this utility to populate the Data Rationalization dashboard.

  • Lineage Check (lineage_check)

    Reports details about inferred lineage between child and parent data resources. Use this utility to check if lineage could be discovered.

  • Tag Check (tag_check)

    Reports details for tag association suggestions between specific resource field(s) and any tag in question. Use this utility to get explanations why a tag was not propagated.

  • Utils (OriginPropagation, ExportLineage, and ImportLineage)

    Propagates information across the lineage relationships to show the origin of the data, exports all the lineages, or imports all lineages listed in the CSV files.

  • Utils Yarn (correctFrequency and TFARemover)

    Calculates precise cardinalities and most frequent values. You can also use this tool to remove suggested tag-field associations.

.

The output size from some of these utilities may be large. As a best practice, you should redirect the console output to a temporary file.

CautionThese utilities may expose actual data in their output. For example, some of these utilities expose actual top K values without masking them. Use caution to ensure your data is not exposed beyond its intended audience.

Open and run a utility tool

Perform the following steps to access and run a Data Catalog utilities tool.

Procedure

  1. Navigate to Manage, then click Tools.

    The Tools dialog box displays.

    Tools dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. From the Tools panel on the left, select the utility tool you want to use.

    The dialog box updates with the tool you selected.
  3. Enter the utility parameters you want to set to your specific values in the Command line text box.

  4. Click Submit job.

Results

The submitted job is executed per the command and parameters you specified.

Discovery cache maintenance

As a best practice, you should maintain the discovery cache by removing unused MapFile directories and merging small directories. Data Catalog discovery cache is a repository containing large properties of entities generated during profiling and tag propagation. Large properties are used in Tag Propagation, Lineage Discovery, and in some instances Regex evaluation.

The location and content of the Discovery Cache are configured and controlled by two configuration.json properties:

  • ldc.metadata.hdfs.large_properties.uri

    URI of the discovery cache metadata files. Its default value is hdfs://<host>:8020.

  • ldc.metadata.hdfs.large_properties.path

    Location relative to discovery cache URI of discovery cache metadata store. Its default value is /user/ldcuser/.ldc_hdfs_metadata.

You should change the URI and location to fit your environment, and both should be accessible by the user running jobs.

ImportantThe above properties must be adjusted during installation of the Lumada Data Catalog product. If these properties change after installation, re-run discovery profiling starting with extended profiling.

The large properties of data resources and tag propagation information in the discovery cache are stored in the resource_lp and tag_lp directories, as shown in the following example:

.wld_hdfs_metadata directory

ImportantDo not manually remove the resource_lp and tag_lp directories.

Each directory contains the number of map files with generated unique names, as shown in the following example:

resource_lp directory

NoteWhen installing Data Catalog for the first time, you should manually clean out these directories so you can start with an empty repository with the following commands:

$ hdfs dfs -rm -R /user/ldcuser/.ldc_hdfs_metadata OR $ hdfs dfs -rm -R /user/waterlinesvc/.ldc_hdfs_metadata

Compaction is controlled by configuration parameters, as shown in the following examples:

  • ldc.metadata.discovery.cache.unferenced

    "ldc.metadata.discovery.cache.unferenced":{
        "value": 50.0, 
        "type": "FLOAT", 
        "restartRequired": false, 
        "readOnly": false, 
        "description": "if below this percentage, the objects from mapfile will be relocated during discovery cache compacton (DiscoveryCacheCompactor).",             
        "label": "percent of discovery cache unreferenced objects in one mapfile", 
        "category": "DISCOVERY", 
        "defaultValue": 50.0, 
        "visible": false 
      }
    
  • ldc.discovery.cache.objects.min

    "ldc.discovery.cache.objects.min": { 
        "value": 50
        "type": "INTEGER", 
        "restartRequired": false, 
        "readOnly": false, 
        "description": "If below this number, the
          object from MapFile will be relocated during discovery cache
          compaction(DiscoveryCacheCompactor).", 
        "label": "Minimum number of objects in
          discovery cache MapFile", 
        "category": "DISCOVERY", 
        "defaultValue": 50, 
        "visible": false 
    
        }

Use the cache compactor

Perform the following steps to use the cache compactor to remove unused MapFile directories and to merge small directories:

Procedure

  1. Navigate to Tools under Manage, then click Cache Compactor.

    The Cache Compactor dialog box displays.

    Tools dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name.

    • -compact

      (Optional) If set to true, small or partially referenced discovery cache map file directories are merged according to ldc.metadata.discovery.cache.unferenced and ldc.discovery.cache.objects.min configurations. If set to false, the directories are reported, not merged. The default value is true.

    • -remove

      (Optional) If set to true, unreferenced discovery cache files are removed. If set to false, unreferenced discovery cache files are reported, not removed.

    • -numPartitions

      (Optional) Number of partitions. The default value is 10.

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.
NoteLumada Data Catalog strongly recommends setting at least 8G each of driver-memory and executor-memory for the cache compactor.

Check lineage

You can use the lineage check utility tool to understand why resources are not being discovered as lineage relationships and to help tune Data Catalog's lineage discovery process to ensure it finds applicable lineage relationships.

Perform the following steps to report the suitability of two resources as candidates for a lineage relationship:

Procedure

  1. Navigate to Tools under Manage, then click Lineage Check.

    The Lineage Check dialog box displays.

    Lineage Check dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name.

    • -parentVirtualFolder

      Name of the parent virtual folder.

    • -parentDataResource

      A fully qualified path to parent data resource.

    • -childVirtualFolder

      Name of child virtual folder.

    • -childDataResource

      A fully qualified path to child data resource.

    • -validation

      (Optional) If set to true, the significance of parent-child relationship is validated using modification timestamps for HDFS and HIVE data resources. The default value is true.

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.
The following blocks of code are examples of setting the lineage check parameters in the Command line text box:
  • Example 1

    -parentVirtualFolder MySource
    -parentDataResource /user/demo/raw/rest_insp/restaurants_nyc.csv
    -childVirtualFolder MySource
    -childDataResource /user/demo/pub/classified/blended_violations.csv
    
  • Example 2

    -parentVirtualFolder MyHIVE
    -parentDataResource /default.table1
    -childVirtualFolder MyHIVE
    -childDataResource /default.table1_vw
    

Check tag propagation

You can use reports for tag association suggestions between specific resource field(s) and any tag in question to understand why the tag was or was not propagated.

Perform the following steps to report on tag association suggestions:

Procedure

  1. Navigate to Tools under Manage, then click Tag Check.

    The Tag Check dialog box displays.

    Tag Check dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name.

    • -virtualFolder

      Virtual folder name.

    • -dataResource

      A fully qualified path to data resource.

    • -domain

      Tag domain name to which the tag in question belongs; may use shortcut -domain p for Built-in_Tags Domain.

    • -tag

      Tag name for which propagation is being checked.

    • -field

      (Optional) Full path to field. If this parameter is not set, all fields are checked.

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.
The following block of code is an example of setting the lineage check parameters in the Command line text box:
-virtualFolder MySource
-dataResource /user/wlddev/data/src/regex_all.csv
-mdsType Solr
-domain p
-tag "Global City"

Origin propagation

The origin is where a data set comes into the cluster. The origin of a data set is usually a specific resource, such as the data source or the virtual folder. In Data Catalog, origin information propagates across lineage relationships. Origins of any resource are data sources directly related to that resource (belongs to) or indirectly through imported, inferred, or manually created lineages (factual).

You can use a utility tool to examine origins in the single resource view.

Perform the following steps to investigate origin propagations across lineage relationships:

Procedure

  1. Navigate to Tools under Manage, then click Utils.

    The Utils dialog box displays.

    Utils dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify into the Command line text box:

    • -agent

      Agent name. This parameter is only required if multiple agents are registered.

    • -driver

      The driver of the utility. For origin propagation, set -driver to com.hitachivantara.datacatalog.origin.OriginPropagation.

    • -mode

      The type of origin. Specify one of the following options to propagate the selected type of origins:

      • inferred

        Only origins through inferred (accepted and suggested) lineages.

      • imported

        Only origins through factual imported lineages.

      • all

        (default) Origins through all lineages (inferred and imported).

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.

Lineage export and import utility

You can export lineages discovered by Data Catalog and import user-defined lineages with Data Catalog's lineage export and import utility. You can also use the utility to delete lineages that were primarily exported as a CSV file with the ExportLineage action.

The lineage relationships, both resource level and field level, are exported in the CSV format in a pre-defined structure. Lineage import also expects the user-defined lineages be submitted in this pre-defined format only. Lineage export and imports are based with respect to lineage targets, not sources.

All lineages are exported with respect to lineage targets. When limiting lineage export to a specific virtual folder or path, all lineages are exported to defined targets in the virtual folder or path.

Export-import CSV structure

Data Catalog exports lineages in a pre-defined fixed format. The following table describes the 18 columns of this fixed format:

Col#Col NameDefault ValueDescription
0external_idsystem generated(Required) Unique ID of the entity represented by CSV line, operation, or operation execution
1external_source_name"LDC"Name of the external source. Use LDC for lineages created manually in Data Catalog.
2target_data_source_name""Name of the target data source
3target_resource_path""Target data resource path
4lineage_type""Lineage type (options: INFERRED, INFERRED_ALTERNATIVE, FACTUAL, IMPORTED, HDFS2HIVE, OTHER)
5lineage_kind""Lineage kind (options: COPY, PARTIAL_COPY, JOIN, UNION, UNION_PART, MASK, ENCRYPT, STANDARDIZE, CALCULATION, SUBSET, SUPERSET, HIVE_EXTERNAL, OTHER)
6lineage_exec_type""Lineage operation level (options: lineage_exec_type_resource, lineage_exec_type_field)
7target_resource_field""Target field
8resource_lineage_reference""Operation execution entity GUID
9principal""Principal or lineage creator as group:user
10source_data_source_name""Source data source name
11source_resource_path""Source data resource path
12source_resource_field""Source field
13lineage_state""Lineage state (options: ACCEPTED, REJECTED, SUGGESTED, IMPORTED)
14description""Lineage description
15code""Transformation code
16operation_reference""Operation entity GUID
17operation_type"operation_execution"Lineage entity type (options: operation, operation_execution)

For every lineage you export, the lineage operation details include the operation execution details, as shown in the following example:

Lineage operation details

NoteThe format is removed in the display to highlight the relevant lineage operation and operation execution in the details.

The following sample is a portion of the CSV file exported from the above example:

CSV export example

The following details apply to the above portion of the CSV export:

  • Row 75: Identifies the lineage operation. The REJECTED lineage for resource Retiree180.csv per this example.
  • Row 76: Lists all the information in the DETAILS panel for the lineage. It identifies lineage type, kind, status, description, and code.
  • Rows 77 – 94: Lists the field level lineage details for the selected lineage operation. The Code column (Col P) identifies the source field to target field mapping found under the View Mapping or Code section under the DETAILS tab.

The operation_reference values (Column Q in this example) for all rows following the lineage identifier row (Row 75 per this example) are the same and represent the external_id of the lineage operation.

The following lineage types can occur:

  • Inferred

    Primary lineage discovered by Data Catalog. The AI engine identifies one primary lineage based on proprietary algorithms.

  • Inferred alternative

    Lineage discovered by Data Catalog. If more than one source is candidates for primary lineage, the AI engine marks these inferred lineages as Inferred_alternative.

  • Factual

    Any lineage added manually.

  • Imported

    Any lineage imported from third-party applications, like Atlas or Navigator.

  • HDFS2HIVE

    HDFS to HIVE lineage as part of the schema discovery.

Exporting lineages

Perform the following steps to export lineages:

Procedure

  1. Navigate to Tools under Manage, then click Utils.

    The Utils dialog box displays.

    Utils dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify into the Command line text box:

    • -agent

      Agent name. This parameter is only required if multiple agents are registered.

    • -driver

      The driver of the utility. For origin propagation, set -driver to com.hitachivantara.datacatalog.lineage.ExportLineage.

    • -file

      A name string for the CSV file, which will be generated with the export action.

    • -virtualFolder

      (Optional) The lineage export limited to this virtual folder. If this parameter is not set, all lineages in the data lake are exported.

    • -path

      (Optional) The lineage export limited to this path.

    • -separator

      (Optional) Separators used in the exported CSV file. The default value is ','.

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.
As an example, the following sample command was entered into the Utils dialog box to export lineages:
-driver com.hitachivantara.datacatalog.lineage.ExportLineage
-agent localAgent
-file /home/waterlinesvc/lineage-utility/Exported-lineage.csv

As an example, the following sample command was entered into the Utils dialog box to export lineages:

INFO  | 2020-04-10 02:24:06,563 |  | Main [main]  - Starting Lumada Data Catalog 2019.3 Build:420 Patch:undefined with Oracle Corporation Java, Version 1.8.0_112 from /usr/jdk64/jdk1.8.0_112/jre, Locale en_US
INFO  | 2020-04-10 02:24:06,564 |  | Main [main]  - Arguments [-action, utils, -driver, com.hitachivantara.lineage.ExportLineage, -file, /home/waterlinesvc/lineage-utility/Exported-lineage.csv]
INFO  | 2020-04-10 02:24:06,583 | utils | ClientConfigurationLoader [main]  - In loadConfig Loading configuration from file: meta-client-configuration.json
INFO  | 2020-04-10 02:24:06,660 | utils | MetadataClientServicefactory [main]  - Preparing metadataclient
INFO  | 2020-04-10 02:24:06,713 | utils | Main [main]  - Calling com.hitachivantara.lineage.ExportLineage with options {action=utils, file=/home/waterlinesvc/lineage-utility/Exported-lineage.csv, COMMAND=/opt/ldc/agent/bin/ldc utils -virtualFolder null -path null}.
WARN  | 2020-04-10 02:24:06,719 | utils | LockManager [main]  - Zookeeper based locking is disabled. Jobs should not be run concurrently
INFO  | 2020-04-10 02:24:06,719 | utils | ExportLineage [main]  - Getting metadata client instance.
INFO  | 2020-04-10 02:24:06,719 | utils | MetadataClientServicefactory [main]  - Preparing metadataclient
INFO  | 2020-04-10 02:24:06,719 | utils | ExportLineage [main]  - Utility parameters validated.
Virtual folder: [null]
Path: [null]
File: [/home/waterlinesvc/lineage-utility/Exported-lineage.csv]
Field delimiter: [,]
INFO  | 2020-04-10 02:24:06,719 | utils | ExportLineage [main]  - Starting to export lineage information.
INFO  | 2020-04-10 02:24:06,719 | utils | ExportLineage [main]  - Fetching all lineages.
INFO  | 2020-04-10 02:24:38,127 | utils | ExportLineage [main]  - Successfully processed 50 lineages so far...
INFO  | 2020-04-10 02:25:12,254 | utils | ExportLineage [main]  - Successfully processed 100 lineages so far...
INFO  | 2020-04-10 02:25:40,729 | utils | ExportLineage [main]  - Successfully processed 150 lineages so far...
INFO  | 2020-04-10 02:25:55,552 | utils | ExportLineage [main]  -

        Finish lineage export. 189 lineages were successfully exported to file /home/waterlinesvc/lineage-utility/Exported-lineage.csv. Encounter 0 errors
INFO  | 2020-04-10 02:25:55,554 | utils | ExportLineage [main]  -

        Elapsed time = PT1M48.832S
INFO  | 2020-04-10 02:25:55,554 | utils | ExportLineage [main]  - Successfully exported all lineages.
INFO  | 2020-04-10 02:25:55,554 | utils | Main [main]  - Elapsed time PT1M49.53S
INFO  | 2020-04-10 02:25:55,554 | utils | Main [main]  -

Importing lineages

Perform the following steps to import lineages:

Procedure

  1. Navigate to Tools under Manage, then click Utils.

    The Utils dialog box displays.

    Utils dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name. This parameter is only required if multiple agents are registered.

    • -driver

      The driver of the utility. For origin propagation, set -driver to com.hitachivantara.datacatalog.lineage.ImportLineage.

    • -file

      (Required) The path to the CSV file, which will be imported.

    • -separator

      (Optional) Separators used in the CSV file, except for the pipe character (‘|’). The default value is ','.

    • -replaceConfig

      (Optional) CSV file with configured replacements. You can use the configuration file to make replacements to the lineages during the import process. See Replacements during import for details.

    • -undo

      (Optional) When set to true, all defined lineages are removed in the import file specified. The default value is false.

  3. Click Submit job.

Results

The submitted job is executed per the parameters you specified.
NoteS-Fit and T-Fit values under View Field Mapping are not available for imported lineages. Becaue of the nature of this user-established mapping, Data Catalog’s engine is unable to restore these values from the imported CSV.

Next steps

The following best practices apply while updating existing lineages:
  • Existing lineages can be updated by exporting them to a CSV, curating the lineages in the CSV, then importing the transformed lineages back into Data Catalog.
  • When updating lineages, do not manipulate the external_id, lineage_exec_type, operation_reference and resource_lineage_reference fields. The external_ids are generated by Data Catalog and are unique to the system. The resource_lineage_reference and operation_reference fields depend on the external_id field. The lineage_exec_type field identifies resource lineage versus field lineage. Errors occur when any of these fields are modified.
  • The principal, description, and code fields are descriptive. No validations are performed on the content on these fields while importing lineages.
  • The lineage_kind and lineage_type fields, although relevant when exporting inferred lineages by Data Catalog, any changes to these fields when importing lineages are also considered descriptive. No validations are performed on the user updates to these fields.
  • Validations are performed on any changes to the target_data_source_name field, the source_data_source_name field, the target path, the target field, the resource path, the resource field, and the lineage_state. Changes to resource names as part of path name changes are not supported.
  • Data Catalog does not offer field curation.

Removing lineages

You can also use the export/import lineage utility to remove lineages from Data Catalog. When the undo option of the ImportLineage action is set to true, it removes all lineages specified in the import CSV file, as shown in the following example command call:

-driver com.hitachivantara.datacatalog.lineage.ImportLineage -file ~waterlinesvc/Exported-Lineage/export_apr2020.csv -undo true

Replacements during import

You can use the Replace-config file to make replacements to the lineages during the import process. You can then enter the identifiers of this file as the optional parameter when importing the CSV lineages.

For example, in the environment where the CSV lineages are being imported, the paths to specific resources are different. One option would be to manually edit the CSV lineage file to alter any or every occurrence of the specified resource. In another example, where mapping an 'account' field of one resource to the 'record' field of another downstream, a resource may be more applicable per business logic than mapping it to the 'account' field in that resource. You could search and make those changes in the CSV lineage file.

The Replace-config file automates these replacements in run-time when the import command is triggered. The Replace-config file is a CSV file with the following four comma-separated (",") fields:

Column #Column NameDescription
0field name from CSV header(Required) Name of the field being replaced (from the CSV header).
1value to be replaced(Required) Value to be replaced.
2new value (replacement)(Optional) Replacement value. The value is empty value if it is not defined.
3replacement strategy

(Optional) One of the four following values:

  • ALL: Replace entire matching field value.
  • START: Replace matching value at the beginning of the field.
  • END: Replace matching value at the end of the field.
  • OCCURRENCE: Replace first occurrence of the matching value.

The default value is ALL.

The following sample is an example Replace-config file:

principal,,wlddev:waterlineservice-user
target_data_source_name,COPY,LANDING,ALL
target_resource_path,/user/wlddev/lin_ju_stg/SRC_JOIN2,/user/wlddev/lin_ju_stg/SRC_JOIN,START
target_resource_path,/user/wlddev/lin_ju_src/SRC_JOIN2,/user/wlddev/lin_ju_src/SRC_JOIN,START
target_resource_path,TRG_JOIN2,TRG_JOIN,OCCURRENCE
source_data_source_name,COPY,LANDING,ALL
source_resource_path,/user/wlddev/lin_ju_stg/SRC_JOIN2,/user/wlddev/lin_ju_stg/SRC_JOIN,START
source_resource_path,/user/wlddev/lin_ju_src/SRC_JOIN2,/user/wlddev/lin_ju_src/SRC_JOIN,START
source_resource_path,TRG_JOIN2,TRG_JOIN,OCCURRENCE

Correct frequency

You should configure data resources for precise calculation. Profiling results are approximate for very large resources with very high cardinalities (configurable tipping point).

NoteWhen changing defaults for command line parameters, precise calculations can be time-consuming for fields with very high cardinalities.

Perform the following steps to modify the frequency of precise calculations:

Procedure

  1. Navigate to Tools under Manage, then click Utils Yarn.

    The Utils Yarn dialog box displays.

    Utils Yarn dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name. This parameter is only required if multiple agents are registered.

    • -driver

      The driver of the utility. For origin propagation, set -driver to com.hitachivantara.datacatalog.cli.MostFrequentCounter.

    • -virtualFolder

      (Optional) Data resources from this virtual folder. If this parameter is not set, all applicable data resources from the repository are used.

    • -path

      (Optional) Profiling path (resource specific filter). By default, the entire virtual folder is used.

    • -min_card

      (Optional) Minimum reported cardinality to recalculate the most frequent distribution for the fields. The default value is 500000.

    • -min_topk_count

      (Optional) Minimum most frequent reported count to recalculate the most frequent distribution for the fields. The default value is 16.

    • -calc_card

      (Optional) If set to true, recalculate the precise cardinality for the fields selected for the most frequent distribution recalculation. The default value is true.

    • -lines_out

      (Optional) Top most frequent values for the old and new exact distribution. The default value is 20.

  3. Click Submit job.

Results

The submitted job is executed using the parameters you specified.

Remove tag field associations

You can remove suggested tag-field associations.

Perform the following steps to remove suggested tag-field associations:

Procedure

  1. Navigate to Tools under Manage, then click Utils Yarn.

    The Utils Yarn dialog box displays.

    Utils Yarn dialog box

    NoteTools only displays in Manage if you have administrator permissions.
  2. Enter the following parameters you want to specify in the Command line text box:

    • -agent

      Agent name. This parameter is only required if multiple agents are registered.

    • -driver

      The driver of the utility. For origin propagation, set -driver to com.hitachivantara.datacatalog.cli.bucket.TFARemover.

    • -domain

      Tag domain name.

    • -tag

      (Optional) Tag name. It removes all suggested tag associations for this tag. If the domain is manually created, Data Catalog removes suggested tag associations. If the tag name is not defined, all suggested tags from the domain are removed.

  3. Click Submit job.

Results

The submitted job is executed using the parameters you specified.

Run a Data Rationalization job

You must run the Data Rationalization job to gather all the necessary information for populating the Data Rationalization dashboard. Resources are analyzed by their overlap ratio, which is the amount of overlap divided by the cardinality. Data resources with an overlap ratio of 1.0 means the resources are a 100 percent match.

Before you run the Data Rationalization job, first run the applicable profile, discovery, and lineage jobs on your data. Also, allocate at least 6 GB of memory for the executors and driver for this job.

Perform the following steps to run the Data Rationalization job:

Procedure

  1. Click the Manage tab, then click Tools.

  2. Select Data Rationalization from the menu on the left side of the page.

  3. Add any of the following parameters and values to the Command line according to your requirements.

    NoteAll parameters are optional. If a value is not specified, the default value is used.
    ParameterDescription
    -firstDataSourceSpecifies the first data source to use for overlap analysis. If not defined, all data sources are analyzed.
    -secondDataSourceSpecifies the second data source to use for overlap analysis. If the second data source is not defined, all data sources are analyzed.
    -incrementalSpecifies whether to perform incremental analysis. Set to true to perform overlap analysis for data sources only if the analysis has not been done before or requires an update. Set to false to redo the overlap analysis. The default is true.
    -reprocessSpecifies whether to reprocess data resources. Set to true to perform overlap analysis for data resources that have been processed. Set to false to perform overlap analysis only for resources that were re-profiled after the last analysis. The -incremental parameter must be set to false for this parameter to be active. The default is false.
    -fth_copySpecifies the amount of overlap ratio when assessing a field as a copy. The formula is (1.0 - cardinality ratio < -fth_copy value). The default is 0.1.
    -fth_overlapSpecifies the precision to cut off accidental field overlaps. The file is considered an overlap if both the source and target overlap ratios are greater than this value. The default is 0.2.
    -rth_copySpecifies the amount of precision required to detect if resources are copies. Resources are assessed as copies if the ratio of the matching fields in the compared resources is less that this value. The default is 0.1.
    -rth_overlapSpecifies the precision in overlap relationships needed to cut off accidental resource overlaps. The formula is overlap relationships are denoted between resource1 and resource2 only if max([resource1 overlapped fields count]/[resource1 fields count], [resource2 overlapped fields count]/[resource2 fields count]) > [this value]. The default is 0.32.
    -same_semanticsSpecifies whether to perform semantic analysis. Set to true to perform overlap analysis for a pair of source and target fields that have the same internally discovered semantics attributes such as temporal, free text, numeric, or strings. To extend the scope of fields compared for overlap analysis, set this value to false. Setting this value to false may increase the accuracy of your overlap analysis, but also may cause a decrease in performance. The default is true.
    -exclude_field_listSpecifies a list of field names to exclude from overlap analysis.
  4. Click Save.

Results

The job runs and displays on the Job Activity page.

Next steps