Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data rationalization

Parent article

AttentionThis feature is available in version 6.1 and later.

You can use the data rationalization feature to identify and analyze copies and overlaps of the data in your data lake. The tools in the data rationalization dashboard help you to discover the owners of duplicate files, the number of duplicates, and the amount of space that they are consuming. Identifying duplicates can help your organization save on storage costs, reduce the size of your data lake, and improve your search performance.

After you Run a Data Rationalization job to provide the most recent data in your dashboard, you can begin to search for overlaps and copies. An overlap is a file that contains part, but not all of the data that another file contains, while a copy is a form of overlap that is an exact copy (100 percent overlap). A copy of a file can have the same contents even if the file name is different. Because the duplicate files may have different names, they are gathered into copy groups. You can use the analytic and drill-down tools in the dashboard to investigate the duplicates and overlaps within these copy groups.

Data rationalization provides your organization with the following benefits:

  • You can de-clutter your data lake for improved search performance.
  • You can identify duplicate files so you can delete them and save on storage costs and reduce the volume of files.
  • You can identify deprecated versions of files to make sure users are using the correct version.
  • You can manage the redundancy of sensitive data to mitigate security risks and ensure compliance with security regulations.

Tour the Rationalization dashboard

You can use the Data Rationalization dashboard to view details of duplicate copies of files that contain overlapping data in your data lake. This dashboard offers a drill-down interactive menu for copy or overlap analysis of resources grouped by size, owner, volume, and source so you can identify business assets to help manage redundancies. For security, the data comparison is restricted to the resources that the user has access to.

Before you can use the Data Rationalization dashboard, you must first Run a Data Rationalization job to provide data for the dashboard to display. The dashboard displays real-time data from the time when the job is run.

NoteAfter you have de-cluttered your data in your data sources, you must run the profile, discovery, lineage, and rationalization jobs to reflect your changes in the dashboard.

To open the Data Rationalization dashboard, on the Home page select Dashboard Rationalization.

GUID-6A11D076-625F-4FF4-81FF-06782AF41678-low.png

Once the dashboard appears, you can view the following summary information: GUID-A00A5CC8-C58A-46F2-8C2A-85DCD13202E4-low.png

  • As Of Time

    Time that the Data Rationalization job was run. All information will be current up to this date and time.

  • Unique Resources

    Total number of Resources that are not copies of anything else, including table, files, and collection roots.

  • Copy Resources

    Total number of Resources that are a copy of at least one other resource.

  • Overlapping Relationships

    Total number of resources that overlap with at least one other resource. Overlapping refers to files that either contains part of another file or files that are copies.

  • Copy Groups

    Total number of discovered copy groups.

In the Data Rationalization dashboard, select the Copies tab to use tools for investigating the duplicate data, or select the Overlaps tab to use tools for investigating the overlapping data.

Working with copies

After you run the Data Rationalization job, the Data Rationalization dashboard displays summary information of the analysis. At a glance, you can view the numbers of unique resources compared to copy resources and the total number of overlapping relationships and copy groups.

When copy groups are present, you can view a further breakdown of the data as reports based on properties:

  • Top Copy Groups By Volume

    The amount of storage space the largest five groups of duplicate files consume.

  • Top Copy Groups By Resource Count

    The number of files the largest five groups of duplicate files consume.

  • Top Copy Groups By Resource Size

    The size of the files with duplicate copies consuming the most resources. For example, the group may contain a few large files, so those copies consume a lot of storage space.

  • Top Data Sources With Most Copies

    The data sources with the largest number of copies.

  • Top Owners By Volume Of Copies

    The data source owners with the largest number of copies.

You can explore all the copy groups within a report by clicking the View All button on each report. The available data for that copy group appears in a table format. For a quick look, you can mouse over a bar graph to view the name of the copy group with either the exact count of the group or the volume of the group by size in megabytes.

GUID-6A11D076-625F-4FF4-81FF-06782AF41678-low.png

You can drill down into data sources and files to quickly investigate your duplicates by clicking the bar graph in a report to view further information about that copy group or data source. For example, to explore the top copy group identified in the Top Copy Group By Resource Count report, click the bar graph of the copy group to view the list of included resources. From here, you can click a resource to open it in the Data Catalog single resource view to explore the fields, or export the list to the team best suited to resolve the copy issues. Copy Group details

To export your list, click the Export Table as CSV link. An export dialog box opens with a link to the Export your findings to a CSV file page.

Filtering your view

The Data Rationalization dashboard features filters that let you drill further down into your data. Click Filters to filter your data, then mouse over the group you that want to filter and select the filters for the data that you want to include. You can select multiple filters for a more detailed view. Only data for the selected filters displays in the dashboard. To remove filters, deselect the filters, or click the Clear All Filters link to remove all filters. Applying filters

Working with overlaps

Overlaps contain mostly the same data, but are not perfect copies. For example, modifications to a small percentage of the file may have occurred from enrichment activities, such as adding or deleting a couple of rows or columns in a table. To view the overlaps detected in your system, click Overlaps to display the tab.

The Overlaps tab offers three reports: the Top Resources With Most Overlaps, Top Data Sources With Most Overlaps, and the Top Owners With Most Overlaps. Drilling down into the reports helps you determine if these overlaps are the result of prep activity around a file and if this particular file needs to be stored in your data lake.

You can explore all the resources and data sources within a report by clicking the View All button on each report. The available data for that resource or data source appears in a table format. For a quick look, you can mouse over a bar graph to view the name of the source with either the exact count of the overlaps or the volume of the overlaps by size in megabytes (MB).

GUID-F1F5EE9C-9B92-4675-9CF6-1AA88BFA7387-low.png

Like in the Copies tab reports, you can drill down into data sources and files to investigate your overlaps by clicking the bar graph in a report. From here, you can click a resource to open it in the Data Catalog single resource view to explore the fields, or export the list to the team best suited to resolve the overlaps issues.

To export your list, click the Export Table as CSV link. An export dialog box opens with a link to the Export your findings to a CSV file page.

Export your findings to a CSV file

You can export your findings as a report to a CSV file to analyze offline or share with others. The process of exporting includes two parts: you first generate a report of your findings, then download the report to a CSV file. You can generate reports from virtual folders, single resource view details, search results, or the Data Rationalization dashboard.
AttentionThis feature is available in version 6.1 and later.

Perform the following steps to export your findings to a CSV file.

Procedure

  1. Click Export as CSV or Export Table as CSV to start generating the export data.

    The Export CSV Settings dialog box displays.
  2. Select which properties to export for each resource. Click Select All to include all the properties listed.

  3. Click Export to generate the data.

    After the CSV data values are successfully generated, a confirmation message appears in the header with an exports link.
  4. If you are ready to download the generated information at this point, click exports in the header message.

    The Exports page opens. This page provides a summary of your exported reports, including the report name, the report type (from where is was generated), the generation interval, and the report size. Any report listed here is automatically deleted within seven days from the time the report is generated.
    NoteIf you want to wait until later to download the generated CSV data, you can access the Exports page through the Exports option in your User Profile menu.
  5. From the reports table, click More actions, and then select Download report.

    Exports pageThe generated CSV file is downloaded to the location specified for the Path to exports configuration property during your installation of Data Catalog. See Managing configurations if you need to reconfigure the Path to exports property to a different location.
  6. (Optional) To delete a report, click More actions, and then select Delete report.

Results

After the report is downloaded, you can access the CSV file offline from the Path to exports location and share it with others.