Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Managing data objects

Parent article

Data objects are virtual resources created when users define sets of join conditions between resources. For example, you have access to multiple resources that tabulate employee contact information, employee skills, and their assignments on a particular project. To gather employee contact and skill set information for all employees assigned on a project, you can build a data object by defining join conditions between these resources.

You can create join conditions in two ways:

  • As suggested by Lumada Data Catalog, which are inferred from tag associations.
  • With user-defined joins, single or composite.

You can profile, tag, search, and browse data objects like any other resource in Data Catalog. You can also view a visual interpretation of the relationships between all the resources in the data object. Using the graphic view, you can browse the possible join paths and profile the data object to assess its accuracy and relevance.

Create a data object

Only profiled and tagged data can be part of a data object. Before creating a data object, make sure your data is profiled and tagged for these resources to be included in the join suggestions for potential data object members.

Perform the following steps to create a data object:

Procedure

  1. Navigate to the detailed resource view of the resource you want to be the initial building block for your data object.

  2. Click the Data Objects tab to create a new data object.

  3. Click New Data Object to start building a data object.

    The Suggested joins appear in the list as Join Keys. The Tags are identified on the resources with their join Cardinality relationship:

    • one_many
    • many_one
    • one_one
    • many_many
    The Join Keys are field names for the resources considered for potential joins. Suggested joins
  4. Click a suggested join to further inspect the join statistics.

    JOIN STATS The JOIN STATS display the Overlap statistics, identifying the overlap between both resources. If the JOIN STATS look favorable, click Join to establish the relationship between the two resources.

    In the previous example, 77% of the values in CUSTOMER_RISK_PROFILE.CSV overlap with 59% of the values in LAD_WILSHIRE_PARTIES.CSV in a MANY_ONE relationship, which is a good join candidate.

    NoteYou may choose to save the data object at this point or continue building the data object. This task continues to show a manual join. Go to step #6 if you decide to save your data object now.

    If fields overlap between resources, you can establish manual joins without having tag associations discovered for those fields. Manual joins can be simple with single field joins, or composite with multiple field joins as shown in the following example.

  5. When you have a first name-last name combination and you want to create a unique relationship between resources, perform the following steps to establish a composite manual join:

    1. Exit the Edit mode and in the graph click the Manual join icon on the file name for which the manual join is to be added. In this example, we are using theCustomer_Risk_Profile.csv file.

    2. Enter the fields in the Key 1 for the base resource, Customer_Risk_Profile.csv in this case.

    3. Enter the intended resource in Resource 2 that is expected to have the values in Key 1. Start typing the field names and select the best match from the list that displays.

    4. Select the field names for Key 2.

    5. Click Join to establish a composite manual join.

    Create data object manual join
  6. Click Save to save the data object and provide details such as Data object name and Data object description.

    Saving a new data object

    The data object name must begin with an alphabetic character and can include alphanumeric characters, hyphens, and underscores. It cannot contain dots, because Data Catalog uses the dot separator to identify lineage or parentage between nested folders. Also, make sure that the name strings do not match any Data Catalog reserved names.

  7. Click Save to save the data object.

Update a data object

Perform the following steps to update a data object:

Procedure

  1. Navigate to Browse, click Data Objects, then select the data object you want to update.

    You can continue building an existing data object by defining join conditions using any of the data object members by making that member resource the focus block in the Graph View.The Graph View of the data object appears. In the following example, LAD_Investment_Advisor_Clients.csv is the focus block.Graph view of data object
  2. Select a focus for the next join condition, then click Edit.

    In this example, the edit mode of the CustomerRiskAssessment data object shows the join suggestions for the new focus. Select a suggested join
  3. Select the suggested join or establish another manual join to complete the data object.

  4. Save the data object with the same name.

    The updated data object appears. The updated data object
  5. Click the List icon in the upper-right corner of the Graph View.

    A list of data objects that the focus resource is a part of appears. List icon displays linked data objects
  6. Select the data object from the list to update the Graph View with the selected data object.

Next steps

Any update on a data object, such as new joins or deleted joins, requires you to reprofile the data object to reflect the changes.

Using a data object as a template

An existing data object can be used as a template to build another data object. Data Catalog creates a copy of the template data object and opens it in Edit mode.

Users can make edits or updates as described in the previous task then save the copy as the new data object.Using a data object as a template

Profiling data objects

Once you assemble a data object, you can retrieve its metadata as defined by the join conditions using the profiling and tagging operations, like any other resource in Lumada Data Catalog. Additionally, you can apply the refresh joins operations which only applies to data objects.

Refresh joins

You can refresh any join to verify that the join statistics are still favorable. Although suggested joins already display these statistics to aid your join selection decision, manual joins are limited because join statistics are unavailable at join time. For this reason, refreshing joins is particularly useful after you have established manual joins.
NoteThe Refresh join job provides an estimate of the join statistics over 1,000 rows. The Profile job is more accurate, however, because it examines and fetches join statistics over the entire data profile.

Perform the following steps to refresh a join.

Procedure

  1. Navigate to Browse, then click Data Objects.

    The Data Objects page opens.
  2. Click the data object for which you want to refresh joins, and click Edit.

  3. Click the Joins tab.

    The existing joins appear.
  4. Click the row of the join to refresh.

  5. Click the Refresh icon to refresh the join.

    Refresh joins The Refresh join job is submitted.

Profile a data object

Perform the following steps to profile a data object:

Procedure

  1. Navigate to Browse, then click Data Objects.

    The Data Objects page opens.
  2. Click the data object you want to profile, then click View Profile.

  3. On the Profile view, click the More actions icon and select Profile data object.

    Profiling mode dialog box A Profiling mode dialog box prompts you to choose if you want to profile the data object incrementally (processing only the changes since last profile) or non-incrementally (profiling the entire data object).
  4. Choose either Yes to process only the changes since last profile or No to profile the entire data object.

    Data object profile results A profile job is triggered for the data object. If the join conditions yield any records, then the resulting data object's metadata profile is a cumulative data profile of the participating resources. Note that the data object sensitivity is determined by the highest sensitivity of the member resources. In addition, the data object assumes the origin of the member resource on which the data object is built.
  5. (Optional) To edit the data object or view the participating members, click Edit or Graph View respectively.

    Edit and Graph View buttons

Delete a data object

You can delete a data object. However, note that deleting a data object cannot be undone.

Perform the following steps to delete a data object:

Procedure

  1. Navigate to Browse, then click Data Objects.

    The Data Objects page opens.
  2. On the row of the data object you want to delete, click the More actions icon and select Delete.

    Delete direct object A confirmation dialog box appears, warning you that deleting a data object cannot be undone.
  3. Click Delete to confirm your action.

Export a data object to a CSV file

Once data objects are created, Lumada Data Catalog treats data objects like any other resource and you can export one to a CSV file for offline viewing.
NoteThe Export To CSV menu option is available only if the data object is profiled.

Perform the following steps to export a data object to a CSV file.

Procedure

  1. Navigate to Browse, then click Data Objects.

    The Data Objects page opens.
  2. Click the data object you want to export to a CSV file, then click View Profile.

  3. Click the More actions icon and select Export To CSV.

    Export To CSV dialog box The Export To CSV dialog box appears with the name of the data object entered as the Report Name.
  4. Accept the default Report Name or enter a new one and click Next.

  5. Select the fields to export into the CSV file.

    Select data object fields to export
  6. Click Export to start the CSV file download process.

    The CSV file is downloaded into the designated folder configured in the browser, which is typically the Download folder. You can view the following sample exported CSV file.Exported data object details

Generating a Hive view

Once data objects are created, Lumada Data Catalog treats these data objects like any other resource and you can generate a corresponding Hive view. The data object Hive view generation has specific pre-conditions and restrictions as described in the following section.

Hive view generation checklist

  • Confirm the Hive Server is configured with Lumada Data Catalog JAR files.

    Before you can generate Hive views, the Hive server must be updated with Lumada Data Catalog JAR files. See Updating Hive Server with LDC JARs for details.

  • When generating Hive view for data objects, confirm the active user has authorization for at least one Hive database available through Data Catalog.

    The user generating the Hive view must have system-level write access to the directory containing the member resources of the data object. Data Catalog uses these generated tables to generate the data object Hive view.

  • To create a Hive view, Data Catalog must have at least the schema information for the resource.

    Because data objects only support profile job operations, Hive view generation of data objects requires that the data object be profiled before creating its Hive view.

  • Hive view generation is only supported for data objects with members belonging to different virtual folders (HDFS and Hive only).

    If any two data object members belong to the same virtual folder, Data Catalog is unable to create a corresponding Hive view, so Data Catalog displays an error message and also puts a message in the log files.

  • When you create a Hive view, it displays in Data Catalog immediately.

    However, detailed metadata and data are available only after you profile the Hive view.

Generate a Hive view

You can profile and tag the Hive view of a data object like any other resource in Lumada Data Catalog.

Perform the following steps to generate a Hive view of a data object:

Procedure

  1. Navigate to Browse, then click Data Objects.

    The Data Objects page appears.
  2. Click the data object for which you want to generate a Hive view, and click View Profile.

  3. Click the More actions icon, then select Generate Hive view.

    Generate a Hive view The Generate Hive View dialog box opens, with the name of the data object entered as the Hive View Name.
  4. Select a Hive virtual folder, Hive database, and accept the default Hive View Name or enter a new one, and click Next.

    By default, Data Catalog uses associated tags to auto-fill the Field Names window.
  5. Accept the default or enter the field names you want to use for the Hive view and click Next.

  6. Click Generate Hive View.

Data objects access logic

Lumada Data Catalog is governed by access logic to protect data object accessibility from an unintended audience while allowing enough metadata visibility to request data access.

Depending on your access level, your access to data objects follows these rules:

  • Resource Read Access level is NATIVE

    • You cannot see any join options unless you have access to both resources involved in the join.
    • You cannot enter joins that involve resources you have no data access to.
    • Any data object that you construct under the above restrictions, can be profiled, and is viewed normally since you have access to all resources.
  • Resource Read Access level is METADATA

    • You can see all join options and can enter joins, but may not be able to see stats on join conditions.
    • For single column joins, you cannot run or see join-related stats, join cardinality, or overlaps, unless you have access to both resources in question.
    • For composite joins, you cannot run or see join and join keys stats, join cardinality, overlaps, selectivity, or null-related stats unless you have access to both resources in question.
    • You cannot run or view data object profiles unless you have access to all resources included in the data object.