Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Exploring your data

Parent article

In Lumada Data Catalog you can create several virtual data units that make it easier for you to manage the metadata of your data sources. Use this article to learn how to access and manage these data units based on your access level.

  • Virtual folders

    A virtual folder is a group of resources with the same data source type. Because data resources can be part of multiple virtual folders, virtual folders can have overlapping sets of data resources. The root virtual folder can contain virtual folders, collections, and files. See the Exploring virtual folders and Manage virtual folders topics for more information.

  • Collections

    A collection is a group of files with the same format and schema that Data Catalog recognizes, arranges hierarchically, and categorizes during profiling. See View a collection

  • Datasources

    With Data Catalog, you can connect to data sources in file systems, HIVE, relational databases such as AWS, Oracle, SQL databases, and JDBC sources. See Manage data sources. You can browse the schema, tables, and fields of your data sources.

Data Canvas

Lumada Data Catalog provides a graphic user interface for investigating your data called the Data Canvas. The Data Canvas view offers detailed insights into resource metadata to help you apply a deeper understanding and find clarity for practical applications. Click the Explore Your Data card on the landing page to open the Data Canvas view and begin exploring your data. If you have not added any data sources to LDC, you must first add one. See Manage data sources.

The Data Canvas is divided into four primary areas: (1) the Top Navigation pane, (2) the Navigation pane, (3) the Content pane, and (4) the Data Summaries pane. Select a data element in the Navigation pane to view its details in the content area. The information that displays varies according to the type of resource you have selected. For example, if you select a folder or schema, their metadata will display in the Content pane. In the example below, the data lineage and a user-defined description display in the content pane, and the metadata for the selected table displays in the data summaries pane.

The Data Canvas view

Data Canvas top navigation pane

The top navigation pane on the Data Canvas displays the Explore Your Data title name and a search field. Enter a search term in the SEARCH field to perform a global search of all the data sources . See Searching Data Catalog.

Navigation pane

Navigate the tree of data resources to find the one you want to explore in the canvas. The navigation pane contains the following:

  • Find field
  • Data tree view
  • Action menu
  • Actions button

Enter a search term in the Find field to search for resources such as Virtual Folders, Folders, Collections, Schemas, Tables, Files or Fields within the navigation pane.

The navigation pane displays the data resources and virtual folders you have added to Data Catalog in a tree view. A checkbox displays next to each resource in the tree. When you select a data source in the navigation pane you want to work with, the Content pane displays the structure of your data source. When you select an individual resource, the resource name is highlighted in the tree view and the metadata of that resource displays in the Content pane of the canvas view. The name of the selected item and the path displays in the banner area of the Content pane.

You can click the three-dot Action menu on the right of each resource in the navigation tree to scan, process, or bookmark that resource. Selecting Scan will immediately scan the resource. Selecting Process opens the Process Selected Items page, which lists the processes you can run on your resource. Selecting Bookmark adds a bookmark to your Bookmarks page.

Another way to perform an action on a resource is to select the checkbox next to the resource in the Navigation pane, click Actions at the bottom of the Navigation pane, then click Process. This will also open the Process Selected Items page

Content pane

Information about the selected resource displays in the Content pane and also in the Data Summaries pane. The information that is displayed in the two panes depends on the type of resource selected. In the example below, a table is selected in the navigation pane. The numbered items in the illustration below reference the areas in the Data Canvas content pane. When you view the contents of a column or field, the displayed details include the resource-level metadata along with Data Catalog's data analysis, discovered and seeded or accepted terms, cardinality for fields, and sample values.

Content pane

ItemNameDescription
1Data bannerDisplays the name, path, and type icon identifying the resource . The name and type attributes identifying the resource are provided. It is possible for a resource to have multiple origins.
2Actions buttonClick to view actions available for processing, saving, and copying the data, depending on the selected asset type. The actions you can take in the data content area are: change to a Galaxy view of the data, process the selected data, and copy the resource ID of the data.
3Data tabsClick to view additional information about the resource.
4Data LineageDisplays information about lineage.
5Key Metrics Displays information about Sensitivity and Activity.
6Close buttonClick to close the page and return to the Home page.

Data summaries pane

The Data Summaries pane contains the following informational areas:

Tab views

Use these tabs to view different details and perform actions on the resource. You can also apply filters and edit the content of the pages displayed by these tabs.

  • Summary tab

    Scan overview information about the resource, including the path details, associated and suggested resource tags, profiling status, resource owner, and description of the resource. See Data summaries.

  • Details tab

    Explore detailed information about the table, including the column names, data types, length, keys, business terms, and owners. You can also apply filters and edit the details about the resource.

  • Properties tab

    View the property name, description, and value details about the resource. You can also add and remove properties, apply filters, and edit the value of a property.

  • Glossary tab

    Explore the business terms and business entities, and link information on the resource. You can also add and remove business terms or business entities, and apply filters.

Under the Tab views there are tools to rate the data, to see who is viewing the data, and to bookmark resources.

ItemDescription
RatingHighlights the popularity of the resource based on the overall user ratings and is a fair indication of the popularity of the resource, regardless of a low or high rating. It is a computed property, which is an average of all the ratings. A low rating highlights an issue with the resource, such as having incomplete data.
Bookmarks

You can mark a resource as a favorite by clicking the Bookmarks icon for a quicker future reference. See Bookmarking resources

Key Metrics

Explore key metrics and metadata for insights into the selected resource to make quick decisions. The processing engine discovers the resource metadata, and the AI engine calculates other data. Depending on your selected resource, you may view the following information panels:

Key Metrics panel

Metadata attributes about the data source are shown. The AI engine uses these attributes to provide a deeper insight into the resource. Each of these discovered and calculated attributes are described below. For example, when you request access to a resource, a high-sensitivity level will promptly warrant a thorough check of the permission and access levels of your user role.

Sensitivity

Computed metadata attribute that identifies the sensitivity of the resource. The sensitivity is based on the highest sensitivity level of any term (field or resource) associated or suggested on the resource. If the AI engine identifies a single term with high sensitivity, the resource sensitivity is automatically set to high (High Sensitivity icon). Sensitivity levels are defined as follows:

IconLevelDescription
High sensitivityHighIdentifies highly-sensitive data for personally identifying information, such as bank account numbers, SSN, PAN, credit card numbers, and employee IDs.
Medium sensitivityMediumIdentifies medium sensitivity for data like names, addresses, and marital status.
Low sensitivityLowIdentifies low sensitivity for data that may be personally identifiable but with lower sensitivity.
Non-sensitiveNon-sensitiveIdentifies data that AI has determined is not sensitive because no sensitivity label was associated with that resource after processing and curation.
Unknown sensitivityUnknown sensitivityIdentifies data that needs to be processed to determine the sensitivity.
Activity

A computed attribute measuring how well this resource is curated in terms of content updates performed on the resource like description, lineage, tags, and reviews. The amount of the Activity icon that is filled indicates the level of curation, as follows:

LevelDescription
HighCuration level is more than 30 and is considered high.
MediumCuration level is between 10 and 30.
LowCuration level is less than 10.
NoneNot curated.

Status panel

Lists the resource profile status of jobs that run on the resource. Click the displayed status of a profiled source to view the details and the timestamp of the last update.

  • Profiled

    The discovery job(s) ran successfully. Displays the time stamp of the last successful profile job.

  • Not Yet

    When no Data Catalog job has processed the file yet.

Business Terms

Lists associated discovered and suggested business terms for the resource. If no terms are associated with a resource, you can click Add Term to open the Business Terms dialog box and add terms on the resource. Both associated and suggested terms are listed with any "overflow" terms indicated by an overflow number link. Click this link to display all the terms in a separate dialog box.

Source Properties

Lumada Data Catalog discovers seven resource properties by default. The Source Properties panel summarizes the Properties tab by listing the properties, including the connection, resource type and name.

Statistics panel

Provides information about discovered metadata in a field including the data type and frequency.

Data statistics for the column are shown in a donut chart. The donut chart specifies how much of the data are null values .

Data Catalog assigns the mixed data type to a given field after examining the data in that field. For example, you may have a date of birth (DOB) data chart. One of the discovered data type resources in the Data field is string, but the donut chart displays two different inferred data types: string and Boolean. In the dob field for another resource, the discovered data type is string and the donut chart displays that Data Catalog infers both string and dateTime data types. Examining the corresponding sample values reveals that all values have some sort of date format. This feature is controlled by a flag in profiling for number and date formats and is described in Managing configurations in the Administration Guide.

The value distributions of the data are shown in a frequency graph. The frequency graph indicates the number of times a given value appears in a data sample. The value field shows the 20 most frequent values for that entity.

FeatureDescription
MinThe minimun value in the column.
Null CountNumber of entries that are null.
Date Time CountThe time and date counts found in the field.
SelectivityThe ratio of Cardinality to the total number of values in the field, that is, the percentage of unique values. Use this field characteristic along with the Cardinality value when selecting a good seed for tagging. See Cardinality and selectivity calculations.
MaxThe maximum value in the column.
Value CountTotal number of records in that file. The value distribution in each field is shown in the frequency graph.
String CountWhen string data is found in the field, this indicates the number of strings.
CardinalityThe number of unique values in a field, where a low cardinality number indicates many repeated values. Use this field characteristic along with the Selectivity value when selecting a good seed for tagging. See Cardinality and selectivity calculations.
DensityThe ratio of the Null Count to the Max count. Indicates the density of the data without the null values, where a number denotes that many rows are missing data for this field.

Item Properties

Displays the original name of the database for a table or the field name and value for a column level resource. Information on whether the selected item is nullable, the column size, the cata type, and the database name are displayed. You can click View All to see detailed information on the Properties page for the resource.

Data summaries

Data Catalog offers rich fields metadata in graphical formats like donut charts, value histograms, and unique value counts to help you to quickly analyze data. Data Catalog also provides quality metrics and sample values in addition to profiled samples. You can view the metadata for the most frequent values from each field and the quality metrics such as maximum, minimum, cardnality, and selectivity are calculated on all field values during profiling.

To open a data type profile, navigate to the column in the resource that you want to view and click it to explore the field-level data.

When viewing column details, you can see the resource field-level metadata along with data analysis, discovered and seeded or accepted field tags, cardinality for fields, and sample values. To show metadata in the resource field, you need native access to the resource or metadata level as governed by the RBAC settings for your user role.

Depending on the selected resource level or data element, you can view different summaries of information, including the following resource metrics:

Description

Displays a description of the resource that is imported from the source. You can contribute resource information to the knowledge base to write content and include links to other articles in Lumada Data Catalog. You can also access and communicate the usage history of the resource. To edit a description, click Edit Description. You can use rich text formatting in the Description dialog box for content with formatted text: font style, size, color, bold, and italic), mathematical formulas, code blocks, highlighting, quote blocks, links, and inline images. You can also use the description text in a search string to search Data Catalog for resources containing similar strings.

Imported Description

Displays the description imported from the resource. You cannot edit imported descriptions. They are either imported from Hive/JDBC descriptions when processing these resources or via API calls for non-Hive/JDBC resources.

Data Quality

Communicates detailed data quality metrics for fields as derived from business rules. Trend indications reflect recent changes in the associated metric.

  • Accuracy reflects the reality of the data and its conformity with a verifiable source.
  • Completeness indicates whether the data is sufficient to be meaningful and actionable.
  • Consistency is a measure of the same information, stored and used at multiple instances, matching values.
  • Uniqueness indicates whether the information is the only recorded instance in the data set used.
  • Validity provides hints about the value attributes available for aligning within a range or requirement.
Data Lineage

Provides a snapshot of the upstream sources and downstream destinations of the resource.

Data Patterns

Displays the common data patterns profiled by Data Catalog and frequency in the form NN-AN-AAAA when viewing a field. Special separator characters are also included, such as N for a numeric character and A for alphanumeric to identify the different distributions.

If your user role does not permit you access to the field or viewing level of the information, the Data Patterns pane will not be displayed.

Sample Data

Shows the most frequently occurring values for the field along with the frequency and distribution when viewing a column. Text names and values are truncated after 256 characters. If your data includes strings that have only numbers, such as zip codes, Data Catalog displays the values as numbers without leading zeros. You can identify resources that have been sample-profiled and other resource-level information.

To view this pane, your role must allow Sample Data Access through native system permissions.

If your user role has administrative privileges, you can configure these values. If not, contact your administrator for details.

Bookmarking resources

In Lumada Data Catalog, you can add a resource as a bookmark. In the Collaboration bar of the Data Canvas, click the Bookmarks icon for a file, table, database, or folder to include as a resource in your bookmarks list. You have easy access to bookmarked resources from the toolbar. Bookmarks are private for each user role and are stored individually.

To delete a bookmark that is no longer useful, click the Bookmarks icon on the toolbar then locate and select the name of the bookmark you want to delete and click Remove.

NoteBookmarks are maintained by Data Catalog for every user role separately and are not visible to other users. You cannot share your bookmarks with other users. Also note that root virtual folders and data sources cannot be bookmarked.

Cardinality and selectivity calculations

Lumada Data Catalog calculates cardinality based on a data subset in the field. The result is an approximate value that is less accurate if the number of unique values in the field is more than 2000. This "tipping point" of 2000 values can be configured by your Data Catalog administrator.

When fields include more than one type of value (all numbers or a combination of numbers and letters), the data profiles apply from the perspective of one data type. For example, the text profile shows statistics calculated with only the text values and indicates the non-text and null values.

Cardinality and selectivity for a certain data type show what part of the total field values are of that type. The sum of cardinality across all data types is the number of values in the field. The sum of selectivity across all data types is 1.

Sample-profiled resources

In Lumada Data Catalog's browser, you can identify resources that have been sample-profiled and other resource-level information.

If a large resource is profiled with the -sample parameter, the Statistics panel identifies its aggregated analysis and lists the number of rows sampled. The Status panel details also identify this resource as sample-profiled.