Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Exploring your data

In Pentaho Data Catalog you can explore and discover data scattered across various sources, be it databases, cloud storage, or data lakes. Its powerful search and discovery capabilities ensure quick access to the data you need, while comprehensive metadata offers a deep understanding of data structures, quality, and ownership.

Data Canvas

Pentaho Data Catalog provides a graphic user interface for investigating your data called the Data Canvas. The Data Canvas view offers detailed insights into resource metadata to help you apply a deeper understanding and find clarity for practical applications. Click the Explore Your Data card on the landing page to open the Data Canvas view and begin exploring your data. If you have not added any data sources to PDC, you must first add one. See Manage data sources.

The Data Canvas is divided into three primary areas: the header (1), the navigation pane (2), and the content canvas (3). Select a data element in the navigation pane to view its details in the content area. In the example below, the metadata for the selected table opens in the content pane.

The Data Canvas view showing table metadata

Navigation pane

Navigate the tree of data resources to find the one you want to explore inData Canvas in Data Catalog. Expand the data source and select the resources you want to work with, then view the structure of your data source in the Content pane. In addition, you can enter a search term in the Search field to search for resources such as folders, schemas, tables, files, or fields within the navigation pane.

When you select an individual resource, the resource name is highlighted in the tree view and the metadata of that resource displays in the Content pane. You can view the name of the selected item and the path in the banner. In addition, select Process to open the Choose Process page, which lists the processes you can run on your resource. For more information, see Data profiling and data identification.

Content pane

You can view details about the selected resource in the Content pane in the respective tabs. The details displayed depend on the type of resource selected. For example, if you select a table, then you can view the contents of a column or field, the resource-level metadata with Data Catalog's data analysis, cardinality for fields, and sample values.

Data Canvas Content Pane The following table identifies the key details available in the Contentfor a table resource:

ItemNameDescription
1Data bannerDisplays the name, path, and type icon identifying the resource. The name and type attributes identifying the resource are provided.
2Actions buttonClick to view actions available for processing, saving, and copying the data, depending on the selected asset type. The actions you can take in the data content area are:
  • Process

    Process the selected data.

  • View Galaxy

    Change to a Galaxy view of the data.

  • Copy Path

    Copy the data path.

  • Migrate*

    Choose the location and move the selected data assets.

  • Delete*

    Deletes the file from the file server.

    CautionOnce you delete a data asset, you cannot recover it.
3Data tabsClick to view additional information about the resource:
  • Summary
  • Details
  • Properties
  • Glossary
  • Comment
* The action item appears only when:
  1. Data Catalog has Data Storage Optimizer integrated.
  2. You have imported the data source in Data Storage Optimizer. For more information, see Importing a data source.
  3. You have selected the data asset (file type) which Data Storage Optimizer supports.

Data tabs

In Data Canvas, use individual tabs to view different details and perform actions on the resource. The tabs that appear vary according to the selected resource.

Summary tab

In Data Catalog, you can view metadata in graphical formats like value histograms and unique value counts to help you analyze data quickly. You can also view sample values, and profiled samples.

To open a data type profile, navigate to the column in the resource you want to view and click it to explore the field-level data.

When viewing column details, you can see the resource field-level metadata along with data analysis, cardinality for fields, and sample values. To show metadata in the resource field, you need native access to the resource or metadata level as governed by the RBAC settings for your user role.

Depending on the selected resource level or data element, you can view different summaries of information, including the following resource metrics:

  • Description

    Displays a description of the resource that is imported from the source. You can contribute resource information to the knowledge base to write content and include links to other articles in Data Catalog. To edit the description, click Edit Description, which will open a dialog box where you can format the text using tools like bold, italic, underline, and strikeout. You can also align text, insert code blocks, and add links as needed.

  • System Information

    When you choose an unstructured file, it displays the timestamps for file creation, modification, and last access.

    NoteIn certain file systems, when a file's modification date is less than its creation date, certain APIs, like the SMB network client, may display the more recent date as the modification date.
  • Statistics

    When you select a table, you can view the Field Count and Row Count statistics. The following table identifies the key details available in the Statistics pane when you select a column in a table to view:

    FeatureDescription
    Null CountNumber of entries that are null.
    CardinalityThe number of unique values in a field, where a low cardinality number indicates many repeated values.
    HLLAn estimate of cardinality of the data, with a roughly ~2% margin of error.
    Blank CountThe number of entries that are blank.
    Min WidthThe minimum number of character count in a value in the column.
    Max WidthThe maximum number of character count in a value in the column.
    Avg Width The average number of character count in a value in the column.
  • Data Patterns

    In Data Catalog, data pattern analysis offers insightful recommendations based on detected patterns and their frequency. These recommendations include RegEx expressions, catering to different levels of pattern matching precision: loose, moderate, and strict. Data Cataloggives you the flexibility to choose the most appropriate patterns. Simplifying the patterns by focusing on just the characters 'A,' 'a,' 'n,' and 's' reveals the underlying data patterns more clearly. After obtaining a set of simplified patterns along with their respective frequency counts, candidate RegEx expressions can be generated. The following options demonstrate possible RegEx expressions tailored to the desired level of strictness:

    PatternDescription
    ^\w{2}\d{5}$Loose Pattern: This pattern is less strict and excludes the last value in the example with 80% confidence.
    ^[K]\w\d{5}$Strict first letter and five digits: This expression maintains strict criteria for the first letter while allowing for variability in the subsequent characters.
    ^[K]\w\d{5,6}$Loose on the second character: This pattern ensures 100% confidence but introduces flexibility for the second character.
    ^[K][A,L,T,W]\d{5,6}$More Strict Pattern: This expression imposes stricter conditions while maintaining 100% confidence.
    ^[A-Z][A-Z]\d{5,6}$Another 100% confidence pattern that differs in its structure.
    CautionIf your user role does not grant access to the field or viewing level of the information, the Data Patterns pane does not appear.
  • Sample Data

    Shows the random values for the field along with the frequency and distribution when viewing a column. Text names and values are truncated after 200 characters. You can identify resources that have been sample-profiled and other resource-level information.

    To view this pane, your role must allow Sample Data Access through native system permissions. If your user role has administrative privileges, you can configure these values. If not, contact your administrator for details.

  • Properties panel

    Displays a summary of the resource properties, like the last update time stamp, name, version, and type of the resource.

  • Business Terms panel

    Lists associated business terms for the resource. You can also click Add Term to open the Business Terms dialog box and add terms to the resource. For more information, see Manage business glossary.

  • Tags panel

    Lists the tags associated with the resource. In addition, you can click and start adding tags like “quality:45” (the key should be unique) to the resource, which helps to identify the resource with tagged keywords.

  • Custom Properties panel

    Lists the first five custom properties associated with the resource. Custom properties refer to user-defined metadata attributes or fields that can be associated with various data assets, such as databases, tables, files, or documents, to provide additional context and information about those assets. To add a custom property, click Add Custom Property and provide the required information. In addition, go to the Properties tab to see the complete list of custom properties added to the resource.

Details tab

Contains detailed information about child resources. You can view the items available in the selected resource, along with some additional information. The information varies based on the resource selected. For example, if you select a data source, you can view available items like schema for structured data and folders for file systems. In addition, you can also get the number of tables and columns, including associated tags. For schemas, you can get additional information like row count and the last profiled date and time.

Properties tab

View the custom properties added to the resource and the details like name and value. You can also add custom properties and edit the value of a property. For more information, see Resource properties.

Glossary tab

Explore the business terms information on the resource, such as category, domain, definition, and purpose. In addition, you can also add business terms to the resource. For more information, see Business Glossary.

Comment tab

Post a comment or view the comments posted on the selected resource along with the comment owner. On each comment, you can reply and delete (your comment). In addition, in the comment dialog box, you can format the text using tools like bold, italic, underline, and strikeout. You can also align text, insert code blocks, and add links as needed.

This helps to share information with the other users and collaborate better within the application, such as initiating a discussion thread to identify and resolve a problem.

NoteYou can only delete the comments you posted. However, if you are an admin, you can delete any comment.

Processing data

Processing data involves essential steps to extract meaningful insights and ensure the effective utilization of data. Two significant stages in this process are Metadata Ingest and Data Profiling, especially when dealing with structured and unstructured data. These steps are essential to ensure that the information is used effectively. Additionally, data processing involves Data Identification for structured data.

Metadata Ingest

Ingests the metadata for a file system object store, and JDBC data sources.

Data Profiling

Data Profiling is a crucial step for any data analysis. It is the process in which Data Catalog examines file and JDBC data sources and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics.

Data Identification

Data Identification is an essential process in managing structured data. It involves tagging data to make it easier to search, retrieve, and analyze. By associating dictionaries and data patterns with tables and columns, you can ensure that data is appropriately categorized and easily accessed when needed.

CautionYou must run Data Profiling prior to proceeding with any Data Identification activities.

Processing unstructured data

Perform the following steps to process the unstructured data:

Before you begin

You must perform Metadata Ingest and Data Profiling to process unstructured data and view its properties.

Procedure

  1. Select the unstructured resource you want to investigate in Data Canvas.

    This can be a file or a folder.
  2. Click Process.

    The Choose Process pane opens with Metadata Ingest and Data Profiling options.Choose process
  3. In the Metadata Ingest tile, click Start to begin the metadata ingest process.

    You can view the status of metadata ingest on the Manage Workers page.
  4. To perform the data profiling, click the Data Profiling tile.

    The Profiling page opens with the following options to configure data profiling:
    NoteWhen configuring data profiling, it is recommended to use the default settings as they are suitable for most situations.
    FieldDescription
    Ingest PropertiesParses document metadata from files.
    Compute ChecksumCalculates checksums for each file.
    Files Modified

    More Than Day(s) Ago

    Filters file processing by modification timestamp.
    Files Modified

    More Than Day(s) Ago

    Filters file processing by access timestamp.
    Extensions

    Enter to add value. Leave empty to use all extensions

    Specify the document extension, such as pdf, .doc, .txt, and so on. Profiling will be performed for the specified extension.
    Additional File Processing ThreadsNumber of processing threads for file processing per job (should keep this low if running many jobs).
    Persistence ThreadsNumber of persistence writing per job (should keep this low if running many jobs).
    Supported Max File SizeFiles larger in size than this amount will be skipped. Example: 100 MB
  5. Click Start.

    You can view the status of metadata ingest on the Manage Workers page.
  6. Go to Data Canvas and select an unstructured document to view its properties.

Results

The document properties are displayed in the Document Properties pane.
NoteThe properties displayed will vary according to the type of unstructured data selected.

Processing structured data

Perform the following steps to process the structured data:

Before you begin

You must perform Metadata Ingest, Data Profiling, and Data Identification to process structured data.

Procedure

  1. Select the structured resource you want to investigate in Data Canvas.

    This can be a table or column.
  2. Click Process.

    The Choose Process pane opens with Metadata Ingest, Data Profiling, and Data Identification options.Choose process
  3. In the Metadata Ingest tile, click Start to begin the metadata ingest process.

    You can view the status of metadata ingest on the Manage Workers page.
  4. To perform the data profiling, click the Data Profiling tile.

    The Profiling page opens with an option to configure data profiling. You can use Skip Recent (days) to skip profiling for recently profiled tables. For example, if the days field is set to 7, any table profiled within the last 7 days will be skipped.
    NoteWhen configuring data profiling, it is recommended to use the default settings as they are suitable for most situations.
  5. To perform data identification, click the Data Identification tile.

    NoteYou must perform data profiling before proceeding with data identification.
    If data profiling is not done, Data Catalog highlights it as Required. You can start data profiling from the Data Identification pane by clicking Start.Profiling
  6. Click Select Methods and select the Dictionaries and Patterns, click Apply, and then click Start.

    You can view the status of metadata ingest on the Manage Workers page.
  7. Go to Data Canvas to view tags.

Cardinality calculation

In Pentaho Data Catalog cardinality is a measure of the uniqueness of values within a table column concerning the total number of rows in that table. It helps understand the data's uniqueness and can assist in data analysis and profiling within Data Catalog.

NoteCardinality calculation is particularly relevant for RDBMS data sources.

Once you've processed the data source within Data Catalog, go to Data Canvas and select a column. You can see the Cardinality score in the Statistics panel under the Summary tab.

Data sampling

In Data Catalog, the data sampling process is initiated without the need for a preliminary pre-analysis step. Samples are selected in a semi-random manner, with a focus on including the most recent entry in the respective table whenever possible.

The overall process involves worker implementations tailored to the specific database type. Inputs for this process include essential information like REST polling endpoints, challenge tokens for authentication, database connection details for both the source and Data Catalog, and the scope of data to be ingested. The process allows for the selection of a defined number of samples, typically ranging from 20 to 50, and can optionally purge existing schema and relationships for testing purposes.

The process outputs include progress updates during processing, such as estimated completion percentage and error counts, along with the final sample set sent to the database, replacing any existing samples. It's important to note that detailed status information about active workers may be viewable on a Worker page due to the potential presence of multiple concurrent workers.