Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Manage worker processes

Pentaho Data Catalog (PDC) uses worker processes to implement virtually all the data analytics functions. Most worker processes consist of a single primary worker process that Data Catalog launches from a user action or a scheduled action. Some processes might also initiate secondary worker processes.

Worker processes

The following table lists the worker processes:

ProcessDescriptionActions performed
Test ConnectionReturns detailed success or failure information for each step of the test. Data Catalog starts this worker process when you configure or update a data source connection. Data Catalog marks the data source “OFFLINE” until a successful test completes.
  • Connect to data source
  • Authenticate
  • Retrieve list of schemas and store in MongoDB
Metadata IngestIngests the metadata for one or more schemas.
  • Read schema from data source and store in MongoDB
Data ProfilingGenerates a variety of statistics and intermediate data with a single pass through the source data.

Typically, this is the first process you run on your data.

  • Create bitset
  • Create HyperLogLogs (HLL) for full data
  • Generate statistics (numeric and string related)
  • Generate data patterns
  • Lucene Indexing (optional)
  • Extract samples for viewing (<100)
Data IdentificationIdentifies and tags columns and tables using ontology information (dictionaries, aliases), along with underlying data and metadata.
  • Tag columns based on dictionaries
  • Tag columns based on metadata and aliases
Key DiscoveryPerforms a variety of key discovery actions. Foreign key discovery requires that Data Profiling of the data sources has completed.
  • Foreign key discovery
  • Superkey identification
  • Composite key discovery
  • Compound key discovery
  • Secondary key discovery
  • Natural and Surrogate key identification
Data QualityPerforms a full data quality (DQ) analysis on the underlying data, using regular expressions and other configurable business rules.
  • RegEx matching
  • Data pattern analysis
  • Update column statistics
  • Evaluate column DQ rules
  • Evaluate row-relative DQ rules
Sensitive Data Discovery (SDD)Performs the tasks beyond data identification for SDD. This process uses flows, lineage, Foreign Keys, and more to put together the items comprising PI and PII.
  • Generate separate SDD Lucene Index which cross- references data

Monitor worker status

From the Manage Your Environment page, you can see the number of completed worker processes and the number of worker alerts on the Workers card.

NoteYou may be able to see a Processed Items region on your Home page, if your Landing page options window has the Processed Items check box selected.

Use the following steps to monitor the status of a worker process:

Procedure

  1. From the Manage Your Environment page, click View Workers to see the completed and in-progress worker processes.

    The Status column shows the status of the worker processing.
  2. Click the up arrow at the beginning of the worker process row to expand the information.

View worker process details

Use the following steps to view details of a worker process:

Procedure

  1. On the Workers page, locate the worker process you want more information.

  2. If an up arrow is visible at the beginning of the row for the worker process, click the arrow to expand the information.

  3. Click the View Details icon (>) at the end of the row.

    The View Worker Details window opens. If the process failed, an Exception tab might be available, in addition to the Details tab.
  4. Click Close to close the View Worker Details window.

Cancel a worker process

Use the following steps to cancel a worker process:

Procedure

  1. While a worker process is running, go to the Workers page and locate the worker process you want to cancel.

  2. Click Cancel at the end of the row.

    Data Catalog cancels the worker process, and displays Cancelling in the Job Status column.