Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Managing jobs

Parent article

In Lumada Data Catalog, administrators can delegate data processing jobs to user roles. Depending on your role, you can perform administrative data cataloging and processing functions like profiling and term propagation on data nodes.

For sources you can access based on your role and the permissions set by your system administrator, you can run a job either with a job template or with a job sequence. You can select sequences that Data Catalog provides, or if you have a privileged role, you can use custom job templates with your data assets.

  • Templates

    You can use administrator-created job templates to run job sequences that apply to specific clusters. Job templates have system or Spark-specific parameters as command line arguments for the job sequences, such as driver memory, executor memory, or number of threads required based on a cluster size. You can update the default Data Catalog parameters prior to job execution. For example, you can clear the incremental profile, profile a collection as a single resource, or force a full profile instead of the default sampling option. See Managing job templates for details.

    Contact your system administrator to select the template that is best suited for your data cluster.

  • Sequences

    You can use Data Catalog's job sequences to execute jobs. These jobs are executed with default parameters, and you have the option to update the sequence using the command line parameters.

Analysts can run jobs by default. Stewards and Guests can only run jobs if the administrator has enabled job execution for their roles.

Run a job template on a resource

You can use a template when you run a Data Catalog job. A template is a custom definition for a given sequence, which may have a custom set of parameters.

For example, you may have a template with a job format for asset path /DS1/virtualFolder/VFA and use a custom parameter set of [-X -Y -Z] to run the same job against a resource in /DS2/virtualFolder/VFB. In this example, only the asset path in the applied template is updated internally to reflect that of VFB.

To run a job template against a resource, such as a virtual folder or a single resource, perform these steps.

Procedure

  1. Click Data Canvas in the left naviagtion menu.

    It opens the Explore Your Data page.
  2. Use the left navigation pane to drill down to the resource.

  3. Click More actions and then select Process from the menu that appears.

    The Process Selected Items page opens.
  4. Click Select Template.

    The Select from template page opens. Note that the selected resource determines the available job templates.
  5. Select the check box next to the template that you want to run and click Start Now.

Results

The job is submitted to the Data Catalog processing engine.

Run a job sequence on a resource

You can use a sequence when you run a Data Catalog job.
NoteJob sequences run with default parameters, so be mindful of running sequences on large data, which may require additional system or functional parameters for the job to run successfully. Contact your system administrator for Spark parameters.
To run a job sequence against a resource such as a virtual folder or a single resource, perform the following steps:

Procedure

  1. Click Data Canvas in the left navigation menu.

    It opens the Explore Your Data page.
  2. Use the left navigation pane to drill down to the resource.

  3. Click More actions and then select Process from the menu that appears.

    The Process Selected Items page opens.
  4. Select the type of job sequence to run:

    SequenceDescription
    Select TemplateA template is a custom definition for a given process with a custom set of parameters.
    Format DiscoveryIdentifies the format of data resources, marking the resources that can be further processed.
    Schema DiscoveryApplies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource’s catalog entry.
    NoteIn addition to the alphanumeric characters, only spaces, hyphens, and underscores are supported in column names. The job will fail if a column name has any other special characters in it.
    Collection DiscoveryDiscover collections of data elements with same schema.
    Data ProfilingProfiling applies data-specific logic to compute field-level statistics and patterns for each resource as unique fingerprints of the data. See MongoDB onboarding and profiling example video for a demonstration of creating a MongoDB data source and profiling it.
    Data Profiling ComboStarts a combined sequence of processes to profile your data. Executes format discovery, schema discovery and data profile process.
    Business Term DiscoveryCompares and analyzes the computed fingerprints with any defined or seeded label signatures to discover possible matches.

    Note that users must have Run Term Discovery permissions to run this job.

    Lineage DiscoveryShows relationships among resources in the form of a lineage graph. Data lineage identifies copies of the same data, merges between resources, and the horizontal and vertical subsets of these resources.
    Data RationalizationFinds redundant data copies and overlaps.
    The sequence page opens.
  5. Based on the resource, follow the workflow for the sequence.

  6. Click Incremental Profiling if you want to use incremental processing.

    NoteWhen you select Fast profiling mode in the sequence flow, the default values for sample-splits and sample-rows are used as defined in the Agent component's configuration.
  7. In the Enter Parameters field, enter any command line parameters for the job sequence.

  8. Click Start Now.

Results

The job is submitted to the Data Catalog processing engine.

Monitoring job status

To view the status of your jobs, click the Notifications icon. In the list that appears, you can click the More actions (three dots) icon to mark the item as unread, or remove it from the list.

To see job details, click the job notification to open the Job Activity page.

The Job Activity page lists job submission details. You can sort the job status by clicking the Sequence Name, Template Name or Status column headers. Note that only template jobs include entries for Template Name.

For each job, the status displays in the Status column, which is updated in real time.

To view job details, click the More actions icon at the end of the row for the job and click View Details. You can view sequence steps by clicking the down arrow to expand job information. To view a job log, click the right arrow (>) at the end of a job step to open more details in a dialog box. You can click View Log File or Download Log to view or download the log file, respectively.

Monitor job status

To monitor job status, use the following steps:

Procedure

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.
  2. Click View Jobs on the Job Management card.

    The Job Activity page opens.
  3. Locate the Asset Name of the job then view its status in the Status column.

Terminate a job

You can terminate a submitted job that has a Submitted or In progress status. The job is canceled based on Spark's job scheduling.

Follow the steps below to terminate a job.

Procedure

  1. Click Management in the left navigation menu.

  2. Click View Jobs on the Job Management card.

    The Job Activity page opens.
  3. Locate the Asset Name of the job you want to terminate.

  4. In the job row, click More actions and select Terminate Instance.

Results

The Status for the job instance indicates Cancelling and finally Cancelled. You are also notified about the job’s status in Notifications.

View job information

When you click the More actions icon and select View Details in the row of a job on the Job Activity page, the Job Info pane opens detailing the execution information.

To view the individual sequence details, click the down arrow in the row of a job on the Job Activity page.

For example, if the job sequence Data Profiling Combo was executed, three instance steps are listed in the Job Activity page in the order they execute: Format, Schema, Profile.

The job details pane provides the execution details of the sequence, as described in the following table:

Fields Description
StatusLists the status of the sequence in run time, as follows:
  • INITIAL/SUBMITTED: The job is waiting. Any new job is in INITIAL status while waiting.
  • IN PROGRESS: The job is executing.
  • SUCCESS: The job finishes without issues.
  • FAILED: The job runs into errors or issues.
CommandLists the command executed, including the optional parameters used, if any.
Execution idThe execution identifier assigned by Data Catalog.
Spark event logClick View Log File or Download Log to view the event file.
Total SizeSize of the data asset that was processed.
SuccessThe number of resources within the data asset that were processed successfully. A negative value indicates INITIAL/IN PROGRESS status. This value is only updated after job execution.
SkippedThe number of resources within the data asset that Data Catalog skipped, either because of a corrupt resource or an unsupported format.
IncompleteThe number of resources within the data asset that could not finish discovery due to issues.
Lineage InsertedThe number of lineages inserted during processing.
Tag Associations InsertedThe number of tag associations inserted during processing.
Tag Associations RemovedThe number of tag associations removed during processing.
StartThe recorded start time.
EndThe recorded end time.

If the Skipped or Incomplete counts are '1' or more, you can click them for details about the skipped or incomplete resources. These lists are also shown in paginated form to improve the response time for large numbers of skipped or incomplete resources.

Read job info

Follow the steps below to read information about your job execution.

Procedure

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.
  2. Click View Jobs on the Job Management card.

    The Job Activity page opens.
  3. Locate the Asset Name of the job you want to examine.

  4. In the job row, click the down arrow to expand the section.

    The instance step details appear.
  5. Click the right arrow to view an instance step.

    The job details dialog for the instance step opens.
  6. Click View Log File to view the job info.

    You can also click Download Log to output a copy of the log file to your local machine.
  7. Click Close.