Managing jobs

Last updated
Save as PDF

In Lumada Data Catalog, administrators can delegate data processing jobs to user roles. Depending on your role, you can perform administrative data cataloging and processing functions like profiling and term propagation on data nodes.

For sources you can access based on your role and the permissions set by your system administrator, you can run a job either with a job template or with a job sequence. You can select sequences that Data Catalog provides, or if you have a privileged role, you can use custom job templates with your data assets.

Templates
You can use administrator-created job templates to run job sequences that apply to specific clusters. Job templates have system or Spark-specific parameters as command line arguments for the job sequences, such as driver memory, executor memory, or number of threads required based on a cluster size. You can update the default Data Catalog parameters prior to job execution. For example, you can clear the incremental profile, profile a Collection as a single resource, or force a full profile instead of the default sampling option. See Managing job templates for details.
Contact your system administrator to select the template that is best suited for your data cluster.
Sequences
You can use Data Catalog's job sequences to execute jobs. These jobs are executed with default parameters, and you have the option to update the sequence using the command line parameters.

Analysts can run jobs by default. Stewards and Guests can only run jobs if the administrator has enabled job execution for their roles.

Run a job template on a resource

You can use a template when you run a Lumada Data Catalog job. A template is a custom definition for a given sequence, which may have a custom set of parameters.

For example, you may have a template with a job format for asset path /DS1/virtualFolder/VFA and use a custom parameter set of [-X -Y -Z] to run the same job against a resource in /DS2/virtualFolder/VFB. In this example, only the asset path in the applied template is updated internally to reflect that of VFB.

To run a job template against a resource, such as a virtual folder or a single resource, perform these steps.

Procedure

On the Home page from the left-side menu bar, click Data Canvas.
Use the Navigation pane to drill down to the resource.
Click More actions and then select Process from the menu that appears.
The Process Selected Items page opens.
Click Select Template.
The Select from template page opens. Note that the selected resource determines the available job templates.
Select the check box next to the template that you want to run and click Start Now.

Results

The job is submitted to the Data Catalog processing engine.

Run a job sequence on a resource

You can use a sequence when you run a Data Catalog job.

NoteJob sequences run with default parameters, so be mindful of running sequences on large data, which may require additional system or functional parameters for the job to run successfully. Contact your system administrator for Spark parameters.

To run a job sequence against a resource such as a virtual folder or a single resource, perform the following steps:

Procedure

On the Home page from the left-side menu bar, click Data Canvas.
Use the navigation pane to drill down to the resource.
Click More actions and then select Process from the menu that appears.
The Process Selected Items page opens.

Select the type of job sequence to run:

Sequence	Description
Select Template	A template is a custom definition for a given process with a custom set of parameters.
Format Discovery	Identifies the format of data resources, marking the resources that can be further processed.
Schema Discovery	Applies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource’s catalog entry.
Collection Discovery	Discover collections of data elements with same schema.
	Profiling applies data-specific logic to compute field-level statistics and patterns for each resource as unique fingerprints of the data.
Data Profiling Combo	Starts a combined sequence of processes to profile your data. Executes format discovery, schema discovery and data profile process.
Business Term Discovery	Compares and analyzes the computed fingerprints with any defined or seeded label signatures to discover possible matches. Note that users must have Run Term Discovery permissions to run this job.
Lineage Discovery	Shows relationships among resources in the form of a lineage graph. Data lineage identifies copies of the same data, merges between resources, and the horizontal and vertical subsets of these resources.
Data Rationalization	Finds redundant data copies and overlaps.

The sequence page opens.

Based on the resource, follow the workflow for the sequence.
Click Incremental Profiling if you want to use incremental processing.

NoteWhen you select Fast profiling mode in the sequence flow, the default values for sample-splits and sample-rows are used as defined in the Agent component's configuration.
In the Enter Parameters field, enter any command line parameters for the job sequence.
Click Start Now.

Results

The job is submitted to the Data Catalog processing engine.

Monitoring job status

Click the Notifications icon to view the status of your jobs. You can click a report to see the job’s processing details.

You can also see job status for the jobs you executed on the Job Activity page.

The Job Activity page lists job submission details. You can sort the job status by clicking any table column header except Asset Name, Agent, Time Elapsed, or Submitted By. Note that only template jobs include entries for Template Name.

For each job, the status displays in the Status column, which is updated in real time.

Monitor job status

To monitor job status, use the following steps:

Procedure

On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.
Locate the Asset Name of the job then view its status in the Status column.
The job's status is shown.

Terminate a job

You can terminate a submitted job that has a Submitted or In progress status. The job is canceled based on Spark's job scheduling.

Follow the steps below to terminate a job.

Procedure

On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.
Locate the Asset Name of the job you want to terminate.
In the job row, click More actions and select Terminate Instance.

Results

The Status for the job instance indicates Cancelling and finally Cancelled. You are also notified about the job’s status in Notifications.

View job information

When you click the More actions icon and select View Details in the row of a job on the Job Activity page, the Job Info pane opens detailing the execution information.

To view the individual sequence details, click the down arrow in the row of a job on the Job Activity page.

For example, if the job sequence Data Profiling Combo was executed, three instance steps are listed in the Job Activity page in the order they execute: Format, Schema, Profile.

The job details pane provides the execution details of the sequence, as described in the following table:

Fields	Description
Status	Lists the status of the sequence in run time, as follows: INITIAL/SUBMITTED: The job is waiting. Any new job is in INITIAL status while waiting. IN PROGRESS: The job is executing. SUCCESS: The job finishes without issues. FAILED: The job runs into errors or issues.
Command	Lists the command executed, including the optional parameters used, if any.
Execution id	The execution identifier assigned by Data Catalog.
Spark event log	Click View Log File or Download Log to view the event file.
Total Size	Size of the data asset that was processed.
Success	The number of resources within the data asset that were processed successfully. A negative value indicates INITIAL/IN PROGRESS status. This value is only updated after job execution.
Skipped	The number of resources within the data asset that Data Catalog skipped, either because of a corrupt resource or an unsupported format.
Incomplete	The number of resources within the data asset that could not finish discovery due to issues.
Lineage Inserted	The number of lineages inserted during processing.
Tag Associations Inserted	The number of tag associations inserted during processing.
Tag Associations Removed	The number of tag associations removed during processing.
Start	The recorded start time.
End	The recorded end time.

If the Skipped or Incomplete counts are '1' or more, you can click them for details about the skipped or incomplete resources. These lists are also shown in paginated form to improve the response time for large numbers of skipped or incomplete resources.

Read job info

Follow the steps below to read information about your job execution.

Procedure

On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.
Locate the Asset Name of the job you want to examine.
In the job row, click the down arrow to expand the section.
The instance step details appear.
Click the right arrow to view an instance step.
The job details pane for the instance step opens.
Click View Log File to view the job info.
You can also click Download Log to output a copy of the log file to your local machine.
Click Close.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Run a job template on a resource

Run a job sequence on a resource

Monitoring job status

Monitor job status

Terminate a job

View job information

Read job info