Managing jobs
In Lumada Data Catalog, administrators can delegate data processing jobs to user roles. Depending on your role, you can perform administrative data cataloging and processing functions like profiling and term propagation on data nodes.
For sources you can access based on your role and the permissions set by your system administrator, you can run a job either with a job template or with a job sequence. You can select sequences that Data Catalog provides, or if you have a privileged role, you can use custom job templates with your data assets.
Templates
You can use administrator-created job templates to run job sequences that apply to specific clusters. Job templates have system or Spark-specific parameters as command line arguments for the job sequences, such as driver memory, executor memory, or number of threads required based on a cluster size. You can update the default Data Catalog parameters prior to job execution. For example, you can clear the incremental profile, profile a Collection as a single resource, or force a full profile instead of the default sampling option. See Managing job templates for details.
Contact your system administrator to select the template that is best suited for your data cluster.
Sequences
You can use Data Catalog's job sequences to execute jobs. These jobs are executed with default parameters, and you have the option to update the sequence using the command line parameters.
Analysts can run jobs by default. Stewards and Guests can only run jobs if the administrator has enabled job execution for their roles.
Run a job template on a resource
For example, you may have a template with a job format for asset path /DS1/virtualFolder/VFA and use a custom parameter set of [-X -Y -Z] to run the same job against a resource in /DS2/virtualFolder/VFB. In this example, only the asset path in the applied template is updated internally to reflect that of VFB.
To run a job template against a resource, such as a virtual folder or a single resource, perform these steps.
Procedure
On the Home page from the left-side menu bar, click Data Canvas.
Use the Navigation pane to drill down to the resource.
Click More actions and then select Process from the menu that appears.
The Process Selected Items page opens.Click Select Template.
The Select from template page opens. Note that the selected resource determines the available job templates.Select the check box next to the template that you want to run and click Start Now.
Results
Run a job sequence on a resource
Procedure
On the Home page from the left-side menu bar, click Data Canvas.
Use the navigation pane to drill down to the resource.
Click More actions and then select Process from the menu that appears.
The Process Selected Items page opens.Select the type of job sequence to run:
The sequence page opens.Sequence Description Select Template A template is a custom definition for a given process with a custom set of parameters. Format Discovery Identifies the format of data resources, marking the resources that can be further processed. Schema Discovery Applies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource’s catalog entry. Collection Discovery Discover collections of data elements with same schema. Profiling applies data-specific logic to compute field-level statistics and patterns for each resource as unique fingerprints of the data. Data Profiling Combo Starts a combined sequence of processes to profile your data. Executes format discovery, schema discovery and data profile process. Business Term Discovery Compares and analyzes the computed fingerprints with any defined or seeded label signatures to discover possible matches. Note that users must have Run Term Discovery permissions to run this job.
Lineage Discovery Shows relationships among resources in the form of a lineage graph. Data lineage identifies copies of the same data, merges between resources, and the horizontal and vertical subsets of these resources. Data Rationalization Finds redundant data copies and overlaps. Based on the resource, follow the workflow for the sequence.
Click Incremental Profiling if you want to use incremental processing.
NoteWhen you select Fast profiling mode in the sequence flow, the default values for sample-splits and sample-rows are used as defined in the Agent component's configuration.In the Enter Parameters field, enter any command line parameters for the job sequence.
Click Start Now.
Results
Monitoring job status
Click the Notifications icon to view the status of your jobs. You can click a report to see the job’s processing details.
You can also see job status for the jobs you executed on the Job Activity page.
The Job Activity page lists job submission details. You can sort the job status by clicking any table column header except Asset Name, Agent, Time Elapsed, or Submitted By. Note that only template jobs include entries for Template Name.
For each job, the status displays in the Status column, which is updated in real time.
Monitor job status
Procedure
On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.Locate the Asset Name of the job then view its status in the Status column.
The job's status is shown.
Terminate a job
Follow the steps below to terminate a job.
Procedure
On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.Locate the Asset Name of the job you want to terminate.
In the job row, click More actions and select Terminate Instance.
Results
View job information
When you click the More actions icon and select View Details in the row of a job on the Job Activity page, the Job Info pane opens detailing the execution information.
To view the individual sequence details, click the down arrow in the row of a job on the Job Activity page.
For example, if the job sequence Data Profiling Combo was executed, three instance steps are listed in the Job Activity page in the order they execute: Format, Schema, Profile.
The job details pane provides the execution details of the sequence, as described in the following table:
Fields | Description |
Status | Lists the status of the sequence in run time, as follows:
|
Command | Lists the command executed, including the optional parameters used, if any. |
Execution id | The execution identifier assigned by Data Catalog. |
Spark event log | Click View Log File or Download Log to view the event file. |
Total Size | Size of the data asset that was processed. |
Success | The number of resources within the data asset that were processed successfully. A negative value indicates INITIAL/IN PROGRESS status. This value is only updated after job execution. |
Skipped | The number of resources within the data asset that Data Catalog skipped, either because of a corrupt resource or an unsupported format. |
Incomplete | The number of resources within the data asset that could not finish discovery due to issues. |
Lineage Inserted | The number of lineages inserted during processing. |
Tag Associations Inserted | The number of tag associations inserted during processing. |
Tag Associations Removed | The number of tag associations removed during processing. |
Start | The recorded start time. |
End | The recorded end time. |
If the Skipped or Incomplete counts are '1' or more, you can click them for details about the skipped or incomplete resources. These lists are also shown in paginated form to improve the response time for large numbers of skipped or incomplete resources.
Read job info
Procedure
On the Home page, click Open Activity on the View Activity card.
The Job Activity page opens.Locate the Asset Name of the job you want to examine.
In the job row, click the down arrow to expand the section.
The instance step details appear.Click the right arrow to view an instance step.
The job details pane for the instance step opens.Click View Log File to view the job info.
You can also click Download Log to output a copy of the log file to your local machine.Click Close.