Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Running PDI-CLI on Azure

You can use PDI-CLI images to run the Kitchen command to run transformations and the Pan command to run jobs on Azure.

Prerequisites for installing Pentaho on Azure

Observe the following prerequisites before installing Pentaho:

  • A stable version of Docker must be installed on your workstation.
  • You must have an Azure account and subscription to complete this installation.
  • The following software versions are required:
ApplicationSupported version
Dockerv20.10.21 or a later stable version
Azure CLIv2.x

Process overview for PDI-CLI on Azure

Use the following instructions to deploy PDI-CLI on the Azure cloud platform:

  1. Download and unpack PDI-CLI for Azure.
  2. Create an Azure ACR.
  3. Push the PDI-CLI Docker image to ACR.
  4. Create Storage Account.
  5. Deployment Methods.
    • Using Azure Container Instances
    • Using Azure Batch Services

Download and unpack PDI-CLI for Azure

Download and open package files that contain the files needed to install the PDI-CLI.

Procedure

  1. Navigate to the Support Portal and download the Azure version of the Docker image with the corresponding license file for the applications you want to install on your workstation.

  2. Unpack the image to view the directories and the readme file.

    The image package file (<package-name>.tar.gz) contains the following:
    NameContent description
    imageDirectory containing all the Pentaho source images.
    templatesDirectory containing templates for various operations.
    yamlDirectory containing YAML configuration files.
    Batch-jsonDirectory containing a sample batch JSON configuration
    README.mdFile containing detailed information about what we are providing for this release.
  3. In the image directory, unpack the tar.gz file that contains the PDI-CLI Docker image layers.

Create an Azure ACR

Before pushing the PDI-CLI image to Azure, you need to create an Azure Container Registry (ACR).

Procedure

  1. Create an ACR repository to load Pentaho.

    For information on how to create an Azure ACR, see Create an ACR repository.
  2. Record the name of the ACR repository that you have created in the Worksheet for Azure hyperscaler.

Push the PDI-CLI Docker image to ACR

Select and tag the PDI-CLI Docker image and then push it to the ACR registry.

Procedure

  1. Navigate to the image directory containing the PDI-CLI tar.gz files.

  2. Select and load the tar.gz file into the local registry by running the following command:

    docker -i load <pentaho-image>.tar.gz
  3. Record the name of the source image that was loaded into the registry by using the following command:

    docker images
  4. Tag the source image so it can be pushed to the cloud platform by using the following command:

    docker tag <source-image>:<tag> <target-repository>:<tag>
  5. Push the image file into the ACR registry by using the following Docker command:

    docker push <target-repository>:<tag>
    NoteFor general Azure instructions on how to push an image to Azure, see Pushing a Docker image.
    The Azure Management Console displays the uploaded image URI.
  6. Record the newly created ACR repository URI in the Worksheet for Azure hyperscaler.

Create an Azure storage account for PDI-CLI

Create an Azure Cloud storage account to store the following:

For instructions on how to create a storage account see: Create a storage account - Azure Storage | Microsoft Learn

DirectoryActions
/root

All the files in the storage account’s fileshare are copied to the PDI-CLI .kettle directory.

If you need to copy a file to PDI CLI’s .kettle directory, drop the file in the file share’s root directory.

Example: If you need to copy your local repositories.xml to PDI CLI’s .kettle directory, place the repositories.xml file in the storage account’s root folder.

metastore

The PDI CLI can execute jobs and transformations. Some of these require additional information that is usually stored in the Pentaho metastore.

If your need to provide your Pentaho metastore to the PDI CLI, copy your local metastore directory to the root of the fileshare. The metastore directory will be copied to the proper location within the Docker image.

jdbc-drivers

If your PDI CLI needs JDBC drivers, add the jdbc-drivers directory to your storage account’s fileshare and place the drivers there.

Any files within this folder will be copied to the PDI CLI’s lib folder.

plugins

If your PDI CLI needs additional plugins installed, add the plugins directory to your file share.

Any files within this folder will be copied to the PDI CLI’s plugins folder. For this reason, the plugins should be organized in their own directories as expected by the PDI CLI.

The relevant files are explained below:

FileActions
content-config.properties

The content-config.properties file is used by the PDI-CLI Docker image to provide instructions on, which storage account files to copy over and their location.

The instructions are populated as multiple lines in the following format:

${KETTLE_HOME_DIR}/<some-dir-or-file>=${APP_DIR}/<some-dir>

A template for this file can be found in the templates project directory.

content-config.sh

This is a bash script that can be used to configure files, change file and directory ownership, move files around, install missing apps, and so on.

You can add it to the storage account’s file share.

It is executed in the Docker image after the other files are processed.

Procedure

  1. Your Pentaho license file has the file type .lic and it must be placed in the metastore > .pentaho folder.

  2. Copy any jobs or transformations you want to run into a folder on the storage account.

    Record the path to the job or transformation you want to run to start your batch job under PROJECT_STARTUP in the worksheet.
  3. Record the path to your .pentaho folder under METASTORE_LOCATION in the worksheet.

    NoteThe directories in the storage account should be created in a container if you’re deploying in Batch and file share if you’re deploying in ACI. The two methods of deployment are mentioned in Deployment methods for Azure hyperscaler.

Deployment methods for Azure hyperscaler

There are two methods that can used to deploy PDI-CLI based on the required use case:

The following table lists a few differences between these methods:

FactorsACIAKS
ScalabilityLimited scalability: With ACI, you can only run one single server instance. Multiple server instances and load-balancing cannot be achieved with ACI.Scalability and high availability: AKS provides automatic scaling and self-healing, which make it ideal for running large and complex workloads that require scalability and high availability.
FlexibilityLimited flexibility: ACI is a managed service, which means you have limited control over the underlying infrastructure.Flexibility: AKS provides more control over the underlying infrastructure and allows for greater customization and flexibility.
CostCost-effective: ACI is a pay-per-second model, which means you only pay for the time your container is running.Cost: AKS can be more expensive than ACI, especially for small workloads that do not require scaling.
MaintenanceMinimal maintenance required: ACI is a managed service, so most maintenance tasks are handled by Microsoft.Maintenance required: AKS requires ongoing maintenance and management, including updates and patches.
Feature SetLimited feature set: ACI lacks some of the advanced features available in AKS, such as automatic scaling, self-healing, and service discovery.Advanced features: AKS provides advanced features such as service discovery, load balancing, and container orchestration, which make it a powerful tool for managing containerized applications.
ComplexitySimple setup: ACI provides a simple and fast way to run containers without the need to manage a cluster.Complexity: AKS can be more complex to set up and manage than ACI, especially for users who are not familiar with Kubernetes.

Use Azure Container Instances (ACI)

Perform the following steps to deploy Pentaho on an Azure Container Instance (ACI):

Procedure

  1. Create Docker ACI context by running the following command:

    docker context create aci context name
  2. Choose the resource group where your ACR is located.

  3. Use the Docker ACI context that you previously created by running the following command:

    docker context use context name
  4. In the docker-compose-pdi-aci.yml file, replace the following values:

    ValueSetting
    <image_uri>Image URI from the ACR in the format name:tag
    <fileshare-name>The fileshare name created in the storage account
    <your-storageaccount-name>Your Storage account name
  5. Run the docker-compose-pdi-aci.yml file by using the following command:

    docker-compose –f docker-compose-pdi-aci.yml up

Use Azure Batch Service

Use the Azure Batch service to run your workload in the storage account.

For information on how to upload a file to the file share of the storage account, see Quickstart for creating Azure Container in Storage account| Microsoft Learn.

Procedure

  1. Record the <container-name> of the storage account in the Worksheet.

  2. You can now run your batch job by submitting a job to the Azure Batch service.

    See Quickstart: Use the Azure portal to create a Batch account and run a job - Azure Batch | Microsoft Learn for details on how to submit batch jobs.
  3. Record the <job-id> while creating a job in the worksheet.

    The task can be run using the template JSON file present in batch-json folder. The values can be replaced from the worksheet.

    Example:

    • STORAGE: /mnt/batch/tasks/workitems/<job-id>/job-1/<task-id>/wd
    • METASTORE_LOCATION: /mnt/batch/tasks/workitems/<job-id>/job-1/<task-id>/wd/metastore
    • PROJECT_STARTUP_JOB: /mnt/batch/tasks/workitems/<job-id>/job-1/<task-id>/wd/<transformation-or-job.ext>
    • (Optional) PARAMETERS: param:file_name=pvfs://blob/file-formats -param:file_name2=pvfs://blob/file-formats
  4. Navigate to Batch Accounts > Jobs > Choose your job > Add (JSON editor).

  5. Replace the text with your JSON file content and create the task.

    Environment variables
    VariableFunction
    STORAGECopies the data in the storage account to the container's data folder
    LOG_LEVELAssigns the logging level
    PROJECT_STARTUPAssigns the path of the KTR/KJB
    METASTORE_LOCATIONConfigures the metastore path from where the metastore content and configuration is downloaded. The metastore folder should contain the contents of your local metastore folder. The data present within that folder's .pentaho folder is copied to the user's .pentaho folder for that container.
    PARAMETERSConfigures any additional parameters you want to pass to the KTR/KJB, such as:

    -param:file_name=pvfs://blob/file-formats

    -param:file_name2=pvfs://blob/file-formats

    Parameters to be replaced
    VariableDescription
    <image_uri>Image URI from the ACR with <name>:<tag>
    <container-name>Container name created in the storage account
    <storage-account-name>Storage account name
    <fileshare-name>Fileshare name created in the storage account
    <transformation-or-job.ext>Transformation or job in the file share that you want to run with the extension ktr or kjb respectively
    <job-id>Name of the job created in the batch pool
    <task-id>Unique ID for the task

Worksheet for Azure hyperscaler

Use the following worksheet for important information needed during installation and configuration of Pentaho.

VariableDescriptionRecord your setting
<image_uri>Image URI from the ACR with <name>:<tag>
<your-namespace-name> [OR] <your-AKS-namespace-name>Record your AKS name space name: <namespace-name>
<fileshare-name>Record the file share name created in the storage account.
<your-secret-name>Record your AKS Secret name.
<your-storage-account-name-base64encoding.Run the command echo -n "<your-storage-account-name>" | base64 and record the output of this command as the value.
<your-storage-account-key-base64encoding>Run the command echo -n "<your-storage-account-key>" | base64 and record the output of this command as the value.
<your-storageaccount-name>Record your storage account name.
<container-name>Record the container name used for the storage account [Use Azure Batch service].
<job-id>

Record the job ID of the job submitted to the Azure Batch service

<transformation-or-job.ext>Transformation or job in the file share that you want to run in PDI-CLI with the extension of KTR or KJB respectively.
<task-id>Unique ID for the task
Environment Variables
VariableFunction
STORAGECopies the data present in the storage account to the container's data folder
LOG_LEVELAssigns the logging level.
DB_HOST_NAMEAssigns the host name of the corresponding database.
DB_PORTAssigns the port value of the corresponding database
METASTORE_ZIPApplies if the metastore is placed as a zip file in the container. Given that path, its contents are unzipped to your .pentaho folder for that container.
METASTORE_SRC_DIRProvides a metastore path in the storage. The metastore folder should contain the contents of your local metastore folder. The data present within that folder's .pentaho folder is copied to your .pentaho folder for that container.
CARTE_CONFIG_FILEFile name carte-config.xml located in the storage account.
PROJECT_STARTUPAssigns the name of the KTR/KJB.
METASTORE_LOCATIONProvides a metastore path in the storage. The metastore folder should contain the contents of your local metastore folder. The data present within that folder's .pentaho folder is copied to your .pentaho folder for that container.
PARAMETERS (optional)

Any additional parameters you want to pass to the KTR/KJB.

Example: -param:file_name=pvfs://blob/file-formats -param:file_name2=pvfs://blob/file-formats

CARTE_CONFIG_FILEFile name carte-config.xmllocated in the storage account.