Running PDI-CLI on AWS
You can use PDI-CLI images to run the Kitchen command to run transformations and the Pan command to run jobs on AWS.
Prerequisites for installing PDI-CLI on AWS
Observe the following prerequisites before installing Pentaho:
- A stable version of Docker must be installed on your workstation.
- You must have an AWS account to complete this installation.
- Amazon AWS CLI must be installed on your workstation.
- The following software versions are supported:
Application Supported version Docker v20.10.21 or a later stable version AWS CLI v2.x
Process overview for running PDI-CLI on AWS
Use the following steps to deploy PDI-CLI on the AWS cloud platform:
- Download and extract Pentaho for AWS
- Create an Amazon ECR
- Load and push the Pentaho Docker image to ECR
- Create an S3 bucket
- Configure and execute PDI-CLI
Download and extract Pentaho for AWS
Download and open the package files that contain the files needed to install Pentaho.
Procedure
Navigate to the Support Portal and download the AWS version of the Docker image with the corresponding license file for the applications you want to install on your workstation.
Extract the image to view the directories and the readme file.
The image package file (<package-name>.tar.gz) contains the following:Directory or file name Content description image Directory containing all the Pentaho source images. yaml Directory containing YAML configuration files and various utility files. README.md File containing a link to detailed information about what we are providing for this release.
Create an Amazon ECR for PDI-CLI
Before pushing the Pentaho image to AWS, you need to create an Amazon ECR.
For information on how to create an Amazon ECR, see instructions for creating a private repository on AWS.
Load and push the PDI-CLI Docker image to ECR
Select and tag the Pentaho Docker image and then push it to the ECR registry.
Procedure
Navigate to the image directory containing the Pentaho tar.gz files.
Select and load the tar.gz file into the local registry by running the following command:
docker load -i <pentaho-image>.tar.gz
Record the name of the source image that was loaded into the registry by using the following command:
docker images
Tag the source image so it can be pushed to the cloud platform by using the following command:
docker tag <source-image>:<tag> <target-repository>:<tag>
Push the image file into the ECR registry using the following Docker command:
docker push <target-repository>:<tag>
The AWS Management Console displays the uploaded image URI.For general AWS instructions on how to push an image to AWS, see Pushing a Docker image.
Load and push the Pentaho Docker image to ECR
Select and tag the Pentaho Docker image and then push it to the ECR registry.
Procedure
Navigate to the image directory containing the Pentaho tar.gz files.
Select and load the tar.gz file into the local registry by running the following command:
docker load -i <pentaho-image>.tar.gz
Record the name of the source image that was loaded into the registry by using the following command:
docker images
Tag the source image so it can be pushed to the cloud platform by using the following command:
docker tag <source-image>:<tag> <target-repository>:<tag>
Push the image file into the ECR registry using the following Docker command:
docker push <target-repository>:<tag>
The AWS Management Console displays the uploaded image URI.For general AWS instructions on how to push an image to AWS, see Pushing a Docker image.
Record the newly created ECR repository URI in the Worksheet for AWS hyperscaler.
Create an S3 bucket for PDI-CLI
You must create an S3 bucket to deploy the PDI-CLI image on AWS.
Procedure
Create an S3 bucket .
To create an S3 bucket, see Creating a bucket. To upload a file to S3, see Uploading objects.Record the newly created S3 bucket name in the Worksheet for AWS hyperscaler.
Upload files into the S3 bucket.
After the S3 bucket is created, manually create any needed directories as shown in the table below and upload the relevant files to an appropriate directory location by using the AWS Management Console. The relevant Pentaho directories are explained below:
The relevant Pentaho files are explained below:Directory Actions /root All the files in the S3 bucket are copied to the /home/pentaho/data-integration/data directory.
If you need to copy a file to the /home/pentaho/data-integration/data directory, drop the file in the root directory of the S3 bucket.
This directory contains the Pentaho licenses files.
Jdbc-drivers If your Pentaho installation needs JDBC drivers, do the following:
- Add the jdbc-drivers directory to your S3 bucket.
- Place the drivers in this directory.
Any files within this directory will be copied to Pentaho’s lib directory.
plugins If your Pentaho installation needs additional plugins installed, do the following:
- Add the plugins directory to your S3 bucket.
- Copy your plugins to the plugins directory.
Any files within this directory are copied to Pentaho’s plugins directory. For this reason, the plugins should be organized in their own directories as expected by Pentaho.
metastore Pentaho can execute jobs and transformations. Some of these require additional information that is usually stored in the Pentaho metastore.
If you need to provide your Pentaho metastore to Pentaho, copy your local .pentaho directory to the metastore directory (you can name it something else by passing a variable) of the S3 storage bucket. From there, the content of the .pentaho directory is copied to the /home/pentaho/.pentaho folder within the Docker image.
File Actions content-config.properties The content-config.properties file is used by the Pentaho Docker image to provide instructions on, which S3 files to copy over and their location.
The instructions are populated as multiple lines in the following format:
${KETTLE_HOME_DIR}/<some-dir-or-file>=${APP_DIR}/<some-dir>
A template for this file can be found in the templates project directory.
The template has an entry where the file context.xml is copied to the required location within the Docker image:
${KETTLE_HOME_DIR}/context.xml=${APP_DIR}/context.xml
content-config.sh This is a bash script that can be used to configure files, change file and directory ownership, move files around, install missing apps, and so on.
You can add it to the S3 bucket.
It is executed in the Docker image after the other files are processed.
Configure and execute PDI-CLI
Configure and run AWS Batch using the PDI-CLI image.
Refer to the AWS instructions for the following steps at Getting Started with AWS Batch.
Procedure
Navigate to the AWS Batch home page.
Create a compute environment by selecting Compute environments and follow the instructions.
Create a job queue by selecting Job queues and follow the instructions.
Create a job definition by selecting Job definitions and follow the instructions.
Provide the image name in the section for configuring the container.Create a job by selecting Jobs and follow the instructions.
In the Environment Variables section, configure the following variables:Variable Description PROJECT_S3_LOCATION Configures the S3 path from where the data is downloaded. It is then uploaded to the container.
Example: Set PROJECT_S3_LOCATION to
s3://pentaho-samples/
METASTORE_LOCATION Configures the metastore path from where the metastore content and configuration will be downloaded. It is then uploaded to the path of the container:
/home/pentaho/.pentaho
.Example: Set METASTORE_LOCATION to
metastore
PROJECT_STARTUP_JOB Path used to execute KJB files.
Example: Set PROJECT_STARTUP_JOB to
jobs/run_job_write_to_s3/read_csv_from_s3_job.kjb
Results
Worksheet for AWS hyperscaler
Use the following worksheet for important information needed during installation and configuration of Pentaho.
Variable | Record your setting |
ECR_IMAGE_URI (only Platform/PDI Server and Carte Server) | |
RDS_HOSTNAME (only Platform/PDI Server and Carte Server) | |
RDS_PORT (only Platform/PDI Server and Carte Server) | |
S3_BUCKET_NAME | |
EKS_CLUSTER_NAME (only Platform/PDI Server and Carte Server) |