Installation on Kubernetes

Requirements

Category	Description
Minimum Hardware Requirements	16 GB RAM 8 Cores 100GB Storage Though there is no hard requirement for Operating System, Linux is typically used.
Kubernetes Cluster	Kubernetes version 1.21.x+ Suggested Cluster configuration of 1 Master and 2 Worker nodes.
Software for cluster	Helm version 3.6.x+ Kubectl 1.21.x+
Software for ldc-load-images.sh	Docker (preferably the latest stable version) Command Line Utilities tar and jq.
Miscellaneous	Knowledge of your organization’s networking environment. Root permissions for your designated server. Ability to connect to your organization’s data sources. Access to an existing private container registry owned by your organization. Alternatively, you can create a registry on your Kubernetes host using Docker (see Appendix A for more information). Internet access (to pull public Docker images). An organization owned object store (e.g. AWS S3 Bucket). A web browser to access Data Catalog; supported web browsers include the latest stable versions of: Google Chrome Microsoft Edge Mozilla Firefox Apple Safari

You will also need the helm chart (TGZ file), the Docker images (TAR GZ file) and the ldc-load-images.sh script on your Kubernetes host server.

Loading Docker images

By default, the Hitachi Vantara-owned Docker images are not publicly available. These images are included in the artifacts as a TAR GZ file and must be loaded onto a private container registry first using the provided ldc-load-images.sh script. You may need to login to your private container registry before running the script.

On the terminal of your Kubernetes cluster, run the ldc-load-images.sh script, specifying your private registry and the path to where the images are stored: ./ldc-load-images.sh -r <private registry> --images <path to ldc-images TAR GZ file>

If executed successfully, the following Docker images be available on your private registry (note that the tags for each Docker image will vary per Lumada Data Catalog release):

lumada-catalog/app-server:<app-server tag>
lumada-catalog/agent:<agent tag>
lumada-catalog/mongodb-migration-tool:<mongodb-migration-tool>
lumada-catalog/spark:<spark tag>
lumada-catalog/mongodb:<mongodb tag>

Minimal Helm chart values

You will need to customize certain values that you provide to the Helm chart. This can be done by creating a custom-values.yml file, which can be used to override default helm chart configuration during install.

Create this file in the same location where your Data Catalog artifacts are stored. As a minimum, some Data Catalog services must be exposed to be able to access them on your browser. Two examples of exposing these services are shown below:

Minimal configuration with services exposed using NodePort - Recommended for local debug purposes only

This example configuration will expose:

Keycloak
- HTTP Port 30880
App-server (Data Catalog UI)
- HTTPS port 31083
- HTTP port 31080 (this is a default setting in the helm chart)

keycloak:
  service:
    type: NodePort
    nodePort: 30880
app-server:
  service:
    type: NodePort
    httpsNodePort: 31083
  keycloak:
    authServerUrl: "http://<k8s node hostname>:30880/auth"
global:
  registry: <private registry>

Where <k8s node hostname> is the hostname of the server running the Kubernetes cluster, and <private registry> is the registry containing the loaded Data Catalog Docker images. This is an arbitrary value based on how your organization’s registry was set up (e.g. an Azure container registry would follow the format myregisty.azurecr.io).

Minimal configuration with services exposed using an Ingress controller - Recommended for production

For ingress, it is assumed that the host of your Kubernetes cluster already has the relevant DNS configuration (domain name that you own and can create DNS records for) for your cluster.

In this example, the configuration will expose services on the following subdomains:

Keycloak via keycloak-dev1.hv.com
App-server (Data Catalog UI) via app-server-dev1.hv.com

keycloak:
  ingress:   
    enabled: true
    hosts:
    - host: keycloak-dev1.hv.com  
      paths:
      - path: /
        pathType: Prefix
    tls:
    - hosts:
      - "keycloak-dev1.hv.com"
      secretName: keycloak-ingress-certs
app-server:
  ingress:
    enabled: true
    hosts:
    - host: app-server-dev1.hv.com
      paths:
      - path: /
        pathType: Prefix
    tls:
    - hosts:
      - "app-server-dev1.hv.com"
      secretName: app-server-ingress-certs
  keycloak:
    authServerUrl: "https://keycloak-dev1.hv.com/auth"
global:
  registry: <private registry>

Where <k8s node hostname> is the hostname of the server running the Kubernetes cluster, and <private registry> is the registry containing the loaded Data Catalog Docker images. This is an arbitrary value based on how your organization’s repository was set up (e.g. an Azure container registry would follow the format myregisty.azurecr.io).

Customizing Helm chart values for production

You can customize Helm chart values for your production environment. Use the following guidelines.

Large Properties

Large Properties configuration determines where metadata from your job runs are stored. By default, the large properties location is set to the local MinIO component included in the Helm chart, which is suitable for debug purposes. For production however, it is recommended that this location should be on an object store owned by your organization (e.g. AWS S3).

Overriding the default large properties settings can be done by adding the following agent configuration in your custom-values.yml file. This setup covers the minimum settings you will need to set your large properties location to your organization’s objects store:

app-server:
  <other app-server custom values>
  configurationOverrides:
    - propertyKey: ldc.metadata.hdfs.large_properties.uri
      value: <large properties URI>
      component: __template_agent
    - propertyKey: ldc.metadata.hdfs.large_properties.attributes
      value:
        - fs.s3a.access.key=<access key>
        - fs.s3a.secret.key=<secret key>
        - fs.s3a.endpoint=<object store endpoint>
        - fs.s3a.path.style.access=true
        - fs.s3a.threads.max=20
        - fs.s3a.connection.maximum=200
      component: __template_agent
    - propertyKey: ldc.metadata.hdfs.large_properties.path
      value: <large properties path>
      component: __template_agent

Where:

<access key>, <secret key> and <object store endpoint> refer to the connection details to your organization’s object store. This will be used by Data Catalog to connect and write to your object storage.
<large properties URI> is the URI of your object store. For S3 bucket the URI can be set to one of two formats, depending what environment the agent is running on:
- For most cases - s3a://<Bucket Name>
- For a remote agent running on EMR - s3://<Bucket Name>
<large properties path> refers to the location in the object store that metadata will be stored.

If you need to change a single agent’s large properties configuration, you can do this on the UI by going to Management and click on Configurations, and changing the following MISC values for your agent:

Attributes for discovery cache metadata store
URI for discovery cache metadata store
Relative location for a large properties metadata store

Spark History Server

The spark driver history location is where spark execution logs are written to. By default, the spark history server location is on the local MinIO component included in the Helm chart. Setting the new spark history server location to a S3 bucket can be done by overriding spark-history-server configuration your custom-values.yml file:

spark-history-server:
  <other spark history server configuration>
  historyServer:
    events:
      s3:
        createBucket: true
        # either use existingSecretName OR access keys
        existingSecretName: "<existing secret name>"
        accessKey: "<access key>"
        secretKey: "<secret key>"
        # aws s3 session token (optional)
        sessionToken: ""
        endpoint: "<endpoint>"
        bucket: <bucket name>
        eventLogPath: "<event log path>"
        # s3 event history file system log directory
        historyPath: "<history path>"

Where:

<existing secret name> refers to a Kubernetes secret that holds the credentials to your S3 bucket (this would have been set up by your Kubernetes admin separately). Note that if you are using an existing secret name, accessKey and secretKey do not need to be set here.
<access key> and <secret key> refer to the connection details to your organization’s S3 bucket. Note that if you are using keys to access your S3 bucket, you do not need to define existingSecretName.
<endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000.
<bucket name> and <path> refers to the location in the object store’s where the jars will be placed.
<event log path> and <history path> are locations on the bucket that logs are stored in. By default, these are both set to /events/.

MongoDB

Though Data Catalog comes with its own MongoDB, you can edit your custom-values.yml by providing the URI to an external MongoDB instance:

app-server:
  mongodbURI: <MongoDB URI>

Where the MongoDB URI will be in the form mongodb://<username>:<password>@ <mongodb host>.

Keycloak

Though Data Catalog comes with its own Keycloak component for authentication, it is possible for Data Catalog to point to your organization’s own Keycloak configuration by overriding Keycloak configuration in your custom-values.yml:

keycloak:
  authServerUrl: <auth server url>
  callbackUrl: <callback Url>
  realm: <realm>
  clientID: <client ID>  
  clientSecret: <client secret>
  authUser: <username>
  authPass: <password>
  # -- Mapping of OAuth profile fields (on the right) to those in LDC Application (on the left)
  userFields:
    id: sub
    email: email
    username: preferred_username
    firsName: given_name
    lastName: family_name

Where:

<auth server url> is the (accessible) base URL of your Keycloak realm's authorization endpoint.
<callback Url> is the URL to that Keycloak will redirect the user to after granting authentication.
<realm> is the name of your Keycloak realm.
<client ID> is a value that will match your Application Name, resource, or OAuth Client Name.
<client secret> is the value of your OAuth client secret.
<username> and <password> are the credentials of the user used for role syncing.

Seed JDBC JAR location

JDBC Jars are required by Data Catalog agents to access and process data sources. For local agent, custom-values.yml can be configured to pull additional seed JDBC jars in two different ways:

s3 – During initialization, the agent pulls JDBC jars from a specified location (e.g. your organization’s S3 bucket)
http – During initialization, the agent will use a provided list of download links to the jars agent requires. can pull jars via URL.

This is useful if local agent unexpectedly terminates, as a replacement of the agent will retrieve required jars during initialization.

S3 configuration

By default, local agent will pull from JDBC jars from a bucket in the local MinIO component included in the Helm chart. In production, it is recommended that this location is set to an organization’s own object store, such as an AWS S3 or MinIO bucket. You can override your local agents jar file location by adding the following to your custom-values.yml helm chart.

agent:
  seedJDBC:
    sources:
      - s3
  # default values used in default secret
    s3:
      existingSecretName: "<secret name>"
      accessKey: <access key>
      secretKey: <secret key>
      bucket: <bucket name>
      path: <path>
      endpoint: "<endpoint>"

Where:

<access key> and <secret key> refer to the connection details to your organization’s S3 or MinIO bucket.
<endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000).
<bucket name> is the name of the S3 or MinIO bucket.
<path> refers to the location in the S3 or MinIO bucket where the jars will be placed. By default, this is set to ext/jdbc.
<secret name> refers to the name of the object store. By default, this is set to <release name>-minio-bundled.

HTTP configuration

For HTTP configuration, you can provide a list of download links to different jar files in your custom-values.yml:

agent:
  seedJDBC:
    sources:
      - http
    http:
      # array of links
      list:
      - <download link to a JDBC jar>
      - <download link to another JDBC jar>

Agent Spark JAR staging

Not to be confused with Spark History Server history server, this configuration is defined on the Agent's side and provides a path where the spark driver can upload additional jars for executors.

By default, this location is on the local MinIO component included in the helm chart, which can be used for debug purposes. For production however, it is recommended that this location should be set to an object store owned by your organization (e.g. AWS S3). This can be done by overriding agent configuration your custom-values.yml file:

agent:
  spark:
    jarUpload:
      endpoint: "<endpoint>"
      existingSecretName: "<secret name>"
      accessKey: <access key>
      secretKey: <secret key>
      secretToken:
      bucket: <bucket name>
      path: <path>

Where:

<secret name> refers to the name of the object store. By default, this is set to <release name>-minio-bundled.
<access key> and <secret key> refer to the connection details to your organization’s S3 or MinIO bucket.
<endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000.
<bucket name> and <path> refers to the location in the object store’s where the jars will be placed.
<path> refers to the location in the object store’s where the jars will be placed. By default, this is set to /cluster_jars.

Once you have made your edits, save the file, and run a helm upgrade using the updated custom-values.yml file:

helm upgrade --wait <release name> ldc-7.0.1.tgz -f custom-values.yml -n <namespace>

Where <release name> and <namespace> refer to values that were set during the initial helm installation process.

Deploy Helm

In this example, a helm install is done where the helm release name is ldc7, namespace is ldc, the custom values file is called custom-values.yml and the helm chart used is called ldc-7.0.1.tgz.

Procedure

Create a namespace where your Data Catalog components will reside.
kubectl create namespace ldc
Use the following command to deploy the chart, specifying the release name, paths to the helm chart and custom values, and the namespace created in the previous step:
helm install --wait ldc7 ldc-7.0.1.tgz -f custom-values.yml -n ldc
Typically, this helm install takes a few minutes to fully complete.
On your web browser, confirm that you can access the Data Catalog login page. The URL will vary based on what was set in custom-values.yml:
- If the Node Port configuration has been used, this will be at https://<hostname>:<app-server.service.httpsNodePort>
- If the Ingress controller configuration has been used, this will depend on what was set under app-server.ingress.hosts.

Uninstall

To uninstall the helm chart, run the helm uninstall command, passing the release name and namespace that was specified during install.

helm uninstall ldc7 -n ldc

Troubleshooting

Use the following guidelines for troubleshooting your installation or uninstall issues.

Install

If a server is reused to install Data Catalog, a resource conflict message may appear if the previous release has not been cleanly uninstalled (or fails):

Error: rendered manifests contain a resource that already exists.
Unable to continue with install: existing resource conflict:

If this error appears, manually delete any pre-existing resources:

kubectl delete all -l “app.kubernetes.io/instance=ldc” -l “release=ldc7”

Uninstall

For most cases, the provided helm uninstall command should suffice. However, if the uninstall is not completed successfully, you can run the command without enabling uninstall hooks:

helm uninstall <release name> --no-hooks -n <namespace>

If there are any orphaned or incomplete jobs, these can be found and deleted manually:

kubectl get jobs -o name | grep <release name>

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Requirements

Loading Docker images

Minimal Helm chart values

Customizing Helm chart values for production

Large Properties

Spark History Server

MongoDB

Keycloak

Seed JDBC JAR location

Agent Spark JAR staging

Deploy Helm

Uninstall

Troubleshooting

Install

Uninstall