System requirements

Last updated
Save as PDF

The Lumada Data Catalog software builds a metadata catalog from data assets residing in tools and databases. It profiles the data assets to produce field-level data quality statistics and to identify representative data so users can efficiently analyze the content and quality of the data.

Data Catalog requires specific external components and applications to operate optimally. This article provides a list of those components and applications along with details of their use and the versions Lumada Data Catalog. If you have questions about your particular computing environment, contact Hitachi Vantara Lumada and Pentaho Support.

Kubernetes

You need a Kubernetes cluster to set up the multiple Data Catalog components. The following table lists the requirements for Kubernetes installation.

Category	Description
Minimum hardware requirements	16 GB RAM 8 cores 100 GB storage Though there is no hard requirement for operating systems, Linux or a similar operating system is typically used.
Kubernetes cluster	Kubernetes version 1.23
Software for cluster	Helm version 3.8 kubectl 1.21

For installation instructions, see Installation on Kubernetes.

Web browsers

Data Catalog supports major versions of web browsers that are publicly available.

Browser	Version
Google Chrome* (Recommended)	105.0.5195.102 (Official Build) (x86_64)
Microsoft Edge*	42.17134.1098.0
Mozilla Firefox*	104.0.2 (64-bit)
iOS Safari	15.6.1 (17613.3.9.1.16)

* Backward version compatibility depends on the changes made in browser libraries. Contact Hitachi Vantara Lumada and Pentaho Support for any specific version compatibility.

Authentication

For Lumada Data Catalog, Keycloak 18.0.2 is the default identity and access management, which allows the creation of a user database with custom roles and groups. It will be installed during the Helm deployment of Lumada Data Catalog.

Port and firewall requirements

If the Lumada Data Catalog users, including service users, need to access Data Catalog across a firewall, you should allow access to the following ports at the cluster IP address.

NotePorts mentioned in the below table are the default ports that Data Catalog uses when installed. However, you can change these ports during the installation. If you need to change these ports after installation, contact Hitachi Vantara Lumada and Pentaho Support.

Component	Port	Description
Secure Data Catalog browser application (HTTPS)	31083	Grants users, access to the Data Catalog browser application
Authentication and access management	30843	Manages the user authentication and access through Keycloak
Metadata repository (MongoDB)	30017	Stores metadata collected from processing functions in a MongoDB repository
Object storage (MinIO)	30900	Serves as object storage for debugging purposes
Metadata repository REST API endpoint (HTTP)	31088	Used internally by the Data Catalog Application Server and agent components to communicate with the repository

Data sources

Lumada Data Catalog can process data from different data sources. The following table lists the different types of supported data sources.

Category	Data Source	Version
RDBMS	PostgreSQL	v14
	Oracle	v19
	MSSQL SERVER	v2019
	MySQL	v8
	Snowflake	5.30.2
	IBM DB2	11.5.7.0
Document DB	MongoDB	5.0.9
Object store	MinIO	8.0.10
	Amazon S3
	Hitachi Content Platform (HCP)	9.4.0
	Azure Data Lake Storage (ADLS)
Columnar DB	Hadoop Distribution File System (HDFS)	3.2.1 (for EMR)
	Hadoop Distribution File System (HDFS)	3.1.1 (for CDP and HDP)
	Apache Hive	3.1.3 (for EMR and CDP)
	Apache Hive	3.1.0 (for HDP)
	Vertica	11.1.1
Virtual DB	Denodo	8.0

Remote agent

Agents are responsible for initiating, executing, and monitoring jobs that communicate with the data sources and process the data and create fingerprints. Refer to the following respective sections that list the requirements for the remote agent installation, distributions, and Kerberos environments.

General

Category	Description
Hardware	8 cores 64 GB RAM 100 GB storage
Miscellaneous	Data Catalog has already been set up on a Kubernetes cluster The server hosting your remote agent can connect to the Data Catalog server

Distribution

Remote agent setup supports a few distributions that vary in requirements. The following table lists the distribution requirements most suitable for your Data Catalog setup.

Category	Distribution
Amazon Elastic Map Reduce (EMR)	Version 6.2.1 NoteWhen prompted by the remote agent script, you set up the remote agent using the Lumada Data Catalog service user for Hadoop.
Cloudera Data Platform (CDP)	Version 7.1.3

Kerberos environments

Additionally, you can enable Kerberos on your remote agent server. As Kerberos-enabled environments add extra security between your remote agent and Data Catalog cluster, additional configuration is required.

Your Hadoop administrator has created a service user on your environment.
A Kerberos keytab file has already been set up for your service user on the Kerberos machine.

See Remote agent installation for more information.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.