Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Pre-installation steps

Parent article

Before starting the Lumada Data Catalog installation process, you must first set up any dependent external components and collect other essential information about your cluster configuration.

Basic pre-install preparation includes the following tasks:

  • Review system requirements

    Review the external components and applications that Data Catalog supports. See System requirements .

  • Verify components

    Verify the proper functioning of the various components that Data Catalog interacts with. See Component validations.

You must also make the following changes outside of the Lumada Data Catalog environment or on other cluster nodes:

  • Validate component functioning
  • Configure components for Data Catalog
If you do not have control over or access to these components, you may need to plan in advance to perform these changes or contact your system administrator.

System requirements

Lumada Data Catalog requires specific external components and applications to operate optimally. This section provides a list of those components and applications along with details of their use and versions we support. If you have questions about your particular computing environment, please contact support at the Hitachi Vantara Lumada and Pentaho Support Portal.

Capacity planning

The Data Catalog stores data in the following locations:

  • Discovery Cache storage

    The Discovery Cache uses HDFS or S3 storage to store the metadata of profiled resources.

  • Repository Storage

    The MongoDB database stores the transactional metadata.

Discovery Cache storage

The Discovery Cache stores the metadata information gathered by the Data Catalog. Data Catalog gathers metadata information about every resource it profiles in the data lake. This metadata forms the fingerprints of the data lake and are stored in the Discovery Cache in the blob storage location.

The Discovery Cache stores business term propagation and lineage discovery algorithms. Typically, this local storage is in HDFS, but in non-Hadoop deployments, a S3 bucket or other cloud storage can be used.

The Discovery Cache URI must be owned by the Data Catalog service user with full read, write, and execute privileges. Other non-service users must not be granted write permissions to this location. As a best practice, the Discovery Cache should be periodically backed up.

The size requirements are 4 KB per column.

Repository storage

Data Catalog uses a MongoDB database for storing audit logs and transactional data, including the sample values and sample data shown in resource views. The Data Catalog-specific data is stored in a separate collection. Since this database contains sample values and sample data, the collection should be only accessible to the Data Catalog service users.

The repository storage requirements are 2GB of MongoDB storage per 100K resources or 1M resource fields.

Cluster compute resources

Data Catalog performs most of its computation as Spark jobs on the cluster. It profiles new and updated files, so after profiling existing files in a data lake, it needs resources only to manage the ingress of data.

Hitachi Vantara suggests the memory settings for the Apache YARN container as 2 GB.

The suggested Spark memory requirements are:

  • Driver memory: 2 GB.
  • Executor memory: Proportional to the size of the records. A size estimate is roughly 12 times the size of a single row of data being processed. For example, if your data contains rows that are 100 MB, the executor memory should be at least 1200 MB.
    NoteA row is defined as a row of data, not a single line in the file. For a JSON file that does not include line breaks, the row is the single row of data, not the entire file.
  • When you run deep profiling jobs on data sets in the order of 10,000 resources, you should increase executor memory to as high as 12 GB.
  • When you run business term discovery jobs on data sets of 5,000 resources, increase executor memory to 12 GB.

You can manage cluster resources that Data Catalog consumes for any given operation using YARN.

For information about tuning Spark jobs, see the Spark article https://spark.apache.org/docs/2.3.2/tuning#tuning-spark

Example estimate

Data Catalog stores data in the following locations:

  • Blob storage

    4 KB per column

  • MongoDB database

    The following example provides an estimate of the storage you would need to set up for a 250 TB data lake. It assumes an environment with the following conditions:

    • Both Hive tables and HDFS data are included in the Data Catalog.
    • All HDFS data appears in a Hive table.
    • Files and tables have an average of 30 columns, that are fields in Data Catalog.
    • Each table has a large set of backing files, such as daily files for multiple years.

The example data lake contains 2500 tables of 100 GB each. Each table has 700 files, and each table/file has 30 fields/columns.

  • Total tables and files: 2500 tables with 700 files each = 1.75M resources.
  • Tables: 2500 tables x 30 columns = 75,000 columns.
  • Files: 2500 tables x 700 files per table x 30 columns = 52,500,000 columns.
  • Total of about 53 million columns.
  • HDFS storage: 53 M x 4 KB = 212 GB.

In addition, Data Catalog produces logs that are stored on the host node. A best practice is to plan for 200 MB a day and storing several weeks of logs for a total of 6 to 10 GB.

Data sources

Lumada Data Catalog can process data from file systems, Hive, and relational databases. The following table lists the different types of supported data sources:

Data source typeSupported versionsNotes

Amazon DynamoDB

2019.11.21

-

Aurora

MySQL Compatible 5.6.10

PostgreSQL Compatible 12.4

-

-

Azure

CosmosDB

PostgresSQL 12.4

SQL DB

SQL Server

Synapse Analytics

-

-

-

-

-

Cloud/Blob storage

S3

Azure Blob Storage (wasb://)

Azure Data Lake Storage (adl://)

Azure Data Lake Storage Gen2 (abfs://)

For Azure only

-

-

-

DB2 LUW

-

Denodo

8.0for LDC 7.1

HBase

--
HDFSAs supported by the distribution

-

HIVEAs supported by the distribution

-

MSSQL15.x

-

MySQL8.0

-

Oracle

11g

19c

-

-

PostgreSQL14.1

-

Redshift

1.0.15345

-
SAP-HANA

2.00.030.00.1522210459

Certification available on demand
Snowflake5.30.2/latest cloud version availableIOTA
Teradata

16.20.49.01

Certification available on demand
Vertica10.01.0001-

Data Catalog stores the connection information to the data sources including their access URLs and credentials for the LDC service user in a data sources entity. This entity stores the details of the connections established between the Data Catalog's web server and engine, and the sources.

Connectors supported

Data Catalog stores the connection information to the data sources including their access URLs and credentials for the LDC service user in a data sources entity. This entity stores the details of the connections established between Data Catalog's web server and engine, and the data sources.

ConnectorVersionNotes
Apache Atlas

2.0.0 with CDP 7.1.7

1.1.0 with HDP 3.1

-

-

Denodo8.0for LDC 7.1

Authentication

Data Catalog integrates with your existing Linux and Hadoop authentication mechanisms. When you install Data Catalog, Keycloak authentication is installed by default.

Web browsers

Data Catalog supports major versions of web browsers that are publicly available.

BrowserSupported VersionNotes
Google Chrome98.0.4758.102 (Official Build) (x86_64)Backward version compatibility is contingent upon changing the browser libraries. See the Hitachi Vantara Lumada and Pentaho Support Portal for any specific version compatibility.
Mozilla Firefox97.0 (64-bit)Backward version compatibility is contingent upon changing the browser libraries. See the Hitachi Vantara Lumada and Pentaho Support Portal for any specific version compatibility.
IOS Safari15.3 (16612.4.9.1.8, 16612)

Data Catalog ports

Data Catalog uses the following default ports and protocols for the specified components:

ComponentProtocolPort
Data Catalog Application server

HTTPS

31083
Keycloak administration consoleHTTPS30880
MinIO consoleHTTPS30901

Minimum node requirements

The sections listed below list the minimum node requirements for your system:

Kubernetes requirements

The requirements for Kubernetes include:

  • Minimum of 20 GB, typically 100 GB of disk space. The node does not store data or metadata, just the software and logs.
  • Minimum of 4 CPUs, running at least 2-2.5 GHz, typically 8 CPUs.
  • Minimum of 6 GB of RAM, typically 16 GB.
  • Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
  • Linux operating system. The netstat, and lsof commands are useful.
  • JDK version 1.8.x.
  • Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.

Agent

Agents are processing engine components that can run on multiple remote clusters, and are installed on the edge nodes of a Hadoop cluster.

The requirements include:

  • Minimum of 20 GB, typically 100 GB disk space. The node does not store data or metadata, just the software and logs.
  • Minimum 4 CPUs, running at least 2 to 2.5 GHz, typically 16 CPUs if running Spark in Yarn client mode.
  • Minimum 6 GB of RAM, typically 64 GB if running Spark in Yarn client mode.
  • Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
  • Linux operating system. The netstat, and lsof commands are useful.
  • JDK version 1.8.x.
  • Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.

Multi-byte support

Data Catalog handles cluster data transparently, assuming the data is stored in formats that Data Catalog supports. Data Catalog does not enforce any additional limitations beyond what Hadoop and its components enforce. However, there are locations where the configuration of your Hadoop environment should align with the data you are managing. These locations include:

  • Operating system locale.
  • Character set supported by Hive client and server.
  • Character set supported by MongoDB.

The Data Catalog browser application allows users to enter multi-byte characters to annotate HDFS data. Where Data Catalog interfaces with other applications, such as Hive, Data Catalog enforces the requirements of the integrated application.

Component validations

You must verify the proper functioning of the various components that interact with Data Catalog.

Perform the following validations:

  1. Validate Spark environment variables.
  2. Validate the user authentication method.

Once these component validations are verified, you can start configuring the components for Data Catalog compatibility.

Validate Spark environment variables

You need to run a smoke test to verify that all underlying Spark environmental variables are correctly setup. Data Catalog jobs run in a manner similar to Spark SQL context.

Perform the following steps to run the smoke test:

Procedure

  1. Open spark-shell:

    $ /usr/bin/spark-shell

  2. Import the following packages to create SparkConf, SparkContext, and HiveContext objects:

    $ import org.apache.spark.SparkConf

    $ import org.apache.spark.SparkContext

    $ import org.apache.spark.sql.hive.HiveContext

  3. Create a new SQL context:

    $ val sqlContext = new HiveContext(sc);

    $ sqlContext.sql("select count(*) from database.table").collect().foreach(println)

Validating the user authentication method

Before installing Data Catalog on a secure cluster, you need to identify what security measures your cluster employs for authentication so you can integrate Data Catalog to use the sanctioned security channels. You need to know the authentication method and configuration details. You also need to validate that you can access your authentication system.

Keycloak authentication

Keycloak authentication is installed by default.