System requirements

Processing engine

Data Catalog uses Apache Spark for profiling jobs against HDFS and Hive data. The application code is transferred to each cluster node and is executed against the data that resides on that node. The results are collected in either the Postgres repository, the Discovery Cache, or both. Data Catalog then runs Spark jobs against the resulting metadata to determine tag association suggestions. See Discovery Cache storage

Data Catalog jobs are standard Spark jobs, so your organizational practices for tuning cluster operations apply to running and tuning Data Catalog jobs.

Hadoop distributions

Data Catalog must be installed on the edge node within the Hadoop cluster. The Hadoop and Hive clients installed on this edge node must have access to the Hadoop NameNode and HiveServer2.

The following table lists the supported distributions and the compatible applications for each Hadoop distribution on the edge node.

Distribution	Components	Versions	Notes
Cloudera	CDH Apache Spark™ Solr HDFS HIVE Postgres Navigator	6.1 2.4 8.4.1 3.0.0 2.1.1 11.9 -	- - Installed separately - - Installed separately Certification available on demand
Cloudera	CDP Apache Spark Solr HDFS HIVE Postgres Atlas	7.1.3 2.4 8.4.1 3.1.1 3.1.3 11.9 2.0.0	- - - - - - -
HortonWorks	HDP Apache Spark Solr HDFS HIVE Postgres Atlas	3.1.0 2.3.2 8.4.1 3.1.1 3.1.0 11.9 1.1.0	- - Installed separately - - Installed separately
MEP (MapR Ecosystem Pack)	MapR Apache Spark Solr MapR-FS HIVE Postgres	6.1.0 2.4.0 8.4.1 6.1.0.20180926230239.GA 2.3.3 11.9	- - Installed separately - - Installed separately
AWS	EMR Apache Spark Solr HDFS HIVE Postgres	5.30.1 2.4.5 8.4.1 2.8.5 2.3.6 11.9	- - Installed separately - - Installed separately
Microsoft Azure	HDI Apache Spark Solr HDFS HIVE Postgres Atlas	4.0 2.4.4 8.4.1 3.1.1 3.1.0 11.9 -	- - Installed separately - - Installed separately NA (HDP Only)

Distribution

Components

Versions

Notes

Cloudera

CDH

Apache Spark™

Solr

HDFS

HIVE

Postgres

Navigator

6.1

2.4

8.4.1

3.0.0

2.1.1

11.9

-

Installed separately

-

Installed separately

Certification available on demand

Cloudera

CDP

Apache Spark

Solr

HDFS

HIVE

Postgres

Atlas

7.1.3

2.4

8.4.1

3.1.1

3.1.3

11.9

2.0.0

-

HortonWorks

HDP

Apache Spark

Solr

HDFS

HIVE

Postgres

Atlas

3.1.0

2.3.2

8.4.1

3.1.1

3.1.0

11.9

1.1.0

-

Installed separately

-

Installed separately

MEP

(MapR

Ecosystem

Pack)

MapR

Apache Spark

Solr

MapR-FS

HIVE

Postgres

6.1.0

2.4.0

8.4.1

6.1.0.20180926230239.GA

2.3.3

11.9

-

Installed separately

-

Installed separately

AWS

EMR

Apache Spark

Solr

HDFS

HIVE

Postgres

5.30.1

2.4.5

8.4.1

2.8.5

2.3.6

11.9

-

Installed separately

-

Installed separately

Microsoft Azure

HDI

Apache Spark

Solr

HDFS

HIVE

Postgres

Atlas

4.0

2.4.4

8.4.1

3.1.1

3.1.0

11.9

-

Installed separately

-

Installed separately

NA (HDP Only)

Indexing engine

Data Catalog uses Solr to provide an integrated repository and index solution. Configuration settings for Solr are described in Apache Solr configuration.

NoteAs a best practice, install Solr on a separate node from Data Catalog. Consider if your environment is set up differently before following the guidelines for altering Solr configurations as directed in the following articles:

Distribution	Supported versions	Unsupported
Apache Solr	8.4.1	5.5.4, 6.1 and 7.5
Lucidworks Solr	8.4.1	on HDP; 5.x, 6.1 and 7.5

Capacity planning

The Data Catalog stores data in three locations:

Solr storage
The Solr Repository stores the indexes required for Data Catalog searches.
Discovery Cache storage
The Discovery Cache uses HDFS or S3 storage to store the metadata of profiled resources.
Repository Storage
The Postgres database stores the transactional metadata.

Solr storage

Solr storage requirements are 1.5 KB per document or Data Catalog item, including data sources, resources, fields, tags, tag domains, and tag associations. You must multiply this value by the replication factor configured for the Solr collection.

If you do not have details for calculating this volume, you can estimate it by using 10% of the size of the data you intend to catalog.

Discovery Cache storage

The Discovery Cache stores the metadata information gathered by the Data Catalog. Data Catalog gathers metadata information about every resource it profiles in the data lake. This metadata forms the fingerprints of the data lake and are stored in the Discovery Cache on local storage.

The Discovery Cache stores tag propagation and lineage discovery algorithms. Typically, this local storage is in HDFS, but in non-Hadoop deployments, a S3 bucket or other cloud storage can be used.

The Discovery Cache URI must be owned by the Data Catalog service user with full read, write, and execute privileges. Other non-service users must not be granted write permissions to this location. As a best practice, the Discovery Cache should be periodically backed up.

The size requirements are 4 KB per column.

Repository storage

Data Catalog uses a Postgres SQL database for storing audit logs and transactional data, including the sample values and sample data shown in resource views. The Data Catalog-specific data is stored in a separate database, ldc_db (default). Since this database contains sample values and sample data, the ldc_db database should be only accessible to the Data Catalog service users.

The repository storage requirements are 2GB of Postgres storage per 100K resources or 1M resource fields.

Data at rest
Data Catalog does not suggest any method to secure data at rest on the Postgres server. You should secure data according to your organization's policies.
Data in transit
Data Catalog offers support for securing data in transit when its clients are communicating with the Postgres server. Securing data in transit between the Data Catalog and Postgres should be done according to the guidelines listed in the Postgres documentation. See Secure TCP/IP Connections with SSL.

NoteYou may select to use an existing instance of Postgres in your enterprise. The details required by the Data Catalog in this scenario are covered in the Part 2: Perform installation in the browser section .

Cluster compute resources

Data Catalog performs most of its computation as Spark jobs on the cluster. It profiles new and updated files, so after profiling existing files in a data lake, it needs resources only to manage the ingress of data.

Hitachi Vantara suggests the memory settings for the Apache YARN container as 2 GB.

The suggested Spark memory requirements are:

Driver memory: 2 GB.
Executor memory: Proportional to the size of the records. A size estimate is roughly 12 times the size of a single row of data being processed. For example, if your data contains rows that are 100 MB, the executor memory should be at least 1200 MB.
NoteA row is defined as a row of data, not a single line in the file. For a JSON file that does not include line breaks, the row is the single row of data, not the entire file.
When you run deep profiling jobs on data sets in the order of 10,000 resources, you should increase executor memory to as high as 12 GB.
When you run tag discovery jobs on data sets of 5,000 resources, increase executor memory to 12 GB.

You can manage cluster resources that Data Catalog consumes for any given operation using YARN.

For information about tuning Spark jobs, see the Spark article https://spark.apache.org/docs/2.3.2/tuning#tuning-spark

Example estimate

Data Catalog stores data in three locations:

Solr storage
1.5 KB per Data Catalog item, including data sources, resources, fields, tags, tag domains, tag associations, etc. This value must be multiplied by the replication factor configured for the Solr collection. If you do not have details for calculating this volume, consider using 10% of the size of the data you intend to catalog
HDFS storage
4 KB per column
PostgreSQL database
The following example provides an estimate of the storage you would need to set up for a 250 TB data lake. It assumes an environment with the following conditions:
- Both Hive tables and HDFS data are included in the Data Catalog.
- All HDFS data appears in a Hive table.
- Files and tables have an average of 30 columns, that are fields in Data Catalog.
- Each table has a large set of backing files, such as daily files for multiple years.

The example data lake contains 2500 tables of 100 GB each. Each table has 700 files, and each table/file has 30 fields/columns.

Total tables and files: 2500 tables with 700 files each = 1.75M resources.
Tables: 2500 tables x 30 columns = 75,000 columns.
Files: 2500 tables x 700 files per table x 30 columns = 52,500,000 columns.
Total of about 53 million columns.
HDFS storage: 53 M x 4 KB = 212 GB.
Solr storage: 55 M x 1.5 KB = 82.5 GB (x Solr collection replication factor).

In addition, Data Catalog produces logs that are stored on the host node. A best practice is to plan for 200 MB a day and storing several weeks of logs for a total of 6 to 10 GB.

Data sources

Lumada Data Catalog can process data from file systems, Hive, and relational databases. The following table lists the different types of supported data sources:

Data source type	Supported versions	Notes
Aurora	MySQL Compatible 5.6.10 PostgreSQL Compatible 11.6	- -
Cloud/Blob storage	S3 ADL WASB GCP	- For Azure only Certification available on demand Certification available on demand
HDFS	As supported by the distribution	-
HIVE	As supported by the distribution	-
Oracle	11g Express Edition	Release 11.2.0.2.0 - 64bit production
MSSQL	14.30.3030	-
MySQL	5.7.29	-
PostgreSQL	11.9	-
Redshift	-	Certification available on demand
SAP-HANA	-	Certification available on demand
Snowflake	3.x
Sybase	-	Certification available on demand
Teradata	-	Certification available on demand

Data Catalog stores the connection information to these data sources including their access URLs and credentials for the LDC service user in a data sources entity. This entity stores the details of the connections established between the Data Catalog's web server and engine, and the sources.

Authentication

Data Catalog integrates with your existing Linux and Hadoop authentication mechanisms. When you install Data Catalog, you can select one user authentication method from the following options: SSH, Kerberos, or LDAP.

SSH authentication

When using SSH for user authentication, the Data Catalog Application Server communicates with the host system on the listen address and port defined in the /etc/ssh/sshd_config file. The person who installs Data Catalog needs this address and port information.

NoteSSH assumes the password authentication mechanism is available to the Data Catalog Application Server. Because the password authentication mechanism is not available to the Data Catalog Application Server in cloud configurations, SSH authentication does not work on systems using Amazon AWS, Google Compute, or other cloud configurations.

Kerberos authentication

If the cluster is controlled using Kerberos authentication, there are two ways to configure Data Catalog to interact with the KDC (Key Distribution Center):

All user interactions are controlled through Kerberos.
Only service user operations are controlled through Kerberos.

LDAP authentication

You can configure Data Catalog to interact with LDAP for user authentication.

For more information about customizing configurations for these authentication mechanisms, see Component validations.

Data Catalog also supports the following federated authentication mechanisms in addition to these basic authentication integrations. The configuration settings for these additional methods are reviewed in the following security topics:

Web browsers

Data Catalog supports major versions of web browsers that are publicly available prior to the finalization of the Lumada Data Catalog release.

Browser	Supported Version	Notes
Google Chrome	84.0.4147.105 (Official 64-bit) build)	Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility.
Microsoft Edge	42.17134.1098.0	Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility.
Mozilla Firefox	79.0 (64-bit)	Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility.
MacOS Safari	13.1.2	Catalina

Data Catalog ports

Data Catalog uses the following ports and protocols for the specified components:

Component	Protocol	Port
Data Catalog Application server	HTTP	8082
Data Catalog Application server	HTTPS	4039
Data Catalog Metadata server	HTTP	4242
PostgreSQL Server	TCP or TCP-over-SSL	5432

NoteTo change the ports, see Configure LDC ports

Firewall considerations

If the Data Catalog users, including service users, need to access Data Catalog across a firewall, you should allow access to the following ports at the cluster IP address.

NoteThese are the default ports that Data Catalog uses when installed. If you need to change these ports after installation, see Configure LDC ports.

Component	Port	Use
Data Catalog browser application	8082	Grants users access to the Data Catalog browser application
Secure Data Catalog browser application (with HTTPS)	4039	Grants users secure access to the Data Catalog browser application
Metadata repository REST API endpoint	4242	Used internally by the Data Catalog Application Server and agent components to communicate with the repository

Minimum node requirements

The sections listed below list the minimum node requirements for your system:

Data Catalog Application Server
Data Catalog Metadata Server
Agent
Solr server

Data Catalog Application Server

The Data Catalog Application Server can be installed either on a Hadoop edge node as a traditional single cluster installation, or in a non-Hadoop virtual machine outside the cluster as a distributed, multi-cluster installation.

The requirements include:

Minimum of 20 GB, typically 100 GB of disk space. The node does not store data or metadata, just the software and logs.
Minimum of 4 CPUs, running at least 2-2.5 GHz, typically 8 CPUs.
Minimum of 6 GB of RAM, typically 16 GB.
Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
Linux operating system. The netstat, and lsof commands are useful.
JDK version 1.8.x.
Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.

Agent

Agents are processing engine components that can run on multiple remote clusters, and are installed on the edge nodes of a Hadoop cluster.

The requirements include:

Minimum of 20 GB, typically 100 GB disk space. The node does not store data or metadata, just the software and logs.
Minimum 4 CPUs, running at least 2 to 2.5 GHz, typically 16 CPUs if running Spark in Yarn client mode.
Minimum 6 GB of RAM, typically 64 GB if running Spark in Yarn client mode.
Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
Linux operating system. The netstat, and lsof commands are useful.
JDK version 1.8.x.
Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.

Data Catalog Metadata Server

The Data Catalog Metadata Server is installed close to the Solr nodes and acts as an API endpoint for profiling jobs to communicate with the metadata repository.

The requirements include:

Minimum of 20 GB, typically 100 GB disk space. The node does not store data or metadata, just the software and logs.
Minimum 4 CPUs, running at least 2 to 2.5 GHz, typically 8 CPUs.
Minimum 6 GB of RAM, typically 16 GB.
Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
Linux operating system. The netstat, and lsof commands are useful.
JDK version 1.8.x.

Solr server

These requirements assume that the operating system and other components of the node are properly provisioned in terms of memory. The following values provide a baseline for an example Solr server environment:

OS: 4 GB minimum.
Solr runtime: 2 GB.
Solr Index: 8 GB.
NoteThis index size is proportional to the number of documents in the repository. An 8 GB index accounts for approximately 1 million Solr documents. As your index size increases, you may be able to improve performance by increasing the RAM available on the Solr server.

Multi-byte support

Data Catalog handles cluster data transparently, assuming the data is stored in formats that Data Catalog supports. Data Catalog does not enforce any additional limitations beyond what Hadoop and its components enforce. However, there are locations where the configuration of your Hadoop environment should align with the data you are managing. These locations include:

Operating system locale.
Character set supported by Hive client and server.
Character set supported by Solr.

The Data Catalog browser application allows users to enter multi-byte characters to annotate HDFS data. Where Data Catalog interfaces with other applications, such as Hive, Data Catalog enforces the requirements of the integrated application.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Processing engine

Hadoop distributions

Indexing engine

Capacity planning

Solr storage

Discovery Cache storage

Repository storage

Cluster compute resources

Example estimate

Data sources

Authentication

Web browsers

Data Catalog ports

Firewall considerations

Minimum node requirements

Data Catalog Application Server

Agent

Data Catalog Metadata Server

Solr server

Multi-byte support