Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Installing Lumada Data Catalog on Amazon EMR

Parent article

You can use Amazon EMR as the computation platform for Lumada Data Catalog. As with any other platform, the catalog can contain resources from S3 buckets, HDFS file system, Hive databases, and other relational databases.

Requirements

Data Catalog supports the following configuration on Amazon EMR:

  • Amazon EMR version 5.6.1 (running Hadoop 2.7.3 and Spark 2.0; ZooKeeper; Hive)

    Amazon EMR version 5.12 (running Hadoop 2.8.3 and Spark 2.2.1; ZooKeeper 3.4.10; Hive 2.3.2)

  • Recommended Solr version 7.5.0 running in SolrCloud mode with a single shard (non-HDFS storage)
  • Data Catalog service user requires S3 bucket actions: s3:listBucket and s3:getObject

Sizing estimates

If you plan to install Data Catalog and Solr on the same node, the node should have at least 64 GB RAM. This corresponds to an EC2 m5.2xlarge instance. Alternatively, configure Data Catalog and Solr on separate nodes in the same cluster.

Launching the Amazon EMR instance

Depending on the volume of data you intended to process, you can include as few as one master node and one compute node in your cluster. If you choose to use spot pricing, be sure to give yourself room for surge pricing so you don't lose your instances unexpectedly. Be sure to instantiate the cluster in the same region as your S3 bucket data.

Preparation

Before proceeding to installing Data Catalog, make sure you have read and followed the Pre-installation Validations for your environment, specifically:

Downloading and installing Solr

Refer to Installing standalone Solr document for instructions on installing Solr in SolrCloud mode for EMR.

Downloading and installing Postgres

Data Catalog highly recommends installing the complementary Postgres package provided by Data Catalog.

Download the Postgres package and follow the on-screen installation instructions.

Downloading the Data Catalog

If you haven't already, download the Data Catalog distribution from the location provided by Data Catalog and upload it to the AWS node you are using to run Data Catalog (master node). If your organization has subscribed to support, you can find the location through the Hitachi Vantara Lumada and Pentaho Support Portal.

The Data Catalog is packaged as three separate installers:

  • Application server: wld-app-server-X.run
  • Metadata server: wld-metadata-server-X.run
  • Agent: wld-agent-X.run

Where X is the specific version you want to install.

Installation Sequence

The installation of components must follow this sequence:

  1. Application server
  2. Metadata server
  3. Agent(s)

WARNING
  • This install path assumes that the user running the installation has sudo access. If needed, the installer will create a Data Catalog service user.
  • If you do not have sudo access, create directories to contain the software and logs and choose Custom install and specify the Data Catalog service user.

As the Data Catalog service user, extract the installer from the tar package.

The following installation is a generic installation on a non-Kerberized environment. For environment specific installations refer to those sections in the Installation guide.

If installing on a Kerberized environment, please refer to the Installation with Special Cases.

NoteIf you cannot find installation instructions for your specific environment, please drop a note to our support team and if your environment is supported, we will try to accommodate your special request.

Application server

Follow the topics below to install Data Catalog:

Part 1: Perform the command line installation

Follow the steps below to start installing Data Catalog in your environment.

Procedure

  1. Stop any previously running Data Catalog processes.

  2. As a user with root access, execute:

    ./ldc-app-server-*.run
    Verifying archive integrity...  100%   MD5 checksums are OK. All good.
    Uncompressing Lumada Data Catalog App Server Installer  100%
    
    
    This program installs Data Catalog Application Server.
    
    Press ^C at any time to quit.
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                   LUMADA DATA CATALOG APPLICATION SERVER INSTALLER
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Enter the name of the Data Catalog service user [ldcuser]: wlddev
    Enter install location [/opt/ldc] :
    Enter log location [/var/log/ldc] :
    #~~~~~~~~~~~~~~~~~~~~~~~
       SELECTION SUMMARY
    #~~~~~~~~~~~~~~~~~~~~~~~
         Data Catalog service user : wlddev
                    Install location : /opt/ldc (will be created)
                        Log location : /var/log/ldc (will be created)
    Proceed? [Y/n]:
    [sudo] password for wlddev:
    Created directory /opt/ldc
    Created directory /opt/ldc/app-server
    Created directory /var/log/ldc
    Copying files ... done.
    
    Installed app-server to /opt/ldc/app-server
    Starting services ............ done.

Next steps

Part 2: Continue the installation in the browser

The Data Catalog installation process continues in a browser window pointing to the host where you are installing Data Catalog, at port 8082.

To continue the installation in the browser, perform the following steps:

Procedure

  1. Navigate to the setup link at http://<ldc node>:8082/setup.

    The browser opens the Welcome to Lumada Data Catalog page.
  2. Click Let's Get Started.

    Welcome to Lumada Data Catalog page

    The End User License Agreement page appears.
  3. Read the license terms and conditions, then select the checkbox to accept the license agreement, and click I agree.

    End User License Agreement page

  4. On the Connect with Solr page, enter the following fields and settings to set up the Data Catalog Solr collection repository:

    Connect with Solr page

    1. In Solr Client Mode, choose the client mode corresponding to your Solr installation.

      For EMR we recommend using Cloud mode.

      These are the same values that you configured in Install Solr and create a collection.

    2. In the Solr Server Url field, enter the URL of Solr server.

      Review the Solr connection URL to ensure it matches the location of the Solr server (the default is the same node where Data Catalog is installed).
    3. In the Solr Zookeeper Ensemble field, enter the ZooKeeper ensemble.

    4. In the Lumada Data Catalog Collection Name field, enter the Solr collection name as defined in the previous Solr steps.

    5. In Solr Authentication Mode, select an authentication mode - None, Basic, or Kerberos.

    6. Click Test Connection.

      The Connection Successful! message appears.

      If the test does not succeed, check to see if Solr is running and that the Data Catalog service user has access to the collection.

    7. Click Next step.

    The Connect with Postgres page appears.
  5. Enter the following fields and settings to set up the Postgres database which will be used for audit logs and Discussions:

    Connect with Postgres page

    1. In the Postgres Driver Class field, enter the class used to communicate with the Postgres database.

    2. In the Url field, enter the location of the Postgres database installation.

    3. In the Postgres User field, enter the username used to access Postgres.

    4. In the Postgres Password field, enter the password for the above username used to access Postgres.

    5. Click Test Connection.

      The Postgres Connection Successful! message appears.

      If the test does not succeed, check to see if Solr is running and that the Data Catalog service user has access to the collection.

    6. Click Next step.

    The Large properties storage page appears.
  6. Enter the following fields for the location in your cluster (typically HDFS) which will be used to store metadata information required for running jobs and identifying tags.

    Large properies storage page

    1. In the Large Metadata Storage Uri field, enter the fully qualified URI location to store intermediate metadata for Lumada Data Catalog processing jobs.

      This URI will be automatically detected by the installer, but if incorrect, enter the URI of the HDFS name node. If HA is enabled, this will be the HA URI of the HDFS service.

      Set the storage location URI if different from the local HDFS. For example, this location could be an S3 bucket.

    2. In the Parent Path field, enter the path of the parent where you want to store the metadata (typically the home directory of the Data Catalog service user).

      This path must be write accessible by the Data Catalog service user. Subsequently when running jobs, a .wld_hdfs_metadata directory will be created under this path.
    3. Click Test Connection.

      The Connection Successful! message appears.

      If you haven't already configured the Hadoop proxy settings of the Data Catalog service user as mentioned earlier, the Test Connection might fail. Follow these steps and make sure the client configuration has been propagated to the entire cluster and that the cluster has been restarted.

    4. Click Next step.

    The Repository Bootstrap page appears.

    Repository Bootstrap page

    Data Catalog begins the repository and roles bootstrap process. Make sure that the Solr schema has been created, the Postgres schema has been created and Roles, Job sequences and Built-in tags are bootstrapped successfully. The Solr collection creation should be a one time act only.

  7. When ready, click Next step to continue.

    The Authentication method page appears.
  8. Select and configure your user authentication scheme.

    In this step you select the authentication method that allows Data Catalog to validate users who log in to the web application.

    Authentication method page

    For EMR instances, we recommend that you point to an LDAP server.

    The subsequent steps presume that you have already configured your authentication method that will validate the users logging in to Data Catalog.

    1. In Authentication type, select the authentication type. For LDAP enter the following fields.

    2. In LDAP Auth Mode, select an authentication mode: bind-only, bind-search, or search-bind.

      These are covered in detail in LDAP Authentication Modes.
    3. In the LDAP Url field, enter the URL for the authentication type.

      The default entry is a free, third-party LDAP provider. The URL should begin with "ldap://" or if you are using a secure connection to the server, "ldaps://". The standard LDAP server port is 389, 636 for SSL.
    4. In the Auth Identity Pattern field, enter the identity pattern for the authentication.

      The pattern must contain the username literal that will replace with the actual user ID.

      This string must include the phrase "uid={USERNAME}" and can include other LDAP configuration parameters such as to specify users and groups. You can add a search root to the URL to restrict user identity searches to only a part of the LDAP directory.

      For example, "uid={USERNAME},dc=subsidiary,dc=com"

    5. Enter a user as the administrator who will manage the Data Catalog In the Lumada Data Catalog Administrator field.

      It is recommended to enter ldcuser. However, you can enter a different name here, now or later. This user will be granted with the Data Catalog administrator role, which is configured to have access to all data sources and tag domains.

      Use this login to add additional users and continue configuration tasks.

    6. In Test Authentication, enter the user credentials of the administrator in the Username and Password fields to test the authentication.

    7. Click Test Login

      The Login Successful message appears.
    8. Click Next step to continue to the Restart page.

  9. Metadata REST server details page. Copy the Metadata server installation command for later reference but do not execute it yet. We will need this information when installing the Metadata REST server.

    The Application server installation will automatically create a token for the Metadata REST server, which will be used for initializing and registering the Metadata server with the application server.

    The same metadata server token shown in the last step 7 of the installation process can also be obtained from the UI after restarting the application server, by selecting Install Metadata Rest Server under Manage Tokens metadata-rest-server and select Install Metadata Rest Server.

  10. Click Restart to apply the changes.

    Restart page

    You may have to restart the Data Catalog services through the command line in order to make sure the changes are applied successfully. After the changes are applied, Data Catalog is ready.

  11. Login with your Data Catalog administrator credentials to begin cataloging your data lake.

    Log in page

Metadata server

The metadata server installation command is auto-generated by the app-server installer for convenient installation of the metadata server.

After restarting the app-server, you can execute the command on the node where you want to install the metadata server:

./wld-metadata-server-5.1.run -- --init --endpoint proton:8082 \
--client-id metadata-rest-server \
--token 4236cea0-93ad-416d-9b38-919392ac6059 \ 
--public-host proton \
--port 4242

Where:

  • --init initializes (syncs repository configuration from App-Server).
  • --endpoint is the app-server URL to connect to.
  • --token is the authentication token.
  • --public-host is the public host of the metadata server to be reported to agents when they subsequently register.
  • --port is the port on which to run.

Sample output:

./ldc-metadata-server-*.run -- --init --endpoint proton:8082 \
--client-id metadata-rest-server \
--token 4236cea0-93ad-416d-9b38-919392ac6059 \ 
--public-host proton \
--port 4242

Verifying archive integrity...  100%   All good.
Uncompressing Lumada Data Catalog Metadata Server Installer  100%


This program installs Lumada Data Catalog Metadata Server.

Press ^C at any time to quit.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               LUMADA DATA CATALOG METADATA SERVER INSTALLER
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1. Express Install          (Requires superuser access)
    2. Upgrade
    3. Exit

Enter your choice [1-3]: 1
Enter the name of the Lumada Data Catalog service user [ldcuser]:
Enter install location [/opt/ldc] :
Enter log location [/var/log/ldc] :
Enter the Solr server version [8.4.1]:
Is Kerberos enabled? [y/N]: y
Full path to Lumada Data Catalog service user keytab : /home/waterlinesvc/waterlinesvc.keytab
Lumada Data Catalog service user's fully qualified principal : waterlinesvc@WATERLINEDATA.COM
~~~~~~~~~~~~~~~~~~~~~~~
   SELECTION SUMMARY
~~~~~~~~~~~~~~~~~~~~~~~
Lumada Data Catalog service user : waterlinesvc
                Install location : /opt/ldc
                    Log location : /var/log/ldc
                Kerberos enabled : true
            Kerberos keytab path : /home/waterlinesvc/waterlinesvc.keytab
              Kerberos principal : waterlinesvc@WATERLINEDATA.COM
             Solr server version : 7.5.0
Proceed? [Y/n]: y
Removed existing directory /opt/ldc/metadata-server
Directory /opt/ldc exists.
Created directory /opt/ldc/metadata-server
Directory /var/log/ldc exists.
Copying files ... done.

Installed metadata-server to /opt/ldc/metadata-server
Generating certificate ...
SLF4J: Class path contains multiple SLF4J providers.
SLF4J: Found provider [org.slf4j.simple.SimpleServiceProvider@38af3868]
SLF4J: Found provider [org.apache.logging.slf4j.SLF4JServiceProvider@77459877]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual provider is of type [org.slf4j.simple.SimpleServiceProvider@38af3868]
[main] INFO com.hitachivantara.cli.WldSSLCertificateGenerator - Absolute path for keystore is : /opt/ldc/metadata-server/conf/keystore
[main] INFO com.hitachivantara.utils.WldSSLUtility - SSL certificate successfully generated. Storing certificate in keystore
[main] INFO com.hitachivantara.utils.WldSSLUtility - SSL certificate successfully stored in keystore
        Certificate fingerprint (SHA-256): 28baed0ff68461d6079e8faccb7132d835abb1f66589b2c6d11dcbd313c69f12
Executing command: "/opt/ldc/metadata-server/bin/metadata-server" init --endpoint http://ec2-xx-xxx-xx-xx.aa-aaaa-1.compute.amazonaws.com:8082 --client-id metadata-rest-server --token d60766a6-7c5c-49e5-b86e-dd759fd640eb --public-host ec2-xx-xxx-xx-xx.aa-aaaa-1.compute.amazonaws.com --port 4242 --no-exec false
Initializing application
done.
removed ‘/tmp/tmp.UqOj0eBmhY’

Agents

Follow the steps below to create a new Agent.

Procedure

  1. In Lumada Data Catalog, navigate to Manage Agents.

    The Create Agent dialog box appears.

    Create Agent dialog box

  2. Copy the command generated in the Register Agent dialog box.

    Register Agent dialog box

    and use it to install the agent on as follows:

  3. Use the command to install the agent on as follows:

    ./ldc-agent-2019.3.run -- --register --endpoint cdh:8082 --agent-id radf0e60f224ad436e --agent-token c6cd59db-6225-4698-9dd5-ac12f5d5e434

    Sample output:

    ./ldc-agent-*.run -- --register --endpoint http://ec2-xx-xxx-xx-xx.aa-aaaa-1.compute.amazonaws.com:8082 --agent-id ra8be27f45bd764a58 --agent-token d9417b45-2cd9-401b-927c-4d5c4912c614          
    
    Verifying archive integrity...  100%   All good.
    Uncompressing Lumada Data Catalog Agent Installer  100%
    
    
    This program installs Lumada Data Catalog Agent.
    
    Press ^C at any time to quit.
    
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                   LUMADA DATA CATALOG AGENT INSTALLER
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        1. Express Install          (Requires superuser access)
        2. Upgrade
        3. Exit
    
    Enter your choice [1-3]: 1
    Enter the name of the Lumada Data Catalog service user [ldcuser]:
    Enter install location [/opt/ldc] :
    Enter log location [/var/log/ldc] :
    Enter HIVE version [3.1.2]: 2.1.1
    Is Kerberos enabled? [y/N]: n
    ~~~~~~~~~~~~~~~~~~~~~~~
       SELECTION SUMMARY
    ~~~~~~~~~~~~~~~~~~~~~~~
    Lumada Data Catalog service user : ldcuser
                    Install location : /opt/ldc
                        Log location : /var/log/ldc
                    Kerberos enabled : false
    Proceed? [Y/n]:
    Directory /opt/ldc exists.
    Created directory /opt/ldc/agent
    Directory /var/log/ldc exists.
    Copying files ... done.
    
    Installed agent to /opt/ldc/agent
    Executing command: "/opt/ldc/agent/bin/agent" register --endpoint http://ec2-xx-xxx-xx-xx.aa-aaaa-1.compute.amazonaws.com:8082 --agent-token d9417b45-2cd9-401b-927c-4d5c4912c614 --agent-id ra8be27f45bd764a58 --no-exec false
    Registering agent
    done.
    The installation then continues in the browser at the public IP address for the node where Data Catalog was installed: http://<public_ip_address>:8082/setup.

    AttentionDo NOT try to login into a fresh installation without running the Lumada Data Catalog installer setup procedure first.

Final EMR setup

  1. Just after installation, before defining any S3 data sources, get the requisite jars in classpath and restart app-server.

    If running on the EMR node itself, just link existing jars into the App-Server's ext/ dir

    $ ln -s /usr/share/aws/aws-java-sdk/aws-java-sdk-core-1.11.xxx.jar /opt/ldc/app-server/ext/
    $ ln -s /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-1.11.xxx.jar /opt/ldc/app-server/ext/

    Or,

    If running on a non-EMR EC2 instance or plain, non-hadoop VM, download from Maven

    $ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.221/aws-java-sdk-core-1.11.221.jar -P /opt/ldc/app-server/ext/
    $ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.221/aws-java-sdk-s3-1.11.221.jar -P /opt/ldc/app-server/ext/
    

    Also for non-EMR EC2 instance, in addition to the above files we also need to remove the following jars from the specified locations.

    <Agent Dir> $ rm lib/dependencies/httpclient-4.5.10.jar
    <Agent Dir> $ rm lib/dependencies/httpcore-4.4.12.jar
    <Agent Dir> $ rm lib/dependencies/joda-time-2.2.jar
    <Agent Dir> $ rm lib/ldc ldc-execution-bigquery-2019.3.jar
  2. Download the hadoop-aws jar into the app-server's ext/ dir.

    This is needed for the S3A file scheme to work.

    $ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.9.2/hadoop-aws-2.9.2.jar -P /opt/ldc/app-server/ext
    
  3. Restart the App-Server

    $ <LDC-HOME>/app-server/bin/app-server restart
  4. Check the version of the installed JARs.

    $ ls /usr/share/aws/emr/emrfs/lib/

  5. Link the following JARs to ensure Data Catalog Agent has access to JARs needed for HDFS and Spark.

    Use the version number from the output of the previous command to set the version in this one:

    $ ln -s /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-<version>.jar <LDC-HOME>/agent/ext/ $ ln -s /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar <LDC-HOME>/agent/ext/

  6. In the ldc script under <Agent Dir>/bin path make the following changes:

    1. Update the SPARK_HIVE_SITE_PATH value to point to /etc/spark/conf/hive-site.xml

    2. In the below block, change $DEFAULT_HIVE_SITE_PATH to $DEFAULT_SPARK_SITE_PATH

    $PROFILE)
            shift
        buildJarList
         reorder "$@"
          FILES="${FILES},$DEFAULT_HIVE_SITE_PATH"
  7. Restart Agent.

    $ <LDC-HOME>/agent/bin/agent restart

  8. If your EMR version is less than 5.6, move the following Hive Server 2 JDBC driver into the Data Catalog installation.

    $ ln -s /usr/lib/hive/lib/hive-exec.jar <LDC-HOME>/agent/ext/
    $ ln -s /usr/lib/hive/lib/hive-service.jar <LDC-HOME>/agent/ext/
    $ ln -s /usr/lib/hive/lib/hive-jdbc.jar <LDC-HOME>/agent/ext/
  9. Restart HiveServer2.

Build your Lumada Data Catalog

Now that Data Catalog is installed and running, the next step is to connect to the data you want to include in the catalog. For information on how to create a data source, see Manage data sources.

NoteWhen adding a Data Source on EMR:
  • Make sure that the HDFS connection URL reflects s3://
  • Also note the difference between the HDFS URL (s3a://) and Agent Local URL (s3://)