Configure the Data Catalog service user

Last updated
Save as PDF

As a best practice, you should configure a dedicated service user to own the installation directory and to run Lumada Data Catalog jobs and services. These articles refer to the service user as "ldcuser". If you choose not to create a "ldcuser" user, choose another user to be the dedicated user for running Data Catalog jobs and services.

CautionBecause of the extensive access privileges that Data Catalog needs to produce a catalog of HDFS files, the user account that runs Data Catalog jobs must be created to adhere to all enterprise security requirements.

The Data Catalog service user (ldcuser) must be a valid user in the system used by your enterprise to authenticate cluster users.

Authentication

Authentication for your service user depends on whether you are using Kerberos or secure MapR.

Kerberos
Configure a principal and keytab. Ask your Kerberos administrator to configure a principal name for the Data Catalog service user and create a corresponding keytab file. You need this information to configure the Data Catalog application server and to run Data Catalog jobs. Since the principal needs to auto renew tickets, verify the principal has the allow_renewable attribute.
Secure MapR
Use the maprlogin utility and a username/password pair to generate a valid MapR ticket. See Installing Lumada Data Catalog on MapR for details.

Configure the HDFS proxy user

You need to configure the Data Catalog service user to act on behalf of the users logged on to the application. This action is accomplished by configuring the Data Catalog service user as an HDFS proxy user for the hosts and groups who access the Data Catalog web application.

NoteApplying changes to cluster configuration files may require restarting cluster components. Coordinate with cluster administrators to include this change in the maintenance schedule.

Perform the following steps to configure the Data Catalog service user to use HDFS trusted delegation.

Procedure

Add the Data Catalog service user (ldcuser for example) to the HDFS or MapR-FS superuser group.
Enable the secure impersonation properties for the Data Catalog superuser in the core-site.xml file on your Hadoop nodes through the Advanced/Custom Configuration sections for the HDFS service in Cloudera Manager or Ambari.
These properties are the hosts and groups that represent users who use the application. The Data Catalog service user needs to be able to act on behalf of these users.
Specify the Data Catalog server node in the proxy host.
You can also specify a comma-separated list of fully qualified host names or group names. Alternatively, you can include all hosts or all groups using an asterisk (*) as the property value. Change ldcuser to the name you are using for the Data Catalog service user, as shown in the following example code:
```
<property>
    <name>hadoop.proxyuser.ldcuser.groups</name>
    <value>*</value>
    <description>Allow the superuser 'ldcuser' to impersonate any user
    </description>
</property>
<property>
    <name>hadoop.proxyuser.ldcuser.hosts</name>
    <value>*</value>
    <description>The superuser 'ldcuser' can connect from any host to impersonate a user</description>
</property>
```
NoteIf running on a Kerberized system, use the same username in the same format when configuring Data Catalog for connecting to the HDFS cluster through Kerberos.
Restart components as necessary to apply the changes on the cluster.

Edge node directories

You may be required to have root access to configure the directories needed for installing and running Data Catalog.

The Data Catalog service user requires full access to the following locations on the edge node:

Directory	Typical Locations
Software installation location	/opt/ldc/`<app-server/metadata-server/agent>`
Logs	/var/log/ldc
Temporary storage	/tmp

Data Catalog does not have any specific requirement for where it is installed. As a best practice, you should install Data Catalog in the same way other Hadoop cluster edge node applications are installed. Some clusters use /usr/lib. Others use /opt.

Data Catalog can be installed in other locations, such as in the home directory for the Data Catalog service user /home/ldcuser/ldc. If necessary, you can nest these artifacts inside the same directory structure.

Hue and Ambari access

As a convenience, if you plan to use Ambari Views or Hue to manage HDFS files and monitor MapReduce jobs, create a corresponding user account for the Data Catalog service user on Hue. Alternatively, to identify jobs run by the service user, use that username to filter the job lists.

Data access

Your service user needs permission to read the data they will profile for the catalog. The form of this permission depends on the data source type.

HDFS access

Your Data Catalog service user needs READ access to files in HDFS to include the files in the catalog. One way to provide the service user with READ access is to include this user in the file system group, such as "hdfs" or "mapr".

If your cluster security is ensured using Apache Ranger™ or Apache Sentry™ or with Access Control Lists (ACL), use the following table to indicate how to set access permissions to make sure that the Data Catalog service user has applicable file access.

HDFS user and area of access	Read	Write	Execute
Browsing HDFS directories in Data Catalog	X		X
Profiling HDFS files in Data Catalog	X
HDFS staging area for profiling results (.wld_hdfs_metadata directory)	X	X	X

Hive access

Your Data Catalog service user needs SELECT access to Hive tables to include the tables in the catalog. If your cluster security is ensured using Apache Ranger or Apache Sentry, set the following access permissions to ensure that the Data Catalog service user has applicable table access.

Hive area of access	Hive operation
Profile existing tables	SELECT
Browse existing tables	SHOW DATABASE

Hive table and view generation

To generate a Hive table or a view of specified resource types in the catalog, you must have the following access:

CREATE TABLE privileges to at least one database in Hive.
WRITE access to the HDFS directory containing the resource for which the Hive table or view is being generated.

You can use this precondition as means of controlling which users can generate a Hive table or a view for the resources in Data Catalog.

Lumada Data Catalog supports Hive table or view generation for the following file formats:

Avro
CSV
ORC

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.