Configure the Data Catalog service user
As a best practice, you should configure a dedicated service user to own the installation directory and to run Lumada Data Catalog jobs and services. These articles refer to the service user as "ldcuser". If you choose not to create a "ldcuser" user, choose another user to be the dedicated user for running Data Catalog jobs and services.
The Data Catalog service user (ldcuser) must be a valid user in the system used by your enterprise to authenticate cluster users.
Authentication
Authentication for your service user depends on whether you are using Kerberos or secure MapR.
Kerberos
Configure a principal and keytab. Ask your Kerberos administrator to configure a principal name for the Data Catalog service user and create a corresponding keytab file. You need this information to configure the Data Catalog application server and to run Data Catalog jobs. Since the principal needs to auto renew tickets, verify the principal has the allow_renewable attribute.
Secure MapR
Use the
maprlogin
utility and a username/password pair to generate a valid MapR ticket. See Installing Lumada Data Catalog on MapR for details.
Configure the HDFS proxy user
Perform the following steps to configure the Data Catalog service user to use HDFS trusted delegation.
Procedure
Add the Data Catalog service user (ldcuser for example) to the HDFS or MapR-FS superuser group.
Enable the secure impersonation properties for the Data Catalog superuser in the core-site.xml file on your Hadoop nodes through the Advanced/Custom Configuration sections for the HDFS service in Cloudera Manager or Ambari.
These properties are the hosts and groups that represent users who use the application. The Data Catalog service user needs to be able to act on behalf of these users.Specify the Data Catalog server node in the proxy host.
You can also specify a comma-separated list of fully qualified host names or group names. Alternatively, you can include all hosts or all groups using an asterisk (*
) as the property value. Change ldcuser to the name you are using for the Data Catalog service user, as shown in the following example code:<property> <name>hadoop.proxyuser.ldcuser.groups</name> <value>*</value> <description>Allow the superuser 'ldcuser' to impersonate any user </description> </property> <property> <name>hadoop.proxyuser.ldcuser.hosts</name> <value>*</value> <description>The superuser 'ldcuser' can connect from any host to impersonate a user</description> </property>
NoteIf running on a Kerberized system, use the same username in the same format when configuring Data Catalog for connecting to the HDFS cluster through Kerberos.Restart components as necessary to apply the changes on the cluster.
Edge node directories
You may be required to have root access to configure the directories needed for installing and running Data Catalog.
The Data Catalog service user requires full access to the following locations on the edge node:
Directory | Typical Locations |
Software installation location | /opt/ldc/<app-server/metadata-server/agent> |
Logs | /var/log/ldc |
Temporary storage | /tmp |
Data Catalog does not have any specific requirement for where it is installed. As a best practice, you should install Data Catalog in the same way other Hadoop cluster edge node applications are installed. Some clusters use /usr/lib. Others use /opt.
Data Catalog can be installed in other locations, such as in the home directory for the Data Catalog service user /home/ldcuser/ldc. If necessary, you can nest these artifacts inside the same directory structure.
Hue and Ambari access
As a convenience, if you plan to use Ambari Views or Hue to manage HDFS files and monitor MapReduce jobs, create a corresponding user account for the Data Catalog service user on Hue. Alternatively, to identify jobs run by the service user, use that username to filter the job lists.
Data access
Your service user needs permission to read the data they will profile for the catalog. The form of this permission depends on the data source type.
Your Data Catalog service user needs READ access to files in HDFS to include the files in the catalog. One way to provide the service user with READ access is to include this user in the file system group, such as "hdfs" or "mapr".
If your cluster security is ensured using Apache Ranger™ or Apache Sentry™ or with Access Control Lists (ACL), use the following table to indicate how to set access permissions to make sure that the Data Catalog service user has applicable file access.
HDFS user and area of access | Read | Write | Execute |
Browsing HDFS directories in Data Catalog | X | X | |
Profiling HDFS files in Data Catalog | X | ||
HDFS staging area for profiling results (.wld_hdfs_metadata directory) | X | X | X |
Your Data Catalog service user needs SELECT access to Hive tables to include the tables in the catalog. If your cluster security is ensured using Apache Ranger or Apache Sentry, set the following access permissions to ensure that the Data Catalog service user has applicable table access.
Hive area of access | Hive operation |
Profile existing tables | SELECT |
Browse existing tables | SHOW DATABASE |
Hive table and view generation
To generate a Hive table or a view of specified resource types in the catalog, you must have the following access:
- CREATE TABLE privileges to at least one database in Hive.
- WRITE access to the HDFS directory containing the resource for which the Hive table or view is being generated.
You can use this precondition as means of controlling which users can generate a Hive table or a view for the resources in Data Catalog.
Lumada Data Catalog supports Hive table or view generation for the following file formats:
- Avro
- CSV
- ORC