Manage data sources

Last updated
Save as PDF

Lumada Data Catalog can process data from file systems, Hive, and the following relational databases:

Amazon Aurora
MSSQL
MySQL
Oracle
PostgreSQL
Redshift
SAP-HANA
Snowflake
SYBASE
Teradata

To process data from these systems, Data Catalog stores the connection information to these data sources, including their access URLs and credentials for the service user, in a data source definition. For more information on the Data Catalog service user, see the Installation Guide.

In addition to traditional data sources, you also can create registered data sources to represent data assets that may or may not be fingerprinted by Data Catalog, but are required in Data Catalog for lineage derivation or discussions, such as a COBOL copybook. Registered data sources are non-processed data sources that do not offer any discovered data information nor provide any visible data samples. These are also referred to as offline sources and are indicated with the Icon for registered data source icon.

Registered data sources have a flat structure with no child virtual folders. All resources are listed at the same hierarchical level. To browse the resources of a registered data source, you can add individual data resources to them using a POST request to the /v2/dataresource/generic endpoint. Refer to the POST /v2/dataresource/generic reference for more information.

If your role has the Manage Data Source privilege, you can create data source definitions from the Data Catalog home page by clicking Manage, then clicking Data Sources and Add Data Source. On the Manage Data Sources page, you can create connections to the data sources you want to include in your catalog and test a data source connection. You can also complete tasks such as updating the settings for existing data sources, creating virtual folders, and updating existing virtual folders for data sources, and deleting a data source.

The Manage Data Sources page

NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

AWS S3 data sources

To connect to an Amazon Web Services (AWS) Simple Storage Service (S3) bucket, the data source URL must contain the Elastic MapReduce (EMR) file system name of the S3 bucket, for example, s3://acme-impressions-data/. Access requirements differ depending on whether you are running Lumada Data Catalog on an EMR instance or on another type of instance.

The following list describes access requirements for S3 access from supported file systems:

For S3 access from EMR file systems
- The EMR role must include s3:GetObject and s3:ListBucket actions for the bucket. By default, the EMR_DefaultRole includes "s3:Get*" and "s3:List*" for all buckets.
- The bucket must allow access for the EMR role principal to perform at least s3:GetObject and s3:ListBucket actions.
For S3 access from non-EMR file systems
- Data Catalog uses the AWS command line interface to access S3 data. These commands send requests using access keys, which consist of an access key ID and a secret access key.
- You must specify the logical name for the cluster root. This value is defined by dfs.nameservices in the hdfs-site.xml configuration file.
For S3 access from AWS S3 and MapR file systems
You must identify the root of the MapR file system with "maprfs:///". For more details, refer to Installing Lumada Data Catalog on MapR.
For S3 access from Kerberos
You must specify the connection URL, specify the keytab and principal created for the Data Catalog service user.
- The Kerberos user name in the Data Catalog configuration, the cluster proxy settings, and the KDC principal are all case-sensitive.
- Kerberos principal names are case-sensitive, but operating system names can be case-insensitive. A mismatch can cause problems that are difficult to diagnose.

Add an HDFS data source

You can add data to Lumada Data Catalog from files in HDFS, AWS S3, and the Azure ADLS and Blob file systems.

NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

Perform the following steps to add an HDFS data source:

Procedure

Navigate to Manage, then click Data Sources.
The Manage Data Sources page appears.
Click Add Data Source, then click HDFS.

If you are adding an AWS S3 data source, see AWS S3 data sources for more information.
The Create Data Source – HDFS dialog box opens.
Complete the following fields:
1. In the NAME field, enter the name of your data source.
  This name is used in the Data Catalog interface and command line and should be something your Data Catalog users will recognize.
  NoteNames must start with a letter, and must contain only letters, digits, hyphens, or underscores. White spaces in names are not supported.
2. (Optional) In the DESCRIPTION field, enter a description of the data source.
3. In the Agents field, select the Data Catalog Agent that will service this data source.
  This Agent is responsible for triggering and managing profiling jobs in Data Catalog for this data source.
4. In the URL field, enter the location of the HDFS root.
  If the cluster is configured for high-availability (HA), this URL may be a variable name without a specific port number, for example, HDFS: hdfs://<name node>:8020. The <name node> address can be a variable name for high availability.
  Other examples include:
  - s3://<bucket-name>
  - gs://<bucket-name>
  - wasb://<container-name>
  - adl://<data-lake-storage-path>
  - maprfs:///
5. In the SOURCE PATH field, enter the HDFS directory that this data source includes.
  It can be the root of HDFS, or it can be a specific high-level directory. Enter a directory based on your needs for access control. To indicate the root of the file system, use the slash "/".
6. Expand the Advanced pane to display the Agent Local URL field. In the Agent Local URL field, specify the URL that the Data Catalog Agent will use to connect to the data source.
  If the Data Catalog Agent is co-located within the same cluster as the Data Catalog Application Server, you may leave this field empty.
7. Click Test Connection.
  The database connection settings are verified. If the settings that you entered are correct, a success message appears.
Click Add.
The newly created HDFS data source appears in the data source list.

Next steps

After the HDFS data source connection is successfully created, use the Test a data source connection task to test the connection.

Add a HIVE data source

If you are running HIVE on the Hadoop platform, you can add a Lumada Data Catalog data source from the HIVE database. However, note that profiling a HIVE database is little slower than profiling other traditional databases.

NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

Perform the following steps to add a HIVE data source:

Procedure

Navigate to Manage, then click Data Sources.
The Manage Data Sources page appears.
Click Add Data Source, then click HIVE.
The Create Data Source – HIVE dialog box opens.
Complete the following fields:
1. In the NAME field, enter the name of your data source.
  This name is used in the Data Catalog interface and command line and should be something your Data Catalog users will recognize.
  NoteNames must start with a letter, and must contain only letters, digits, hyphens, or underscores. White spaces in names are not supported.
2. (Optional) In the DESCRIPTION field, enter a description of the data source.
3. In the Agents field, select the Data Catalog Agent that will service this data source.
  This Agent is responsible for triggering and managing profiling jobs in Data Catalog for this data source.
4. In the URL field, enter the Hive URL.
  The service user for Data Catalog uses this connection string to access HIVE: jdbc:hive2://:10000. This value can be a variable node or a list of nodes for high availability (HA).
  NoteIf the connection to HIVE is Kerberized, this URL must contain the Kerberos connection details:
  - Hostname and port
    Enter the HiveServer2 host. By default, the port is 10000.
  - Domain
    Enter a domain corresponding to the realm configured for the cluster. Use a domain_realm entry from the krb5.conf file on the HIVE host. This value is used to construct the principal parameter of the URL. Specify the keytab and principal created for the Data Catalog service user.
5. In the SOURCE PATH field, enter the HIVE directory that this data source includes.
  It can be the HIVE root (indicated by a slash "/") or a specific database (indicated by a slash followed by the database name "/default").
6. Expand the Advanced pane to display the Agent Local URL field. In the Agent Local URL field, specify the URL that is used by the Data Catalog Agent to connect to the data source.
  If the Data Catalog Agent is co-located within the same cluster as the Data Catalog Application Server, you may leave this field empty.
7. Click Test Connection.
  The database connection settings are verified. If the settings that you entered are correct, a success message appears.
Click Add.
The newly created HIVE data source appears in the data source list.

Next steps

After the HIVE data source connection has been successfully created, use the Test a data source connection task to test the connection.

Add a JDBC data source

Lumada Data Catalog can connect to the following relational databases using JDBC connectors:

Amazon Aurora
MSSQL
MySQL
Oracle
PostgreSQL
RedShift
SAP-HANA
Snowflake
SYBASE
Teradata

NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

Perform the following steps to add a JDBC data source:

Procedure

Navigate to Manage, then click Data Sources.
The Manage Data Sources page appears.
Click Add Data Source, then select JDBC.
The Create Data Source – JDBC dialog box opens.
Complete the following fields:
1. In the NAME field, enter the name of your data source.
  This name is used in the Data Catalog interface and command line and should be something your Data Catalog users will recognize.
  NoteNames must start with a letter, and must contain only letters, digits, hyphens, or underscores. White spaces in names are not supported.
2. (Optional) In the DESCRIPTION field, enter a description of the data source.
3. In the Agents field, select the agent that will service this data source.
  This agent is responsible for triggering and managing profiling jobs in Data Catalog for this data source.
4. In the DATABASE TYPE field, select the database type.
  
  If the relational database type is not listed in the drop-down list, select the Other option and enter the details, such as in the following examples using SAP-HANA and Sybase:
  
  In this example, the user selected Other to add Sybase:
  
  AttentionWhen you use the Other option to enter the database type, you must provide the DRIVER CLASS and import the corresponding JDBC JARs which may require restarting Data Catalog's web application server.
5. In the DRIVER CLASS field, enter the driver class for the database type.
  To connect Data Catalog to a database, you need a driver class of the database. Data Catalog auto-fills the DRIVER CLASS field for the type of database selected from the drop-down list.
6. In the URL field, enter the connection URL of the database.
  
  For example, a MYSQL URL would look like jdbc:mysql://localhost:<port_no>/<database_name>.
7. In the SOURCE PATH field, enter the directory where this data source is included. It can be the root of JDBC or it can be a specific high-level directory. To include all databases, use "/".
8. In the Username field, enter the name of the default user in the database.
  
  Note Make sure the specified user can access the data in the JDBC database. Data Catalog can only process the required data if the user has access to the data within the JDBC data source.
9. In the Password field, enter the password for the default user in the database.
10. Expand the Advanced pane to display the Agent Local URL field. In the Agent Local URL field, specify the URL the agent uses to connect to the data source.
  If the agent is co-located within the same cluster as the application server, you may leave this field empty.
11. Click Test Connection.
  The database connection settings are verified. If the settings that you entered are correct, a success message appears.
  NoteIf you encounter errors such as "ClassNotFoundException" or "NoClassDefFoundError", your JDBC driver is not available on the class path. See Installing JDBC drivers for details.
Click Add.
The newly created JDBC data source appears in the data source list.

Next steps

Use the Test a data source connection task to test the connection.

After a JDBC data source connection has been successfully created by a Data Catalog service user, any other user must provide their security credentials to connect to and access this JDBC database.

Add a registered data source

Registered data sources have limited functionality in Lumada Data Catalog. You can search a registered data source and it can be the subject of a discussion. However, if you want to browse a registered data source, you need to use a POST request to the /v2/dataresource/generic endpoint to add data resources to the registered data source. Refer to the POST /v2/dataresource/generic reference for more information.

Perform the following steps to add a registered data source:

Procedure

Navigate to Manage, then click Data Sources.
The Manage Data Sources page appears.
Click Add Data Source, then select Registered.
The Create Data Source - Registered dialog box opens.
In the NAME field, enter the name of your data source.
This name is used in the Data Catalog interface and command line and should be something your Data Catalog users will recognize.
NoteNames must start with a letter, and must contain only letters, digits, hyphens, or underscores. White spaces in names are not supported.
(Optional) In the DESCRIPTION field, enter a description of the data source.
Click Add.
The newly created registered data source appears in the data source list.

Test a data source connection

Before you begin

Before performing this task, make sure that you have restarted your cluster. Otherwise, you may see test connection failures.

After you have successfully created or updated a data source connection, you can test the connection. However, when adding or updating the data source, you can skip the connection test or test it later.

Perform the following steps to test the connection:

Procedure

Click the user icon on the top right of the Data Catalog home page and then click Profile settings.
The User Profile page displays.
On the Account tab, click Connect next to the name of the data source you want to test.
The Data Source User Credentials dialog box opens.
Enter your credentials and click Test connection.
If the test succeeds, you see two test connection statuses: one for the Agent connection and the other for the App-Server connection.

Edit a data source

You can edit a data source. For example, you may want to increase the scope of the data by modifying the SOURCE PATH field. If your data source path was Data/Landing, you can modify the path to /Data to increase the data scope. Note that you cannot narrow the scope by adding a subdirectory to the data source path, such as Data/Landing/2017. The source path scope can only be expanded.

Perform the following steps to edit a data source:

Procedure

Navigate to Manage and click Data Sources.
Locate the data source that you want to edit and then click the name of the data source to open the Settings tab for the data source.
Edit the fields then click Save.

Results

If any connection details have changed, Data Catalog automatically prompts you to test the connection to validate data source connectivity before your changes can be saved. See Test a data source connection to test the data source connection.

NoteWhile most properties of a connected data source can be edited, the following exceptions apply:

The source path scope can can only be expanded. Data Catalog does not permit narrowing the scope for an already connected data source.
Two data sources cannot have overlapping source paths. Data Catalog performs a check for overlapping paths at the time of data source creation and does not proceed if overlapping paths are found.

Delete a data source

You can delete a data source from the Manage Data Sources page or from the data source Settings tab.

Perform the following steps to delete a data source:

Procedure

Click Manage and then click Data Sources.
The Manage Data Sources page appears.
Do one of the following.
- Locate the data source you want to delete and click the More actions icon at the end of the row. On the drop-down menu, click Delete.
- Click the data source you want to delete. On the data source Settings tab, click the More actions icon, and then click Delete.
The Delete Data Source dialog box lists the detected dependencies of the data source.
Review the dependencies and click either Delete or Cancel.

Results

After the data source is deleted, a confirmation message appears.

Important A data source in Data Catalog holds the connection information in an external database or HDFS system, while virtual folders are just a logical mapping of the connection. Deleting a data source in Data Catalog deletes all the dependencies including, but not limited to, deleting the corresponding:

Virtual folder representations and children.
Asset associations in job templates.
Tag associations.

Allow time between deleting a data source and the actual deletion of all dependencies. This time is dependent on the data source size and the number of dependencies.

CautionPlan carefully and allow time when reusing names of deleted data sources. If you reuse the name of a recently-deleted data source for a new data source, an error may occur, especially if the size of the deleted data source is large. Data source deletion continues as a background job and updating Data Catalog documents takes time. If you encounter this situation, try again later.

Connecting to a data source outside the cluster

If the Lumada Data Catalog Application Server is installed in a location outside the cluster, and the Data Catalog agent is located on the cluster, you need to modify the data source settings to add the URL of a data source that is relative to the agent, known as the agent-relative data source URL.

To add the URL, expand the Advanced pane in the Create Data Source dialog box and enter an agent-relative data source URL in the Agent Local URL field. You can also enter the agent-relative data source URL in the Advanced pane of the Settings tab when managing an existing data source.

Create Data Source dialog box

For efficient network access, the agent might use a local network hostname/IP of the data source, while the application server connection might be routed through a different remote network. This configuration may also be used when the application server cannot connect to the data source directly, but the agent can. In such cases, you can select the Skip Connection Test check box to skip validating network connectivity.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

AWS S3 data sources

Add an HDFS data source

Add a HIVE data source

Add a JDBC data source

Add a registered data source

Test a data source connection

Edit a data source

Delete a data source

Connecting to a data source outside the cluster