Manage data sources

Last updated
Save as PDF

With Lumada Data Catalog, you can process data from file systems and relational databases. Data sources can contain structured or unstructured data. Supported unstructured data sources are: PDF documents, and Word documents in the format .doc and .docx. You can also use JSON documents in a NoSQL MongoDB as a data source. The following data sources are supported:

Azure Data Lake Storage Gen 1, Gen 2
AWS S3
DB2 11.5.7
Denodo 8
HCP
HDFS
HIVE 3.1.2
Minio (S3)
MongoDB 5.0
MSSQL 2019
MySQL 8
Oracle 11g,12,19c
PostgreSQL 12.4

Additionally, you can process data from the following JDBC sources using the Other data source type:

Snowflake 3.13
Vertica 10.1ce, 11.1ce

To process data from these systems, Data Catalog establishes a data source definition. This data source stores the connection information to your sources of data, including their access URLs and credentials for the service user.

To ignore selected MongoDB databases in scan or schema jobs, use the MongoDB databases to be restricted configuration setting to specify the databases to ignore.

You can connect to an Apache Atlas data source. See Apache Atlas integration.

NoteFor the latest supported versions refer to the release notes.

Adding a data source

If your role has the Manage Data Sources privilege, perform the following steps to create data source definitions.

Specify data source identifiers

Perform the following steps to identify your data source within Data Catalog:

Procedure

Click Go to Management in the Welcome page or Management in the left toolbar of the navigation pane.
The Manage Your Environment page opens.
Click Data Source then Add Data Source, or Add New then Add Data Source.
The Create Data Source page opens.

Specify the following basic information for the connection to your data source:

Field	Description
Data Source Name	Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize. NoteNames must start with a letter, and must contain only letters, digits, and underscores. White spaces in names are not supported.
Data Source ID (Optional)	Specify a permanent identifier for your data source. If you leave this field blank, Data Catalog generates a permanent identifier for you. NoteYou cannot modify Data Source ID for this data source after you specify or generate it.
Description	Specify a description of your data source.
Agent	Select the Data Catalog agent that will service your data source. This agent is responsible for triggering and managing profiling jobs in Data Catalog for this data source.
Data Source Type	Select the database type of your source. You are then prompted to specify additional connection information based on the file system or database type you are trying to access.

Specify additional connection information based on the file system or database type you are trying to access.
See the following sections for details:

ADLS data source

You can connect to an instance of Microsoft’s Azure Data Lake Storage (ADLS) system through a shared key, OAuth, and another configuration method. Regardless of the method you choose, specify the following base fields:

Field	Description
Source Path	Directory where this data source is included. It can be the root of JDBC or it can be a specific high-level directory. To include all databases, use "`/`". NoteMake sure the specified user can access the data in the JDBC database. Data Catalog can only process the required data if the user has access to the data within the JDBC data source.
File System	The parent location that holds the files and folders
Account Name	The name given to your storage account during creation

If you are using the OAuth 2.0 configuration method, you must also specify the client credentials, such as ClientID, Client Secret, and Client Endpoint.

AWS S3 data source

You can connect to an Amazon Web Services (AWS) Simple Storage Service (S3) bucket with your data source URL containing the Elastic MapReduce (EMR) file system name of the S3 bucket, for example, s3://acme-impressions-data/. Access requirements differ depending on whether you are running Lumada Data Catalog on an EMR instance or on another instance type.

Specify the following additional fields for AWS access:

Field	Description
Source Path	Directory where this data source is included.
Endpoint	Location of the bucket. For example, `s3.<region containing S3 bucket>.amazonaws.com`
Access Key	User credential to access data on the bucket.
Secret Key	Password credential to access data on the bucket.
Bucket Name	The name of the S3 bucket in which the data resides. For S3 access from non-EMR file systems, Data Catalog uses the AWS command line interface to access S3 data. These commands send requests using access keys, which consist of an access key ID and a secret access key. You must specify the logical name for the cluster root. This value is defined by `dfs.nameservices` in the hdfs-site.xml configuration file. For S3 access from AWS S3 and MapR file systems, you must identify the root of the MapR file system with `maprfs:///`.
URI Scheme	Version of S3 used for the bucket. You can select either `S3` or `S3A`.
Assume Role	For S3 access from EMR file systems, the EMR role must include `s3:GetObject` and `s3:ListBucket` actions for the bucket. By default, the `EMR_DefaultRole` includes `s3:Get` and `s3:List` for all buckets. The bucket must allow access for the EMR role principal to perform at least `s3:GetObject` and `s3:ListBucket` actions.
Additional Properties	Any additional properties needed to connect. The syntax for additional properties is `property = value`. For S3 access from Kerberos, you must specify the connection URL, the keytab, and principal created for the Data Catalog service user. The Kerberos user name in the Data Catalog configuration, the cluster proxy settings, and the KDC principal are all case-sensitive. Kerberos principal names are case-sensitive, but operating system names can be case-insensitive. NoteA mismatch can cause problems that are difficult to troubleshoot.

HCP data source

You can add data to Data Catalog from Hitachi Content Platform (HCP) by specifying the following additional fields:

Field	Description
Source Path	Directory where this data source is included.
Endpoint	Location of the bucket. (hostname or IP address)
Access Key	The access key of the S3 credentials to access the bucket.
Secret Key	The secret key of the S3 credentials to access the bucket.
Bucket Name	The name of the bucket in which the data resides.
URI Scheme	The version of S3 used for the bucket.
Additional Properties	Any additional properties needed to connect.

HDFS data source

You can add data to Data Catalog from files in HDFS file systems by specifying the following additional fields:

Field	Description
Configuration Method	How to configure the connection. For example, to configure the connection using a URL, select URI.
Source Path	A HDFS directory that this data source includes. It can be the root of HDFS, or it can be a specific high-level directory. Enter a directory based on your needs for access control. To indicate the root of the file system, use the slash "`/`".
URL	Location of the HDFS root. If the cluster is configured for high-availability (HA), this URL may be a variable name without a specific port number, for example, `HDFS: hdfs://<name node>:8020`. The `<name node>` address can be a variable name for high availability. Other examples include: `s3://<bucket-name>` `gs://<bucket-name>` `wasb://<container-name>` `adl://<data-lake-storage-path>` `maprfs:///`

HIVE data source

You can add data to Data Catalog from a HIVE database by specifying the following additional fields:

Field	Description
Configuration Method	How to configure the connection. For example, to configure the connection using a URL, select URI.
Source Path	The HIVE database that this data source includes. It can be the HIVE root, or it can be a specific database. Enter a database based on your needs for access control. To indicate the HIVE root, use the slash "`/`". To indicate a specific database, use a slash "`/`" followed by the database name. For example, `/default` where `default` is the name of the HIVE database.
URL	Location of the HIVE root. For example, `jdbc:hive2://localhost:10000`.

JDBC data source

You can add a Data Catalog data source connection to the following relational databases using JDBC connectors:

MSSQL
MySQL
Oracle
PostgreSQL

Other JDBC sources include:

Denodo
Snowflake
Vertica

Specify the following additional fields:

Field	Description
Configuration Method	How to configure the connection. For example, to configure the connection using a URL, select URI.
Source Path	Directory where this data source is included. It can be the root of JDBC or it can be a specific high-level directory. To include all databases, use the slash "`/`". NoteMake sure the specified user can access the data in the JDBC database. Data Catalog can only process the required data if the user has access to the data within the JDBC data source.
URL	Connection URL of the database. For example, a MYSQL URL would look like `jdbc:mysql://localhost:<port_no>/`.
Driver Name	Driver class for the database type. To connect Data Catalog to a database, you need a driver class of the database. Data Catalog auto-fills the Driver Class field for the type of database selected from the drop-down list. NoteWhen you select Other JDBC to enter the database type, you must provide the Driver Class and import the corresponding JDBC JARs which will restart the agent being used to run the data source's profiling jobs.
Username	Name of the default user in the database.
Password	Password for the default user in the database.
Database Name	Name of the related database.

After a JDBC data source connection has been successfully created by a Data Catalog service user, any other user must provide their security credentials to connect to and access this JDBC database.

NoteIf you encounter errors such as ClassNotFoundException or NoClassDefFoundError, your JDBC driver is not available on the class path.

Test and add your data source

After you have specified the detailed information according to your data source type, perform the following steps to test and add your data source to Data Catalog:

Procedure

Click Test Connection to test your connection to the specified data source.
If you are testing a MySQL connector and you get the following error, it means you need a more recent MySQL connector library:
```
java.sql.SQLException: Client does not support authentication protocol requested by server. plugin type was = 'caching_sha2_password'
```
1. Go to MySQL :: Download Connector/J and select option "Platform Independent".
2. Download the compressed (.zip) file and copy to /opt/ldc/agent/ext where /opt/ldc/agent is your agent install directory, and unpack the file.
(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click Create Data Source to establish your data source connection.

Next steps

You can also update the settings for existing data sources, create virtual folders, update existing virtual folders for data sources, and delete data sources.

NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

Add an external data source

Through an external data source, you can integrate Apache® Atlas™ with Data Catalog. For a given resource in Data Catalog, you can use this integration to perform the following actions:

Push business terms to Atlas.
Pull lineage information from Atlas.

If your role has the Manage Data Sources privilege, perform the following steps to create an external data source for Apache Atlas:

Procedure

Click Management in the left toolbar of the navigation pane.
The Manage Your Environment page opens.
Click Add New in the Data Sources card then Add External Data Source.
The Create External Data Source page opens.

Specify the following information for the connection to your external data source:

Field	Description
External Data Source Name	Specify the name of your data source. This name is used in Data Catalog, so it should be something your Data Catalog users recognize. NoteNames must start with a letter, and must contain only letters, digits, and underscores. White spaces in names are not supported.
Description	Specify a description of your data source.
External Data Source Type	Select Atlas to establish a connection with your data source and the Atlas service.
URL	Connection URL for the Atlas service. This URL should include the host name and port for the Atlas service.
Atlas Username	Name of the Atlas user with the applicable permissions to perform the import and export operations.
Atlas Password	Password for the Atlas user.
Atlas Cluster Name	Name of the cluster containing Atlas.

Click Test Connection to test your connection to the specified data source.
(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click Create Data Source to establish your data source connection.

Results

The data source is created and the count of external data sources is incremented on the Data Sources card.

Edit a data source

You can edit a data source as needed.

Two data sources can have overlapping source paths such that the same Source Path and URL are used, but they should have different names. For example, if a data source ds1 has the path "/" and the URLhdfs://aa:2000, you can create another data source with the same path and URL, named ds2.

NoteSome details about an existing data source, such as source path and data source type, cannot be changed and are unavailable for editing.

Perform the following steps to edit a data source:

Procedure

Navigate to Management and click Data Sources.
Locate the data source that you want to edit and then click the View Details (>) icon at the right end of the row for the data source.
The Data source page opens.
Edit the fields, then click Test Connection to verify your connection to the specified data source.
Click Save Data Source.

Remove a data source

You remove a data source by removing its related root virtual folder. A data source in Data Catalog holds the connection information in an external database or HDFS system, while a virtual folder is the logical mapping of the connection. Removing the root virtual folder of a data source deletes all the dependencies including, but not limited to, deleting any related virtual folder representations and children, asset associations in job templates, and term associations.

Perform the following steps to remove the root virtual folder of a data source:

Procedure

Navigate to Management, then click Data Sources.
Locate the data source that you want to remove, then click the View Details (>) icon at the right end of the row for the data source.
The Data source page opens.
Click Remove Data Source.
The Delete dialog box opens. The Delete dialog box lists the detected dependencies of the data source.
Review the dependencies, enter the name of the data source to be removed, and click Confirm.

Results

A confirmation message appears after the data source is removed.

CautionAllow time between removing a data source and the actual removal of all dependencies. This time depends on the data source size and the number of dependencies. Plan carefully and allow time when reusing names of removed data sources. If you do reuse the name of a recently removed data source for a new data source, an error may occur, especially if the size of the removed data source is large. Removing the data source continues as a background job. Allow time for updating Data Catalog documents. If you encounter this situation, try again later.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Adding a data source

Specify data source identifiers

ADLS data source

AWS S3 data source

HCP data source

HDFS data source

HIVE data source

JDBC data source

Test and add your data source

Add an external data source

Edit a data source

Remove a data source