Connecting to Virtual File Systems

Last updated
Save as PDF

You can connect to most of your Virtual File Systems (VFS) through VFS connections in PDI. A VFS connection is a stored set of VFS properties that you can use to connect to a specific file system. In PDI, you can add a VFS connection and then reference that connection whenever you want to access files or folders on your Virtual File System. For example, you can use the VFS connection for Hitachi Content Platform (HCP) in any of the HCP transformation steps without the need to repeatedly enter your credentialing information for data access.

With a VFS connection, you can set your VFS properties with a single instance that can be used multiple times. The VFS connection supports the following file system types:

Google Cloud Storage

The Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.
Snowflake Staging

A staging area used by Snowflake to load files. See Snowflake staging area for more information on this protocol.
Amazon S3 / MinIO
- Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.
- MinIO accesses data objects on an Amazon compatible storage server. See the MinIO Quickstart Guide for MinIO setup instructions.
HCP

The Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. See Access to HCP for more information.
Microsoft Azure

The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Storage Gen2 and Blob Storage. See Access to Microsoft Azure for more information.

After you create a VFS connection, you can use it with PDI steps and entries that support the use of VFS connections. The VFS connection is saved in the repository. If you are not connected to a repository, the connection is saved locally on the machine where it was created.

If a VFS connection in PDI is not available for your Virtual File System, you may be able to access it with the VFS browser. See VFS browser for further details.

Before you begin

You must first perform a few setup tasks if you need to access the Google Cloud or the Hitachi Content Platform (HCP).

Access to Google Cloud

To access Google cloud from PDI, you must have a Google account and service account credentials in the form of a JSON format key file. Additionally, you must set permissions for your Google Cloud accounts. To create service account credentials, see https://cloud.google.com/storage/docs/authentication.

Perform the following steps to set up your system to use Google Cloud storage:

Procedure

Download the service account credentials file that you have created using the Google API Console to your local machine.
Create a new system environmental variable on your operating system named GOOGLE_APPLICATION_CREDENTIALS.
Set the path to the downloaded JSON service account credentials file as the value of the GOOGLE_APPLICATION_CREDENTIALS variable.

Results

You are now ready to access files from the Google Cloud Storage file system in PDI.

Access to HCP

Hitachi Content Platform (HCP) is a distributed storage system that can be used through a VFS connection in the PDI client.

Within HCP, access control lists (ACLs) grant privileges to the user to perform a variety of file operations. Namespaces, owned and managed by tenants, are used for logical groupings, access and permissions, and object metadata such as versioning, retention and shred settings. For more information about HCP, see the Introduction to Hitachi Content Platform.

Perform the following steps to setup access to HCP:

NoteThe following process assumes that you have HCP tenant permissions and that namespaces have been created. For more information, see Tenant Management Console.

NoteTo create a successful VFS connection to HCP, object versioning must be configured in HCP Namespaces.

Procedure

Log on to the HCP Tenant Management Console.
Click Namespaces and then select the Name you want to configure.
In the Protocols tab, click HTTP(S), and verify Enable HTTPS and Enable REST API with Authenticated access only are selected.
In the Settings tab, select ACLs.
Select the Enable ACLs check box and, when prompted, click Enable ACLs to confirm.

Results

This completes the setup of HCP for accessing files in the PDI client.

Access to Data Catalog

Data Catalog has the capability of accessing and registering data stored either on Amazon, MinIO, or HDFS. Multiple connections in PDI are required to access Data Catalog and the resources that it manages:

To access Data Catalog, you will need to know the authentication type used, instance URL, and your account credentials.
To access S3 resources from Data Catalog, see Create a VFS connection.
To access HDFS resources from Data Catalog, see Connecting to a Hadoop cluster with the PDI client.

Access to Microsoft Azure

To access Azure services from PDI, you must first create and configure your Azure Data Lake Gen 1, or Azure Data Lake Storage Gen2 and Blob Storage services. you should also enable the hierarchical namespace to maximize file system performance..

Access to the Azure services requires an Azure account with an active subscription. See Create an account for free.
Access to Azure Storage requires an Azure Storage account. See Create a storage account.

Create a VFS connection

Perform the following steps to create a VFS connection in PDI:

Procedure

Start the PDI client (Spoon) and create a new transformation or job.
In the View tab of the Explorer pane, right-click the VFS Connections folder, then click New.
The New VFS connection dialog box opens.
In the Connection Name field, enter a name that uniquely describes this connection.
The name can contain spaces, but it cannot include special characters, such as #, $, and %.
In the Connection Type field, select from one of the following types:
- Amazon S3 / MinIO( Default):
  - Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.
  - MinIO accesses data objects on an Amazon compatible storage server. See the MinIO Quickstart Guide for MinIO setup instructions.
- Azure Data Lake Gen 1
  
  The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Gen 1 storage. See Access to Microsoft Azure for more information.
- Azure Data Lake Gen 2 / Blob
  
  The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Storage Gen2 and Blob Storage. See Access to Microsoft Azure for more information.
- Catalog
  
  The Lumada Data Catalog. You must configure your Data Catalog connection before accessing the platform. Enter the authentication type, connection URL and account credentials. To access data resources from Data Catalog, an S3 or HDFS connection is also required. See the Lumada Data Catalog documentation for details.
- Google Cloud Storage:
  
  The Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.
- HCP:
  
  The Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. You must also configure object versioning in HCP Namespaces. See Access to HCP for more information.
- Snowflake Staging:
  
  A staging area used by Snowflake to load files. See Snowflake staging area for more information on this protocol.
Based on the connection type selected, the details section opens below the Description section.
(Optional) Enter a description for your connection in the Description field.
Click Next.
In the Connection Detailssection, enter the information according to your selected Connection Type.

If you selected Amazon S3 / MinIO, choose one of the following options:
- For Amazon: Select the Default S3 Connection check box to enable use of Amazon S3.
- For MinIO: Select the Default S3 connection check box to enable use of MinIO. Also, select the PathStyle Access check box to enable path-style access. Otherwise, S3 bucket-style access is used.
If you selected Azure Data Lake Gen 1 on the previous page, specify your Azure account name in the Account Fully Qualified Domain Name field, and enter your Application ID, Client Secret, and Oauth 2.0 token endpoint.

If you selected Azure Data Lake Gen 2 / Blob on the previous page, choose one of the following Authentication Type options:
- For Account Shared Key: Enter your shared account key credential in the Account Shared Key field.
- For Azure Active Directory: Enter your Azure Active Directory credentials in the Application (client) ID, Client Secret, and Directory (tenant) ID fields.
- For Shared Access Signature: Enter your SAS token in the Shared Access Signature field.
For all three Authentication Type options, you must also specify your Azure account name in the Account Fully Qualified Domain Namd field, the Block Size, the Buffer Count, the Max Block Upload Size, and your Access Tier.

NoteYou can add a predefined variable file to the fields that have icon. Place the cursor in the required field and enter Ctrl+Space. Select the required file from the list of files available. The variable must be a predefined variable and not a runtime variable. The variables are defined in the Kettle.properties. For more information on variables, see Kettle Variables.
(Optional) Click Test to verify your connection.
Click OK to save the changes for the created connection.

Results

You can now use your connection to specify VFS information in your transformation steps or job entries, such as the Snowflake entries or HCP steps. See PDI and Snowflake and PDI and Hitachi Content Platform (HCP) for more information about these entries and steps.

Edit a VFS connection

Perform the following steps to edit an existing VFS connection:

Procedure

Right-click the VFS Connections folder and select Edit.
When the Edit VFS Connection dialog box opens, select the Pencil icon next to the section you want to edit.

Delete a VFS connection

Perform the following steps to delete a VFS connection:

Procedure

Right-click the VFS Connections folder.
Select Delete, then Yes, Delete.

Results

The deleted VFS connection no longer appears under the VFS Connections folder in the View tab of the PDI Explorer pane.

Access files with a VFS connection

After you have created a VFS connection, you can use the VFS Open and Save dialog box to access files in the PDI client.

Follow these instructions to access files with the VFS Open or Save dialog box in the PDI client.

Procedure

In the PDI client, select File Open URL to open a file or File Save as to save a file.
The VFS Open or Save As dialog box opens, as shown in the following example:

Recently accessed files appear in the right panel.
Navigate to your folders and files through the VFS connection category in the left panel.
When you select a folder or file, the Recents drop-down list updates to show the navigation path to your file location.
(Optional) Click on the navigation path to show and copy the Pentaho file path, even if it is a Virtual File System (VFS) path. See Pentaho address to a VFS connection for details on the Pentaho file path for the Virtual File System.
Select the file and click Open or Save depending on whether you are opening or saving a file.

Results

The Open or Save As dialog box closes when you have access to your folder or file.

NoteIf you are not connected to the Pentaho Repository, you can select a folder or file in the Open or Save As dialog box, click on it again, and rename it.

Pentaho address to a VFS connection

The Pentaho address is the Pentaho virtual file system (pvfs) location within your VFS connection. When you are locating a file under the VFS connection category in the file access dialog box, the directory path in your Virtual File System appears in the address text box. VFS navigation path in the Open dialog box in the PDI client

When you click in the address bar, the Pentaho address to the file appears in the text box. PVFS file path in the Open dialog box in the PDI client

You can copy and paste a Pentaho address into file path options of PDI steps and entries that support VFS connections.

NoteYou must use the Pentaho virtual file system for anything related to Amazon S3. In addition, file paths and permissions in your existing transformations and jobs that use Amazon S3 are supported when Amazon S3 is the Default S3 Connection.

Steps and entries supporting VFS connections

You may have a transformation or a job containing a step or entry that accesses a file on a Virtual File System.

The following steps and entries support VFS connections:

VFS browser

In some transformation steps and job entries, a Virtual File System (VFS) browser is used in place of VFS connections and the Open dialog box. When you use the VFS browser, specify a VFS URL instead of the VFS connection. The files are accessed using HTTP and the URLs contain schema data that identify a protocol to use. Your files can be local or remote, and can reside in compressed formats such as TAR and ZIP. For more information, see the Apache Commons VFS documentation.

Before you begin

If you need to access a Google Drive or the Hitachi Content Platform (HCP), you must perform a few setup tasks.

Access to a Google Drive

Perform the following setup steps to initially access your Google Drive.

Procedure

Follow the "Step 1" procedure in the article "Build your first Drive app (Java)" in the Google Drive APIs documentation.
This procedure turns on the Google Drive API and creates a credentials.json file.
Rename the credentials.json file to client_secret.json. Copy and paste the renamed file into the data-integration/plugins/pentaho-googledrive-vfs/credentials directory.
Restart PDI.
The Google Drive option does not appear when creating a VFS connection until you copy and paste the client_secret.json file into the credentials directory and restart PDI.
Log in to your Google account.
Enter you Google account credentials and log in. The Google Drive permission window displays.
Click Allow to access your Google Drive Resources.

Results

After this initialization, Pentaho stores a security token called StoredCredential in the data-integration/plugins/pentaho-googledrive-vfs/credentials directory. With this token, you can access your Google Drive resources whenever you log in to your Google account. If this security token is deleted, you are prompted to log in to your Google account after restarting PDI. If you change your Google account permissions, you must delete the token and repeat the previous steps to generate a new token.

NoteIf you want to access your Google Drive via a transformation running directly on your Pentaho Server, copy then paste the StoredCredential and the client_secret.json files into the pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory on your Pentaho Server.

Set up HCP credentials

To use the VFS browser with your HCP files, you must set up your HCP credentials. Before setting up your credentials, verify you already established access to the HCP platform by performed the tasks specified in Access to HCP .

Perform the following steps to set up your HCP credentials.

Procedure

Depending on the operating system, create the following subdirectory in the user’s home directory:
- Linux: ~/.hcp/
- Windows: C:\Users\username\.hcp\
Create a file named credentials and save it to the \.hcp directory.
Open the credentials file then add the parameters and values shown in the following code:
```
[default]
hcp_username=[username]
hcp_password=[password]
accept_self_signed_certificates=false
```
Insert the HCP namespace username and password, and change accept_self_signed_certificates to true if you want to enable a security bypass.

NoteYou can also use obfuscated or encrypted usernames and passwords.
Save and close the file.
For the Pentaho Server setup, stop and start the server.

Results

This completes the setup for VFS browser access to HCP.

Access files with the VFS browser

Perform the following steps to access your files with the VFS browser.

Procedure

Select File Open URL in the PDI client to open the VFS browser.
The Open File dialog box opens.
In the Location field, select the type of file system. The following file systems are supported:
- Local: Opens files on your local machine. Use the folders in the Name panel of the Open File dialog box to select a resource.
- Hadoop Cluster: Opens files on any Hadoop cluster except S3. Click the Hadoop Cluster drop-down box to select your cluster, then the resource you want to access.
- HDFS: Opens files on Hadoop distributed file systems. Select the cluster you want for the Hadoop Cluster option, then select the resource you want to access.
- Google Drive: Opens files on the Google file system. You must configure PDI to access the Google file system. See Access to a Google Drive for more information.
In the Open from Folder field, enter the VFS URL.

The following addresses are VFS URL examples for the Open from Folder field:
- Local: ftp://userID:password@ftp.myhost.com/path_to/file.txt
- HDFS: hdfs://myusername:mypassword@mynamenode:port/path
(Optional) Select another value in the Filter menu to filter on file types other than transformations and jobs, which is the default value.
(Optional) Select a file or folder and click the X icon in the upper-right corner of the browser to delete it.
(Optional) click the + icon in the upper-right corner of the browser to create a new folder.

Next steps

NoteVFS dialog boxes are configured through specific transformation parameters. See Configure VFS options for more information.

Supported steps and entries

The following steps and entries support the VFS browser:

Amazon EMR Job Executor (introduced in v9.0)
Amazon Hive Job Executor (introduced in v9.0)
AMQP Consumer (introduced in v9.0)
Avro Input (introduced in v8.3)
Avro Output (introduced in v8.3)
ETL metadata injection
File Exists (Job Entry)
Hadoop Copy Files
Hadoop File Input
Hadoop File Output
JMS Consumer (introduced in v9.0)
Job Executor (introduced in v9.0)
Kafka consumer (introduced in v9.0)
Kinesis Consumer (introduced in v9.0)
Mapping (sub-transformation)
MQTT Consumer (introduced in v9.0)
ORC Input (introduced in v8.3)
ORC Output (introduced in v8.3)
Parquet Input (introduced in v8.3)
Parquet Output (introduced in v8.3)
Oozie Job Executor (introduced in v9.0)
Simple Mapping (introduced in v9.0)
Single Threader (introduced in v9.0)
Sqoop Export (introduced in v9.0)
Sqoop Import (introduced in v9.0)
Transformation Executor (introduced in v9.0)
Weka Scoring (introduced in v9.0)

NoteIf you have a Pentaho address to an established VFS connection, you can copy and paste your PVFS location into the file or folder option of any of the steps and entries listed here. You do not need to click Browse to access the VFS browser.

The VFS dialog boxes are configured through specific transformation parameters. See Configure SFTP VFS for more information on configuring options for SFTP.

Configure VFS options

The VFS browser can be configured to set variables as parameters for use at runtime. A VFS Configuration Sample.ktr sample transformation containing some examples of the parameters you can set is located in the data-integration/samples/transformations directory. For more information on setting variables, see Specifying VFS properties as parameters. For an example of configuring an SFTP VFS connection, see Configure SFTP VFS.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

Access to Google Cloud

Access to HCP

Access to Data Catalog

Access to Microsoft Azure

Create a VFS connection

Edit a VFS connection

Delete a VFS connection

Access files with a VFS connection

Pentaho address to a VFS connection

Steps and entries supporting VFS connections

VFS browser

Before you begin

Access to a Google Drive

Set up HCP credentials

Access files with the VFS browser

Supported steps and entries

Configure VFS options