Connecting to Virtual File Systems
You can connect to most of your Virtual File Systems (VFS) through VFS connections in PDI. A VFS connection is a stored set of VFS properties that you can use to connect to a specific file system. In PDI, you can add a VFS connection and then reference that connection whenever you want to access files or folders on your Virtual File System. For example, you can use the VFS connection for Hitachi Content Platform (HCP) in any of the HCP transformation steps without the need to repeatedly enter your credentialing information for data access.
With a VFS connection, you can set your VFS properties with a single instance that can be used multiple times. The VFS connection supports the following file system types:
Google Cloud Storage
The Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.
Snowflake Staging
A staging area used by Snowflake to load files. See Snowflake staging area for more information on this protocol.
Amazon S3 / MinIO
- Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.
- MinIO accesses data objects on an Amazon compatible storage server. See the MinIO Quickstart Guide for MinIO setup instructions.
HCP
The Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. See Access to HCP for more information.
Microsoft Azure
The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Storage Gen2 and Blob Storage. See Access to Microsoft Azure for more information.
After you create a VFS connection, you can use it with PDI steps and entries that support the use of VFS connections. The VFS connection is saved in the repository. If you are not connected to a repository, the connection is saved locally on the machine where it was created.
If a VFS connection in PDI is not available for your Virtual File System, you may be able to access it with the VFS browser. See VFS browser for further details.
Before you begin
You must first perform a few setup tasks if you need to access the Google Cloud or the Hitachi Content Platform (HCP).
Access to Google Cloud
Perform the following steps to set up your system to use Google Cloud storage:
Procedure
Download the service account credentials file that you have created using the Google API Console to your local machine.
Create a new system environmental variable on your operating system named GOOGLE_APPLICATION_CREDENTIALS.
Set the path to the downloaded JSON service account credentials file as the value of the GOOGLE_APPLICATION_CREDENTIALS variable.
Results
Access to HCP
Within HCP, access control lists (ACLs) grant privileges to the user to perform a variety of file operations. Namespaces, owned and managed by tenants, are used for logical groupings, access and permissions, and object metadata such as versioning, retention and shred settings. For more information about HCP, see the Introduction to Hitachi Content Platform.
Perform the following steps to setup access to HCP:
Procedure
Log on to the HCP Tenant Management Console.
Click Namespaces and then select the Name you want to configure.
In the Protocols tab, click HTTP(S), and verify Enable HTTPS and Enable REST API with Authenticated access only are selected.
In the Settings tab, select ACLs.
Select the Enable ACLs check box and, when prompted, click Enable ACLs to confirm.
Results
Access to Data Catalog
Data Catalog has the capability of accessing and registering data stored either on Amazon, MinIO, or HDFS. Multiple connections in PDI are required to access Data Catalog and the resources that it manages:
- To access Data Catalog, you will need to know the authentication type used, instance URL, and your account credentials.
- To access S3 resources from Data Catalog, see Create a VFS connection.
- To access HDFS resources from Data Catalog, see Connecting to a Hadoop cluster with the PDI client.
Access to Microsoft Azure
To access Azure services from PDI, you must first create and configure your Azure Data Lake Gen 1, or Azure Data Lake Storage Gen2 and Blob Storage services. you should also enable the hierarchical namespace to maximize file system performance..
- Access to the Azure services requires an Azure account with an active subscription. See Create an account for free.
- Access to Azure Storage requires an Azure Storage account. See Create a storage account.
Create a VFS connection
Procedure
Start the PDI client (Spoon) and create a new transformation or job.
In the View tab of the Explorer pane, right-click the VFS Connections folder, then click New.
The New VFS connection dialog box opens.In the Connection Name field, enter a name that uniquely describes this connection.
The name can contain spaces, but it cannot include special characters, such as#
,$
, and%
.In the Connection Type field, select from one of the following types:
Amazon S3 / MinIO( Default):
- Simple Storage Service (S3) accesses the resources on Amazon Web Services. See Working with AWS Credentials for Amazon S3 setup instructions.
- MinIO accesses data objects on an Amazon compatible storage server. See the MinIO Quickstart Guide for MinIO setup instructions.
Azure Data Lake Gen 1
The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Gen 1 storage. See Access to Microsoft Azure for more information.
Azure Data Lake Gen 2 / Blob
The Microsoft Azure Storage services. You must create an Azure account and configure Azure Data Lake Storage Gen2 and Blob Storage. See Access to Microsoft Azure for more information.
Catalog
The Lumada Data Catalog. You must configure your Data Catalog connection before accessing the platform. Enter the authentication type, connection URL and account credentials. To access data resources from Data Catalog, an S3 or HDFS connection is also required. See the Lumada Data Catalog documentation for details.
Google Cloud Storage:
The Google Cloud Storage file system. See Google Cloud Storage for more information on this protocol.
HCP:
The Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. You must also configure object versioning in HCP Namespaces. See Access to HCP for more information.
Snowflake Staging:
A staging area used by Snowflake to load files. See Snowflake staging area for more information on this protocol.
(Optional) Enter a description for your connection in the Description field.
Click Next.
In the Connection Detailssection, enter the information according to your selected Connection Type.
If you selected Amazon S3 / MinIO, choose one of the following options:
- For Amazon: Select the Default S3 Connection check box to enable use of Amazon S3.
- For MinIO: Select the Default S3 connection check box to enable use of MinIO. Also, select the PathStyle Access check box to enable path-style access. Otherwise, S3 bucket-style access is used.
If you selected Azure Data Lake Gen 1 on the previous page, specify your Azure account name in the Account Fully Qualified Domain Name field, and enter your Application ID, Client Secret, and Oauth 2.0 token endpoint.
If you selected Azure Data Lake Gen 2 / Blob on the previous page, choose one of the following Authentication Type options:
- For Account Shared Key: Enter your shared account key credential in the Account Shared Key field.
- For Azure Active Directory: Enter your Azure Active Directory credentials in the Application (client) ID, Client Secret, and Directory (tenant) ID fields.
- For Shared Access Signature: Enter your SAS token in the Shared Access Signature field.
For all three Authentication Type options, you must also specify your Azure account name in the Account Fully Qualified Domain Namd field, the Block Size, the Buffer Count, the Max Block Upload Size, and your Access Tier.
NoteYou can add a predefined variable file to the fields that haveicon. Place the cursor in the required field and enter Ctrl+Space. Select the required file from the list of files available. The variable must be a predefined variable and not a runtime variable. The variables are defined in the Kettle.properties. For more information on variables, see Kettle Variables.
(Optional) Click Test to verify your connection.
Click OK to save the changes for the created connection.
Results
Edit a VFS connection
Procedure
Right-click the VFS Connections folder and select Edit.
When the Edit VFS Connection dialog box opens, select the Pencil icon next to the section you want to edit.
Delete a VFS connection
Procedure
Right-click the VFS Connections folder.
Select Delete, then Yes, Delete.
Results
Access files with a VFS connection
Follow these instructions to access files with the VFS Open or Save dialog box in the PDI client.
Procedure
In the PDI client, select
The VFS to open a file or to save a file.Open or Save As dialog box opens, as shown in the following example:Recently accessed files appear in the right panel.
Navigate to your folders and files through the VFS connection category in the left panel.
When you select a folder or file, the Recents drop-down list updates to show the navigation path to your file location.(Optional) Click on the navigation path to show and copy the Pentaho file path, even if it is a Virtual File System (VFS) path. See Pentaho address to a VFS connection for details on the Pentaho file path for the Virtual File System.
Select the file and click Open or Save depending on whether you are opening or saving a file.
Results
Pentaho address to a VFS connection
The Pentaho address is the Pentaho virtual file system
(pvfs) location within your VFS connection. When you are locating a
file under the VFS connection category in the file access dialog box, the directory path in
your Virtual File System appears in the address text box.
When you click in the address bar, the Pentaho address to the file appears in the text box.
You can copy and paste a Pentaho address into file path options of PDI steps and entries that support VFS connections.
Steps and entries supporting VFS connections
You may have a transformation or a job containing a step or entry that accesses a file on a Virtual File System.
The following steps and entries support VFS connections:
- Avro Input
- Avro Output
- Bulk load from MySQL into file
- Bulk load into MSSQL
- Bulk load into MySQL
- Copybook Input
- CSV File Input
- De-serialize from file
- Fixed file input
- Get data from XML
- Get File Names
- Get Files Rows Count
- Get SubFolder names
- Google Analytics
- GZIP CSV Input
- Job (job entry)
- JSON Input
- JSON output
- ORC Input
- ORC Output
- Parquet Input
- Parquet Output
- Query HCP
- Read metadata from Copybook
- Read metadata from HCP
- Text File Output
- Transformation (job entry)
- Write metadata to HCP
VFS browser
In some transformation steps and job entries, a Virtual File System (VFS) browser is used in place of VFS connections and the Open dialog box. When you use the VFS browser, specify a VFS URL instead of the VFS connection. The files are accessed using HTTP and the URLs contain schema data that identify a protocol to use. Your files can be local or remote, and can reside in compressed formats such as TAR and ZIP. For more information, see the Apache Commons VFS documentation.
Before you begin
If you need to access a Google Drive or the Hitachi Content Platform (HCP), you must perform a few setup tasks.
Access to a Google Drive
Procedure
Follow the "Step 1" procedure in the article "Build your first Drive app (Java)" in the Google Drive APIs documentation.
This procedure turns on the Google Drive API and creates a credentials.json file.Rename the credentials.json file to client_secret.json. Copy and paste the renamed file into the data-integration/plugins/pentaho-googledrive-vfs/credentials directory.
Restart PDI.
The Google Drive option does not appear when creating a VFS connection until you copy and paste the client_secret.json file into the credentials directory and restart PDI.Log in to your Google account.
Enter you Google account credentials and log in. The Google Drive permission window displays.
Click Allow to access your Google Drive Resources.
Results
Set up HCP credentials
Perform the following steps to set up your HCP credentials.
Procedure
Depending on the operating system, create the following subdirectory in the user’s home directory:
- Linux: ~/.hcp/
- Windows: C:\Users\username\.hcp\
Create a file named credentials and save it to the \.hcp directory.
Open the credentials file then add the parameters and values shown in the following code:
[default] hcp_username=[username] hcp_password=[password] accept_self_signed_certificates=false
Insert the HCP namespace username and password, and change accept_self_signed_certificates to true if you want to enable a security bypass.
NoteYou can also use obfuscated or encrypted usernames and passwords.Save and close the file.
For the Pentaho Server setup, stop and start the server.
Results
Access files with the VFS browser
Procedure
Select
The in the PDI client to open the VFS browser.Open File dialog box opens.In the Location field, select the type of file system. The following file systems are supported:
- Local: Opens files on your local machine. Use the folders in the Name panel of the Open File dialog box to select a resource.
- Hadoop Cluster: Opens files on any Hadoop cluster except S3. Click the Hadoop Cluster drop-down box to select your cluster, then the resource you want to access.
- HDFS: Opens files on Hadoop distributed file systems. Select the cluster you want for the Hadoop Cluster option, then select the resource you want to access.
- Google Drive: Opens files on the Google file system. You must configure PDI to access the Google file system. See Access to a Google Drive for more information.
In the Open from Folder field, enter the VFS URL.
The following addresses are VFS URL examples for the Open from Folder field:
- Local:
ftp://userID:password@ftp.myhost.com/path_to/file.txt
- HDFS:
hdfs://myusername:mypassword@mynamenode:port/path
- Local:
(Optional) Select another value in the Filter menu to filter on file types other than transformations and jobs, which is the default value.
(Optional) Select a file or folder and click the X icon in the upper-right corner of the browser to delete it.
(Optional) click the + icon in the upper-right corner of the browser to create a new folder.
Next steps
Supported steps and entries
The following steps and entries support the VFS browser:
- Amazon EMR Job Executor (introduced in v9.0)
- Amazon Hive Job Executor (introduced in v9.0)
- AMQP Consumer (introduced in v9.0)
- Avro Input (introduced in v8.3)
- Avro Output (introduced in v8.3)
- ETL metadata injection
- File Exists (Job Entry)
- Hadoop Copy Files
- Hadoop File Input
- Hadoop File Output
- JMS Consumer (introduced in v9.0)
- Job Executor (introduced in v9.0)
- Kafka consumer (introduced in v9.0)
- Kinesis Consumer (introduced in v9.0)
- Mapping (sub-transformation)
- MQTT Consumer (introduced in v9.0)
- ORC Input (introduced in v8.3)
- ORC Output (introduced in v8.3)
- Parquet Input (introduced in v8.3)
- Parquet Output (introduced in v8.3)
- Oozie Job Executor (introduced in v9.0)
- Simple Mapping (introduced in v9.0)
- Single Threader (introduced in v9.0)
- Sqoop Export (introduced in v9.0)
- Sqoop Import (introduced in v9.0)
- Transformation Executor (introduced in v9.0)
- Weka Scoring (introduced in v9.0)
The VFS dialog boxes are configured through specific transformation parameters. See Configure SFTP VFS for more information on configuring options for SFTP.
Configure VFS options
The VFS browser can be configured to set variables as parameters for use at runtime. A VFS Configuration Sample.ktr sample transformation containing some examples of the parameters you can set is located in the data-integration/samples/transformations directory. For more information on setting variables, see Specifying VFS properties as parameters. For an example of configuring an SFTP VFS connection, see Configure SFTP VFS.