Virtual File System Browser
Some transformation steps and job entries use virtual file system (VFS) dialog boxes in place of traditional local file system windows. With VFS file dialog boxes, you can specify a VFS URL instead of a typical local path. The files are accessed using HTTP, with the URLs containing schema data that identify a protocol to use. See http://commons.apache.org/vfs/apidocs/index.html for VFS schema documentation. Your files can be local or remote. They can also reside in compressed formats such as .tar, .zip, or other compressed formats.
Perform the following steps to access your files with the VFS browser:
- Select File > Open URL in the PDI client to open the VFS browser as shown in the following figure:
- Choose the file system type of your Location. The following file systems are supported:
- Local – Opens files on your local machine. Use the folders in the Name panel of the Open File dialog box to select a resource.
- Hadoop Cluster – Opens files on any Hadoop cluster except S3. Click the Hadoop Cluster drop-down box to select your desired cluster, then the resource you want to access.
- S3 – (Simple Storage Service) accesses the resources on Amazon Web Services. For instructions on setting up AWS credentials, see Working with AWS Credentials.
- HDFS – Opens files on any Hadoop distributed file system except MapR. Select your desired cluster for the Hadoop Cluster option, then select the resource you want to access.
- MapRFS – Opens files on the MapR file system. Use the folders in the Name panel of the Open File dialog box to select a MapR resource.
- Google Cloud Storage – Opens files on the Google Cloud Storage file system.
- Google Drive – Opens files on the Google file system. You must configure PDI upon initial access into the Google file system. See Access to a Google Drive for more information.
- HCP – Opens files on the Hitachi Content Platform. You must configure HCP and PDI before accessing the platform. See Access to HCP for more information.
The following addresses are VFS URL examples:
Access to a Google Drive
Perform the following set up steps to initially access your Google Drive:
- Turn on the Google Drive API, which results in a credentials.json file. See https://developers.google.com/drive/api/v3/quickstart/java for details.
- Rename your credentials.json file to client_secret.json and copy it into the data-integration/plugins/pentaho-googledrive-vfs/credentials directory, then restart PDI. The Google Drive option will not appear as a Location until you copy the client_secret.json file into the credentials directory and restart.
- Select Google Drive as your Location. You are prompted to log in to your Google account.
- Once you have logged in, the Google Drive permission screen displays.
- Click Allow to access your Google Drive Resources.
Pentaho then stores a security token called StoredCredential under the data-integration/plugins/pentaho-googledrive-vfs/credentials directory. With this token, you can access your Google Drive resources whenever you are logged in to your Google account. If this security token is ever deleted, you will be prompted again to log in to your Google account after restarting PDI. If you ever change your Google account permissions, you must delete the token and repeat the above steps to generate a new token.
If you want to access your Google Drive via a transformation running directly on your Pentaho server, copy the StoredCredential and client_secret.json files into the pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory on your Pentaho Server.
Access to HCP
Hitachi Content Platform (HCP) is a distributed storage system that can be used with the VFS browser. Within HCP, access control lists (ACLs) grant user privileges to perform various file operations. Namespaces, owned and managed by tenants, are used for logical groupings, access and permissions, as well as object metadata such as retention and shred settings. Learn more about HCP’s features and functions here.
The process below assumes that you have HCP tenant permissions and that namespaces have been created.
Perform the following steps to setup access to HCP:
- Log on to the HCP Tenant Management Console.
- Click Namespaces and then select the Name you want to configure.
- In the Protocols tab, click HTTP(S), and verify Enable HTTPS and Enable REST API with Authenticated access only are selected.
- In the Settings tab, select ACLs.
- Select the Enable ACLs check box and, when prompted, click Enable ACLs to confirm. This completes the setup of HCP. Next, setup the credentials.
Perform the following steps to setup credentialing:
- Depending on the OS, create the following subdirectory in the user’s home directory:
- Linux: ~/.hcp/
- Windows: C:\Users\<username>\.hcp\
- Create a file named ‘credentials’ and save it to the .hcp directory.
- Open the credentials file then add the parameters and values shown below:
[default] hcp_username=[username] hcp_password=[password] accept_self_signed_certificates=false
Insert the HCP namespace username and password, and change ‘accept_self_signed_certificates’ to ‘true’ if you want to enable a security bypass.
You can also use obfuscated or encrypted usernames and passwords.
- Save and close the file.
- For the Pentaho Server setup, stop and start the Server. This completes the setup for VFS browser access to HCP.
Add and Delete Folders or Files
You can also use the VFS browser to delete files or folders on your file system. A default filter is applied so that initially Kettle transformation and job files display. To view other files, click the Filter drop-down and select the type of file you want to select. Once you have selected the file or folder you want to delete, click the X in the upper-right corner of the VFS browser to delete your selection. If you want to create a new folder, click the + in the upper-right corner of the VFS browser and enter your new folder name, and click OK.
Supported Steps and Entries
Supported transformation steps and job entries open the VFS browser instead of the traditional file open dialog box. With the VFS browser, you specify a VFS URL instead of a file path to access those resources.
The following steps and entries support the VFS browser:
- Avro Input
- Avro Output
- ETL Metadata Injection
- File Exists
- Hadoop Copy Files
- Hadoop File Input
- Hadoop File Output
- Mapping (sub-transformation)
- ORC Input
- ORC Output
- Parquet Input
- Parquet Output
VFS dialog boxes are configured through certain transformation parameters. Refer to Configure SFTP VFS for more information on configuring options for SFTP.
Configure VFS Options
The VFS browser can be configured to set variables as parameters for use at runtime. A VFS Configuration Sample.ktr sample transformation containing some examples of the parameters you can set is located in the data-integration/samples/transformations directory. For more information on setting variables, see VFS Properties. For an example of configuring an SFTP VFS connection, see Configure SFTP VFS.