PDI and Snowflake
Snowflake is an analytic data warehouse running completely on a cloud infrastructure. Snowflake supports loading popular data formats like JSON, Avro, Parquet, ORC, and XML. Using Pentaho Data Integration (PDI), you can load your data into Snowflake and define jobs in PDI to efficiently orchestrate warehouse operations, paying only for the storage and computing resources actually used when you use them. See the Snowflake Documentation to learn more about Snowflake and how it works.
Using the Snowflake job entries in PDI, data engineers can set up virtual warehouses, bulk load data, and stop the warehouse when the process is complete. You can scale Snowflake virtual warehouses up and down, or suspend them when not in use to reduce costs.
Snowflake job entries in PDI
PDI has six job entries you can use to load data and manage warehouses in Snowflake.
In PDI, you can bulk load files into your Snowflake data warehouse:
Using this job entry, you can load a vast amount of data into a warehouse on Snowflake in a single session, provided you have sized your warehouse correctly. You can load data from Snowflake stages using the Snowflake VFS connection or directly from AWS S3 using the S3 VFS connection. For example, you may want to upload six months of ORC data from an S3 bucket. Using this job entry, you can define the source and type of data to load, specify the target data warehouse, and provide any needed parameters.
In PDI, you can create, modify, and even delete a Snowflake virtual warehouse to help you automate your virtual warehouse scaling activities. These orchestration entries include:
You can use this job entry to create a Snowflake virtual warehouse. You can set size, scaling, automated suspension, and other properties for your warehouse.
Once you create a warehouse, you can edit its settings using this job entry. Modifying a warehouse is useful if your users typically perform simple queries and only require a small warehouse. However, to meet your ETL service-level agreements (SLA), you may need a larger warehouse during the ETL process. Using this job entry, you can modify the warehouse at the beginning of the ETL process to scale it up, and then modify it to scale it back down when the ETL process is complete.
Use this job entry to delete virtual warehouses. Deleting unwanted virtual warehouses helps you clean up the Snowflake management console.
In PDI, you can dynamically start and stop Snowflake virtual warehouses to help you better control your Snowflake costs. For example, if your employees only work 8 hours a day, then you don’t need to keep your warehouse up for 24 hours a day. Using the Start and Stop job entries, you can turn on the warehouse from 8 AM to 5 PM for day-to-day business activities and again from 11 PM to 2 AM while your ETL processes are running.
See the Snowflake Documentation to learn more about how credits are billed for running virtual warehouses in Snowflake.
Use this job entry to start/resume a virtual warehouse on Snowflake. Warehouses consume credits while running.
You can set this job entry to stop/suspend a virtual warehouse on Snowflake. Suspending a warehouse stops the warehouse from consuming credits once all the servers shut down.
Install the Snowflake plugin in Pentaho 8.3
Perform the following steps to download and install the Snowflake PDI plugin for 8.3:
On the customer portal home page, click Downloads, then click Pentaho 8.3 GA Release in the 8.x list.
At the bottom of the Pentaho 8.3 GA Release page, click the Snowflake Plugin folder in the Box widget and download the snowflake-plugins-1.0.0-xxx-dist.zip file.
Unzip the plugin file in a temporary location.
Open a Command Prompt or Terminal window, navigate to the folder that contains the files you just extracted, and enter one of the following commands at the prompt:
- For Windows: install.bat
- For Linux: ./install.sh
Read the license agreement in the IZPack window.
Select the I accept the terms of this license agreement check box
In the Select the installation path text box, specify one of the following locations for the plugin directory:
NoteThe installation process creates a new directory for the Snowflake plugin in the location you specify. If you need access to the Snowflake entries from both PDI and the Pentaho Server, copy the resulting snowflake directory from your chosen location to the other location after the installation process.
- For PDI: design-tools/data-integration/plugins
- For Pentaho Server: pentaho-server/pentaho-solutions/system/kettle/plugins
Click through the prompts to finish the installation process.If you are updating the Pentaho Server, the installation is complete. If you are updating PDI, continue with the next steps to update the PDI client.
Perform the following steps to update the PDI client only (do not perform these steps on the Pentaho Server):
Navigate to the design-tools/data-integration/plugins/snowflake directory.
(Linux only) Set execute permissions for install-snowflake-vfs.sh, as shown in the following example:
chmod u+x install-snowflake-vfs.sh
Run one of the following commands at the prompt to update the PDI client:
- For Windows: install-snowflake-vfs.cmd
- For Linux: install-snowflake-vfs.sh