Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Bulk load into Snowflake

Parent article

The Bulk load into Snowflake job entry in PDI loads vast amounts of data into a Snowflake virtual warehouse in a single session. For more information about working with Snowflake in PDI, see PDI and Snowflake.

This entry automates Snowflake's COPY INTO command to populate your Snowflake data warehouse with your PDI data, eliminating the need for repetitive SQL scripting. To use the Bulk load into Snowflake job entry, you must size your virtual warehouse, define the source and type of data to load, specify the target data warehouse, then provide any needed parameters.

For more information about using Snowflake, including best practices for bulk loading data, see the Snowflake documentation.

Before you begin

Before using the Bulk load into Snowflake job entry in PDI, you must have the following items:

  • A Snowflake account.
  • If you are using Pentaho 8.3, you must have installed the Snowflake plugin. See Install the Snowflake plugin in Pentaho 8.3 for download and installation instructions.
  • A connection to the database where you want to load your data.
  • An S3 connection or a VFS connection to a Snowflake staging area for your data source.
  • A table and schema set up in the database where you want to place your data. On the first use, you will need to create the table.

General

The following field is general to this job entry:

  • Entry Name: Specify the unique name of the job entry on the canvas. You can customize the name or leave it as the default.

Options

The Bulk load into Snowflake entry includes four tabs to define the data input source, the output database and location, options, and advanced options for loading PDI data into Snowflake. Each tab is described below.

Input tab

Input tab

Use the options in this tab to define your input source for the Snowflake COPY INTO command:

OptionDescription
SourceChoose from one of the following input source types
  • S3

    The input source is an S3 bucket.

  • Snowflake Staging Area

    The input source is files on a Snowflake staging area.

Click Select to specify the file, folder, prefix, or variable of the S3 bucket or staging location to use as the input for the Snowflake COPY INTO command. See "Syntax" in the Snowflake documentation for more details specifying this option.

What file type is your sourceSelect the file type of the input source. You can select one of the following types:
  • Delimited text

    The input source is character-delimited UTF-8 text.

  • Avro

    The input source is an Avro data serialization protocol.

  • JSON

    The input source is a JavaScript Object Notation (JSON) data file containing a set of either objects or arrays.

  • ORC

    The input source is an Optimized Row Columnar (ORC) file containing Hive data.

  • Parquet

    The input source is a Parquet file of nested data structures in a flat columnar format.

  • XML

    The input source is a file in XML format.

CompressionSelect the type of compression applied to your input source:
  • None
  • Auto
  • BZIP2
  • GZIP
  • Deflate
  • Raw deflate
  • Brotli
  • Zstd
For Parquet files, the Compression options are:
  • None
  • Auto
  • Snappy

Depending on what file type you selected for What file type is your source?, the following file settings appear at the bottom of this tab:

File TypeFile Settings
Delimited textSpecify the following settings for a delimited text file:
  • Leading rows to skip

    Specify the number of rows to use as an offset from the beginning of the file. This option is useful to skip header lines.

  • Delimiter

    Specify the character used to separate a data field. Default value is semicolon (;).

  • Quote character

    Specify the character used to enclose a data field. Default value is double-quotation mark (″).

  • Remove quotes

    Select one of the following values to indicate whether quotation characters should be removed from a data field during the bulk load:

    • Yes: Remove the quotation characters.
    • No: Retain the quotation characters.
  • Empty as null

    Select one of the following values to indicate whether empty data values should be set to null during the bulk load:

    • Yes: sets empty data values to null.
    • No: leaves data values as empty.
  • Trim whitespace

    Select one of the following values to remove trailing and leading whitespace from the data during the bulk load:

    • Yes: Remove the whitespace.
    • No: Retain the whitespace.
NoteFor delimited text files, you must have a table in your database with all the columns you need defined.
AvroNo additional settings.
JSON
  • Ignore UTF8 errors

    Select one of the following values to ignore UTF8 errors in the data during the bulk load:

  • Yes: Ignore UTF8 errors.
  • No: Do not ignore UTF8 errors.
  • Allow duplicate elements

    Select one of the following values to allow duplicate elements in the data during the bulk load:

  • Yes: Allow duplicate elements.
  • No: Do not allow duplicate elements.
NoteSnowflake only uses the last duplicate value and discards the others.
  • Strip null values

    NULL values are stored as null in JSON files. Select one of the following values to indicate whether to delete NULL values from the data during the bulk load:

  • Yes: Strip the NULL values.
  • No: Store the NULL values in a variant column.
  • Parse octal numbers

    Select one of the following values to indicate whether to parse octal numbers during the bulk load:

  • Yes: Parse octal numbers.
  • No: Do not parse octal numbers.
ORC

Additional file settings for ORC files.

Parquet

Additional file settings for Parquet files.

XML
  • Ignore UTF8 errors

    Select one of the following values to indicate whether to replace UTF-8 encoding errors during the bulk load:

    • Yes: Replace invalid UTF-8 sequences with Unicode character U+FFFD.
    • No: Invalid UTF-8 sequences produce an encoding error (default).
  • Preserve space

    Select one of the following values to indicate whether to preserve leading and trailing spaces in element content during the bulk load:

    • Yes: Preserve spaces.
    • No: Do not preserve spaces (default).
  • Strip outer element

    Select one of the following values to indicate whether to remove the outer XML element, and expose the second level elements as separate documents during the bulk load:

    • Yes: Remove the outer XML element.
    • No: Do not remove the outer XML element (default).
  • Enable Snowflake data

    Select one of the following values to indicate whether to enable recognition of Snowflake semi-structured data tags from the data during the bulk load:

    • Yes: Enable recognition of Snowflake tags (default).
    • No: Disable recognition of Snowflake tags.
  • Auto convert

    Select one of the following values to indicate whether to convert numeric and Boolean values from text to native representation during the bulk load:

    • Yes: Convert numeric and Boolean values (default).
    • No: Do not convert numeric and Boolean values.
NoteIf you have unstructured data, you must have a variant column in your database table to store the data for the following file types:
  • JSON
  • ORC
  • Parquet
  • XML

Output tab

Output tab

OptionDescription
Database connectionSelect your database connection from a list of existing Snowflake connections.

If you do not have an existing connection, click New. If you need to modify an existing connection, click Edit. See Define Data Connection for instructions.

Note: If timeout errors occur, see Snowflake timeout errors to troubleshoot.

SchemaSelect the schema to use for the bulk load. The Bulk load into Snowflake entry reads the schemas that exist in the database to populate this list.
Table nameSelect the name of the table to bulk load. The Bulk load into Snowflake entry reads the table names from your selected schema to populate this list.
NoteYou can only load one table at a time. For multiple tables, use multiple Bulk load into Snowflake entries or parameterized schema and table names to change the table that is loaded.
NoteYou can use variables in the Schema and Table name fields.

Options tab

Options tab

Use the options in this tab to define how the data is loaded using the Snowflake COPY INTO command:

OptionDescription
Truncate tableSelect to remove all the data in the table before bulk loading the current data. When the Truncate table option is cleared, the data is appended during a new data load (default).
On errorSelect one of the following values to indicate what to do if an error occurs during during the bulk load:
  • Abort: Stops loading data into the warehouse.
  • Skip file: Skips loading the file that caused the error and continues processing.
  • Skip file number: Number of errors that can occur before the file is skipped in the loading process.
  • Skip file percent: Percentage of errors in the record count that can occur before processing stops.
  • Continue: Ignore the errors and continues processing.
Skip file after (x) errorsEnter the number of errors than can occur before the file is skipped in the loading process.

This field displays when the Skip file number field is selected.

Skip file after (%) errorsEnter the percentage of errors than can occur before the file is skipped in the loading process.

This field displays when the Skip file percent field is selected.

Advanced options tab

Advanced Options tab

Use this tab to configure parameters for the Snowflake COPY INTO command. Any Name/Value pair added as a parameter is passed to Snowflake as a parameter, but the validation of the parameter is the user’s responsibility. See Snowflake’s documentation for further details on these parameters. The Force paramenter here is provided as an example.

ParameterDescription
ForceSpecify to force loading of files into a database.
  • True: Loads data to the table even if the data has already been loaded from that file before. This option can potentially duplicate data in a table.
  • False: Ignores staged data files already loaded into the table (default).