Cassandra Output
The Cassandra Input step writes data to a column family (table) of an Apache Cassandra database using CQL (Cassandra Query Language).
General
Enter the following information in the transformation step field.
- Step Name: Specifies the unique name of the Cassandra Output step on the canvas. You can customize the name or leave it as the default.
Options
The Cassandra Output transformation step features several tabs with fields. Each tab is described below.
Connection Tab
This tab includes the following options:
Option | Description |
---|---|
Cassandra host | Specifies host name for the connection to the Cassandra server. |
Cassandra port | Specifies the port number for the connection to the Cassandra server. |
Socket timeout | Sets an optional connection timeout period, specified in milliseconds. |
Username | Specifies the username of the target keyspace and/or family (table) authentication details. |
Password | Specifies the password of the target keyspace and/or family (table) authentication details. |
Keyspace | Specifies the keyspace (database) name. |
Write Options Tab
The Cassandra Output step provides a number of options that control what and how data is written to the target Cassandra keyspace (database). This tab contains the following connection details and basic query information (in particular, how to connect to Cassandra and execute a CQL query to retrieve rows from a column family):
Option | Description |
---|---|
Column family (table) to write to | Specifies which the column family (table) to write the incoming rows. |
Get column family names |
Populates Column family (table) to write to with names of all the column families that exist in the specified keyspace. |
Consistency level |
Specifies an explicit write consistency. The following values are valid:
The Cassandra default is ONE. |
Commit batch size |
Specifies the number of rows to send with each commit |
Batch insert timeout |
Specifies the number of milliseconds to wait for a batch to completely insert before splitting into smaller sub-batches. You must specify a value lower than Socket timeout or leave empty for no timeout. |
Sub batch size |
Specifies the sub-batch size (in number of rows) if the batch must be split because Batch insert timeout is reached. |
Insert unlogged batches |
Select for non-atomic batch writing. |
Time to live (TTL) |
Specifies the amount of time in which to write a column. If the time expires, that column is deleted. |
Incoming field to use as the key | Indicates which incoming field to use as the key. Its list is populated with the names of incoming transformation fields. |
Get fields |
Inserts a list of fields from the incoming PDI stream into Incoming field to use as the key. |
Use CQL version 3 |
Queries with CQL version 3. |
Use Thrift I/O |
Uses Thrift I/O. |
Show schema |
Opens a dialog box that shows metadata for the column family specified in Column family (table) to write to. |
Important: Cassandra Output does not check the types of incoming columns against matching columns in the Cassandra metadata. Incoming values are formatted into appropriate string values for use in a textual CQL INSERT statement according to PDI's field metadata. If resulting values cannot be parsed by the Cassandra column validator for a particular column then an error occurs.
Cassandra Output converts PDI's dense row format into sparse data by ignoring incoming field values that are null.
Schema Options Tab
This tab includes the following options:
Option | Description |
---|---|
Host for schema updates |
Specifies the host name for the connection to the Cassandra schema. |
Port for schema updates |
Specifies the port number for the connection to the Cassandra schema. |
Create column family |
Creates the named column family if one does not already exist. |
Table creation WITH clause |
Specifies additions to the table creation WITH clause. |
Truncate column family |
Indicates whether any existing data should be deleted from the named column family before inserting incoming rows. |
Update column family meta data | Updates the column family metadata with information on incoming fields not already present. If this option is not selected, any unknown incoming fields are ignored unless Insert fields not in column meta data option is selected. |
Insert fields not in column meta data | Inserts the column family metadata in any incoming fields not present, with respect to the default column family validator. This option has no effect if Update column family meta data is selected. |
Use compression |
Compresses (gzip) the text of each BATCH INSERT statement before transmitting it to the node. |
CQL to execute before inserting first row | Use to specify any CQL statements to execute before inserting the first row. |
Update Column Family Metadata
Selecting the Update column family meta data option will result in the column family metadata getting updated with information on incoming fields not already present. If this option is not selected, any unknown incoming fields are ignored unless Insert fields not in column meta data is selected. If the latter is selected, then any incoming fields that are not present in the column family metadata will be inserted with respect to the default column family validator. This option has no effect if Update column family meta data is selected.
Cassandra Output does not check the types of incoming columns against matching columns in the Cassandra metadata. Incoming values are formatted into appropriate string values for use in a textual CQL INSERT statement according to PDI's field meta data. If resulting values cannot be parsed by the Cassandra column validator for a particular column, then an error will occur.
Pre-Insert CQL
You have the option of executing an arbitrary set of CQL statements prior to inserting the first incoming PDI row. It is useful for creating or dropping secondary indexes on columns. Clicking CQL to execute before inserting first row opens a CQL editor. You can enter multiple CQL statements as long as each is terminated by a semicolon, as shown in the following example:
Pre-insert CQL statements are executed after any column family metadata updates for new incoming fields, and before the first row is inserted. This allows for indexes to be created for columns corresponding new incoming fields.
Metadata Injection Support
All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.