Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Understanding PDI data types and field metadata

Parent article

This section is for users who want to maximize the efficiency of their transformation and job results.

As a best practice for producing consistent, predictable outcomes when working with your data in PDI, you must consider how the Pentaho engine processes different data types and field metadata in transformations and jobs. For example, steps like Avro Input and Text File Input, require additional considerations to best meet your working requirements for specific data types, mathematical operations, number conversions, and formatting.

NoteAs a rule, data is never modified by metadata inside of PDI. Data is only modified when PDI writes to files or similar objects, but not to databases. Refer to the sections below that apply to your use case.

Data type mappings

PDI data types map internally to Java data types, so the Java behavior of these data types applies to the associated fields, parameters, and variables used in your transformations and jobs. The following table describes these mappings.

PDI data typeJava data typeDescription
BigNumberBigDecimalAn arbitrary unlimited precision number.
BinaryByte[]An array of bytes that contain any type of binary data.
BooleanBooleanA boolean value true or false.
DateDateA date-time value with millisecond precision.
IntegerLongA signed long 64-bit integer.
Internet AddressInetAddressAn Internet Protocol (IP) address.
NumberDoubleA double precision floating point value.
StringStringA variable unlimited length text encoded in UTF-8 (Unicode).
TimestampTimestampAllows the specification of fractional seconds to a precision of nanoseconds.

Using the correct data type for math operations

Using the correct data type for math operations helps ensure expected results from your transformations and jobs. The Number, BigNumber, and Integer types offer specific solutions for different computing needs. The following table highlights the behaviors and possible uses for each data type. For information about the proper method to round or truncate numbers, see Applying calculations and rounding.

PDI data typeDescription
Number

Use for general math with real numbers:

  • Decimal precision is not guaranteed.
  • Normally precise within 15 to 16 decimal points.
  • 15.4 may be represented as 15.400000000000000000001111111
  • 15.498723528092515678989712397 may be 15.498723528092515701252…
BigNumber

Use to get exact results from math of decimal numbers:

  • Guarantees precision to about 2 billion decimal places.
  • Requires more memory than Integer or Number.
  • 15.498723528092515678989712397 will always be 15.498723528092515678989712397
Integer

Use for math without a fraction or a decimal component:

  • Handles minimum and maximum values ranging from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Using the fields table properties

You define properties for the fields to read or write using the fields table. The properties in the fields table determine the field-level processing options for your row data, including the metadata attributes. Some commonly used steps that include a fields table are Split Fields, Select Values, Text File Output, and Concat Fields.

When using the fields table the following definitions and processing rules apply.

NoteDepending on the transformation step or job entry, some fields tables may feature only a portion of the columns listed below.

PDI fields table

  • Name

    The name of the field.

  • Type

    The type of the field. For example, String, Date, or Number. See Data type mappings for more information.

  • Format

    Defines the format mask to use when converting the value to, or reading the value from, a string. The Format drop-down menu offers suggestions, but you can enter your own mask. Format is only used when converting a non-string data type to a string data type. Format overrides Length and Precision. See Applying formatting for formatting details.

  • Length

    Defines the length to use when converting the value to, or reading the value from, a string. The numbers before the decimal point, or a value that is longer than the maximum length, will not be truncated. Length, also called Precision in some databases, is a metadata component. PDI converts to the required metadata type when the data is resulted to a string, not during the transformation (or job) or if resulted to non-string data types. See Output type examples for a listing of string and non-string types. Length is not used when Format is specified.

  • Precision

    Defines the number of digits after the decimal point to use when converting the value to, or reading the value from, a string. The numbers before the decimal point will not be truncated. Precision, also called Scale in some databases, is a metadata component. PDI converts to the required metadata type when the data is resulted to a string, not during the transformation (or job) or if resulted to non-string data types. See Output type examples for a listing of string and non-string types. Precision is not used when Format is specified.

  • Currency

    Used in conjunction with Format to interpret numbers such as $10,000.00 or E5.000,00. If the format mask contains the Unicode currency symbol ¤ (\u00A4), then it replaces the symbol by the value in the currency column. In Pentaho, you must use the copy and paste method to apply this symbol. See Common Formats for information on valid number formats.

  • Decimal

    Represents the character that replaces the period (.) in the format mask. Only applies when converting the value to, or reading the value from, a string.

  • Group

    Represents the character that replaces the comma (,) in the format mask. Only applies when converting the value to, or reading the value from, a string.

  • Null if

    Converts the value to null if the input value matches.

    NoteThis value is case-sensitive.
  • Default

    Defaults to this value if the value is null.

  • Trim type

    Defines the type of trimming to perform on the input or the output string. Trimming removes the white space on either side of a string. Options are both, left, right, or none.

  • Repeat

    Determines how null rows are handled. If the value in this row is null, then the value from the last row where the column was not null is used.

Applying formatting

Format masks define how data returned for a field is converted to, or from, a string. For example, a field might return the value "7000", but you want to display it as "$7,000.00". To do this, you apply a format mask to the field. The original data is not truncated when using a format mask.

As shown in the table below, when Format is used with Decimal, the period (.) in the format mask is replaced with the indicated character. Alternatively, when Format is used with Group, the comma (,) in the format mask is replaced with the indicated character. See Common Formats for information on valid number formats.

Input / Output valueFormatDecimalGroup
10.0#.#.,
1,546.99#,###.##.,
1g546d99#,###.##dg
€1.546,99€#,###.00,.
$1,546.99$#,###.00.,

The following table shows that when Format, Decimal, Group, Length, and Precision are used together. Format always overrides Length and Precision.

InputFormatDecimalGroupLengthPrecisionString outputNumber output
10.0#.#.,5210.010.0
10.0.,52010.0010.0
10.01.,2110.010.01
1,546.99#,###.##.,1031,546.991546.99
1,546.990#,###.000.,01,546.9901546.99

Applying calculations and rounding

Number and date calculations performed in PDI do not apply the Format, Length, and Precision properties. For example, using the table below, A + B + B = 30.1 If you preview B, it will appear as 10.0, so you would think 10.02 + 10.0 + 10.0 = 30.02. However, because B was never converted to a string for the calculation, 10.02 + 10.04 + 10.04 = 30.1.

FieldInputFormatDecimalGroupLengthPrecision
A10.02#.0.,51
B10.04.,51

If you want to truncate a string, use the Strings cut step.

If you want to round or truncate a number, use the following Calculator step features:

  • Round function
  • Floor and Ceil functions

Alternatively, you can convert the date or number to a string in the Select Values step, which applies the formatting specified in the metadata.

Output type examples

The following table provides examples of the string and non-string output types in PDI. Note that Format, Length, Precision, Decimal, and Group apply only when reading from, or outputting to, a string.

String output type exampleNon-string output type example
PreviewTable Output when the target field is a number.
Text File OutputAvro Output when the target field is a number.
JSON OutputParquet Output when the target field is a number.
XML OutputORC Output when the target field is a number.
Table Output when the target field is a varchar.Any binary output type
Anything displayed by PDI including logs, error messages, and prompts.