Understanding PDI data types and field metadata
This section is for users who want to maximize the efficiency of their transformation and job results.
As a best practice for producing consistent, predictable outcomes when working with your data in PDI, you must consider how the Pentaho engine processes different data types and field metadata in transformations and jobs. For example, steps like Avro Input, Text File Input, Catalog Input, and Catalog Output require additional considerations to best meet your working requirements for specific data types, mathematical operations, number conversions, and formatting.
Data type mappings
PDI data types map internally to Java data types, so the Java behavior of these data types applies to the associated fields, parameters, and variables used in your transformations and jobs. The following table describes these mappings.
PDI data type | Java data type | Description |
BigNumber | BigDecimal | An arbitrary unlimited precision number. |
Binary | Byte[] | An array of bytes that contain any type of binary data. |
Boolean | Boolean | A boolean value true or
false. |
Date | Date | A date-time value with millisecond precision. |
Integer | Long | A signed long 64-bit integer. |
Internet Address | InetAddress | An Internet Protocol (IP) address. |
Number | Double | A double precision floating point value. |
String | String | A variable unlimited length text encoded in UTF-8 (Unicode). |
Timestamp | Timestamp | Allows the specification of fractional seconds to a precision of nanoseconds. |
Using the correct data type for math operations
Using the correct data type for math operations helps ensure expected results from your transformations and jobs. The Number, BigNumber, and Integer types offer specific solutions for different computing needs. The following table highlights the behaviors and possible uses for each data type. For information about the proper method to round or truncate numbers, see Applying calculations and rounding.
PDI data type | Description |
Number |
Use for general math with real numbers:
|
BigNumber |
Use to get exact results from math of decimal numbers:
|
Integer |
Use for math without a fraction or a decimal component:
|
Using the fields table properties
You define properties for the fields to read or write using the fields table. The properties in the fields table determine the field-level processing options for your row data, including the metadata attributes. Some commonly used steps that include a fields table are Split Fields, Select Values, Text File Output, and Concat Fields.
When using the fields table the following definitions and processing rules apply.
- Name
The name of the field.
- Type
The type of the field. For example, String, Date, or Number. See Data type mappings for more information.
- Format
Defines the format mask to use when converting the value to, or reading the value from, a string. The Format drop-down menu offers suggestions, but you can enter your own mask. Format is only used when converting a non-string data type to a string data type. Format overrides Length and Precision. See Applying formatting for formatting details.
- Length
Defines the length to use when converting the value to, or reading the value from, a string. The numbers before the decimal point, or a value that is longer than the maximum length, will not be truncated. Length, also called Precision in some databases, is a metadata component. PDI converts to the required metadata type when the data is resulted to a string, not during the transformation (or job) or if resulted to non-string data types. See Output type examples for a listing of string and non-string types. Length is not used when Format is specified.
- Precision
Defines the number of digits after the decimal point to use when converting the value to, or reading the value from, a string. The numbers before the decimal point will not be truncated. Precision, also called Scale in some databases, is a metadata component. PDI converts to the required metadata type when the data is resulted to a string, not during the transformation (or job) or if resulted to non-string data types. See Output type examples for a listing of string and non-string types. Precision is not used when Format is specified.
- Currency
Used in conjunction with Format to interpret numbers such as $10,000.00 or E5.000,00. If the format mask contains the Unicode currency symbol ¤ (
\u00A4
), then it replaces the symbol by the value in the currency column. In Pentaho, you must use the copy and paste method to apply this symbol. See Common Formats for information on valid number formats. - Decimal
Represents the character that replaces the period (.) in the format mask. Only applies when converting the value to, or reading the value from, a string.
- Group
Represents the character that replaces the comma (,) in the format mask. Only applies when converting the value to, or reading the value from, a string.
- Null if
Converts the value to null if the input value matches.
NoteThis value is case-sensitive. - Default
Defaults to this value if the value is null.
- Trim type
Defines the type of trimming to perform on the input or the output string. Trimming removes the white space on either side of a string. Options are both, left, right, or none.
- Repeat
Determines how null rows are handled. If the value in this row is null, then the value from the last row where the column was not null is used.
Applying formatting
Format masks define how data returned for a field is converted to, or from, a string. For example, a field might return the value "7000", but you want to display it as "$7,000.00". To do this, you apply a format mask to the field. The original data is not truncated when using a format mask.
As shown in the table below, when Format is used with Decimal, the period (.) in the format mask is replaced with the indicated character. Alternatively, when Format is used with Group, the comma (,) in the format mask is replaced with the indicated character. See Common Formats for information on valid number formats.
Input / Output value | Format | Decimal | Group |
10.0 | #.# | . | , |
1,546.99 | #,###.## | . | , |
1g546d99 | #,###.## | d | g |
€1.546,99 | €#,###.00 | , | . |
$1,546.99 | $#,###.00 | . | , |
The following table shows that when Format, Decimal, Group, Length, and Precision are used together. Format always overrides Length and Precision.
Input | Format | Decimal | Group | Length | Precision | String output | Number output |
10.0 | #.# | . | , | 5 | 2 | 10.0 | 10.0 |
10.0 | . | , | 5 | 2 | 010.00 | 10.0 | |
10.01 | . | , | 2 | 1 | 10.0 | 10.01 | |
1,546.99 | #,###.## | . | , | 10 | 3 | 1,546.99 | 1546.99 |
1,546.99 | 0#,###.000 | . | , | 01,546.990 | 1546.99 |
Applying calculations and rounding
Number and date calculations performed in PDI do not apply the Format, Length, and Precision properties. For example, using the table below, A + B + B = 30.1 If you preview B, it will appear as 10.0, so you would think 10.02 + 10.0 + 10.0 = 30.02. However, because B was never converted to a string for the calculation, 10.02 + 10.04 + 10.04 = 30.1.
Field | Input | Format | Decimal | Group | Length | Precision |
A | 10.02 | #.0 | . | , | 5 | 1 |
B | 10.04 | . | , | 5 | 1 |
If you want to truncate a string, use the Strings cut step.
If you want to round or truncate a number, use the following Calculator step features:
- Round function
- Floor and Ceil functions
Alternatively, you can convert the date or number to a string in the Select Values step, which applies the formatting specified in the metadata.
Output type examples
The following table provides examples of the string and non-string output types in PDI. Note that Format, Length, Precision, Decimal, and Group apply only when reading from, or outputting to, a string.
String output type example | Non-string output type example |
Preview | Table Output when the target field is a number. |
Text File Output | Avro Output when the target field is a number. |
JSON Output | Parquet Output when the target field is a number. |
XML Output | ORC Output when the target field is a number. |
Table Output when the target field is a varchar. | Any binary output type |
Anything displayed by PDI including logs, error messages, and prompts. |