Data Integration Issues
Follow the suggestions in these topics to help resolve common issues associated with Pentaho Data Integration:
- Troubleshooting Transformation Steps and Job Entries
- Troubleshooting Database Connections
- Jobs Scheduled on Pentaho Server Cannot Execute Transformation on Remote Carte Server
Cannot Run a Job in a Repository on a Carte Instance from another Job
- Troubleshoot Pentaho Data Service Issues
- Kitchen and Pan Cannot Read Files from a ZIP Export
- Using ODBC
- Improving Performance when Writing Multiple Files
See Pentaho Troubleshooting articles for additional topics.
Troubleshooting Transformation Steps and Job Entries
If you have issues with transformation steps or job entries, check to see if these conditions apply to your situation.
'Missing Plugins' Error When a Transformation or Job is Opened
If you try to open a transformation or job that contains a step or entry not installed in your copy of the PDI client, the following error message appears:
To resolve this issue, either add the step or entry in the PDI client or find it in the marketplace and install it. See the Install a Plugin with the Marketplace article for details.
Cannot Execute or Modify a Transformation or Job
If you cannot do any of the standard transformation or job tasks (such as run, preview, debug, replay, verify, schedule, copy, export, or save), you do not have the Execute permission.
Have an administrative user check to see if your role has been granted the Execute permission.
Step is Already on Canvas Error
If a step is hidden on the canvas in the PDI client, you may receive a "Step is Already on Canvas" error message when trying to work with its transformation.
Regardless of whether a step is hidden or displayed in the canvas, it is listed in the Steps folder under your transformation in the View tab of the Data Integration perspective. To redisplay the hidden step on the canvas, drag it from the Steps folder in the View tab to the canvas.
Troubleshooting Database Connections
If you have issues with database connection in PDI, check to see if any the following conditions apply to your situation:
It may be possible to read from unsupported databases by using the generic database driver through an ODBC or JDBC connection. Contact Pentaho if you want to access a database type that is not yet in our list of supported components.
You can add or replace database driver files in the lib directory located under ...\design-tools\data-integration.
Database Locks when Reading and Updating from a Single Table
If you create, read, and update steps to or from a single table within a transformation, you will may experience database locking or slowed processing speeds. Reading and updating rows on a table within a single transformation can cause the database to stop updating.
For example, if you have a step which reads from a row within a table (a Table Input step) and you need to update the transformation with the Update step, this could cause locking issues, especially with MS SQL databases. Reading and updating rows in the same transformation in the same table should be avoided.
A general solution compatible with all databases is to duplicate the table to be read/updated, and then create separate read/update steps. Arrange the steps to be executed sequentially within the transformation, each on a different, yet identical, version of the same table. Adjusting database row locking parameters or mechanisms will also resolve this issue.
Force PDI to Use DATE instead of TIMESTAMP in Parameterized SQL Queries
If your query optimizer is incorrectly using the predicate TIMESTAMP, it is because the JDBC driver/database converts the data type from a TIMESTAMP to a DATE. In certain circumstances, this casting prevents the query optimizer of the database from using the correct index. For example, Oracle might state that it cannot use the index, and generates the following error message:
The predicate DATE used at line ID 1 of the execution plan contains an implicit data type conversion on indexed column DATE. This implicit data type conversion prevents the optimizer from selecting indices on table A.
To resolve this issue, use a
Select Values step and set
DATE. These changes force the parameter to be set as a DATE instead of a TIMESTAMP.
PDI Does Not Recognize Changes Made to a Table
If you edit the table layout outside of the PDI client, PDI is not aware of any field changes, deletions, or additions.
Clearing the cache addresses this issue. The cache needs to be cleared of database-related meta information (field names and their types in each used database table). PDI has this cache to increase processing speed. Perform the following steps to clear this information in the cache from within the PDI client:
- Select the connection.
- Select Tools > Database > Clear Cache or Database connections > Clear complete DB cache.
Jobs Scheduled on Pentaho Server Cannot Execute Transformation on Remote Carte Server
You may see an error similar to the following message when trying to schedule a job to run on a remote Carte server:
ERROR 11-05 09:33:06,031 - !UserRoleListDelegate.ERROR_0001_UNABLE_TO_INITIALIZE_USER_ROLE_LIST_WEBSVC! com.sun.xml.ws.client.ClientTransportException: The server sent HTTP status code 401: Unauthorized
You need to make the following configuration changes to remotely execute scheduled jobs:
This process is also required for using the Pentaho Server as a load balancer in a dynamic Carte cluster.
- Stop the Pentaho and Carte servers.
- Copy the
repositories.xmlfile from the .kettle folder on your workstation to the same location on your Carte slave. Without this file, the Carte slave will be unable to connect to the Pentaho Repository to retrieve PDI content.
- Open the pentaho/server/pentaho-server/tomcat/webapps/pentaho/WEB-INF/web.xml file in a text editor.
- Find the
Proxy Trusting Filtersection, then add your Carte server's IP address to the
param-valueelement. The following code block is a sample of the
Proxy Trusting Filtersection:
<filter> <filter-name>Proxy Trusting Filter</filter-name> <filter-class>org.pentaho.platform.web.http.filters.ProxyTrustingFilter</filter-class> <init-param> <param-name>TrustedIpAddrs</param-name> <param-value>127.0.0.1,192.168.0.1</param-value> <description>Comma separated list of IP addresses of a trusted hosts.</description> </init-param> <init-param> <param-name>NewSessionPerRequest</param-name> <param-value>true</param-value> <description>true to never re-use an existing IPentahoSession in the HTTP session; needs to be true to work around code put in for BISERVER-2639</description> </init-param> </filter>
- Find the following code snippet in the web.xml file:
<filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>i18n</url-pattern> </filter-mapping>
- After the code snippet above, add the following text in your file:
<!-- begin trust --> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/authorizationPolicy</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/roleBindingDao</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/userRoleListService</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/unifiedRepository</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/userRoleService</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/Scheduler</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/webservices/repositorySync</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/api/session/userName</url-pattern> </filter-mapping> <filter-mapping> <filter-name>Proxy Trusting Filter</filter-name> <url-pattern>/api/system/authentication-provider</url-pattern> </filter-mapping> <!-- end trust -->
- Save and close the file.
- Edit the
carte.sh(Linux) startup script on the machine that runs your Carte server.
- Set an OPT variable to
-Dpentaho.repository.client.attemptTrust=true, as shown in the following sample line of code:
SET OPT=%OPT% "-Dpentaho.repository.client.attemptTrust=true"
- Save and close the file, then start your Carte and Pentaho servers.
You can now schedule a job to run on a remote Carte instance.
Cannot Run a Job in a Repository on a Carte Instance from another Job
If you run a Job on a Carte instance and the repository information does not seem available, and further jobs and transformations cannot be loaded after the job runs a child job, then you may see an error similar to the following message:
RepositoriesMeta - Reading repositories XML file: <HOME>/.kettle/repositories.xml General - I couldn't find the repository with name 'singleDiServerInstance' (...) Secondary - Could not execute job specified in a repository since we're not connected to one (...)
- Using an editor, open the
<HOME>/.kettle/repositories.xmlfile on the server where the Carte instance is located.
- Add a new
<repository>element to it or append the current content with the name
singleDiServerInstance, as shown in the following code block:
<repository> <id>PentahoEnterpriseRepository</id> <name>singleDiServerInstance</name> <description/> <is_default>false</is_default> <repository_location_url>http://localhost:8080/pentaho</repository_location_url> <version_comment_mandatory>N</version_comment_mandatory> </repository>
- Save and close the file.
Troubleshoot Pentaho Data Service Issues
You can list the services running on the Pentaho Server or Carte Server using the
listService command, as shown in the following example URL:
If the data service appears in the list, the Pentaho Server is set up correctly. Check your network and client setup.
If it does not appear in the list, try the following tasks:
- If you stored the transformation in the Pentaho Repository, check the status using this command: http://localhost:8080/pentaho/kettle/status. Make sure the repository name appears in the configuration details section. If the repository name is missing or an error appears, check the configuration of the
slave-server-configand accessibility of the
- If you did not store the transformation in the Pentaho Repository, but instead stored it on the local file system, make sure the service name is listed and accessible from the user account that runs the Pentaho Server in <users home>/.pentaho/metastore/pentaho/Kettle Data Service.
If the service name (the table to query) is still missing, reopen the service transformation and check the data service in the transformation settings.
Kitchen and Pan Cannot Read Files from a ZIP Export
If you are trying to read either transformations from Pan or jobs from Kitchen from a ZIP export but are getting errors, you may have a syntax error in your Pan or Kitchen command.
ZIP files must be prefaced by an exclamation point (!) character. On Linux and other Unix-like operating systems, you must escape the exclamation point with a backslash (\!) as in the following Kitchen example:
Kitchen.bat /file:"zip:file:///C:/Pentaho/PDI Examples/Sandbox/linked_executable_job_and_transform.zip!Hourly_Stats_Job_Unix.kjb"
Although ODBC can be used to connect to a JDBC compliant database, Pentaho does not recommend using it and it is not supported.
For details, this article explains "Why you should avoid ODBC." http://wiki.pentaho.com/pages/viewpage.action?pageId=14850644.
Improving Performance when Writing Multiple Files
You can use the two following Kettle properties to tune how the Text Output step and the Hadoop Output step work in a transformation.
- KETTLE_FILE_OUTPUT_MAX_STREAM_COUNT - This property limits the number of files that can be opened simultaneously by the step. The property only works when the "Accept file name from field?" option (in the File tab) is selected. In this scenario, each row coming into the step has a column containing the file name to which the remaining data in the row is to be written. If each row has a different file name, then the step will open a new file for each row. Once data is written to the file referenced in the row, the file is left open in case the next row references the same file. Leaving the file open prevents the step from aborting. If the transformation is processing many rows and each row has a different file name, then many files will be opened and there is a possibility of exhausting the number of open files allowed by the OS. To avoid this problem, use KETTLE_FILE_OUTPUT_MAX_STREAM_COUNT to define the maximum number of files the step can have open. Once the step reaches the maximum number of open files, it closes one of the open files before opening another. If the file that was closed is referenced in a later row, a different file will be closed and the referenced file will be reopened for append. Setting this parameter to 0 allows for an unlimited number of open files. The default is 1024.
- KETTLE_FILE_OUTPUT_MAX_STREAM_LIFE - This property defines the maximum number of milliseconds between flushes of all files opened by the step. The property is most useful when the "Accept file name from field?" option (in the File tab) is selected. Since the output written to each file is buffered, if many small files are opened and written to, there is a possibility that a large amount of buffered data could exhaust the system memory. Setting this property forces the buffered file data to be periodically flushed, reducing the amount of system memory used by the step. Setting this value to 0 prevents the file streams from being periodically flushed. The default value is 0.
To edit the Kettle properties file, follow the steps in Set Kettle Variables in the PDI Client.