If the DI Repository becomes corrupt, it will be unresponsive, content may be missing or inaccessible, and an error message similar to this will appear in the /data-integration-server/tomcat/logs/catalina.out log file:
ERROR [ConnectionRecoveryManager] could not execute statement, reason: File corrupted while reading record: "page[48970] data leaf table:8 entries:1 parent:49157 keys:[118547] offsets:[737]". Possible solution: use the recovery tool [90030-131], state/code: 90030/90030
If this happens, shut down the DI Server and restore your solution repository from a recent backup.
If you do not have a viable backup, you may be able to minimize data loss by identifying the exact file that is corrupt. To do this, enable debug logging by adding the following XML snippet above the <root> element in the /WEB-INF/classes/log4j.xml inside your deployed pentaho.war:
<category name="org.pentaho.platform">
<priority value="DEBUG"/>
</category>
Restart the DI Server and retry the action that caused the original error. If it occurs again, shut down the DI Server and open thecatalina.out log file in Tomcat. Look for the last line that appears before the error; it usually contains the name of the file that has been damaged. When you are finished investigating the data corruption, remove the extra logging capabilities so that your DI Server log files don't become large and unmanageable.
reading file with id 'xyz' and path '/public/a.txt'
You can also try to recover your PDI data from the damaged database by using a recovery program, as explained in Using the H2 Database Recovery Tool.
Note: If the database has become corrupt, the damaged rows will not be exported with the recovery tool, and whatever information they contained will be lost.
Using the H2 Database Recovery Tool
The DI Server includes a third-party H2 database recovery tool that enables you to extract raw data from your DI Repository. This is primarily useful in situations where the repository has become corrupt, and you don't have any relevant backups to restore from.
Note: If the database has become corrupt, the corrupted rows will not be exported. Any information contained in corrupted rows is unrecoverable through this method.
The recovery tool is a JAR that must be run via the Java command. The output is a SQL dump that you can then attempt to import after you've re-initialized your DI Server database.
You can read more about the recovery tool on the H2 Web site: http://www.h2database.com/html/advanced.html#using_recover_tool.
Follow the directions below to use the recovery tool.
- Open a terminal on (or establish an SSH connection to) your DI Server.
- Navigate to the /pentaho-solutions/system/jackrabbit/repository/version/ directory.
cd data-integration-server/pentaho-solutions/system/jackrabbit/repository/version/
- Run the h2-1.2.131.jar H2 database JAR with the recovery tool option.
java -cp h2-1.2.131.jar org.h2.tools.Recover
- Create a directory to move your old database files to.
mkdir old
- Move the old database files to the directory you just created.
mv db.h2.db db.trace.db old
- Re-initialize the repository with the RunScript option, using the salvaged SQL dump as the source.
java -cp h2-1.2.131.jar org.h2.tools.RunScript -url jdbc:h2:./db -user sa -script db.h2.sql
- The backup directory you created earlier (old in the above example) can be removed after you're sure that you don't need the corrupted database files anymore. However, it's a good idea to keep this around just in case you need it later.
You've successfully extracted all of the usable data from your corrupt solution repository, removed the damaged database files, and re-initialized the repository with the salvaged data.
Start the DI Server and check for further errors. If repository errors persist, contact Pentaho support and request developer assistance.
Unable to Use the Database Init Scripts for PostgreSQL
The pg_hba.conf file contains host-based authentication information. If you can't run the SQL scripts that generate the Jackrabbit and Quartz databases, it's probably because the default user accounts for each database don't have the right permissions. To change this, edit the file to ensure that connections from local users created by the Pentaho sql scripts (such as pentaho_user) will be able to connect. The default on Debian-based systems is for local connections you use ident authentication, which means that database users must have local user accounts. In other words, to continue using ident, you would have to create a local pentaho_user account. It's easier to just change the authentication method to something less restrictive, if your IT manager allows that approach.