-
Notifications
You must be signed in to change notification settings - Fork 8
Frequently Asked Questions
-
Common Problems
-
Throws java.nio.charset.MalformedInputException
-
Throws java.lang.AssertionError: Expecting type Value but found Delimiter
-
Throws java.lang.OutOfMemoryError
-
Throws java.io.IOException: No space left on device
-
The character decoding is extremely sensitive. The whole process will fail-fast when any invalid characters is encountered, throwing java.nio.charset.MalformedInputException. This applies to all files accessed by the system, but in practice it only effects the input file.
If this error occurs when reading the input instances file (i.e during the count stage of the pipeline) the instance file may contain invalid characters. If the exception occurs during some other portion of the pipeline then there is likely an internal error or bug of some sort.
Before submitting a bug report, please check the input files is valid by cleaning it. To clean the input instance file run it though iconv:
$ iconv -f <charset> -t <charset> -c -o <output> <input>Replace <charset> with the character encoding of the input instance file, <input> with the path to the input instances file, and <output> with the path to the cleaned instances file.
Having cleaned the data, it is possible that empty strings will have been created; i.e strings that where previously composed entirely of invalid characters. This will result in different problem (oh joy) the solution to which is described in Section [java.lang.AssertionError: Expecting type Value but found Delimiter][FAQ#blah].
This is the old form of an exception that is been somewhat demystified. Delimiter 9 is that tab character and delimiter 10 is the new-line. See the answer for "Throws java.lang.AssertionError: Expecting delimiter Tab but found New-line" below.
This means that when parsing an input file, the end of record was found before it should have been. This usually means the input instances file contains a line with a single entry with no features. Since no-feature entry has no useful semantics in the thesaurus, it is not permitted to include these in the input. Please adjust your feature extraction no-feature entries are not produced, or filter the input file before providing to Byblo.
If the software throws the exception java.lang.AssertionError: Expecting type Value but found Delimiter during the count stage of the pipeline, there may be a problem with the instances file. Empty head or context strings will result in an error, as will strings that contain a tab or new-line characters since they are used as delimiters.
Make sure all head and context strings an non-empty, and that they do not contain the tab character. If unsure try running the instances file through:
$ awk '/^[^\t]+[\t][^\t]+$/' <input> > <output>Replace <input> with the path to the input instances file, and <output> with the path to the cleaned instances file.
If this occurs at some other stage of the pipeline then there is likely an internal error or bug. Please report the problem, with as much detail as possible, on the issue tracker.
The thesaurus build process can require a very large amount of memory, depending on the size and composition of the input instances data. While some effort has been made to insure the software runs on commodity hardware, there are situation where it may require several gigabytes of RAM. Here are some things to try if this error occurs.
- Java will not automatically use all available memory. It's usage is constrained by a memory limit parameter that can be configured by editing the <builddt.sh> script. In <builddt.sh> look for a line, in the constants section, that looks something like:
readonly JAVA_ARGS="-Xmx16g -d64" The -Xmx argument sets the maximum memory allocation for Java. In this example it is set to 16 gigabytes. Increasing this will allow Java to allocate more, and may resolve the OutOfMemoryError exception. However, do not set this larger than the available physical RAM or Java will use virtual memory swap space. This will cause the software to dramatically slow down to a point where it may never complete.
- If the error occurs during the all-pairs or sort stages of the pipeline, the memory usage can be reduced by choosing smaller chunk size. The chunk size can be configured at run time. see Sections [All-pairs-Options] and [Sorting-and-K-Nearest-Neighbours].
The No space left on device error occurs when there is insufficient hard-disk space available to write some data. The software requires large amounts of disk space to store intermediate files, and to write results. It can easy consume terabytes of storage in a typical run, depending on data composition and parameterisation. There are two places that will be written to, during the thesaurus build process: The output directory and the temporary directory. Insure that space is available at both of these locations, or specify a different location:
-
The output directory is where results and non-temporary intermediate files are stored. It can be specified at the command line using
-o\ <path>switch (see Section [General-Options]), otherwise it defaults to the parent directory of the input file. -
The temporary directory is where intra-process files are stored. These differ, from those intermediate files stored in the output directory, because they are only used for a short time before deletion, and because they can have no possible function after the build process has completed. By default the software will use the system defined temporary directory, which may be on a separate device or partition to the output directory. The temporary directory can be specified at run time using the
-T <path>command line switch. See [General Options].