Skip to content

Training sets structure

Jean-Baptiste Lugagne edited this page Aug 16, 2018 · 5 revisions

This page describes how the training_set variable is structured. It contains all the parameters necessary for training an SVM or a Random Forest classifier. With this description, one can evaluate different parameter settings and try to optimize their results.

Main structure

For more details about those parameters, see the Training set construction GUI page. The parameters names are pretty transparent between the training set structure and the GUI.

This is the Tree structure for a training set right before training. See below for differences with training set saved on disk.

training_set: structure. Contains all manually selected pixels and training parameters.

  • training_set.trainingpx: structure. All fields are stack-specific data (most notably manually classified pixels)
    • training_set.trainingpx.[STACK_NAME]: structure. One such structure is created for each stack loaded through the GUI.
      • training_set.trainingpx.[STACK_NAME].stackvarname: string. Name of the stack variable in the MAT file (usually 'stack').
      • training_set.trainingpx.[STACK_NAME].mfile: MatFile. MatFile i/o object to the stack file. (will be reloaded from path if throws error).
        • training_set.trainingpx.[STACK_NAME].pixel: Structure. Selected z-pixels for each class.
          • training_set.trainingpx.[STACK_NAME].pixel.[CLASS_NAME]: Vector of pixel indexes. Selected pixels in the image for class "CLASS_NAME". May not exist if no pixels of this class selected.
      • training_set.trainingpx.[STACK_NAME].path: String. Path to Stack MAT file.
      • training_set.trainingpx.[STACK_NAME].nbframes: Scalar integer. Number of frames in the stack.
      • training_set.trainingpx.[STACK_NAME].currentframe_nb: Scalar integer. Number of current frame displayed (GUI-specific)
      • training_set.trainingpx.[STACK_NAME].currentframe: Grayscale Image. Current frame displayed (GUI-specific).
      • training_set.trainingpx.[STACK_NAME].currentframeRGB: RGB Image. Current frame displayed (GUI-specific).
      • training_set.trainingpx.[STACK_NAME].preloaded_fr: Normally empty (GUI-specific).
  • training_set.rgbmap: Nx3 double array. Colors for all classes in the training set. (Same order as classnames).
  • training_set.classnames: N cell of strings. Names of all classes in the training set. (Same order as rgbmap).
  • training_set.hierarchy: Structure. Describes the hierarchy between the classes in the training set.
    • training_set.hierarchy.[CLASS_NAME]: String. Name of the parent class for class "CLASS_NAME". Classes without parent are top-level classes.
  • training_set.parameters: Structure. Details training parameters.
    • training_set.parameters.maxmemUse: Scalar double. GB of memory to use for training (SVMs).
    • training_set.parameters.frames_subselection: Structure. Parameters for subsampling of frames in the stack. (see frames subselection)
      • training_set.parameters.frames_subselection.type: String. Selects subselection type. Valid values are 'all', 'log', 'lin', and 'custom'.
      • training_set.parameters.frames_subselection.nbframes_linlog: Scalar integer. Number of frames to keep (in case of linear or log subselection) .
      • training_set.parameters.frames_subselection.custom_set: Vector of integers. Indexes of the frames to use in the stack. (In case of custom subselection).
      • training_set.parameters.frames_subselection.frames: Vector of integers. Processed subselection (as produced by the function "process_framessubselection" in /utilities/).
    • training_set.parameters.focus_shifting: Structure. Parameters for focus shifting (see focus shifting)
      • training_set.parameters.focus_shifting.status: Boolean. Turns focus shifting on/off.
      • training_set.parameters.focus_shifting.radius: Scalar integer. Focusing radius (how many frames to shift up and down).
    • training_set.parameters.parallel_processing: Structure. Parameters for parallel processing. (see parallel processing).
      • training_set.parameters.parallel_processing.status: Boolean. Turns parallel processing on/off.
      • training_set.parameters.parallel_processing.nbWorkers: Scalar integer. Number of parallel workers to use in the cluster.
      • training_set.parameters.parallel_processing.cluster_profile: String. Name of the cluster profile to use for parallel processing.
    • training_set.parameters.class_specific: Structure. Class-specific parameters. (see class specific parameters)
      • training_set.parameters.class_specific.default: structure. Default class parameters.
        • training_set.parameters.class_specific.default.subsample: Scalar double. Percentage of z-pixels to keep for training. (see Subsampling)
        • training_set.parameters.class_specific.default.SVM: Structure. (See below for this part of the tree.)
        • training_set.parameters.class_specific.default.OptimizeSVM: Structure. (See below for this part of the tree.)
      • training_set.parameters.class_specific.spec: Structure. Class-specific parameters.
        • training_set.parameters.class_specific.spec.[CLASS_NAME]: Structure. This is a structure similar to training_set.parameters.class_specific.default, but only the parameters that, for class "CLASS_NAME", differ from the default values are fielded. For all other parameters, the values in training_set.parameters.class_specific.default are used..
    • training_set.parameters.feature_extraction: Structure. Parameters for Principal Component Analysis (see feature extraction)
      • training_set.parameters.feature_extraction.nbcomponents: Scalar integer. Number of components to keep after PCA.
      • training_set.parameters.feature_extraction.subsampling: Scalar integer. Percentage of datapoints to run PCA on. This prevents memory issues for large datasets.
    • training_set.parameters.nbframes: Scalar integer. Number of frames in the stacks in the dataset.
    • training_set.parameters.frame_processing: Cell of structures. Contains all the preprocessing operations to perform on each frame in the stack upon loading. (see preprocessing functions and the preprocessing function in /utilities/)

training_set.parameters.class_specific.default.SVM

SVM branch

For more details on the parameters, see SVM parameters and Matlab's fitcsvm doc page

  • training_set.parameters.class_specific.default.SVM.DeltaGradientTolerance: Scalar double. Tolerance for gradient dfference.
  • training_set.parameters.class_specific.default.SVM.IterationLimit: Scalar integer. Maximal number of numerical optimization iterations.
  • training_set.parameters.class_specific.default.SVM.GapTolerance: Scalar double. Feasibility gap tolerance.
  • training_set.parameters.class_specific.default.SVM.ShrinkagePeriod: Scalar integer. Number of iterations between reductions of active set.
  • training_set.parameters.class_specific.default.SVM.KKTTolerance: Scalar double. Karush-Kuhn-Tucker complementarity conditions violation tolerance.
  • training_set.parameters.class_specific.default.SVM.KernelFunction: String. Kernel function.
  • training_set.parameters.class_specific.default.SVM.KernelScale: Scalar double or string. Kernel scale parameter.
  • training_set.parameters.class_specific.default.SVM.PolynomialOrder: Scalar integer. Polynomial kernel function order
  • training_set.parameters.class_specific.default.SVM.Standardize: Boolean. Flag to standardize predictor data.
  • training_set.parameters.class_specific.default.SVM.BoxConstraint: Scalar double. Box constraint.

Random forest branch

For more details on those parameters, see Random forest parameters and Matlab's Treebagger doc page

Note: Even though this is specific to the random forest branch, the parameters are still nested in training_set.parameters.class_specific.default.SVM. This 'SVM' field should be regarded as a 'classifier parameters' field. We kept the name 'SVM' for historical reasons and for backward compatibility with older training sets.

  • training_set.parameters.class_specific.default.SVM.NumTrees: Scalar integer. Number of trees to bag together.
  • training_set.parameters.class_specific.default.SVM.InBagFraction: Scalar double. Fraction of input data to sample with replacement from the input data for growing each new tree.
  • training_set.parameters.class_specific.default.SVM.MinLeafSize: Scalar integer. Minimum number of observations per tree leaf.
  • training_set.parameters.class_specific.default.SVM.SampleWithReplacement: String. 'on' to sample with replacement or 'off' to sample without replacement.

training_set.parameters.class_specific.default.OptimizeSVM

This field and its substructure is only valid for the SVM branch. It specifies parameters over which to perform an optimization procedure that is featured into Matlab's fitcsvm function. However, because of the structure of our SVM approach (hybrid Hierarchical and Winner-takes-all), we now recommend using the evaluation procedure that we wrote.

For more details, see matlab's fitcsvm parameters optimization.

  • training_set.parameters.class_specific.default.OptimizeSVM.KernelFunction: Cell array of strings. Names of the kernel functions to try for optimization.
  • training_set.parameters.class_specific.default.OptimizeSVM.KernelScale: Structure. Parameters for kernel scale optimization.
    • training_set.parameters.class_specific.default.OptimizeSVM.KernelScale.Optimize: Boolean. Turns optimization on/off for Kernel scale.
    • training_set.parameters.class_specific.default.OptimizeSVM.KernelScale.Range: vector of 2 doubles. Range of kernel scale values for optimization.
  • training_set.parameters.class_specific.default.OptimizeSVM.PolynomialOrder: Structure. Parameters for polynomial kernel order optimization.
    • training_set.parameters.class_specific.default.OptimizeSVM.PolynomialOrder.Optimize: Boolean. Turns optimization on/off for polynomial order.
    • training_set.parameters.class_specific.default.OptimizeSVM.PolynomialOrder.Range: Vector of 2 integers. Range of Polynomial order values to try for optimization
  • training_set.parameters.class_specific.default.OptimizeSVM.Standardize: Structure. Parameters for optimization of standardization.
    • training_set.parameters.class_specific.default.OptimizeSVM.Standardize.Optimize: Boolean. Turns optimization on/off for standardization. (Since 'Standardize is a boolean parameter, there is no range to define)
  • training_set.parameters.class_specific.default.OptimizeSVM.BoxConstraint: Structure. Parameters for optimization of the Box Constraint.
    • training_set.parameters.class_specific.default.OptimizeSVM.BoxConstraint.Optimize: Boolean. Turns optimization on/off for Box constraint.
    • training_set.parameters.class_specific.default.OptimizeSVM.BoxConstraint.Range: Vector of 2 doubles. Range of values to try for optimization for the Box Constraint.

Differences between training_set variable saved to disk and training_set variable after GUI

The structure described above is the one fed into the LoadAndExtract function downstream of training set construction (see Training. When a training_set is saved to disk through the GUI to be reloaded through the GUI again later, the structure is slightly different, which means you cannot load a training set saved to disk into memory and feed it directly into the LoadAndExtract function.

There are two very minor differences in the structure, described below. It is frustrating that those two differences persist and make the whole process more cumbersome than it should be, but unfortunately changing either structure to match the other one would require a significant amount of testing to make sure that the change in structure does not create errors in either the GUI or the LoadAndExtract function. I am working on quick and dirty fix for that for version 1.1.

I'm going to refer to the two different structures as disk sets and post-GUI sets below. The one described on this page is the post-GUI sets structure.

  • The first difference is that the field training_set.parameters in the post-GUI sets is named training_set.training_params in the disk sets.
  • The second difference is that the field training_set.parameters.frame_processing is moved up one level in the structure and becomes training_set.frame_processing.

And that's it! I know it's ridiculous to have such minor differences make the whole process of training on disk-saved sets a lot more cumbersome, but I will try to fix this.

Clone this wiki locally