back

UniverSVM

Support Vector Machine with Large Scale CCCP Functionality

Author: Fabian Sinz, Version: 1.1
Collaborators: Ronan Collobert, Jason Weston and Leon Bottou

Overview

  • UniverSVM implements standard SVM optimization (including multi class training and cross valdidation) suited for large scale problems.

  • UniverSVM implements the Concave-Convex procedure applied to Transductive and sparse SVMs.

  • UniverSVM can train very large scale TSVMs with tens of thousands of unlabeled examples. Its computational complexity is empirically quadratic in the number of examples (labeled + unlabeled) and is cubic in the worst case.

  • UniverSVM can be compiled as mex file for matlab (thanks to Olivier Chapelle)

  • UniverSVM is free software (except for military purposes); you can redistribute it and/or modify it under the terms of the GNU General Public License and the exception mentioned above. It also includes parts of LIBSVM and LUSH Lisp Universal Shell: Copyright (C) 2002 Leon Bottou, Yann Le Cun, AT&T Corp, NECI.

  • Download source_v1.tar.gz

  • Download a version for Windows written and kindly provided by Matteo Roffilli

    Installation

    Unzip the compressed file. If you downloaded the source you have to compile it with one of the following commands:
    make all
    for compiling the learning algorithm and the ascii to binary conversion tool.
    make universvm
    for compiling only the learning algorithm.
    make libsvm2bin
    for compiling only the ascii to binary conversion tool.
    make bin2libsvm
    for compiling only the binary to ascii conversion tool.
    make mex
    for compiling univerSVM as MEX file.


    Usage

    UniverSVM consists of a single learning module (universvm). This is used both to train a model and to apply the learned model to new examples.

    If a model file is specified after the trainfile, then the learnt model will be stored there. This is currently not implemented for multi class learning, cross validation and universum variant 1 . At the moment an easy rule is: If the model has only one set of alphas then it can be stored in the model file. If the model is supplied by the switch -F, then UniverSVM will test the specified model on the data that is supplied as training data and the test data supplied by -T. So

      universvm [options] -T testfile trainfile
    
    has the same effect as doing
       universvm [options] trainfile model
       universvm [options] -F model testfile
    
    UniverSVM is called with the following parameters:

    universvm [options] training_set_file [model_file]

    Available options are:

    -T test_set_file: test model on test set
    -U universum_file : use universum (it's also possible to include universum points
                        with label -2 in the training file)
    -F model_file : Test the model stored in model_file on training AND test data (specified by -T)
    -u unlabeled_data_file : use unlabeled data (transductive SVM).
                     (it's also possible to include unlabeled points
                        with label -3 in the training file)
    -B file format : files are stored in the following format:
              0 -- libsvm ascii format (default)
              1 -- binary format
              2 -- split file format
    -f file : output report file to given destination
    -D file : output function values on test set(s) to given destination
    
    OPTIMIZATION OPTIONS:
    -V universum variant:
              0 -- Standard universum training (default)
              1 -- Train SVM with universum by making it a 3-class multiclass
                   problem and adding the decision rules for {+1,U} vs. -1 and
                   {-1,U} vs. +1 (0=off default)
                   This switch works only for binary at the moment.
              2 -- Train universum with ramp loss. This option requires "-o 1".
    -o optimizer: set different optimizers
              0 -- quadratic programm
              1 -- convex concave procedure (if you choose a transductive SVM,
                   this option will be chosen automatically)
    -G gap : set gap parameter for universum (default 0.05)
    -I use_ridge : Add the ridge 1/C to the kernel matrix.
    -r coef0 : set coef0 in kernel function (default 0)
    -c cost : set the parameter C of C-SVC (default 1)
    -C cost : set the parameter C for universum points
    -a cost : set the parameter C for balancing constraint
    -z cost : set the parameter C for unlabeled points
    -m cachesize : set cache memory size in MB (default 256)
    -e epsilon : set tolerance of termination criterion (default 0.001)
    -s s : s parameter for ramp loss (default: -1 )
    -S s : s parameter for transductive SVM loss (default: 0)
    
    MODEL OPTIONS:
    -t kernel_type : set type of kernel function (default 0)
              0 -- linear: u'*v
              1 -- polynomial: (gamma*u'*v + coef0)^degree
              2 -- radial basis function: exp(-gamma*|u-v|^2)
              3 -- sigmoid: tanh(gamma*u'*v + coef0)
              4 -- custom: k(x_i,x_j) = X(i,j)
    -d degree : set degree in kernel function (default 3)
    -g gamma : set gamma in kernel function (default 1/k)
    -b bias: use constraint sum alpha_i y_i =0 (default 1=on)
    -w weight: the rhs of the balancing constraint (default = sum(y_i))
    -v n : do cross validation with n folds
    -M k : perform a multiclass training on k classes labeled with k different
            integers >= 0 (default: 0)
    
    

    File formats

    We support three types of file format: LIBSVM/SVMLight ascii, binary and split files.

    LIBSVM/SVMLight ascii format
    The input file example_file contains the training examples. Each of the following lines represents one training example and is of the following format:

    <line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
    <target> .=. +1 | -1 | -3 | <int> 
    <feature> .=. <integer>
    <value> .=. <float>

    The target value and each of the feature/value pairs are separated by a space character.

    In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line

    -1 1:0.43 3:0.12 9284:0.2

    specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. A class label of -3 indicates that this example should be classified using transduction.

    In multiclass classification the class should be a positive integer (remember to specify the "-M " option).

    Binary files
    One can convert from ascii files to binary ones with the program libsvm2bin which is included in the source. This makes file sizes smaller and loading times faster.

    Split files
    Split files are a handy way to save disk space if you have copies of the same data with different training/ test set splits, and/or different target classes. They work by loading an original data file, but then specifying a subset of the data to load (by index) as well as a possible relabeling of the data points. The split file format is best shown by an example:

    file_name: mnist.trn.bin
    binary_file: 1
    supply_indices: 1
    supply_new_labels: 1
    3 -3
    4 -3
    7 -3
    ...
    
    The first line specifies a data file in either ascii or binary format to load. The second line indicates if that file is binary (set to 1) or otherwise (set to 0). The third line specifies whether you wish to load a subset of the given file (set to 1) and the fourth line, "supply_new_labels" indicates whether you wish to relabel the data differently to the original file. Following the first four lines is a list, of either <index> <label> pairs (if supply_new_labels is set to 1) or else only an <index> is given on each line. These indices (starting from 1) specify examples in the original file.

    Example

    We give an example of text classification taken from Chapelle and Zien, 2005. You will need the following training, testing and unlabeled data files, available here.

    1) Running a standard SVM with no unlabeled data (linear kernel):

    universvm -c 100 -T text.tst1 text.trn1
    
    gives
    Training done ...
    ---------------------------------
               Testing
    ---------------------------------
    Testing on test set with 1896 examples:
    ===========================================
       Accuracy= 81.0127(1536/1896)
    ===========================================
    
    2) Running TSVM:
    universvm -c 100 -z 0.1 -S -0.3 -u text.tst1 -T text.tst1 text.trn1
    gives
    Training done ...
    ---------------------------------
               Testing
    ---------------------------------
    Testing on test set with 1896 examples:
    ===========================================
       Accuracy= 93.7236(1777/1896)
    ===========================================
    

    Feedback/Bug Reports

    If you find any bugs or have useful feedback, please send me an email. Please do not forget to attach a detailed description about how to reproduce the bug.