scripts/trainEstimation/README.md
Go to the documentation of this file.
8 This column tries to approximate the fraction of completed nodes, the so-called _search completion_,
10 By default, the display is based on simple search tree and gap statistics that are collected during
12 While most of these statistics by themselves show a significant improvement over the classical gap,
17 1. a linear regression that uses the two values "tree weight" and "SSG" and is guaranteed to be monotone.
20 Especially the second method requires **careful training to the instances of interest**. Therefore,
31 The training is performed by the R script "train.R". It depends on the availability of some R packages.
32 Please make sure that you have the following packages installed, ideally in their newest versions.
65 If the installation was succesful, you should be able to test the scripts on our provided test data:
71 Th directory "testdata/" contains only a handful of example log files to verify that the R packages have been set up successfully.
72 For good training results in a practical scenario, it is recommended to provide at least 50-100 such log files.
73 Smaller test beds can be enriched, for example, by running SCIP multiple times per instance with different random seed initializations.
75 At successful termination, the training summarizes the training in several new files in the output directory "output/".
82 The second step consists of producing meaningful training data in the form of SCIP output on instances of interest.
83 The required additional output can be enabled using the settings file "periodic_report.set" in this directory for SCIP.
84 The Log files must be stored in a common directory used as argument for "run_training.sh", one log file per instance.
87 The following example shows how to create a new subdirectory "mydata/" and produce a log file for the instance "bell5.mps"
95 The training only considers instances that could be solved, and discards all instances with trees that are too small.
96 Therefore, it should be ensured that at the end of the data collection, there are instances with interesting trees not too small
99 If the data set consists of many instances, make sure to also read [how to run automated tests with SCIP](@ref TEST)
101 Unless the `OUTPUTDIR=results` flag is modified, the necessary log files are then collected in the subdirectory "check/results/"
103 Please note that the above scripts create an additional log file after the runs are finished, in which
105 Please (re-)move this file, which can be easily identified by the prefix "check." before proceeding.
116 to obtain information of the approximation/estimation accuracy of the different estimation methods of SCIP on the newly created data set.
118 The name of the out directory can be changed by providing it as second argument to run_training.sh
127 After the training has finished, a comparison of different tree size estimation methods is printed to the console. In addition,
134 In addition, some intermediate data pipeline results are also stored in the output directory in the form of CSV files.
138 At termination, "run_training.sh" outputs the estimation accuracy for a total of 13 different techniques to estimate the tree size.
142 |:--|:---------------|----:|-----:|---------:|---------:|---------:|---------:|:----------------|
157 For four methods, there are two ways to compute an estimation of tree size: either as an approximation
158 of search completion, or by computing a time series forecast that takes into account the most recent values and trend development.
159 Therefore, these methods appear twice in the table, and the column "Group" shows whether a forecast or
164 We normalize each ratio such that it is bounded from below by 1, which would correspond to a perfect estimate.
167 For methods that approximate search completion, the Mean Squared Error of the approximation is shown in column "MSE".
168 The columns 2Accurate etc. give the fraction of records that are within a factor of 2 (3,4) of the actual tree size at termination.
171 Disclaimer: This table has been produced as a showcase on a subset of 91 publicly available instances almost all of which are solved by SCIP in less than 100 seconds.
172 The figures therein are not representative beyond the data set, and the ranking of the methods may substantially change on other data sources.
178 In the above table, the two learned methods "Random.Forest" and "linear.monotone" outperform the other methods (out of which they are constructed).
179 In the output directory, the file "monotone.set" contains the linear regression coefficients and can be input into SCIP like a normal settings file.
186 launches a SCIP interactive shell with the coefficients preloaded. Since the coefficients are user parameters,
188 We call this a monotone regression because it combines the two values "tree weight" and "SSG", which are monotone.
190 The trained regression forest can also be loaded into SCIP. The file "rf_model.rfcsv" in the output directory
193 The location of this file must be explicitly specified for SCIP by setting the string parameter "estimation/regforestfilename",
200 In the interactive shell, type `set estimation regforestfilename output/rf_model.rfcsv` to tell SCIP
202 A loaded regression forest automatically takes precedence over all other methods and is used as search completion approximation in the new
206 Note that most of the other methods in the table are also readily available. If the tree size should be estimated by an SSG forecasting