Scippy

SCIP

Solving Constraint Integer Programs

scripts/trainEstimation/README.md
Go to the documentation of this file.
1 How to train custom tree size estimation for SCIP {#TRAINESTIMATION}
2 =================================================
3 
4 While watching SCIP's periodic output, users may sometimes wonder to what extent the
5 search process is already finished.
6 As of version 7.0, SCIP features a new display column "compl." that, in many cases, serves as
7 a much better progress bar of the search than the traditional gap.
8 This column tries to approximate the fraction of completed nodes, the so-called _search completion_,
9 as closely as possible.
10 By default, the display is based on simple search tree and gap statistics that are collected during
11 the search.
12 While most of these statistics by themselves show a significant improvement over the classical gap,
13 they can also be combined into even more accurate approximations of the search completion.
14 
15 SCIP provides two ways to combine the individual statistics:
16 
17 1. a linear regression that uses the two values "tree weight" and "SSG" and is guaranteed to be monotone.
18 2. a regression forest on all tree statistics
19 
20 Especially the second method requires **careful training to the instances of interest**. Therefore,
21 SCIP does not come with a built-in regression forest, but can read custom regression forests
22 that have been trained on user data.
23 
24 This tutorial shows how to use the scripts in the directory "scripts/trainEstimation/" of SCIP
25 to train and use custom regressions from user data.
26 
27 
28 Prerequisites: Installation of necessary R packages
29 ---------------------------------------------------
30 
31 The training is performed by the R script "train.R". It depends on the availability of some R packages.
32 Please make sure that you have the following packages installed, ideally in their newest versions.
33 
34 - readr
35 - magrittr
36 - ggplot2
37 - dplyr
38 - knitr
39 - rpart
40 - randomForest
41 - reshape2
42 
43 You can install packages from within an R session interactively by executing
44 
45 ```{.R}
46 install.packages(
47  c(
48  "readr",
49  "magrittr",
50  "ggplot2",
51  "dplyr",
52  "knitr",
53  "rpart",
54  "randomForest",
55  "reshape2"
56  )
57 )
58 ```
59 
60 You will be prompted to specify a CRAN-mirror before downloading the package resources.
61 The above packages are well-maintained standard packages in the R universe. If you
62 experience errors during the installation, please refer to the troubleshooting
63 in the respective package documentation.
64 
65 If the installation was succesful, you should be able to test the scripts on our provided test data:
66 
67 ```
68 ./run_training.sh testdata/
69 ```
70 
71 Th directory "testdata/" contains only a handful of example log files to verify that the R packages have been set up successfully.
72 For good training results in a practical scenario, it is recommended to provide at least 50-100 such log files.
73 Smaller test beds can be enriched, for example, by running SCIP multiple times per instance with different random seed initializations.
74 
75 At successful termination, the training summarizes the training in several new files in the output directory "output/".
76 The output of the script is explained further below.
77 
78 
79 Creation of SCIP Log File Output
80 --------------------------------
81 
82 The second step consists of producing meaningful training data in the form of SCIP output on instances of interest.
83 The required additional output can be enabled using the settings file "periodic_report.set" in this directory for SCIP.
84 The Log files must be stored in a common directory used as argument for "run_training.sh", one log file per instance.
85 Log files must have the file extension ".out".
86 
87 The following example shows how to create a new subdirectory "mydata/" and produce a log file for the instance "bell5.mps"
88 under Linux.
89 
90 ```
91 mkdir mydata/
92 scip -s periodic_report.set -f ../../check/instances/MIP/bell5.mps -l mydata/scip-logfile.out
93 ```
94 
95 The training only considers instances that could be solved, and discards all instances with trees that are too small.
96 Therefore, it should be ensured that at the end of the data collection, there are instances with interesting trees not too small
97 that could be completely solved.
98 
99 If the data set consists of many instances, make sure to also read [how to run automated tests with SCIP](@ref TEST)
100 of the SCIP documentation for one possibility to automate the data collection.
101 Unless the `OUTPUTDIR=results` flag is modified, the necessary log files are then collected in the subdirectory "check/results/"
102 under the SCIP root directory.
103 Please note that the above scripts create an additional log file after the runs are finished, in which
104 log files for each instance are concatenated.
105 Please (re-)move this file, which can be easily identified by the prefix "check." before proceeding.
106 
107 Invocation of run_training.sh
108 -----------------------------
109 
110 Now that the data set has been collected in a subdirectory "mydata/", we can invoke
111 
112 ```
113 ./run_training.sh mydata/
114 ```
115 
116 to obtain information of the approximation/estimation accuracy of the different estimation methods of SCIP on the newly created data set.
117 All information is stored under the default subdirectory "output/".
118 The name of the out directory can be changed by providing it as second argument to run_training.sh
119 
120 ```
121 ./run_training.sh mydata/ /path/to/my/output/
122 ```
123 
124 Interpretation of the Output
125 ----------------------------
126 
127 After the training has finished, a comparison of different tree size estimation methods is printed to the console. In addition,
128 the (user-specified) output directory contains the following files.
129 
130 - monotone.set, a SCIP settings file with linear regression coefficients.
131 - rf_model.rfcsv, a trained regression forest
132 - searchcompletion_mse.pdf, a bar chart that compares the search completion accuracy.
133 
134 In addition, some intermediate data pipeline results are also stored in the output directory in the form of CSV files.
135 
136 ### Output of run_training.sh
137 
138 At termination, "run_training.sh" outputs the estimation accuracy for a total of 13 different techniques to estimate the tree size.
139 An example table could look as follows:
140 
141 | |Method | n| MSE| MeanRatio| 2Accurate| 3Accurate| 4Accurate|Group |
142 |:--|:---------------|----:|-----:|---------:|---------:|---------:|---------:|:----------------|
143 |13 |Random.Forest | 3012| 0.009| 1.540| 0.824| 0.883| 0.910|Learned |
144 |12 |linear.monotone | 3012| 0.049| 2.074| 0.707| 0.822| 0.863|Learned |
145 |1 |treeweight | 3012| 0.074| 2.191| 0.686| 0.791| 0.841|SearchCompletion |
146 |6 |wbe | 3012| NaN| 2.349| 0.655| 0.777| 0.824|Custom |
147 |11 |tree-weight | 3012| NaN| 2.395| 0.651| 0.759| 0.810|Forecast |
148 |8 |leaf-frequency | 3012| NaN| 2.465| 0.666| 0.763| 0.807|Forecast |
149 |9 |open-nodes | 3012| NaN| 2.716| 0.613| 0.714| 0.760|Forecast |
150 |4 |leaf-frequency | 3012| 0.151| 2.755| 0.600| 0.714| 0.754|SearchCompletion |
151 |10 |ssg | 3012| NaN| 3.016| 0.603| 0.698| 0.739|Forecast |
152 |7 |gap | 3012| NaN| 3.308| 0.539| 0.658| 0.708|Forecast |
153 |5 |tree-profile | 3012| NaN| 4.140| 0.463| 0.580| 0.638|Custom |
154 |2 |ssg | 3012| 0.068| 6.301| 0.631| 0.701| 0.758|SearchCompletion |
155 |3 |gap | 3012| 0.259| 9.378| 0.497| 0.594| 0.652|SearchCompletion |
156 
157 For four methods, there are two ways to compute an estimation of tree size: either as an approximation
158 of search completion, or by computing a time series forecast that takes into account the most recent values and trend development.
159 Therefore, these methods appear twice in the table, and the column "Group" shows whether a forecast or
160 a search completion approximation was used.
161 The column n contains the number of records that were aggregated for this table.
162 
163 The methods are sorted by their geometric mean approximation ratio of the true tree size.
164 We normalize each ratio such that it is bounded from below by 1, which would correspond to a perfect estimate.
165 The method at the top, Random.Forest, is the method that yields
166 the most accurate estimate of search tree size at termination.
167 For methods that approximate search completion, the Mean Squared Error of the approximation is shown in column "MSE".
168 The columns 2Accurate etc. give the fraction of records that are within a factor of 2 (3,4) of the actual tree size at termination.
169 Better methods reach higher values in those columns.
170 
171 Disclaimer: This table has been produced as a showcase on a subset of 91 publicly available instances almost all of which are solved by SCIP in less than 100 seconds.
172 The figures therein are not representative beyond the data set, and the ranking of the methods may substantially change on other data sources.
173 
174 
175 Using the Trained Regression in SCIP
176 ------------------------------------
177 
178 In the above table, the two learned methods "Random.Forest" and "linear.monotone" outperform the other methods (out of which they are constructed).
179 In the output directory, the file "monotone.set" contains the linear regression coefficients and can be input into SCIP like a normal settings file.
180 For example,
181 
182 ```
183 scip -s output/monotone.set
184 ```
185 
186 launches a SCIP interactive shell with the coefficients preloaded. Since the coefficients are user parameters,
187 SCIP comes with reasonable default values that have been calibrated on a large data set.
188 We call this a monotone regression because it combines the two values "tree weight" and "SSG", which are monotone.
189 
190 The trained regression forest can also be loaded into SCIP. The file "rf_model.rfcsv" in the output directory
191 contains all learned splits on the input features for all trees in its ensemble as one
192 large data file in an extended CSV format with additional storage size information.
193 The location of this file must be explicitly specified for SCIP by setting the string parameter "estimation/regforestfilename",
194 e.g., via
195 
196 ```
197 estimation/regforestfilename = "/path/to/rf_model.rfcsv"
198 ```
199 as part of settings file for SCIP.
200 In the interactive shell, type `set estimation regforestfilename output/rf_model.rfcsv` to tell SCIP
201 that it should load the regression forest.
202 A loaded regression forest automatically takes precedence over all other methods and is used as search completion approximation in the new
203 display column "compl.".
204 In contrast to the monotone regression, the regression forest is not guaranteed to be monotone.
205 
206 Note that most of the other methods in the table are also readily available. If the tree size should be estimated by an SSG forecasting
207 method, use the "method" parameter
208 
209 ```
210 estimation/method = s
211 ```
212 
213 If the training suggests that tree weight search completion should be used instead, combine the two parameters
214 
215 ```
216 estimation/completiontype = w
217 estimation/method = c
218 ```