Changes

Charles Vernerey · 7d100200
--- a/data-mining.md
+++ b/data-mining.md
@@ -26,7 +26,7 @@ Database database = new DatReader(path, 0, true).readFiles()
 - nbValueFiles: an Integer which represents the number of value files that we want to read (in this example, 0)
 - noClasses: a Boolean which is true if we want to ignore class of the transactions (in this example, true)

-If the noClasses argument is set to false, then the class item of each transaction will be ignored during the mining. We consider the first item of each transaction as the class item. In the example above, class items are `{1,2}`.
+If the noClasses argument is set to false, then the class item of each transaction will be ignored during the mining. We consider the first item of each transaction as the class item. In the example above, class items are `{1,2}` if noClasses is set to false.

 If nbValueFiles is set to an Integer n > 0, then we read *.val0, *.val1,..., *.valn files, where each line of this file represents the value of an item. See the `data` directory to get examples of value files.

@@ -273,7 +273,7 @@ for (Pattern generator : generators) {

 Finally, we solve our model and we find all the generators with a frequency and length >= 1.

-**MFI_MII_Mining**
+**ExampleMFIsMIIsMining**

 In this example, we want to mine **Maximal Frequent Itemsets** (MFIs) and **Minimal Infrequent Itemsets** (MIIs). First, we can analyse the code of the method createModel, that takes three arguments:

@@ -328,3 +328,195 @@ solver.setSearch(Search.intVarSearch(
 ```

 As heuristic, we order items by increasing frequency and we instantiate them first to 1.
+
+**ExampleSkypatternMining**
+
+In this example, we want to find the set of skypatterns w.r.t. the set of measures `M = {freq(x),area(x),aconf(x)}`.
+
+```java
+String dataPath = "src/test/resources/contextPasquier99/contextPasquier99.dat";
+Database database = new DatReader(dataPath, 0, true).readFiles();
+Model model = new Model("skypattern mining");
+```
+
+We load all the required data and we create the model.
+
+```java
+List<Measure> M = Arrays.asList(freq(), area(), allConf());
+Set<Measure> M_prime = MeasureOperand.maxConvert(M);
+```
+
+Then, we create the set of measures M and we compute M' such that M is maximally M'-skylineable.
+
+```java
+BoolVar[] x = model.boolVarArray("x", database.getNbItems());
+IntVar freq = model.intVar("freq", 1, database.getNbTransactions());
+IntVar length = model.intVar("length", 1, database.getNbItems());
+IntVar area = freq.mul(length).intVar();
+model.sum(x, "=", length).post();
+int[] itemFreq = database.computeItemFreq();
+IntVar[] itemFreqVar = model.intVarArray(database.getNbItems(), 0, database.getNbTransactions());
+for (int i = 0; i < database.getNbItems(); i++) {
+    // itemFreqVar[i] = itemFreq[i] if items[i] == 1 else 0
+    model.arithm(x[i], "*", model.intVar(itemFreq[i]), "=", itemFreqVar[i]).post();
+}
+String maxFreqId = maxFreq().getId();
+IntVar maxFreq = model.intVar(maxFreqId, 0, database.getNbTransactions());
+// Compute max value of itemFreqVar
+model.max(maxFreq, itemFreqVar).post();
+IntVar aconf = freq.mul(10000).div(maxFreq).intVar();
+```
+
+Next, we create variables to compute the value of each measure m in M for the pattern x. Note that the aconf is converted to an integer variable by multiplying it by 10000.
+
+```java
+model.post(new Constraint("Cover Size", new CoverSize(database, freq, x)));
+model.post(new Constraint("Adequate Closure", new AdequateClosureWC(database, new ArrayList<>(M_prime), x)));
+IntVar[] objectives = new IntVar[]{freq, area, aconf};
+ParetoMaximizer maximizer = new ParetoMaximizer(objectives);
+model.post(new Constraint("Pareto", maximizer));
+Solver solver = model.getSolver();
+solver.plugMonitor(maximizer);
+solver.setSearch(Search.intVarSearch(
+        new MinCov(model, database),
+        new IntDomainMin(),
+        x
+));
+```
+
+Next, we add the constraints **CoverSize** to link freq variable to the frequency of x and **AdequateClosure** to ensure that x is closed w.r.t. M'. We also add the **Pareto** constraint to ensure that x is not dominated by any previously found pattern. Note that we also plug the Pareto maximizer to the solver such that each time a new solution is discovered, we add it to the archive and we remove all the dominated solutions.
+
+The heuristic considered is **MinCov** which consists of selecting the item i such that freq(x U {i}) is minimal and we instantiate it first to 0. This heuristic is interesting for the following reasons:
+
+- non-frequent patterns are filtered quickly
+- the pattern with the maximal frequency is instantiated first, which is useful to prune efficiently the search space
+
+```java
+while (solver.solve());
+for (Solution solution : maximizer.getParetoFront()) {
+    int[] itemset = IntStream
+            .range(0, x.length)
+            .filter(i -> solution.getIntVal(x[i]) == 1)
+            .map(i -> database.getItems()[i])
+            .toArray();
+    System.out.println(Arrays.toString(itemset) + ", freq=" + solution.getIntVal(freq) + ", area=" +
+            solution.getIntVal(area) + ", aconf=" + solution.getIntVal(aconf));
+}
+```
+
+Finally, we retrieve the Pareto Front and we print all the skypatterns.
+
+**ExampleMNRsMining**
+
+In this example, we want to extract all MNRs with a frequency >= 1 and a confidence >= 20%.
+
+```java
+String dataPath = "src/test/resources/contextPasquier99/contextPasquier99.dat";
+Database database = new DatReader(dataPath, 0, true).readFiles();
+int minFreq = 1;
+int minConf = 20;
+Model model = new Model("MNRs mining");
+```
+
+First, we read the data and we create the model.
+
+```java
+BoolVar[] x = model.boolVarArray("x", database.getNbItems());
+BoolVar[] y = model.boolVarArray("y", database.getNbItems());
+BoolVar[] z = model.boolVarArray("z", database.getNbItems());
+for (int i = 0; i < database.getNbItems(); i++) {
+    model.arithm(x[i], "+", y[i], "<=", 1).post();
+    model.addClausesBoolOrEqVar(x[i], y[i], z[i]);
+}
+model.addClausesBoolOrArrayEqualTrue(x);
+model.addClausesBoolOrArrayEqualTrue(y);
+```
+
+Then, we create three arrays of Boolean variables:
+
+- x: represents the antecedent of the rule
+- y: represents the consequent of the rule
+- z: represents the union between x and y
+
+We add several constraints:
+
+- x[i] + y[i] <= 1: ensure that an item is not present in the antecedent and in the consequent at the same time
+- model.addClausesBoolOrEqVar(x[i], y[i], z[i]): ensures that z[i] = x[i] OR y[i]
+- model.addClausesBoolOrArrayEqualTrue: ensures that at least one item is present in the antecedent and in the consequent
+
+```java
+IntVar freqZ = model.intVar("freqZ", minFreq, database.getNbTransactions());
+new Constraint("frequent Z", new CoverSize(database, freqZ, z)).post();
+IntVar freqX = model.intVar("freqX", minFreq, database.getNbTransactions());
+new Constraint("frequent X", new CoverSize(database, freqX, x)).post();
+freqZ.mul(100).ge(freqX.mul(minConf)).post();
+new Constraint("generator x", new Generator(database, x))
+        .post();
+new Constraint("closed z", new CoverClosure(database, z)).post();
+```
+
+Next, we add **CoverSize** constraints to compute the frequency of z and x, and a constraint on the min confidence of the rule. We also post two constraints **Generator** and **CoverClosure** to ensure that the rule is an MNR.
+
+```java
+Solver solver = model.getSolver();
+solver.setSearch(intVarSearch(
+        new InputOrder<>(model),
+        new IntDomainMin(),
+        ArrayUtils.append(x, y, z)
+));
+while (solver.solve()) {
+    double conf = (double) freqZ.getValue() / freqX.getValue();
+    System.out.println(Arrays.toString(getItemset(x, database)) + " => " +
+            Arrays.toString(getItemset(y, database)) + ", freq=" + freqZ.getValue() + ", conf=" + conf);
+}
+```
+
+Finally, we order the variables such that x is instantiated before y and y is instantiated before z and we select the minimum value first. Each time we find a new MNR, we print it.
+
+**ExampleDiversity**
+
+In this example, we want to find the set of diverse patterns with a min frequency of 10% and a Jmax of 0.05.
+
+```java
+Model model = new Model("Diversity");
+Database database = new DatReader(dataPath, 0, true).readFiles();
+int theta = (int) Math.round(database.getNbTransactions() * 0.1d);
+IntVar freq = model.intVar("freq", theta, database.getNbTransactions());
+IntVar length = model.intVar("length", 1, database.getNbItems());
+BoolVar[] x = model.boolVarArray("x", database.getNbItems());
+model.sum(x, "=", length).post();
+model.post(new Constraint("Cover Size", new CoverSize(database, freq, x)));
+model.post(new Constraint("Cover Closure", new CoverClosure(database, x)));
+```
+
+We read the data and post the constraints.
+
+```java
+Overlap overlap = new Overlap(database, x, 0.05, theta);
+model.post(new Constraint("Overlap", overlap));
+Solver solver = model.getSolver();
+solver.plugMonitor(overlap);
+solver.setSearch(Search.intVarSearch(
+        new InputOrder<>(model),
+        new IntDomainMin(),
+        x
+));
+```
+
+Then, we post the **Overlap** constraint to guarantee the diversity of x with all previously discovered patterns.
+
+```java
+while (solver.solve());
+solver.printStatistics();
+List<int[]> itemsets = overlap.getItemsetsHistory();
+List<BitSet> covers = overlap.getCoversHistory();
+System.out.println(itemsets.size());
+for (int i = 0; i < itemsets.size(); i++) {
+    int[] itemset = Arrays.stream(itemsets.get(i))
+            .map(j -> database.getItems()[j])
+            .toArray();
+    System.out.println(Arrays.toString(itemset) + ", cover=" + covers.get(i));
+}
+```
+
+Finally, we get all the diverse itemsets with their cover and we print them.
\ No newline at end of file