This folder contains 2 small and 2 larger sample outputs from machine learning systems to test and validate the art.py script (http://www.clips.ua.ac.be/~vincent/software.html). The format of the instances in the system files is: - identifier: this is a dummy for a feature (commonly real instances have more features) - gold standard label - predicted label The labels are arranged in such a way that system1 gets an accuracy of 75% and system2 gets an accuracy of 25%. The actual hypothesis H0 that is tested by randomization is: The performance of system1 and system2 is equal. In this sample, accuracy is taken as the performance measure because of the possibility to compare the outcome with the outcome of a sign-test but any performance measure can be used. ------------------------------------------------------------------------------------------------------ Exact randomization testing: $ python art.py -v system1 system2 This gives a probability for accuracy of 0.625 which is higher than the common significance level of 0.05. This means that there is no significant difference between the accuracy of system1 and system2. The outcome of 0.625 can be understood as follows: The gold standard labels are : label1 label1 label1 label2 The predicted labels of system1 are: label1 label1 label1 label1 (accuracy: 75%) The predicted labels of system2 are: label2 label2 label2 label2 (accuracy: 25%) The actual accuracy difference between system1 and system2 is 50%. The instances are represented in a table. The 4 columns are the 4 instances, the rows are the feature values. Since there are 4 instances and 2 possible values for the class label, it is possible to construct 2^4 = 16 different combinations where the class label of an instance comes either from system1 or from system2. Writing out all combinations gives: Shuffled system1 acc. Shuffled system2 acc. |acc. difference| label2 label2 label2 label2 ( 25%) label1 label1 label1 label1 ( 75%) 50%* label1 label2 label2 label2 ( 50%) label2 label1 label1 label1 ( 50%) 0% label2 label1 label2 label2 ( 50%) label1 label2 label1 label1 ( 50%) 0% label2 label2 label1 label2 ( 50%) label1 label1 label2 label1 ( 50%) 0% label2 label2 label2 label1 ( 0%) label1 label1 label1 label2 (100%) 100%* label1 label1 label2 label2 ( 75%) label2 label2 label1 label1 ( 25%) 50%* label1 label2 label1 label2 ( 75%) label2 label1 label2 label1 ( 25%) 50%* label1 label2 label2 label1 ( 25%) label2 label1 label1 label2 ( 75%) 50%* label2 label1 label1 label2 ( 75%) label1 label2 label2 label1 ( 25%) 50%* label2 label1 label2 label1 ( 25%) label1 label2 label1 label2 ( 75%) 50%* label2 label2 label1 label1 ( 25%) label1 label1 label2 label2 ( 75%) 50%* label1 label1 label1 label2 (100%) label2 label2 label2 label1 ( 0%) 100%* label1 label1 label2 label1 ( 50%) label2 label2 label1 label2 ( 50%) 0% label1 label2 label1 label1 ( 50%) label2 label1 label2 label2 ( 50%) 0% label2 label1 label1 label1 ( 50%) label1 label2 label2 label2 ( 50%) 0% label1 label1 label1 label1 ( 75%) label2 label2 label2 label2 ( 25%) 50%* The first 4 columns are the class labels as predicted by the shuffled system1. The accuracy of this predictions is reported in column 5. The next 4 columns are the class labels as predicted by the shuffled system2. Since a shuffled system predicts either the same as the original system1 or system2 the output of the shuffled system2 is always the complement of the output of the shuffled system1. The accuracies for the shuffled system2 are given in column 10. The absolute accuracy difference between the 2 shuffled systems is given in the last column. It can be seen that for 10 shuffled systems (indicated with *) the accuracy difference is greater than or equal to the actual accuracy difference. This means that if the predictions of system1 and system2 would be interchangeable (i.e. would come from the same system) that the probability is 10/16 = 0.625 that the same or greater acc. difference than the actual difference is occurring. This probability is too big to reject the null hypothesis H0 that system1 and system2 are equal. ------------------------------------------------------------------------------------------------------ Approximate randomization testing: $ python art.py -v -t0 -n100000 system1 system2 This gives a probability of around 0.6245; this value may vary since this is an approximation. This probability is comparable to the probability for exact randomization. Instead of computing all shuffles like in exact randomization, the class labels are randomly distributed over the 2 shuffled systems. This is done for 100,000 times to ensure a more reliable probability value. ------------------------------------------------------------------------------------------------------ Validity of the approximate randomization test can be checked by rerunning the test and by the sign-test: - Rerunning the test gives 0.6266. - The sign-test gives a probability of 0.625; which is comparable with the outcome the the approx. randomization. Both approximate randomization tests and the sign-test lead to the same conclusion: accept H0. ------------------------------------------------------------------------------------------------------ Note that an actual accuracy difference of 50% seems to be significant but due to the small number of instances this is not true. To obtain a significant accuracy difference larger files are needed: $ python art.py -v system1_large system2_large system1_large and system2_large have the same accuracies as their smaller counterparts but they contain 1,000 instances. The large number of instances makes exact randomization testing unfeasible but approximate randomization testing remains possible. The probabilities are: approx. randomization: 9.999e-05 (run1) approx. randomization: 9.999e-05 (run2) sign-test : 1.348e-58 The conclusion is that H0 is rejected and that system1_large has a significantly (level 0.05) different accuracy than system2_large. Note that the sign-test has a much smaller probability than the approx. randomization probability but the conclusion is the same. The difference comes from the number of shuffles. The number of shuffles is set to 10,000. For approximate randomization testing the probability is computed by: (nge + 1)/(N + 1) with nge: the number of times the accuracy difference is greater than or equal to the actual acc. difference N: the number of shuffles In the example, the sign-test gives a very low probability of 1.348e-58. Since the number of shuffles is set to 10,000 the chance is very small that the actual acc. difference (or a greater difference) occurs in those 10,000 shuffles, so nge=0. This gives the minimally computable probability of (0+1)/(10000+1) = 9.9e-5. Increasing the number of shuffles can bring this probability down until the region of 1.348e-58 is reached but the conclusion is already obvious. Vincent Van Asch, November 2011