feldgendler June 8, 2016 at 19:38

JS Programming Contest: Word Classifier (Test Progress)

First of all, we apologize to all participants in the programming contest for the delay in the results. Today we publish all the submitted solutions and official scripts for generating tests and testing, as well as tell you how things are going with checking solutions.

The English version of this post is posted on GitHub .

Test 312 solutions

Many thanks to all the participants! We received 603 solutions from 312 participants. Since we take the most recent decision on time for testing, we need to test 312 solutions . This was an unexpected result. Hope this explains a little why it takes so long.

We published all the test solutions in the submissions directory on GitHub. The name of each subdirectory is a unique identifier for the solution; if you sent us a solution, you received an automatic confirmation letter containing this identifier. The names or pseudonyms of decision makers will be published along with the final results.

Each directory contains a filesolution.js(actually a competitive solution) and, if there is one, a data file dataor data.gz(if it is compressed using gzip). This layout is consistent with what official testing scripts expect. The subdirectory srccontains the sources that the author sent in an additional archive. Testing scripts do not pay attention to them.

Of course, to make changes to your decision is no longer possible. However, you can send us certain additional files that you would like to publish in the directory srcbut did not include in the source archive, for example, the README file. This will not affect the test result, but if we can better understand the solution, it may increase your chances of receiving a special prize for originality. If you want to send additional files,write us a letter .

Test generator

As promised, we are publishing a test generator . This is the command line interface to the same module that we used in the web service to generate the tests. For the same initial values of the pseudo-random number generator, this generator is guaranteed to produce the same results as the web service. The option --helpgives an instruction.

At startup, the generator analyzes the dictionary and builds Markov modelsdifferent orders (therefore, initializing the generator takes considerable time). He then uses a pseudo-random number generator to generate test sets. With a probability of ½, a random word is selected from the dictionary, and in other cases, a randomly selected from 9 models is used for generation: these are Markov chains of orders from 1 to 8 and a white noise generator. The higher the order of the Markov model, the more similar its results are to English words. If by chance a word appears in the dictionary, it is discarded and a new one is generated. Since this happens more often with high-order models, in the tests non-words generated by these models are somewhat less common than those generated by low-order models. If you are interested in the details, we recommend reading the source code of the generator .

For serious testing, you need a large number of blocks of 100 words. To generate them, we use a text string (“seed string”) to initialize the master sequence of pseudorandom numbers from which the block numbers are taken. From the same line you always get the same sequence of blocks, which can be continued indefinitely.

After a few days, the web version of the test generator will be disabled. Please use a generator instead, which will produce exactly the same results on your computer.

"Combat" test system

For mass testing of solutions, we have developed a new, more powerful and flexible test system . The option --helpgives an instruction.

The new test system is functionally equivalent to the old one, that is, for the same tests and test programs, both systems will give the same results. Here are the distinguishing features of the new system:

Several solutions can be tested in parallel. To do this, they must be listed on the command line one after another. (As for the first script, you must specify the names of the directories with the solutions, not the path to the files.)
Solutions are run in separate processes, which receive words from the main process and send responses back.
Instead of reading thousands of JSON files with tests, the new system generates tests on the fly using a line-based generator to initialize the master sequence. It is fully deterministic and reproducible.
The new system calculates much more statistics than just the percentage of correct answers. During testing, she regularly updates these statistics on the console. It is also possible to write test results to a machine-readable JSON file.
Optionally, the test program can restart and reinitialize the solution whenever there is a repeat in the test (a word that has already been encountered). This allows you to isolate and measure the effect of training during testing, which some solutions implement.
You can select the “dangerous” mode, in which the solution is loaded as a normal Node.js module, without a “sandbox”. It can be used with caution if you suspect that the Node.js VM is distorting the solution. Of course, you trust your own solution, so it’s always a --unsafegood idea to improve performance when testing it.

This is what the statistics displayed on screen during testing mean. Top line: Total- the total number of checked words, W- the number of words, NW- the number of non-words, NW[0]... NW[8]- the number of non-words produced by each of the models: 0 means a white noise generator, and 1–8 are Markov chains of the corresponding order. For each decision: Correct- the percentage of correct answers (this is exactly what determines the place in the final standings), F-- the percentage of false negative results, F+- the percentage of false positive results, F+[0]... F+[8]- the percentage of false positive results for each model separately,ms/100w- average processing time of one block in milliseconds. If the solution restarts when duplicates appear, then to the left of its name is displayed (R). The machine-readable JSON file with the results contains all the same information in an intuitive format.

Official test sequence

We opened Twitter @ladygaga :-) and got the first tweet that appeared after the deadline. The text of this tweet has become the official string to initialize the sequence. To generate the official test sequence, you can specify the following option for scripts generate.jsand test2.js:

--text 'Lady Gaga has accompanied Mario Andretti in a two-seater race car at #Indy500 today!'

(Addition: unfortunately, this tweet has already been deleted. Only this retweet remains . Since testing has already been started, the initialization string is not changed.)

The beginning of the sequence (block numbers) is: 416943456 , 1130928813, 1690493249, 1560679505, 1146154276, 403245362, 1509995579, 2138666856, 1841363379, 1083285606.

The number of blocks and self-training

First, we ran all the solutions on a very small number of blocks to find out which solutions have a chance to be in the lead. Then we took 8 promising solutions and launched them on a very large number of blocks. With several different initialization lines, the first three lines of the rating stabilized within the first 1,500 blocks: a further increase in the number of blocks, although it clarified the percentage shares, did not change the relative positions of the leaders. We decided to test all solutions on 10,000 blocks (this is a million words).

Some solutions use an original approach: they remember all the words that the test system presents them. Since there are fewer than 700,000 words in the dictionary, and the variety of pseudorandom non-words is much wider, a word that occurs twice is more likely to be a word than a non-word. For solutions using this phenomenon, the accuracy of the results increases over time. Theoretically, with an unlimited increase in the number of blocks, such a solution can show an arbitrarily high result.

This is an interesting idea, but this is a problem for the competition, since theoretically on a sufficiently large set of tests such a solution can bypass any non-learning solutions. This would not only make this simple technique a “magic wand” that defeats everything else, but would also create for the jury the problem of choosing between several learning solutions. If this idea came up with us before the start of the competition, we could easily solve the problem by indicating that the test system can restart the solution at any time during the testing. However, we explicitly promised that initialization will occur only once.

We decided to stop at a figure of 10,000 blocks and run away each solution, with or without restarts, to separately measure the learning effect. The results of non-learning solutions during this time have time to stabilize. Training really improves the result of some decisions (so far we see one solution that is in the top ten thanks to training). Of course, learning solutions would show even higher results on a larger number of blocks, but this would be unfair to those participants whose programs show high recognition accuracy immediately after initialization.

In short: we allow learning solutions to be learned, but do not use such huge numbers of blocks on which the learning effect becomes decisive.

Status

At the moment, some solutions are still being tested.

We did not indicate in the problem statement any requirements for program performance. If we indicated the limitations, we would have to accurately describe the hardware and the operating system, run the solutions many times to accurately measure, solve the problems with distortions introduced by the sandbox. I wanted to get away from the problems of measuring performance. As a result of this, some of the solutions we received are very slow.

Testing of some of the solutions has not yet been completed. The slowest of them, it seems, we will have to stop ahead of schedule and use the results of a partial test run. At the moment, none of these solutions are in the top ten. In addition, at least one solution is stuck. We will figure out whether the sandbox introduces this problem, or is it a bug in the solution itself.

In a few days, when the testing of slow solutions ends, we will publish the results. Now you can find out your own result without waiting for the publication of the entire table. Use the new test system and the above official initialization string for this. Set 10,000 blocks and you will get exactly the same result as us (unless your solution usesMath.random- in this case, the results will vary slightly). So far, the best result we have seen on 10,000 blocks is 83.67% .

Thank you for your patience. Stay with us!

Tags: