StupidSolver February 22, 2015 at 19:58

About enumeration on the example of generating crosswords

From the sandbox

With the article “ Crossword Forming Algorithm ”, several heuristics were proposed that were implemented in a program for automatically generating crosswords. Despite the fact that the proposed heuristics are well developed, even they did not allow for a reasonable time to generate a crossword for the most complex of the given grids:

This and all the following drawings are taken from the original article

In this article, I would like to discuss a much simpler solution that allows to solve this problem.

The proposed solution is based on two main ideas:

The correct sorting order
Do not repeat the same actions

All code is written in Scala.

Data structures

Imagine the crossword field in the form of a two-dimensional list of cells (Cell):

val field: List[List[Cell]] = readField("cross.in")

In addition, each cell is either protein and must contain a letter (Letter), or black, and must remain empty (Empty):

sealed abstract class Cell
case object Empty extends Cell // Все пустые клетки одинаковы
case class Letter() extends Cell // А клетки с буквами - нет

A word is a sequence of letters:

class Word(val letters: Seq[Letter])

Dictionary - many words grouped by length:

val dictionary: Map[Int, Set[String]] = readWords("cross.txt", "WINDOWS-1251").groupBy(_.size)

Highlight words

Note that there are no words marked in the field description. We consider a word to be a vertical or horizontal set of (future) letters bounded by empty cells or the edges of the board.
Learn to highlight words in one line:

def extractWords(row: List[Cell]): Seq[Word] = {
  val rest = row.dropWhile(_ == Empty) // Пропускаем пустые клетки
  if (rest.isEmpty) Seq() // Больше клеток нет
  else {
    // Берем не пустые клетки пока можем, и составляем из них слово
    val word = new Word(rest.takeWhile(_ != Empty).map(_.asInstanceOf[Letter]))
    // Вызываемся рекурсивно на остатке и добавляем составленное слово
    word +: extractWords(rest.dropWhile(_ != Empty))
  }
}

To select all the words, select the words from the lines, transpose the field and select the words from the columns. At the same time, we remove one-letter words.

val words = (field.flatMap(extractWords) ++ field.transpose.flatMap(extractWords)).filter(_.letters.size != 1)

For a quick search for intersecting words, we will store information in (future) letters, which words pass through them and the letter number in this word:

case class Letter() extends Cell {
  var words = Array.empty[(Word, Int)]
}
for (word <- words; (letter, index) <- word.letters.zipWithIndex) {
  letter.words :+= (word, index)
}

Brute force

We will pass two mappings (Map) to the enumeration procedure: variants and assignment. Words for each word that has not yet been considered contains a list of words that can be in this place (initially, all words of the required length), and assignment contains fixed values of the words already considered (initially, empty):

// Сигнатура
def search(variants: Map[Word, List[String]], assignment: Map[Word, String])
// Вызов
search(words.map(w => (w, random.shuffle(dictionary(w.letters.size).toList))).toMap, Map.empty)

Note that when you call, the words in the list are mixed, so that with different launches you will get different crosswords.

Now we can write the first version of the search:

def search(variants: Map[Word, List[String]], assignment: Map[Word, String]) {
  if (variants.isEmpty) {
    // Все слова выбраны - выводим расстановку
    dump(assignment)
  } else {
    // Берем первое попавшееся слово и соответствующие варианты
    val (word, vars) = variants.head
    // Перебираем все варианты
    for (variant <- vars) {
      // Удаляем слово и уже рассмотренных
      var newVariants = variants - word
      // Перебираем буквы слова, находим другое слово, проходящее через каждую букву 
      for ((letter, char) <- word.letters.zip(variant); (other, index) <- letter.words; if newVariants.contains(other)) {
        // Оставляем только те варианты, в которых на требуемом месте (index) строит правильная буква
        newVariants = newVariants.updated(other, variants(other).filter(var => var(index) == char))
      }
      // Вызываемся рекурсивно
      search(newVariants, assignment.updated(word, variant))
    }
  }
}

Unfortunately, for a reasonable time, this bust is not able to find a solution even for the simplest of the grids (9x9) mentioned in the original
article:

The correct sorting order

Fortunately, this problem is easy to fix. To do this, just select the correct order of consideration. Now the first word found is taken in the search.

Are there any more effective strategies? There is, and one of them was proposed back in 1965 in an article by SW Golomb, LD Baumert Backtrack Programming // Journal of the ACM Volume 12 Issue 4, 1965 . Unfortunately, I could not find this article in the public domain, but the idea itself is very simple: you need to sort through the one for which there are fewer options.

In our case, this means that we should not take a word arbitrarily, but that which has the least number of suitable options left:

val (word, vars) = variants.minBy(_._2.size)

Note that at the same time we get two nice bonuses for free. Firstly, if there are only options left for a word, then it will be immediately selected and this branch will be cut off without further enumeration. Secondly, if there is only one option left for the word, then it will be put right away, which will reduce the search in the future.

This optimization allows you to instantly find solutions for a 9x9 grid and, in more than 50% of cases, find a solution for a “simple” 15x15 grid:

However, an 11x11 grid and a “complex” 15x15 grid still require too much time.

Do not repeat the same actions

As in most programs, in our search most of the time is occupied by the innermost cycle, in which we remove the “inappropriate” words:

newVariants = newVariants.updated(other, variants(other).filter(_(index) == char))

At first glance, there is no loop in this line, but in fact it lurked in the implementation of the filter method. How can we save time here? And let's cache the results of this cycle!

Note that here lists of words are computed, in which some letters are fixed and some are not known. We will represent such a "partial" word in the form of a mask of the format " ?ол??о", that is, indicating question marks instead of unknown letters.

Now in the enumeration we will operate not with lists of “suitable” words, but with masks + cache the results of the calculation of lists:

def search(masks: Map[Word, String], assignment: Map[Word, String]) {
  if (masks.isEmpty) {
    // Все слова выбраны - выводим расстановку
    dump(assignment)
  } else {
    // Берем слово с минимальным числом вариантов
    val (word, mask) = masks.minBy(w => cache(w._2).size)
    // Перебираем варианты
    for (variant <- cache(mask)) {
      var newVariants = masks - word
      // Перебираем "перекрестные" слова
      for ((letter, char) <- word.letters.zip(variant); (other, index) <- letter.words; if newVariants.contains(other)) {
        val oldMask = masks(other)
        val newMask = oldMask.updated(index, char)
        // При необходимости, добавляем значение в кеш
        cache.getOrElseUpdate(newMask, cache(oldMask).filter(_(index) == char))
        newVariants = newVariants.updated(other, newMask)
      }
      search(newVariants, assignment.updated(word, variant))
    }
  }
}

Before using the new search function, you need to fill the cache with words from the dictionary and generate the source masks for all words:

for ((length, variants) <- dictionary) {
  cache.getOrElseUpdate("?" * length, random.shuffle(variants.toList).toArray)
}
search(words.map(w => (w, "?" * w.letters.length)).toMap, Map.empty)

Adding caching allows the search to find a solution even for the most complex grid: a

Crossword generated in 30 seconds (in fact, in this case it was lucky with a random ohm, the crossword is quickly generated somewhere in one of 20 cases):

acast ahum ahum 
aloah lara lager 
Anami Rockk Herm 
radio transmitter 
mlanzy tinder     
   aarne phloem 
maar ketch actor 
antidepressants 
atasu aapa yauk 
kartli whole    
    insi clouds 
addition 
rahu alla anand 
Loick Kindy 
cake Anna Aedie

conclusions

Using a couple of simple optimizations made it easy to implement a solution that is better than a heuristic solution based on a deep knowledge of the subject area. Choose the correct search order and do not repeat the same counts!

Source
Input Examples
Dictionaries (at the end of the page, link taken from the comment on the original post )

Tags: