Solving Sudoku and the n-Queens problem

I’ve put together a solver for Sudoku , as well as one for the n-Queens problem. This was inspired and informed by some ideas and algorithms in the book Artificial Intelligence: A Modern Approach by Stuart Russel and Peter Norvig. The solvers were quite fun to write and surprisingly easy to put together, both done in an afternoon.

The approaches taken for these two problems are different, so I’ll highlight the key aspects and compare them at the end.

The sudoku solver first uses the AC-3 algorithm to infer reductions in the domain (possible values) of variables before the search. If this reduces the domains to one value per cell, the puzzle is effectively solved. If some variable’s domain becomes the empty set (no value for a cell that can satisfy all constraints), the puzzle is unsolvable. Otherwise, a search is done to find a solution.

The search uses backtracking. It is a depth-first search (DFS), going deep down one path before checking others, with in-place modification of the puzzle board as it goes, and undoing the changes when it has gotten stuck and needs to backtrack. Order of solutions tried in the DFS matters, as some orders can prune large parts of the tree. The following heuristics are used to sort variables and values:

• Minimum value remaining heuristic — prioritizes cells that have few legal values
• Least constraining value heuristic — after choosing a cell, this prioritizes the value option that least inhibits other cells

Forward checking is used to maintain arc consistency for a variable after a value is chosen for it. This infers domain reductions in neighbouring variables. The the search proceeds until it finds a solution or fails (not solvable). If the puzzle is solvable, a solution is returned. Here is an example, using an “evil difficulty” puzzle found online:

import sudoku

# Taken from https://www.websudoku.com/
evil = [
0,0,7,0,0,5,0,9,0,
6,2,0,1,0,7,0,0,8,
0,1,0,0,0,0,0,0,0,
0,0,0,0,0,3,5,0,0,
0,8,0,0,6,0,0,3,0,
0,0,5,4,0,0,0,0,0,
0,0,0,0,0,0,0,6,0,
9,0,0,6,0,1,0,2,4,
0,6,0,8,0,0,1,0,0
]

solved = sudoku.search(evil)
if solved: sudoku.print_solution(solved)

>>>
8 3 7 | 2 4 5 | 6 9 1
6 2 4 | 1 9 7 | 3 5 8
5 1 9 | 3 8 6 | 7 4 2
- - - - - - - - - - -
2 4 6 | 9 1 3 | 5 8 7
7 8 1 | 5 6 2 | 4 3 9
3 9 5 | 4 7 8 | 2 1 6
- - - - - - - - - - -
1 5 8 | 7 2 4 | 9 6 3
9 7 3 | 6 5 1 | 8 2 4
4 6 2 | 8 3 9 | 1 7 5


The algorithm used for solving sudoku is systematic and exhaustive. In contrast, different algorithms are best suited for the n-Queens problem, another well known puzzle. The challenge is to place n queens on an nxn board, such that no two queens attack each other.

My n-Queens solver uses the local beam search algorithm, an adaptation of another algorithm named beam search. It starts off with some randomly generated configurations and does greedy stochastic hill climbing search on each one. If one of them finds a solution, that’s returned. Otherwise, the k best results are used for the next search. This algorithm requires parallel searches, so the solver utilizes multiprocessing. The idea behind hill climbing algorithms is to follow the steepest path up locally. As the nearest hill may not be the highest hill, it is prone to get stuck in local minimums (dead ends). Randomization is the trick used to solve this, by restarting the search in different places. The randomization also ensures that a solution is eventually found if one exists.

The solution for an 8×8 board is instantaneous, but it took a few minutes on 4 cores to find a solution for 128×128. Supposedly, this algorithm has been used to solve the millions queens problem in under 2 seconds, but it was probably a more optimized solution, and written in a language like C++.

Can the backtracking algorithm used for the sudoku solver be used for the n-Queens problem? It can, but it turns out that it requires much more computation time. It was the approach taken before hill climbing blew it out of the water. When stochastic hill climbing with restarts (a similar algorithm that does not utilize parallelization) was found to work so quickly for the n-Queens problem (in the 1990s), it resurged interest in this algorithm for a host of other problems. For example, it is used by airlines for flight scheduling due to its efficiency and its online property of being able to incorporate new data easily.

On the other hand, while it would also work for sudoku, the solution for sudoku appears to be solved with much less computation when using backtracking search. In the n-Queens problem, the solutions are densely distributed in the search space, whereas for sudoku, they are not. The answer to the question, “which algorithm should I use for this problem?” is not an easy one, and according to the no free lunch theorem, there is no one algorithm that is best for all problems.

You can check out the solvers at n-queens-solver and sudoku-solver.

Preserving Color in Style Transfer

Image style transfer is extremely fun and I had to just return to it one more time. I noticed that some people have started adding color transfer, as a recent paper came out on Preserving Color in Neural Artistic Style Transfer .

Throughout this post I’m modifying the following photo I took back in 2011 in Ireland.

Take this Starry Night Van Gogh rendition.

It is… blue.

But by preserving color, the result is much nicer. The color is transferred from the content image to the style image before the stylized version is generated. The mean and covariance across the RGB channels is updated in the style image to match the content image.

The other method mentioned in the paper is by luminance transfer. I will probably implement that as well and add it here — the math is very similar, and simpler in fact, as covariance matrix is not needed as there is only one channel being changed.

Sometimes color transfer isn’t needed, but when it is, it’s pretty handy. It’s remarkable that this can be done by simply changing some statistics in the image.

Here are a couple of cool ones from today — they did not require/use color transfer but I had to include them because they are pretty cool!

DCGAN on MNIST

The code for this implementation is on github.

As part of the fast.ai Deep Learning For Coders part 2 course, we implemented the original GAN and DCGAN. The original GAN implementation uses a simple multi-layer perceptrons for the generator and discriminator, and it does not work very well. DCGAN uses a convolutional architecture and does better. However, the result in the course notebook did not look impressive despite many thousands of epochs of training. It’s possible to get it to a better state as I have succeeded in doing here. There is room for more improvement still. I suspect the instructor did not try too hard to show that though, because the newer WGAN, introduced a bit later, is better and easier to train. However, this post is about implementing the DCGAN.

It took me a while to get it to work well. Despite trying many things, what seemed to finally do it was switching from Adam optimizer to RMSPROP, and weight clipping. These things, together with Dropout and Batch Normalization, stabilized training. I did not do comprehensive experiments, but 5×5 convolutions seemed to perform better on this particular dataset.

Here you can see how the GAN learns to generate increasingly better novel handwritten digits.

The outputs look realistic, but they can be improved. Future improvements and code are on github.

Generating Art with Computers

In the world of deep learning, there’s never a dull moment. The breadth of interesting applications seems unbounded. It’s been applied to reach super-human performance in many areas like game playing and object recognition, and integral to exciting new technologies like self driving cars. Increasingly we find that the knowledge embodied in a trained neural network can be transferred to seemingly unrelated areas. Take for example the combination of two disparate ideas: object recognition and Picasso paintings. A net built for the sole purpose of recognizing objects in a photo has been found to be useful, completely unaltered, for the task of rendering an image in the style of any painting. This is the technology behind the scenes of apps like Prisma, but it is not hard to do yourself, and I have to say that implementing this was a lot of fun.

These images were created, as described in Gatys et. al, by passing a random noise image through the VGG-16 convolutional network (with imagenet weights and top layers removed) and updating the pixels of the input image directly with gradient descent. An error function is devised, which quantifies how poorly the initial seed (random noise at first) balances between appearing like the original image, and at the same time in the style of the chosen painting. The derivative of the loss with respect to the RGB values of the noise image is calculated. The pixels are adjusted in their respective directions, and the process repeats. Amazingly, this works – the image looks more and more like a stylized version of the original image with each iteration of this optimization. This is due to the unreasonable effectiveness of gradient descent, convolutional neural networks, and well designed loss functions.

The error function combines content loss $E_c$ and style loss $E_s$. This balances between preserving high level features of the original image, and with the textures of the painting. The errors compare the values of chosen convolutional layers when the input is the original image, generated image, or painting.

The content error is just the squared euclidian distance between corresponding filter activations under the stylized and original images.
$\displaystyle E_c^l=\frac{1}{2}\sum _{i,j} {\left\|F_{i_j}^l-P_{i_j}^l\right\|^2}$ where $F_{ij}^l$ is the activation of the $i^{th}$ filter at position $j$ in layer $l$.

The novelty that makes all this work is the style error. It can be done with a statistic on the channels of the convolutions in the higher layers of the network. This was originally designed to capture texture information in a texture synthesis algorithm. The style error of a layer is the mean squared error between the gramians, $\displaystyle E_s^l=\frac{1}{4 N^2 M^2} \sum _{ij} {\left(G_{i_j}^l-A_{i_j}^l\right)^2}$, where $G_{ij}=\langle v_i,v_j \rangle$ is the inner product of the vectorized feature maps $i$ and $j$ in filter layer $l$ of the style image, and $A$ is the corresponding gramian when the input is noise. The inner product (gramian) shows the correlation between each pair of channels, and this captures texture information. The same thing in Keras code:

def gram_matrix(v):
# In tensorflow, dim order is x,y,channel
# Make each row a channel
chans = K.permute_dimensions(x, (2, 0, 1))
# vectorize the feature maps
features = K.batch_flatten(chans)
# gramian is just an inner product
return K.dot(features, K.transpose(features))
/ x.get_shape().num_elements()
def style_loss(vi, vj):
# mean squared error
return mse(gram_matrix(vi), gram_matrix(vj))


The total error is $\alpha E_c + \beta E_s$ with the weights on each error as hyperparameters. The results are quite astounding. Playing with weights $\alpha$ and $\beta$, and with the weights on the contribution of each layer $l$ towards the style loss allows for a large range of interesting results. This picture was created by decreasing the weight of the content loss.

One major drawback of this method can be seen in the creation above. The style loss must be tempered, or else it obstructs features of the image, like the face. The third set of images, in the style of “Woman in a Hat with Pompoms and a Printed Shirt” is another example. So portraits in general are not the best target. However, landscapes and the like come out amazing. The reason for this is that the style loss function does not take into account anything about the style image except the texture of the image. There are alternative statistics that have been tried. One successful result that I have not yet tried comes from the study of Markov Random Fields, used classically for image synthesis. The idea is to calculate the loss between patches of the filters, rather than the whole filter at once, where the loss of each patch of the generated image is calculated against the most similar patch (by cross correlation) for the painting.

Another drawback is that each generated image/painting combo must be calculated separately. This has been addressed by Johnson et. al and others by training a neural network which can turn an input image into a representation which, when passed through a loss network (such as VGG-16 as above), generates a stylized image of a particular style. The benefit is the speed – once trained, generating a stylized version of an input image is hundreds of orders of magnitude faster. The drawback is that the transformation network takes much longer to train and is only able to output images of the specific style it was trained on. However, this is the type of solution that can scale, for example to video. I have replicated exactly the network architecture as described in Johnson et. al. Here’s an example result:

I had trouble getting rid of artifacts showing up in some input images, like the blotch of white on the right of the stylized image. However, I did not train the network for very long, primarily to avoid large AWS GPU server bills, so the results are not that great. I’ll probably come back to this soon, as I am building my own GPU server! There are so many ideas to explore with this.