Wordscapes with OpenCV

2022-10-04

I don't play a lot of games, but occasionally one catches my eye, usually because it looks like something I could have fun automating.

This time, the game was Wordscapes, a cross between scrabble and a crossword. The aim is to find the words in the grid using a given set of letters; each letter can only be used once, bonus points for extra words!

Wordscapes Level 11

The mechanics of the game are trivial to automate, we just need to find all words that are only made up of the letters we have available. I'll spare you the details, we're just programmatically searching through a dictionary for matching words; let's get to the good bit.

Parsing the Game Screen

The interesting parts of this problem revolve around interacting with the game. The game is designed for humans, not computers, so we'll have to do some computer vision work to extract the required information; OpenCV is ideal for this.

The first step of the process is isolating the wheel. Firstly we convert the image to greyscale (colour information isn't important to us) and then threshold it to give us a purely black and white representation. We also normalise the image to always have a black background with white letters (sometimes the wheel is white with black letters).

Level 11 - Grayscale

We can now easily mask out the remainder of the image as the wheel is always in the same position. To achieve this, all we need to do is bitwise AND a white circle over the letter wheel to leave us with just the letters.

Level 11 - Grayscale

Our final computer vision step is to isolate the individual characters and as we're creating a generic solution that should work on any level with any number of characters, we can't use absolute coordinates to do this, we need to do it dynamically.

The OpenCV findContours function provides us with a way to do this.

Contours can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition. OpenCV

So, using findContours followed by boundingRect leaves us with each letter nicely isolated.

Level 11 - Letters

OCR - Optical Character Recognition

Optical character recognition [...] (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text Wikipedia

Now we've extracted each character, we just need to identify which characters they are. We could use machine learning to do this, but there's a simpler way for us to achieve the same result.

If we store a copy of each letter in a library (manually labeling it the first time we see it) and simply XOR it with the letter we're wanting to identify, we'll be left with a diff of the two images. The result with the least difference is our letter.

Let's run through a quick example. Take the letter "T" (currently unidentified), and XOR it with with each letter in our library:

Level 11 - Letters XORed

It's quite clear that when our unidentified letter is XORed with our library's "T", comparatively few pixels differ. Here's that same data represented as a graph, it's easy to see here that the letter "T" is the closest match:

Game Interaction

We've identified our letters and we've searched through a dictionary for valid words, all we have to do is swipe them in and we're done!

Given that we already know the position of the letters on the screen thanks to the boundingRect function, we just need a way to actually send the swipe commands. Thankfully Google provides us with a tool called Android Debug Bridge (adb for short) that allows us to interact with our device from a connected computer.

Using the adb shell command allows us to execute commands on the device, pair this with the input command and we're able to send a single swipe. If we wanted to bring up our app drawer from the home screen, for example, we could do something like this: adb shell input swipe 100 1000 100 100.

Unfortunately, that's as far as this command can take us. As I mentioned above, the input command only allows us to send a single swipe, rather than the multiple continuous swipes that are needed to spell out our word.

The `sendevent` Command

Let's go one level lower, the sendevent command. Unfortunately, as far as I've been able to find, this command isn't very well documented, so we're going to have to do a bit of work to figure out how to use it. All we get when we run sendevent --help is this: usage: sendevent DEVICE TYPE CODE VALUE and the single descriptive line of text stating Sends a Linux input event.

Thankfully sendevent is an open source tool, so we can take a look at the source code. It appears as though the command is basically just taking a few parameters, storing them in a struct and writing them to a file. Not overly helpful in explaining what's going on.

There are a few comments and lines of code that, after some more searching, lead us to the Linux kernel documentation on event codes and documentation on the multi-touch (MT) protocol.

Using this documentation and cross checking values with the output of the getevent command, we have enough information to emulate a continuous swipe on the screen. For example, to draw a triangle to the screen, we might do something like this:

sendevent /dev/input/event1 3 57 23   # EV_ABS  ABS_MT_TRACKING_ID  CONTACT SLOT
sendevent /dev/input/event1 3 53 360  # EV_ABS  ABS_MT_POSITION_X   X POSITION
sendevent /dev/input/event1 3 54 300  # EV_ABS  ABS_MT_POSITION_Y   Y POSITION
sendevent /dev/input/event1 3 48 10   # EV_ABS  ABS_MT_TOUCH_MAJOR  TOUCH DIAMETER
sendevent /dev/input/event1 3 58 10   # EV_ABS  ABS_MT_PRESSURE     PRESSURE
sendevent /dev/input/event1 1 330 1   # EV_KEY  BTN_TOUCH           KEY DOWN
sendevent /dev/input/event1 0 0 0     # EV_SYN  SYN_REPORT

sendevent /dev/input/event1 3 53 580  # EV_ABS  ABS_MT_POSITION_X   X POSITION
sendevent /dev/input/event1 3 54 70   # EV_ABS  ABS_MT_POSITION_Y   Y POSITION
sendevent /dev/input/event1 0 0 0     # EV_SYN  SYN_REPORT

sendevent /dev/input/event1 3 53 140  # EV_ABS  ABS_MT_POSITION_X   X POSITION
sendevent /dev/input/event1 3 54 700  # EV_ABS  ABS_MT_POSITION_Y   Y POSITION
sendevent /dev/input/event1 0 0 0     # EV_SYN  SYN_REPORT

sendevent /dev/input/event1 3 53 360  # EV_ABS  ABS_MT_POSITION_X   X POSITION
sendevent /dev/input/event1 3 54 300  # EV_ABS  ABS_MT_POSITION_Y   Y POSITION
sendevent /dev/input/event1 0 0 0     # EV_SYN  SYN_REPORT

sendevent /dev/input/event1 0 0 0     # EV_SYN  SYN_REPORT
sendevent /dev/input/event1 1 330 0   # EV_KEY  BTN_TOUCH           KEY UP

Playing the Game

Now that we can interact with the device in the correct manner, everything falls into place; we can finally, automatically, play a level:

So we've achieved our goal of automatically completing a level of Wordscapes!

Speeding things up

Yeah, I was a bit disappointed too, I wanted to see some blazingly fast level completions, but it seems that launching sendevent for each and every fraction of a packet isn't particularly fast (who'd have thought). Fortunately, we know exactly what sendevent is doing thanks to our earlier research. We can just replicate what we found in the source code and execute it as fast as we like in our own codebase, without having to launch a separate process each time.

Replicating sendevent boils down to creating this struct and writing it to /dev/input/event1:

struct input_event {
	struct timeval time;
	__u16 type;
	__u16 code;
	__s32 value;
};

A quick and easy job with the following Python: struct.pack('<QQHHi', 0, 0, event_type, code, value). Once we've translated our calls from sendevent to this pack call and written directly the device file, we're done! This is what we end up with:

That's a speed up of 687% (from 24.4 seconds to 3.1). In fact we can go faster, it's now the device/game that struggles to keep up with processing the input.

I'll leave you with one more level, further into the game, for your enjoyment: