# Enginursday: Artificial Text

Playing around with auto-generating content, or predicting next words.

I got interested in the topic of statistically generating words after reading about it as a modern way to ‘brute force’ passwords. There are plenty of papers out there on the topic such as this pseudorandom one. I started to implement my own when I realized that rather than predicting characters in words, predicting words in sentences could be much more interesting. One could make a chat bot that initially only says weird things. I suppose it might be great for hiding spam links in forum posts. Knowing which words are likely to follow previous ones is one input to swipe or predictive keyboards. Google’s auto complete feature could be based on the same general concept.

I decided to make up my own algorithm rather than research the ‘correct’ way to do it. The basis behind my approach is to read in a large amount of source material to train with. This is used to generate probabilities that one word follows another.

### Trivial Example

Assume we have a training set that contains the following three sentences:

• This is a sentence.
• This is a sentence too.
• This is a word too.

Processing these three sentences produces something like the following simplified graph. Each node points to the possible nodes that can follow.

Simplified graph formed by three sentences

At each node the probability of the next word can be found by dividing the frequency the next word appeared by the total number of words that ever followed.

The edges between the nodes This, is, and a have a probability of 1 (33 = 1). The node a has more interesting set of edges leaving it. Since sentence has a count of 2, and word has a count of 1, the total count, or possible valid sentences that can continue from a is 3. 23 of the sentences use the word ‘sentence’ after ‘a’, while 13 of them follow ‘a’ with ‘word’.

Probability graph formed by three sentences

You may notice in the graph above that the sentence node is a little different than the rest. Half of sentences containing the word ‘sentence’ end there, while half continue on to the word ‘too’.

### Implementation

I’m learning C#, so I wrote an implementation in that. It can be found in this GitHub repository. It’s far from polished. I played with some language features in places for fun. Features like terminating on nodes such as sentence aren’t fully implemented. At the time of this writing, there are known bugs. Many are notated in comments. That doesn’t stop the existing code from proving the concept.

For source material free ebooks by Project Gutenberg were used during development. I ran the app on “Alice’s Adventures in Wonderland”, and on “Grimms' Fairy Tales”. Below is the output from those runs. I’ve printed out the top five words used to start sentences (number of times seen), followed by five sample auto generated sentences.

Enter name of text file to read pg11.txt You entered 'pg11.txt', attempting read of that file Whole file was read! Done processing training data Top 5 most common sentence starting words: 1: the (576) 2: i (163) 3: you (155) 4: and (110) 5: alice (94) The company generally you what it was her the the the fight was over and morcar the house and tried to open it before said the the white rabbit but the banquet what i do you a very soon the the the the the march hare and this the world you like the the same the court and you in the the the work you she what the executioner the caterpillar and the the wood for you you if the place of the the united states and i get it home. The following which you you may not the the project gutenberg you to get in a hoarse feeble voice. E below. However he was in fact there's no notion how the door she is i the company generally you you couldn't cut off the fact is you throw the dormouse and why the middle of the regular course. Ah that's the great question is what.

Output from: Project Gutenberg’s “Alice’s Adventures in Wonderland”, by Lewis Carroll

Output from: The Project Gutenberg EBook of “Grimms' Fairy Tales”, by The Brothers Grimm

### Taking it Further

Years ago some people generated some papers that were good enough to be accepted into conferences. Here is an automatic CS paper generator.

This project was for fun, and I might spend more free time improving the existing code. There are plenty of opportunities for improvement that might be fun. If you are inclined, have fun playing!

• Infinitely many monkeys sitting at infinitely many typewriters will eventually produce the works of Shakespeare.

Even if you manage to set up the monkeys and reliable typewriters, you still have to FIND the one that’s writing Shakespeare.

I looked into this also many years ago, but in terms of auto-generating music. Play (somewhat) random notes and monitor some biofeedback sensors. Adjust the probability of note sequences based on what causes the desired biofeedback result (relaxed, excited, etc) and you have music personally tailored to your responses.

Hey! I can buy all that stuff at Sparkfun now! Maybe I’ll try again.

• Alice in Wonderland did some odd seeding. I found “the” in 5 doubles, 2 triples, and a quintuple on your example. I’m curious how that happened- a bug in the generation code, strange text in the source, or what. A quick drop in on the html version on gutenberg.org didn’t pop up with even a single instance of double “the”.

• You are absolutely correct. I don’t think that should happen. There aren’t any double “the"s in the source text, so a the node should never lead to another the node. That book was full of all sorts of made up words like "Beau–ootiful Soo–oop!” which I didn’t address. That is one of the known bugs mentioned, but not explicitly. This was a fun side project that didn’t get the time I would have liked to have given it. It’s pretty much the first think I typed out with little effort fixing things. I’ll hopefully revisit this. The sentences currently generated feel way too long too.

• Infinitely many monkeys sitting at infinitely many typewriters will eventually produce the works of Shakespeare.

However, they’ll also produce a LOT of garbage. And actual monkeys at actual typewriters will produce a lot of jammed typewriters (from someone who learned to type on an actual typewriter).

In 2003, CU student Nate Seidle blew a power supply in his dorm room and, in lieu of a way to order easy replacements, decided to start his own company. Since then, SparkFun has been committed to sustainably helping our world achieve electronics literacy from our headquarters in Boulder, Colorado.

No matter your vision, SparkFun's products and resources are designed to make the world of electronics more accessible. In addition to over 2,000 open source components and widgets, SparkFun offers curriculum, training and online tutorials designed to help demystify the wonderful world of embedded electronics. We're here to help you start something.