Enginursday: Artificial Text

Playing around with auto-generating content, or predicting next words.

Favorited Favorite 0

I got interested in the topic of statistically generating words after reading about it as a modern way to 'brute force' passwords. There are plenty of papers out there on the topic such as this pseudorandom one. I started to implement my own when I realized that rather than predicting characters in words, predicting words in sentences could be much more interesting. One could make a chat bot that initially only says weird things. I suppose it might be great for hiding spam links in forum posts. Knowing which words are likely to follow previous ones is one input to swipe or predictive keyboards. Google's auto complete feature could be based on the same general concept.

I decided to make up my own algorithm rather than research the 'correct' way to do it. The basis behind my approach is to read in a large amount of source material to train with. This is used to generate probabilities that one word follows another.

Trivial Example

Assume we have a training set that contains the following three sentences:

  • This is a sentence.
  • This is a sentence too.
  • This is a word too.

Processing these three sentences produces something like the following simplified graph. Each node points to the possible nodes that can follow.

Graph formed by three sentences

Simplified graph formed by three sentences

At each node the probability of the next word can be found by dividing the frequency the next word appeared by the total number of words that ever followed.

The edges between the nodes This, is, and a have a probability of 1 (33 = 1). The node a has more interesting set of edges leaving it. Since sentence has a count of 2, and word has a count of 1, the total count, or possible valid sentences that can continue from a is 3. 23 of the sentences use the word 'sentence' after 'a', while 13 of them follow 'a' with 'word'.

Graph formed by three sentences showing probabilities

Probability graph formed by three sentences

You may notice in the graph above that the sentence node is a little different than the rest. Half of sentences containing the word 'sentence' end there, while half continue on to the word 'too'.

Implementation

I'm learning C#, so I wrote an implementation in that. It can be found in this GitHub repository. It's far from polished. I played with some language features in places for fun. Features like terminating on nodes such as sentence aren't fully implemented. At the time of this writing, there are known bugs. Many are notated in comments. That doesn't stop the existing code from proving the concept.

For source material free ebooks by Project Gutenberg were used during development. I ran the app on “Alice’s Adventures in Wonderland”, and on “Grimms' Fairy Tales”. Below is the output from those runs. I've printed out the top five words used to start sentences (number of times seen), followed by five sample auto generated sentences.

Enter name of text file to read pg11.txt You entered 'pg11.txt', attempting read of that file Whole file was read! Done processing training data Top 5 most common sentence starting words: 1: the (576) 2: i (163) 3: you (155) 4: and (110) 5: alice (94) The company generally you what it was her the the the fight was over and morcar the house and tried to open it before said the the white rabbit but the banquet what i do you a very soon the the the the the march hare and this the world you like the the same the court and you in the the the work you she what the executioner the caterpillar and the the wood for you you if the place of the the united states and i get it home. The following which you you may not the the project gutenberg you to get in a hoarse feeble voice. E below. However he was in fact there's no notion how the door she is i the company generally you you couldn't cut off the fact is you throw the dormouse and why the middle of the regular course. Ah that's the great question is what.

Output from: Project Gutenberg's "Alice's Adventures in Wonderland", by Lewis Carroll

Enter name of text file to read pg2591.txt You entered 'pg2591.txt', attempting read of that file Whole file was read! Done processing training data Top 5 most common sentence starting words: 1: the (2989) 2: and (1134) 3: he (524) 4: you (454) 5: it (395) He as she how it i pray as she i and he and the doctor for you the mother of the little tailor when the cask for he and and said we go at that as what the wife at his a long the apple and i and you go the horses harnessed to his in when my ball for and you but the mountain and last of the cold wind and take you the water as the king his a hurricane over the cat one so and the miser and rocks away the fine and it all the kitchen the lightnings played and smite the pan and the moon and well with his way and and the anvil with and so the sack of having the witch had she to the robber ran back the seed came up there the king's who in the fox before she at the weather grew so and household tales in the widow heard that the merchant would have the little brother and he they laid her and looked at the egg that and merely to my misery for and the ground he i and then you it and was so all i must then i and and you the words i will have and own the most beautiful child under the robber ran back i have long but the peasant again the little tailor he the chimney and he and i and one but of a fork and then have the head and and it the door it you go and it and then a large hall with one of you the straw the enchanted maiden drive round her the ship and and the three dresses one to you and fat she to and find you the castle all the well he the blue light and many a man and the well he and the horses harnessed to the princess the least they had on they to the liquor ran upon the astonishment of everyone he a reward for she when they then it the nuts are and he the trade went well to me and and of a wife and for the terms of the world and i am the girl and you it then it you and the virtues of that to the mouse you the wife but the king the cows and pray my rampion like the horses shook themselves and the people who at the cook had they then i and bolt the head and the young count entered into the water and the little house they for she i and you it the many a stone and ill. Before him and his mother and you see with the dish had they she next morning when the bridegroom in he the eleven maidens had the cause of water in thy stable and the fire in the prince had his way and the bear the huntsmen and somebody took better care of one to and good and and pray as she and taking her the latter to that the other and when the heath again they soon fell asleep. And she fell fainting to the cart and he i and took a husband but the blow at the old woman went a tale the bridegroom in the ringdove sang from his long the shoe and their way and fell on the boy the thorns and and the one in the raging beast which was much too heavy and i and as they at there she and you now and and he the fox and the chamber next morning the poor the sausage and he when my coat a wife i am not the hazeltree and the carter went the king had the next day when at the king's son and the door the same she and carried the same as the ghost down a golden goblet full and it at the huntsman for the bear the young wife he one he after he in the dark and the queen in the young wife he when everything that the door and the travelling musicians old sultan the forest she with all at the pretty insects enjoy themselves i take the token from a tree it he if the cloak and he but the morning to his long the axe and i and the third he had in and one and and good food they you look the end of the shoe and it the middle and were in there i take the twelve years drew near the beautiful bird that i and is she and what i and the weather grew so and the laws that the roots with the bacon. Of you the same she and then in she that with the king as the words that the barrel so she and earls and they all and is with the salad and wishing to drink in she his footing and the seven of and you when the queen and he the hearth and said the snow had his if he one as when the queen's window. You the bushes that the huntsman had the garden the sausage and will not the wife and said the trough.

Output from: The Project Gutenberg EBook of "Grimms' Fairy Tales", by The Brothers Grimm

Taking it Further

Years ago some people generated some papers that were good enough to be accepted into conferences. Here is an automatic CS paper generator.

This project was for fun, and I might spend more free time improving the existing code. There are plenty of opportunities for improvement that might be fun. If you are inclined, have fun playing!


Comments 4 comments

  • Robotguy / about 8 years ago / 2

    Infinitely many monkeys sitting at infinitely many typewriters will eventually produce the works of Shakespeare.

    Even if you manage to set up the monkeys and reliable typewriters, you still have to FIND the one that's writing Shakespeare.

    I looked into this also many years ago, but in terms of auto-generating music. Play (somewhat) random notes and monitor some biofeedback sensors. Adjust the probability of note sequences based on what causes the desired biofeedback result (relaxed, excited, etc) and you have music personally tailored to your responses.

    Hey! I can buy all that stuff at Sparkfun now! Maybe I'll try again.

  • Eri(c||k)^5 / about 8 years ago / 1

    Alice in Wonderland did some odd seeding. I found "the" in 5 doubles, 2 triples, and a quintuple on your example. I'm curious how that happened- a bug in the generation code, strange text in the source, or what. A quick drop in on the html version on gutenberg.org didn't pop up with even a single instance of double "the".

    • .Brent. / about 8 years ago / 1

      You are absolutely correct. I don't think that should happen. There aren't any double "the"s in the source text, so a the node should never lead to another the node. That book was full of all sorts of made up words like "Beau--ootiful Soo--oop!" which I didn't address. That is one of the known bugs mentioned, but not explicitly. This was a fun side project that didn't get the time I would have liked to have given it. It's pretty much the first think I typed out with little effort fixing things. I'll hopefully revisit this. The sentences currently generated feel way too long too.

Related Posts

Recent Posts

Open-Source HVAC?

What is L-Band?

Tags


All Tags