themolotov.net


Skip To Content

Random Phonetic Password Generator

I created a random phonetic password generator.

One night while talking to Alan, he mentioned that he was working on some homework. His assignment was creating a Java-based program that would generate random but pronounceable passwords. On a whim yesterday at work, I took a crack at it, but using PHP instead.

I think it turned out really well.

I assumed a few things when writing it:

  • PHP script would be smaller.
  • I'm sure the exercise was more a test of Java skills and not implementation of the generator.
  • As per the above, he was given a data set.

I had no dataset except my experiences; my test was more in implementation. After I finished, I was quite proud but wanted to look up any research out there on the subject and actually found the source for a Java program, but was a little disappointed. The dataset as far as I can tell is just an array for each letter which contains a probability of occurring after that letter, for each letter of the alphabet. I can't help but think this effort, while noble, is misguided.

The intent is to generate a random pronounceable password, not words hacked together. With probability alone, you drastically reduce the effect of proper entropy in a random system (especially with only 26 characters!). Granted, I do pretty much the same thing, but I only list which letters I imagined may come after a letter. I excluded some intentionally, imagining strange words like 'ctshacsm' which would occur if you allowed words to be generated letter-by-letter based on the premise that the new letter had to likely come after the previous. To counter this, I required that if two vowels or two consonants occur, that they be followed by the converse as a weak measure to ensure that they stay pronounceable. Still though, there are improvements to be made.

I imagine that if we met half-way, we could incorporate some Bayesian probability into this generation, but that would require an entirely different approach. You could also easily add linguistic rules into my method, such that verbal awkwardness be avoided. Further still, it would be interesting to classify words based on sounds (hard, soft, etc.) in order to avoid two hard syllables together or anything that might be cumbersome.

I don't know the likelihood that this generator would produce actual words. I leave the math to a later date or someone else.

this entry

Mood: Happy
Music: The Shins
Location: home


comments

1

Alan

Thursday, November 30, 2006

The first half of the assignment consisted of taking an input text file and parsing it, keeping track of how many times one letter follows another. It then uses this 2nd-order entropy (1-current letter, 2-next letter; hence 2nd order) to produce the semi-pronouncable words.

However, for extra credit, I used 3rd-order entropy. Given one letter, keep track of how often the next letter follows it. Given two consecutive letters, keep track of how often the next letter follows the current 2-letter combination.

The higher the entropy, the more likely you are to produce actual, dictionary words. When generating passwords, this would obviously be a bad thing (kinda ruins the whole point). For more extra credit, I could have chosen to compare each generated word with a dictionary. If it is found, retry. Otherwise, accept.

2

molotov

Saturday, December 2, 2006

1.) What kind of text files were these? Random text, Lorem ipsem? Technical documents? It would be interesting to see the different results from different types of files; you'd probably be able to see the different linguistic styles and things like that. It would seem though that parsing text files like that would produce... skewed results.

2.) The entropy thing is a good idea, I had not thought of it that way - I was thinking more of creating an algorithm based on rules, not results based on patterns. How did you deal with spaces? and punctunation? Were those probabilities ignored? I'd think for a user-submitted length, they'd have to be - for random length, they'd have to be considered.

3.) For higher levels of entropy (once you pass the three or four letter mark; where an immense amount of actual words exist), it would seem like you'd basically have a random-word chooser, not generator. You're absolutely right, that would ruin the whole point. What's the best method then? I think entropy rules and probabilities could only get you so far in you were going for pronouncability. I think you'd have to incorporate a sophisticated set of phonetic and linguistic rules. It would involve a dataset of all the phonemes in our language and comparing the previous two or three characters (actually the phonemes of those characters). With the method that I used (which the more I see it now, is quite disappointing), you basically end up with chunks of 'possible' syllables - which completely ignore the phonemes and pronouncability of the end-result.

I'd be interested in seeing 50 or so words that your program generates to see the effects of the third-order entropy in action.

3

meatbot

Sunday, December 17, 2006

Hey Smartman,

I have made another little compilation for you to give to you when I return from Aridzona. I'm going to see how my poor old dad & mum are doing. Hope his ribs are healed. OUch!

Can you e-mail 2 me a URL for your kewl password generator?
Seems like you had a couple of versions... I thought it was great and wanted to fiddle around with it some more. :-)

Happy XMess!

Yippee!;-)

4

molotov

Sunday, December 17, 2006

Telegram for antcopter:

Email sent STOP Hope parents are well STOP Have a safe trip STOP.


I bought a textbook on linguistics to help me better understand how words are formed. I'd like to go back and refine the generator after I get some of it under my belt. Have a safe trip dude - if I don't see you before the break, I'd like Erin and I to take you to dinner after you get back.

345, out.

5

Alan

Wednesday, January 31, 2007

What kind of text files were these? Random text, Lorem ipsem? Technical documents?

It was completely up to us. My professor did not care what we used, as long as we included the text when we turned in the assignment. I tested it with a variety of English texts and found that the shorter the text, the more often actual words would be embedded in my "random" passwords.

It would seem though that parsing text files like that would produce... skewed results.

The larger the sample file, the less skewed the results would be.

How did you deal with spaces? and punctunation? Were those probabilities ignored?

Punctuation was ignored, spaces were a special case. When parsing the text file, if I am looking at a "t" and the next character is a space, I do not count it as a "follow character." But, I do not count the first character of the next word as a "follow character" either. I simply do not credit that "t" with one at all. I just pick up with the next word.

I'd think for a user-submitted length, they'd have to be - for random length, they'd have to be considered.

When doing it the way I mentioned above, you do not need to know what length will be needed. The words will still be semi-pronounceable, assuming your sample text is large enough. Take my "t" for example. Nothing was credited with following it, but if my sample was large enough, there will be plenty of other characters to follow a "t" somewhere in the text.

it would seem like you'd basically have a random-word chooser, not generator...What's the best method then?

Given this method as an assignment, I had no choice but to implement it in this way. I think increasing to 4th- or 5th-order entropies would be too much to avoid actual words, especially if I do not want to implement the dictionary-check. 2nd-order works, but sometimes can result in requiring that "semi" is part of the program name. I think 3rd-order with a very large sample text would be the best route.

I'd be interested in seeing 50 or so words that your program generates to see the effects of the third-order entropy in action.

I'll try to stick my code in an applet and put it on my website.


You are not able to comment on this entry.



All Content Copyright Jon Gartman 2006, unless otherwise noted.
This site is part of the molonet. Generated in about 0.438 seconds.
Spiral out, keep going.