Generating random project names using Chichewa words
Modern day developers are no strangers to automatically generated project names, these have become common and you may run into them if you use for example:
Vercel for deploying projects
Docker (Desktop)
... and many other modern tools that generate names for things
I was thinking about this and figured it would be an interesting undertaking to try to generate some names from Chichewa words.
First of all, where to get the words. The internet is full of Chichewa speaking folks nowadays, and I guess I could just scrape a bunch of news sites and social media platforms for Chichewa content, but for this tiny experiment that's overkill. I thought we could mostly make use of this old-but-gold resource by my friend, Edmond Kachale, which he worked on years ago with his colleague Prof. Kevin Scannel.
The name gen algorithm
The algorithm is embarrassingly simple, we just shuffle the list of words and pick the first N-words from the shuffled list and join the words with a hyphen ( - )
Initially, I had tried to come up with something that picked random nouns from the Chichewa project I mentioned above but the output wasn't appealing at all. I tried to change the code to try to include the adjectives, adverbs, pronouns, and nouns in different combinations but it ended up being an exercise in linguistics which I am not qualified to perform.
In the end, I settled on a simple "algorithm" - the algorithm is embarrassingly simple, we just shuffle the list of words and pick the first N-words from the shuffled list and join the words with a hyphen ( - ). I went further to modify the code to add a suffix extracted from a Base64 encoded string to ensure that names remain somewhat unique in the face of collisions due to repetitions of the "random" order we pick words from.
I decided this would work better if I came up with my own list of words for the corpus.
package main;
import java.nio.file.*;
import java.util.*;
import java.security.*;
public class ChichewaNameGen {
private static SecureRandom rng = new SecureRandom();
public static void main(String... args) throws Exception {
rng.setSeed(System.currentTimeMillis());
var lines = Files.readAllLines(Paths.get("./words.txt"));
var words = new ArrayList<>(lines);
var N = Integer.parseInt(args.length >= 1 ? args[0] : "3");
var k = Integer.parseInt(args.length >= 2 ? args[1] : "10");
for (int i = 0; i < k; i++) {
System.out.println(generate(words, N));
}
}
public static String generate(List<String> words, int maxWords) {
Collections.shuffle(words);
var w = String.join("-", words.subList(0, maxWords))
.replace("'", "").toLowerCase();
return w + "-" + randomHash(w);
}
public static String randomHash(String w) {
var alpha = Base64.getEncoder()
.encodeToString((System.currentTimeMillis()+ w)
.getBytes()).replace("=", "");
return (alpha).substring(rng.nextInt(0, alpha.length()-9))
.substring(0, 9);
}
}
Creating a Chichewa word list for quirkier phrases
I wanted to have a list of words that generates names that roll off the tongue somewhat nicely or are at least quirky enough to be mildly interesting. I came up with a 100 random but relatable Chichewa words. I then placed these in the file where the code above looks. I tried several runs of tests and found the results were more pleasing.
Below is an example with a 5 random names, with length of 3 words from the list:
$ java ChichewaNameGen.java 3 5
# output of the random words with a substring from a base64 string
fotokoza-ndemanga-tilipo-XRpbGlwbw
zikwanje-malonje-malemba-WFsZW1iYQ
matumba-chiyambi-mwamva-tbXdhbXZh
basiketi-kufika-lolemba-sb2xlbWJh
kufika-chitedze-makope-1tYWtvcGU
I am confident that those are somewhat quirky phrases. I am not going to try to translate. Okay, I can't resist with the first one ;)
Roughly (fotokoza)-(ndemanga)-(tilipo)
-> (explain)-(your-thoughts/opinions)-(we-are-here/around)
How many names can we generate
The number of random names/phrases we can generate greatly depends on the number of words in the dataset. With 100 random words this approach should give more than 700 thousand random project names if we set N = 3, roughly (100 * 99 * 98 )
.
The last part of the string is a 9 character long sub-string from a Base64 string and it's this bit that will help greatly with reducing collisions if we ever need to generate more names for example in some kind of service with millions of users. That should make us safe-ish from running out of random names in most applications of this kind of thing. Collisions would still occur, but we have more of a leeway for generating new names/phrases without running out of options quickly.
Conclusion
This was an interesting little experiment, I was pleased with the results and plan on using this somewhere in a future project. I might come back to this idea later and try to use the Chichewa dataset properly to create words that are built up grammatically but for now a random list of words works. Here is a link to the Gist with the 100 random chichewa words
Thanks to Seth, Chimwemwe and Walter for reviewing drafts of this post.