We have added a new sample, WordCount, in our 0.5 release of ActorFx. WordCount is meant to demonstrate ActorFx’s ability to effect dataflow-type computations over arbitrary node topologies, employing a MapReduce-type algorithm to count the occurrences
of all words in a selected set of textual documents.
Running the Sample
(Before running the app, make sure that you have a local Actor Runtime deployed, and that the EmptyActorApp has been deployed to that cluster.)
The WordCount app looks like this when it starts up:
You need to specify a few things before initiating the word count:
- You need to select some textual files for processing; you do this by pressing the “Change …” button and selecting your input files. WordCount comes with a number of Shakespeare texts that can be chosen, or you can point to any arbitrary set of text
- You are also asked to specify the number of mappers and the number of reducers; you can have anywhere from 1 to 4 (inclusive) of each. If you don’t specify these values, they default to 1.
After specifying the input files and the topology, the WordCount window looks something like this:
You can now press the “Start Count” button to kick off the word count operation. When you do, you are presented with a screen showing the progress of the calculation (the yellow blocks on the top are the mapper actors, the green blocks in the middle
are reducer actors, and the red block on the bottom is the aggregator actor):
Once the calculation completes, a “Top 5” dialog will pop up:
Those are the top 5 most frequent words found in your texts, along with the number of occurrences of each. Pressing the “OK” button on the “Top 5” dialog will cause it and the progress dialog to disappear, leaving you with just the original application
Back at the original application window, you can now search for words and their frequency of occurrence in your texts. Your search term is interpreted as a regular expression. You can see from this example how you would specify “find all words
that start with ‘love’”, and get back the number of occurrences of each word found:
The WordCount Algorithm
The way that we implemented WordCount was to launch three classes of actors: some mapper actors, some reducer actors, and an aggregation actor. Each are populated with the necessary logic to perform their respective functions by sending an assembly
containing actor methods to each.
The user-specified textual files are “dealt” to the mappers in a round-robin fashion. For each input file, a mapper will:
- Ascertain the number of reducers and construct a Dictionary<string,int> for each.
- For each word in the input file (excluding a set of common “don’t care” words), the mapper will determine the hash function for the word and thereby determine the appropriate reducer index ( = hash value % #reducers) for the word. The word is then
put into the appropriate dictionary, either incrementing the existing count for the word or setting the count to 1 for new words.
- After all words have been processed, all dictionaries are serialized to files. The mapper then passes name of each produced file to the corresponding reducer.
- If there are more input files, process the next file. Otherwise, send a “done” message to each reducer.
For each input filename that a reducer receives, it will read in the serialized dictionary and merge its information into a running dictionary that is tracking overall word counts for that reducer. When the reducer has received “done” messages from
each feeding mapper, it will serialize its running dictionary to a file, and pass the name of that file to the aggregator.
The aggregator will merge each reducer’s input into a final dictionary. Once it receives and processes input from all reducers, it will signal that the computation has completed. The aggregator can then be queried for frequency information about
specific words or regular expressions.
The following diagram summarizes the algorithm (assuming the topology in our example above):
(And if you look at the code, you’ll note that the mapper logic and reducer logic have been purposefully slowed down, so that the app takes a while to complete and the app-runner has the satisfaction of seeing meaningful status data rather than the app just
Some final notes
The WordCount sample demonstrates many of the features of ActorFx:
- Actor behavior is assigned via the passing in of an assembly containing actor methods.
- File names and “done” messages are passed from one tier to the next using actor-to-actor method calls.
- Status information is emitted and collected using ActorFx pub/sub mechanisms.
The sample has a lot of room for improvement:
- It can currently only be run on a local “simulated” Actor Runtime cluster, because it relies on a local file system to marshal results from one actor to another. In the future we can use Azure blobs, or something like that, to hold and convey reduction
information between levels; this would allow us to run the sample on Azure.
- It could be enhanced so that it constructs inverted indices and can therefore tell you how many times a certain word occurs in *each* input file, and where in the file a word occurs.
- The sample in its current form is not tolerant of node failure. To be more Hadoop-like, and more resistant to failure, we could incorporate a “scheduler” actor that marshaled information from one node to the next, and re-started calculations when
node failure was detected.