Thursday, July 29, 2010

Feeling charitable toward Baylor’s IDC cubs

The reason I come off as a nasty bastard on this blog is that I harbor quite a bit of anger toward the creationist bastards who duped me as a teenager. The earliest stage of overcoming my upbringing was the worst time of my life. I wanted to die. Consequently, I am deadly serious in my opposition to “science-done-right proves the Bible true” mythology. William A. Dembski provokes me especially with his prevarication and manipulation. He evidently believes that such behavior is moral if it serves higher ends in the “culture war.” My take is, shall we say, more traditional.

When Robert J. Marks II, Distinguished Professor of Engineering at Baylor University, and Fellow of the Institute of Electrical and Electronics Engineers (IEEE), began collaborating with Dembski, I did not rush to the conclusion that he was like Dembski. But it has become apparent that he is willing to play the system. For instance, Dembski was miraculously elevated to the rank of Senior Member of the IEEE, which only 5% of members ever reach, in the very year that he joined the organization. To be considered for elevation, a member must be nominated by a fellow.

Although Marks was the founding president of the progenitor of the IEEE Computational Intelligence Society, which addresses evolutionary computation (EC), he and his IDCist collaborators go to the IEEE Systems, Man, and Cybernetics Society for publication. He is fully aware that reviewers there are unlikely to know much about EC, and are likely to give the benefit of the doubt to a paper bearing his name. I would love to see him impugn the integrity of his and my colleagues in the Computational Intelligence Society by claiming that they don’t review controversial work fairly. But it ain’t gonna happen.

I’ve come to see Marks as the quintessential late-career jerk, altogether too ready to claim expertise in an area he has never engaged vigorously. He is so cocksure as to publish a work of apologetics with the title Evolutionary Computation: A Perpetual Motion Machine for Design Information? (Chap. 17 of Evidence for God, M. Licona and W. A. Dembski, eds.). He states outright some misapprehensions that are implicit in his technical publications. Here’s the whopper: “A common structure in evolutionary search is an imposed fitness function, wherein the merit of a design for each set of parameters is assigned a number.” Who are you, Bob Marks, to say what is common and what is not in a literature you do not follow? Having scrutinized over a thousand papers in EC, and perused many more, I say that you are flat-out wrong. There’s usually a natural, not imposed, sense in which some solutions are better than others. Put up the references, Distinguished Professor Expert, or shut up.

Marks and coauthors cagily avoid scrutiny of their (few) EC sources by dumping on the reviewers references to entire books, i.e., with no mention of specific pages or chapters. This is because their EC veneer will not withstand a scratch. The chapter I just linked to may seem to contradict that, given its references to early work in EC by Barricelli (1962), Crosby (1967), and Bremmerman [sic] et al. (1966). [That's Hans-Joachim Bremermann.] First, note the superficiality of the references. Marks did not survey the literature to come by them. The papers appear in a collection of reprints edited by David Fogel, Evolutionary Computation: The Fossil Record (IEEE Press, 1998). Marks served as a technical editor of the volume, just as I did, and he should have cited it.

Although Marks is an electrical engineer, he has been working with two of Baylor’s graduate students in computer science, Winston Ewert and George Montañez. I would hazard a guess that there is some arrangement for the students to turn their research with Marks into masters’ theses. I’ve been sitting on some errors in their most recent publication, Efficient Per Query Information Extraction from a Hamming Oracle, thinking that the IDC cubs would get what they deserved if they included the errors in their theses. Well, I’ve got a soft spot for students, and I’m feeling charitable today. But there’s no free lunch for Marks. He has no business directing research in EC, his reputation in computational intelligence notwithstanding, and I hope that the CS faculty at Baylor catch on to the fact.

On first reading the paper, I was deeply annoyed by the combination of a Chatty-Cathy, self-reference-laden introduction focusing on “oracles,” irrelevant to the majority of the paper, with a non-survey of the relevant literature in the theory of EC. Ewert et al. dump in three references to books, without discussion of their content, at the beginning of their 4-1/2 page section giving Markov-chain analyses of evolutionary algorithms. It turns out that one of the books does not treat EC at all — I contacted the author to make sure.

As I have discussed here and here, two of the algorithms they analyze are abstracted from defective Weasel programs that Dawkins supposedly used in the mid-1980's. It offends me to see these whirlygigs passed off as objects worthy of analysis in the engineering literature.

Yet again, they express the so-called average active information per query as $$I_\oplus = {{I_\Omega} \over Q} = \frac{\log N^L}{Q} = {{L \log N} \over Q},$$ where Q is not the simple random variable it appears to be, but is instead the expected number of trials (“queries”) a procedure requires to maximize the number of characters in a “test” string that match a “target” string. Strings are over an alphabet of size N, and are of length L. Unless you have something to hide, you write $$I_\oplus ={{L \log N} \over {E[T]}},$$ where T is the random number of trials required to obtain a perfect match of the target. This is a strange idea of an average, and it appears that a reviewer said as much. Rather than acknowledge the weirdness overtly, Ewert et al. added a cute “yeah, we know, but we do it consistently” footnote. Anyone without a prior commitment to advancing “intelligence creates active information” ideology would simply flip the fraction over to get the average number of trials per bit of endogenous information IΩ, $$\frac{1}{I_\oplus} = E\left[{T \over {I_\Omega}}\right] = {{E[T]} \over {L \log N}}.$$ This has a clear interpretation as expected performance normalized by a measure of problem hardness. But when it’s “active information or bust,” you’re not free to go in any sensible direction available to you. I have to add that I can’t make a sensible connection between average active information per query and active information. Given a bound K on the number of trials to match the target string, the active information is $$I_+ = \log \Pr\{T \leq K\} + {L \log N}.$$ Do you see a relationship between I+ and I that I’m missing?

By the way, I happened upon prior work regarding the amount of information required to solve a problem. The scholarly lassitude of the IDC “maverick geniuses” glares out yet again.

On second reading, I bothered to do sanity checking of the plots. I saw immediately that the surfaces in Fig. 2 were falling off in the wrong directions. For fixed alphabet size N, the plots show the average active information per query increasing as the string length L increases, when it obviously should decrease. The problem is harder, not easier, when the target string is longer. Comparing Fig. 5 to Figs. 3 and 4, it’s easy to see that the subscripts for N and L are reversed somewhere. But what makes Fig. 3 cattywampus is not so simple. Ewert et al. plot $$I_\oplus(L, N) = \frac{L \log N}{E[T_{N,L}]}$$ instead of $$I_\oplus(L, N) = \frac{L \log N}{E[T_{L,N}]}.$$ That is, the matrix of expected numbers of trials to match the target string is transposed, but the matrix of endogenous information values is not.

The embarrassment here is not that the cubs got confused about indexing of square matrices of values, but that a team of four, including Marks and Dembski, shipped out the paper for review, and then submitted the final copy for publication, with nary a sanity check of the plots. From where I sit, it appears that Ewert and Montañez are getting more in the way of indoctrination than advisement from Marks and Dembski. Considering that various folks have pointed out errors in every paper that Marks and Dembski have coauthored, you’d think the two would give their new papers thorough goings-over.

It is sad that Ewert and Montañez probably know more about analysis of algorithms than Marks and Dembski do, and evidently are forgetting it. The fact is that $$E[T_{L,N}] = \Theta(N L \log L)$$ for all three of the evolutionary algorithms they consider, provided that parameters are set appropriately. It follows that $$I_\oplus = \Theta\left(\frac{L \log N}{N L \log L}\right) = \Theta\left(\frac{\log N}{N \log L}\right).$$ In the case of (C), the (1, λ) evolutionary algorithm, setting the mutation rate to 1 / L and the number of offspring λ to N ln L does the trick. (Do a lit review, cubs — Marks and Dembski will not.) From the perspective of a computer scientist, the differences in expected numbers of trials for the algorithms are not worth detailed consideration. This is yet another reason why the study is silly.

Methinks it is like the OneMax problem

The optimization (not search) problem addressed by Ewert et al. (and the Weasel program) is a straightforward generalization of a problem that has been studied heavily by theorists in evolutionary computation, OneMax. In the OneMax problem, the alphabet is {0, 1}, and the fitness function is the number of 1's in the string. In other words, the target string is 11…1. If the cubs poke around in the literature, they’ll find that Dembski and Marks reinvented the wheel with some of their analysis. That’s the charitable conclusion, anyway.

Winston Ewert and George Montañez, don’t say the big, bad evilutionist never gave you anything.

Wednesday, July 28, 2010

Creeping elegance, or shameless hacking?

In my previous post, I did not feel great about handling processes in the following code for selection, but I did not see a good way around it.

from heapq import heapreplace

def selected(population, popSize, nSelect, best = None):
if best == None:
best = nSelect * [(None, None)]
threshold = best[0][0]
nSent = 0

for process in processes:
if nSent == popSize: break
process.submit(population[nSent], threshold)
nSent += 1

for process, x, score in scored(popSize):
if score > threshold:
heapreplace(best, (score, x))
threshold = best[0][0]
if nSent < popSize:
process.submit(population[nSent], threshold)
nSent += 1

return best

What I really want is for selected to know nothing about parallel processing, and for the generator of scored individuals to know nothing about selection. The problem is that threshold changes dynamically, and the generator needs a reference to it. As best I can tell, there are no scalar references in Python. Having taught LISP a gazillion times, I should have realized immediately that I could exploit the lexical scoping of Python, and pass to the generator a threshold-returning function defined within the scope of selected.

def selected(self, population, nSelect, best = None):
if best is None:
best = nSelect * [(None, None)]

threshold = lambda: best[0][0]

for x, score in evaluator.eval(population, threshold):
if score > best[0][0]:
heapreplace(best, (score, x))

return best

Perhaps I should not be blogging about my first Python program. Then again, I’m not the worst programmer on the planet, and some folks may learn from my discussion of code improvement. This go around, I need to show you the generator.

def eval(self, population, threshold):
popSize = len(population)
nSent = 0

for process in self.processes:
if nSent == popSize: break
process.submit(population[nSent], threshold())
nSent += 1

for unused in population:
yield process.result()
if nSent < popSize:
process.submit(population[nSent], threshold())
nSent += 1

This is a method in class Evaluator, which I plan to release. No knowledge of parallel processing is required to use Evaluator objects. The __init__ method starts up the indexed collection of processes, each of which knows its own index. It also opens a Pipe through which processes send their indexes when they have computed the fitness of individuals submitted to them. The Evaluator object’s Connection to the pipe is named isReady.

The first for loop comes from the original version of selected. Iteration over population in the second for loop is just a convenient way of making sure that a result is generated for each individual. In the first line of the loop body, a ready process is identified by receiving its index through the isReady connection. Then the generator yields the result of a fitness evaluation. The flow of control stops flowing at this point, and resumes only when selected returns to the beginning of its for loop and requests the next result from the eval generator.

When execution of the generator resumes, the next unevaluated individual in the population, if any, is submitted to the ready process, along with the value of a call to the threshold function. The call gives the current value of best[0][0], the selection threshold.

By the way, the Pipe should be a Queue, because only the “producer” processes, and not the “consumer” process, send messages through it. But Queue is presently not functioning correctly under the operating system I use, Mac OS X.

Monday, July 26, 2010

Efficient selection with fitness thresholds, heaps, and parallel processing — easier done than said

The obvious approach to selection in an evolutionary algorithm is to preserve the better individuals in the population and cull the others. This is known as truncation selection. The term hints at sorting a list of individuals in descending order of fitness, and then truncating it to length nSelect. But that is really not the way to do things. And doing selection well is really not that hard. After providing a gentle review of the considerations, I’ll prove my point with 18 lines of code.

A principle of computational problem solving is not to waste time determining just how bad a bad solution is. Suppose we’re selecting the 3 fittest of a population of 10 individuals, and that the first 3 fitness scores we obtain are 90, 97, and 93. This means that we’re no longer interested in individuals with fitness of 90 or lower. If it becomes clear in the course of evaluating the fourth individual that its fitness does not exceed the threshold of 90, we can immediately assign it fitness of, say, 0 and move on to the next individual.

Use of the threshold need not be so simple. For some fitness functions, a high threshold reduces work for all evaluations. An example is fitness based on the Levenshtein distance of a string of characters t from a reference string s. This distance is the minimum number of insertions, deletions, and substitutions of single characters required to make the strings identical. Fitness is inversely related to distance. Increasing the threshold reduces the number of possible alignments of the strings that must be considered in computing the distance. In limited experiments with an evolutionary computation involving the Levenshtein distance, I’ve halved the execution time by exploiting thresholds.

A natural choice of data structure for keeping track of the nSelect fittest individuals is a min heap. All you need to know about the heap is that it is stored in an indexed data structure, and that the least element has the least index. That is, the threshold element is always heap[0] when indexing is zero-based. The heap is initialized to contain nSelect dummy individuals of infinitely poor fitness. When an individual has super-threshold fitness, it replaces the threshold element, and the heap is readjusted.

Evolutionary computations cry out for parallel processing. It is cruel and immoral to run them sequentially on computers with multiple processors (cores). But I have made it seem as though providing the fitness function with the selection threshold depends upon sequential evaluation of individuals. There are important cases in which it does not. If parents compete with offspring for survival, then the heap is initialized at the beginning of the run, and is reinitialized only when the fitness function changes — never, in most applications. Also, if the number of fitness evaluations per generation exceeds the number of processors, as is common with present technology, then there remains a sequential component in processing.

The way I’ve approached parallel processing is to maintain throughout the evolutionary run a collection of processes dedicated to fitness evaluation. The processes exist when fitness evaluation cum selection begins. First an individual is submitted, along with the threshold, to each process. Then fitness scores are received one by one. For each score received, the heap and threshold are updated if necessary, and an unevaluated individual is submitted, along with the threshold, to the process that provided the score. In the experiments I mentioned above, I’ve nearly halved the execution time by running two cores instead of one. That is, the combined use of thresholds and two fitness-evaluation processes gives almost a factor-of-4 speedup.

The Python function

I’m going to provide an explanation that any programmer should be able to follow. But first look the code over, considering what I’ve said thus far. The heap, named best, is an optional parameter. The variable nSent registers the number of individuals that have been submitted to evaluation processes. It steps from 0 to popSize, the size of the population.

from heapq import heapreplace

def selected(population, popSize, nSelect, best = None):
if best == None:
best = nSelect * [(None, None)]
threshold = best[0][0]
nSent = 0

for process in processes:
if nSent == popSize: break
process.submit(population[nSent], threshold)
nSent += 1

for process, x, score in scored(popSize):
if score > threshold:
heapreplace(best, (score, x))
threshold = best[0][0]
if nSent < popSize:
process.submit(population[nSent], threshold)
nSent += 1

return best

If no heap is supplied, best is set to an indexed collection of nSelect (unfit, dummy) pairs represented as (None, None). This works because any (fitness, individual) pair is greater than (None, None). The expression best[0][0] yields the fitness of the least fit individual in the heap, i.e., the threshold fitness for selection.

The first for loop submits to each of the waiting processes an individual in population to evaluate, along with threshold. [My next post greatly improves selected by eliminating the direct manipulation of processes.] The loop exits early if there is a surplus of processes. The processes are instances of a subclass of multiprocessing.Process that I have defined, but am “hiding” from you. I am illustrating how to keep the logic of parallel processing simple through object-oriented design. You don’t need to see the code to understand perfectly well that process.submit() communicates the arguments to process.

The second for loop iterates popSize times, processing triples obtained from scored. Despite appearances, scored is not a function, but a generator. It does not return a collection of all of the triples. In each iteration, it yields just one (process, x, score) to indicate the process that most recently communicated an evaluation (x, score). This indicates not only that the fitness of individual x is score, but that process is waiting to evaluate another individual. If the new score exceeds the selection threshold, then (score, x) goes into the best heap, and threshold is updated. And then the next unevaluated individual in the population, if any, is submitted along with the threshold to the ready process.

When the loop is exited, each individual has had its chance to get into the best heap, which is returned to the caller. By the way, there’s an argument to be made that when the best heap is supplied to the function, an individual with fitness equal to that of the worst in the heap should replace the worst. Presumably the heap contains parents that are competing with offspring for survival. Replacing parents with offspring when they are no better than the offspring can enhance escape from fitness plateaus.

Tuesday, July 13, 2010

Sure mutation in Python

In Python, “lazy” mutation goes something like this:

for i in range(len(offspring)):     if random() < mutation_rate:         offspring[i] = choice(alphabet) 

The random() value is uniform on [0, 1), and the choice function returns a character drawn uniformly at random from alphabet. It follows from my last post that this can be made right within the implementation of an evolutionary algorithm by defining

adjusted_rate = \     len(alphabet) / (len(alphabet) - 1) * mutation_rate

and using it in place of mutation_rate. “And that’s all I have to say about that.”

If you want a mutation operator that surely mutates, the following code performs well:

from random import randint alphabet = 'abcdefghijklmnopqrstuvwxyz ' alphaSize = len(alphabet) alphaIndex = \     dict([(alphabet[i], i) for i in range(alphaSize)]) def mutate(c):     i = randint(0, alphaSize - 2)     if i >= alphaIndex[c]:         i += 1     return alphabet[i]

Here alphaIndex is a dictionary associating each character in the alphabet with its index in the string alphabet. The first character of a string is indexed 0. Thus the expressions alphaIndex['a'] and alphaIndex['d'] evaluate to 0 and 3, respectively. For all characters c in alphabet,

alphaIndex[c] == alphabet.index(c).

Looking up an index in the dictionary alphaIndex is slightly faster than calling the function alphabet.index, which searches alphabet sequentially to locate the character. The performance advantage for the dictionary would be greater if the alphabet were larger.

The function mutate randomly selects an index other than that of character c, and returns the character in alphabet with the selected index. It starts by calling randint to get a random index i between “least index” (0) and “maximum index minus 1.” The trick is that if i is greater than or equal to the index of the character c that we want to exclude from selection, then it is bumped up by 1. This puts it in the range alphaIndex[c] + 1, …, alphaSize - 1 (the maximum index). All indexes other than that of c are equally likely to be selected.

Monday, July 12, 2010

The roly poly and the cockroach

You may have noted in my post on Dembski's Weasel-whipping that I gave .0096 as the mutation rate of the Dobzhansky program, a conventional (1, 200) evolutionary algorithm (EA), while Yarus indicates that “1 in 100 characters in each generation” are mutated. Well, the slight discrepancy is due to the fact that the program uses the “lazy” mutation operator that I slammed as a bug in the algorithms analyzed by Ewert, Montañez, Dembski, and Marks [here]. I should explain that what is a roly poly in one context is a big, fat, nasty cockroach in another.

To mutate is to cause or to undergo change. That is, mutation is actual change, not an attempt at change. The lazy mutation operator simply overwrites a character in a phrase with a character drawn randomly from the alphabet, and sometimes fails to change the character. For an alphabet of size N, the mutation rate is (N − 1) / N times the probability that the operator is invoked. For the Dobzhansky program, N = 27, and
26 / 27 × .01 ≈ .0096.
The difference between .01 and .0096 is irrelevant to what Yarus writes about the program.

Correcting an EA implementation that uses a lazy mutation operator is trivial. Set the rate at which the operation is performed to N / (N − 1) times the desired mutation rate. Goodbye, roly poly.

No such trick exterminates the cucaracha of Ewert et al. Their algorithms (A) and (B), abstracted from the apocryphal Weasel programs, apply the lazy mutation operator to exactly one randomly selected character in each offspring phrase. The alphabet size ranges from 1 to 100, so the probability that an offspring is a mutant ranges from 0 / 1 to 99 / 100. As I explained in my previous post, it is utterly bizarre for the alphabet size to govern mutation in this manner. The algorithms are whirlygigs, of no interest to biologists and engineers.

Ewert et al. also address as algorithm (C) what would be a (1, 200) EA, were its “mutation” always mutation. The rate of application of the lazy mutation operator is fixed at 1 / 20. It is important to know that the near-optimal mutation rate for phrases of length L is 1 / L. With 100 characters in the alphabet, the effective mutation rate is almost 1 / 20, and the algorithm is implicitly tuned to handle phrases of length 20. For a binary alphabet, the effective mutation rate is just 1 / 40, and the algorithm is implicitly tuned to handle phrases of length 40. This should give you a strong sense of why the mutation rate should be explicit in analysis — as it always is in the evolutionary computation literature that the “maverick geniuses” do not bother to survey.

Regarding the number of mutants

I may have seemed to criticize the Dobzhansky program for generating many non-mutant offspring. That was not my intent. I think it’s interesting that the program performs so well, behaving as it does.

With the near-optimal mutation rate of 1 / L, the probability of generating a copy of the parent, (1 − 1 / L)L, converges rapidly on e-1 ≈ 0.368. Even for L as low as 25, an average of 72 in 200 offspring are exact copies of the parent.

It would not have been appropriate for Yarus to tune the mutation rate to the length of the Dobzhansky quote. That’s the sort of thing we do in computational problem solving. It’s not how nature works. I don’t make much of the fact that Yarus had an expected 109, rather than 73, non-mutant offspring per generation.

Edit: Yarus possibly changed the parameter settings of the program from those I’ve seen. I really don’t care if he did. I’m trying to share some fundamentals of how (not) to analyze evolutionary algorithms.