Steve Dietz: Tell me about the process of creating Consensual Fantasy Engine. Who did what? How did it come together?
Paul Vanouse: Basically, I'd met Peter [Weyhrauch] in the first few months at Carnegie Mellon [University], while trying to hunt down source code for an "Elizaesque" AI [artificial intelligence] program. The work that he was doing in interactive drama on the Oz project [at the university] was really interesting, since it was more about drama and narrative than about faster computer algorithms. We talked about perhaps collaborating in the future, although I couldn't imagine on what at the time. So when I saw the O.J. Simpson chase, I immediately called Peter with an idea for a real-life interactive fiction. Peter, being a total kitsch-master, loved the idea.
Consensual Fantasy Engine took about a year from conception. I had a very simple idea about how to create a branching-video story that would somehow respond to the audience. Peter brought in the idea of having a sea of clips that could be used in multiple ways by the automated computer critic--basically, the subject of his Ph.D. thesis. Peter and I then decided on ways of annotating appropriated movie clips and a basic structure that the piece should follow (playfully based on Vladimir Propp's structuralist framework). I was originally planning on using an audience polling method similar to Pittsburgh's planetarium (multiple-choice buttons on the armrests of the seats), but upon reflecting on the O.J. Simpson chase, I realized that mass-audience applause-metering was more appropriate.
The work broke down as follows: I wrote all of the video-clip playing, sequencing, digital speech, and applause-metering things in Hypercard, while Peter wrote most of the heavy-duty artificial intelligence stuff in Lisp. (We pretty much split the task of choosing variables and fleshing out plot structures.) We then linked the two programs using Apple Events on the Mac, so that we could take full advantage of the rich AV resources of multimedia, while also having the efficient data processing of Lisp. I suppose that the most brutal end of the task was digitizing hundreds of clips from Hollywood films. Just to find them all involved watching more than 120 generally cornball movies during the period, and finding the right sections to appropriate and re-code.
SD: In Consensual Fantasy Engine, the film clips are stitched together on the fly. This is determined in part by audience responses, but I believe you also refer to a kind of algorithmic "film critic" that makes aesthetic decisions about how the clips should be edited. How does this work?
PV: OK, basically what happens is that each question the audience responds to has some type of psychographic meanings that can be inferred from it. For instance, someone who answers the question, "What is the purpose of the U.S. legal system?" with "To serve the wealthy" probably wants to see court corruption [and] expects things to be good for O.J. (since he's wealthy)--probably wants as many references to money as possible. Each new goal, then, has its own critic function that can look at variables attached to long sequences of clips and score each sequence based on how well it fits that criterion. Other critics include plot experts, mood critics, etc. The plot expert is based on the Propp-ian sense of drama, that there should be conflict that leads to confrontation that leads to struggle that leads to a resolution, and that there are other causalities that any good story should have. Before each four-minute segment of the show, the program suggests to the critics thousands of 12-movie-clip sequences in a somewhat random manner. Then each of these sequences is graded by each critic, and their grades are summed into a sort of GPA. The highest-scoring sequence is then played to the audience. Of course, what is interesting about doing this computationally is that this entire selection process only takes about three seconds on a new PowerPC.
So, although we selected each of the 300 movie clips the piece can use and annotated each clip with up to 110 variables, then built functions that "make sense" of all the variables, each show is still something of a surprise for us, because all of the different operations impact each other in ways that are difficult to predict. For instance, what if the program finds that the sequence that will best satisfy the audience at a given moment shows the hero shooting a cop? That means that from that point on, the hero is going to need to avoid the police at all costs. But what if other answers to later questions give us the sense that they really think that O.J. is good and should go free? Then the system will usually try to find a scapegoat who tricked O.J. into the crime, a la "Othello," or have O.J.'s victory be one in which he is never understood but never caught, i.e., escaping to Tibet. Thus, the system often has to deal with conflicting goals, and it is often surprising how it manages to solve them.
SD: To clarify, I assume that the 12-frame sequences are preproduced. You selected a specific sequence from the films and marked it up with the up-to-110 variables, which the various critics then score, yes?
PV: No, at least not if I understand you correctly. Each movie has this set of variables--the 12-clip sequences have never been put together by us. Rather, every time the engine is run, it puts together the sequences in response to the audience's answers. The critics then are grading each clip and its relationship to other clips in the sequence, or in past sequences, to determine the grade of a particular sequence.
SD: Who or what specifically are the critics? What areas do they cover?
PV: Whew, there are a bunch and many would need some explanation, but basically there are different categories of specialization--plot, mood, theme, etc. Plot experts may include fight experts, confrontation experts, chase experts (both car and foot), trial experts, behind-the-scenes experts (to deal with things that are happening parallel to the hero's adventure). Mood experts are simpler: Basically, an expert (or two) looks at clips for the mood of the hero and the general viewing mood--seriousness vs. slapstick, etc. Theme experts check to see if other audience interests such as arts or sports are brought forth, or if racial, conspiracy, or authoritarian themes are being brought forth.
SD: We do a lot of markup of information in the museum world, and I'm interested in your process with the film clips. Did you use any existing vocabulary other than Propp's? How did you control the vocabulary input? Are the categories hierarchical in any way? Do you have a list of the 110 categories/variables?
PV: Basically, we altered Propp's theory to a television format--which required extending his notion of "struggle" and other slight modifications. The alterations were done somewhat in keeping with work on structural analysis of television done by media theorist John Fiske in his Television Culture book.
The idea was not to control vocabulary very much, but rather to work back and forth with adding clips, testing the engine to see where we needed more information, then adding variables. This method goes along with our idea to build not some "general" knowledge engine--as classical AI seeks--but to build in only selected knowledge of constructing heroic narratives. Furthermore, we knew that we couldn't build in a top-down manner--we needed to see many of the clips to determine possible meanings, and we needed to view full scenarios to see where our data abstractions broke down.
The full list of variables would be a bit of a chore to describe, but to give you an idea, here are some that would fall in the category of action variables, specifically "chase":
(running NIL)
(driving NIL)
(chase NIL)
(in-contact NIL)
(close 5)
(lose-contact NIL)
(acquire-contact NIL)
(not-in-contact NIL)
(foot-to-car NIL)
(initiatory NIL)
You can see that some variables are binary and others are scaled in degrees. There are, of course, many other variables that would alter the meaning of these, such as inside/outside, whether with police or villain, escape of both short- and long-term varieties, etc. So even this categorization doesn't come close to describing the complexity of a chase without the myriad variables that could relate to other aspects of the clip. These particular ones are mostly valuable for making sure that a chase would show someone getting closer before actually catching someone.
SD: Did you use any weighting factors in averaging the critics' responses?
PV: Plot critics had a bit more sway than some, but it wasn't as simple as just saying that they had X amount more weight. All of the critics returned scores that could be really high or low, so just like the applause meter the audience is using, if all the critics but one are pretty lukewarm on something, they might clap really mildly. But if one critic is just amazed by a sequence, he might hoot and holler and stamp his feet in support, which when added to the others gives that critic most of the vote.
SD: How do you and Peter personally measure the success of your algorithms?
PV: It's really based on viewing. If it makes sense, we're happy, but we like it when a clip is used in a really unconventional way that we hadn't expected. We also like algorithms that are balanced enough that even with the same audience responses, [they] return relatively different sequences. This is also a factor of how many clips can fit a certain type of story, of course, not just the algorithms themselves. One really interesting thing that we saw it doing was using flashback clips a lot in the conclusion, which was a really perfect way to wrap up the story, but it wasn't intentional when we were coding the clips.
SD: Would you describe your new Terminal Time project briefly? Specifically, how the experience with Consensual Fantasy Engine influenced some of the decisions you are making with Terminal Time.
PV: In 1996, a reporter in Copenhagen asked me, "So what is your next project? What do you do after the O.J. Simpson spectacle?" He expected me to say the Waco story or something, but in straight-faced jest I said, "Oh, the story of man, you know, the history of the world from Neanderthal life to the third millennium." It was totally a joke at the time, but afterward it seemed the obvious next step.
One of the things that is unavoidable with the Consensual Fantasy Engine is an experience of fracture between clips. Since we used appropriated material, in one clip the hero would be Nick Nolte, in the next Steve McQueen, which gave the plot this very odd effect. I felt that it was important in the Consensual Fantasy Engine to do this, because it called attention to the constructed nature of the news media. Anyway, it made me look toward other styles that would also dictate a new way of thinking about the interactive program's structure--in Terminal Time's case, documentary. Documentary film is really about this visual fracture. It is what gives it its feeling of "authenticity." We don't care if the underlying images switch from a pan of an old photograph to a reenactment to a field of wheat, so long as the narration is consistent. Well, this gave us the ability to still work experimentally with image gathering, and also with narration. We could use a digital voice to narrate, which made for much more sophisticated possibilities for the system to make decisions. For instance, it could go through a sentence and alter the adjectives to reflect certain perspectives. Basically, we felt that we could create a system that had much more control over AV sequencing and would also more closely mirror an audience's expectations.
This is just one formal/technical concern, of course. Having the Consensual Fantasy Engine as a base experiment gave us a great jumping-off point for a lot of things. We realized that the questions were probably the most exciting moment for the audiences, that groups of people looking for a good time will answer much differently than, say, an individual taking a poll, and that the questions needed to be better-thought-through to get as much variance from the system as possible. We're still trying to work a few more interactive question-and-answer moments into the show. We also realized that we needed to give audiences a bit more information before the questions begin, so we've tried to make an intro to get them into the mood, and to make it less necessary to have a presenter around to warm them up. Lastly, we've realized that we should try to schedule the event with time to go through the entire process twice so that audiences can understand and enjoy the fact that different answers dramatically affect the outcome.