This post is one in a series about GANce
As it stood, the three main features that would comprise the upcoming collaboration with Won Pound (slated for release mid-April) were:
- Projection Files (using a styleGAN2 network to project each of the individual frames in a source video, resulting in a series of latent vectors that can be manipulated and fed back into the network to create synthetic videos)
- Audio Blending (using alpha compositing to combine a frequency domain representation of an audio signal with a series of projected vectors)
- Network Switching (feeding the same latent vector into multiple networks produced in the same training run, resulting in visually similar results)
As detailed in the previous post. The effect of these three features can be seen in this demo:
Knowing we had enough runway to add another large feature to the project, and feeling particularly inspired following a visit to Clifford Ross’ exhibit at the Portland Museum of Art, I began exploring the relationship between the projection source video and the output images synthesized by the network.
Motivation
Part of the Ross exhibit at the PMA was a short film called Harmonium Mountain. The one screened at the PMA isn’t available online, but it’s sequel is, and contains my favorite part.
Seeing the black and white infrared region of the image fly across the mountain vista forces one to think about the world outside of human perception. This relationship is posed to viewer all at once, with the black and white infrared being overlaid only on a region of interest rather than the entire frame. You’re able to see both conflicting views at once.
This type of dialog was missing from GANce outputs, and very is hard to come by in synthesized art as well.
One of the goals of the AI art of the day (I’m thinking specifically about midjourney and CLIP) is to abstract away the process and technique (skill) and focus solely on intent. This enables artists to iterate quickly, tweaking inputs and filtering the outputs until the resulting synthesis matches their vision. Both artist and viewer are happily unaware of the source material that formed the training dataset.
This looming increase in artistic productivity follows a trend first described by John Maeda in his book Design by Numbers:
I recall the renowned Japanese art director Katsumi Asaba relating the story from his youth about the time he could draw 10 parallel lines within one millimeter using a bamboo pen that refined over many months, not to mention the fine motor skills he had acquired over many years. This form of disciplined approach toward understanding one’s medium is what we traditionally associated with skill, and at once time valued as one of the many important steps to becoming a skilled designer. But today, such superhuman skill is immediately trivialized by an an amateur, who using a software too, can effortlessly draw 50 perfect lines within the same millimeter. Due to the advent of the computer, mechanical skills have taken secondary importance to the skills required to use complex software tools. (p. 20)
It is naïve to associate GAN images with classic ‘remix’ techniques. Unlike musical sampling or collage, where the source material is visible but reinterpreted, GAN outputs are new images. It’s functionally impossible to take a GAN output and work backwards to surmise the discrete set of source images.
Having photographed the 50,000 portraits of Won Pound ourselves, then shooting/editing the input videos that would become projection files, we have the unique opportunity here to clearly articulate the relationship between the source material and synthesized outputs. Like the light sources in Ross’ film, Pound’s real face is present in the source video, and in the synthesized outputs.
As a proof of concept, I wrote a script that drew face-containing regions of the input video over the synthesized output:
Refining Overlays
Deciding when to draw the overlays became the next question. After a lot of input tweaking and filtering filtering, I landed on the three main parameters needed to make this decision. These are expressed as a group of CLI arguments in music_into_networks.py projection-file-blend
.
Music Mapped Overlay
Close-readers of the repo will notice that the notion of having a music-mapped overlay discriminator exists in the API but not in the CLI. I wanted to try to find conditions in the music that would prompt an overlay programmatically, but ended up abandoning this concept. There were very few good overlay conditions in the first place, and filtering them down even more erased them completely from some songs.
Image Hashing
I first started with comparing image hashes of the source and synthesized images. Luckily, there was a package to do this on pypi already. This is a comparison of each of the hashing techniques. The values represent the difference between the hashes of the source and synthesis:
Looping this a few times reveals the trends in the data. The hashing technique that created the most visually interesting results was a perceptual hash. Additionally, in the production version of this implementation, only the hashes of the bounding boxes are compared, not the hashes of the whole images.
I was really pleased with how well the phash technique dealt with model switching, I was concerned that switches would cause large switches in this fluctuations. This somewhat numerically validates the assertion in the previous post stating that adjacent networks from the same training job always produce visually similar outputs for the same input.
Bounding Box Distance
Originally, a bounding box covering all face keypoints was used to define the overlay region. After some experimentation, tracking the eye region only was more compelling. Pound is looking at you through his synthetic avatar. The bounding box distance is calculated as follows:
python distance_boxes: Optional[DistanceBoxes] = bounding_box_distance( a_boxes=foreground_bounding_boxes, b_boxes=background_bounding_boxes ) box_flag = distance_boxes is not None and (distance_boxes.distance < min_bbox_distance)
Where min_bbox_distance
is a threshold passed in via the CLI, and bounding_box_distance
is defined as:
def bounding_box_distance( a_boxes: List[BoundingBox], b_boxes: List[BoundingBox] ) -> Optional[DistanceBoxes]: """ Calculate the minimum distance between two sets of bounding boxes. :param a_boxes: Left side. :param b_boxes: Right side. :return: Minimum distance between the centers of these boxes in pixels. """ return min( [ DistanceBoxes( distance=float( distance.euclidean(bounding_box_center(a_box), bounding_box_center(b_box)) ), a_box=a_box, b_box=b_box, ) for a_box, b_box in itertools.product(a_boxes, b_boxes) ], key=lambda distance_box: distance_box.distance, default=None, )
We decided to use a constant threshold value here for the entire project of 50 pixels.
Track Length
The final consideration was ‘track’ length, or the frame count for a given overlay period. Really short occurrences, with short gaps in between was distracting. The parameter --track-length
controls the minimum number of consecutive frames that pass the overlay criteria needed to actually draw an overlay.
This of course means that you have to have the result of those two computation for every frame in the video in memory at once to be able to filter in this way. This meant a major overhaul of the synthesis process into an actual pipeline, where functions that consume and return Iterator
s are chained together. This brings my live audio ambitions closer to reality, as the input can now be infinitely long.
Those interested in understanding this new design pattern should start with write_source_to_disk_forward
to understand how the fames move around. If I ever need it again for another project, iterator_on_disk.py
will probably become a standalone module.
Here’s an example output with a track length of 5 frames, and the final demo to conclude this post:
1 Comment