We were asked by IYOIYO studio to help build Infinite Bad Guy, a Google and YouTube collaboration with Billie Eilish. Infinite Bad Guy is an interactive, synchronized celebration of fan remakes, covers and remixes of Billie Eilish’s hit track “Bad Guy”. Check it out, its pretty cool. It even won a webby!

IYOIYO wrote an exhaustive behind the scenes look at the techniques used to execute the project which you can read here.

We wanted to provide a bit more context on the video side of things and highlight some of the workflows, successes and pitfalls we had.

To provide some preliminary context, at Special Circumstances we build computational cinematography tools — machine learning software inspired by the creative language and terminology used by film makers and visual artists. You can read more about computational cinematography, and learn about creative metadata from our co founder Rahul’s article “AI For Filmmaking.”

Our main task was to assist IYOIYO in understanding the diversity of content and help build a viable taxonomy to visually and semantically classify as much of the of the Billie Eilish “Bad Guy” covers as possible.

We wanted to give as many opportunities to let creators shine, and help users explore the content in fun and exciting ways.

Initial examination of content showed dramatic differences in visual style, editing technique, shooting location, and production effort. From single take “bedroom web cam covers” by talented vocalists, to fully edited and choreographed “shot for shot remakes”, and everything in between.

We needed to find a way to understand the content and see if there were reliable quantitative techniques that would help organize the archive of content Billie Eilish fans had uploaded to YouTube.

We initially identified some obvious contenders for grouping by just exploring YouTube manually:

  • Single take videos — locked off shots usually featuring a ‘singing head’ — ie typically vocal covers.
  • Instrument videos — a subset of single take videos, but focusing on a guitar, piano etc, performances.
  • Dance videos.
  • Edited music videos — we noticed that bands that covering Bad Guy in their own genre would often cut together unique new music videos with their own unique visual style.
  • Remakes — shot for shot remakes inspired by the original Bad Guy music video.
  • Lyric videos — typically having a stylized background image and featuring animated text.

Needless to say we had a certain song stuck in our head by now.

Our first task was to throw some random samples of user content into Synopsis — our computational cinematography software and see if our intuition panned out.

We analyzed a sample of 200 random Bad Guy user videos from a bucket of content we had manually identified and tagged, and loaded them into an early build of Synopsis Inspector — our interactive media and metadata inspector to visually explore the content.

Heres a behind the scenes screen recording we shared with IYOIYO of our initial exploration and interpretation — pardon the voice over:

Visually exploring a random sampling of roughly 200 or so Bad Guy videos and trying to intuit the results. Video © their respective owners, courtesy YouTube.

Amusingly, we were able to cluster 2 “rubber duck” videos — so clearly something seems to be working:

“duck, duck … piano?” Video © their respective owners, courtesy YouTube.

One important detail is that Synopsis at the time did “shot” analysis, not necessarily full “edit” analysis. The manner in which we summarize a single shot doesnt necessarily translate to summarizing an entire edit.

The more edits in a video, and the more visually diverse those edits are, the more information we have to capture and represent in a useful embedding. This also complicates tagging — since locations, lighting, focus framing etc all change the more edits there are.

That said, it seemed promising that we could extract some useful signal via video analysis.

One of the initial areas we wanted to explore is how well we could identify shot for shot remakes. These videos were the result of a lot of work by fans, and tended to have very high view counts and made for a fun viewing experience as well. We needed to ensure we didn’t miss any so we could feature them in the final experience.

One of the important signatures of identifying remakes was the shot pattern. Did edits happen at the roughly same time, in the same cadence ? Did the visuals on screen roughly match or evoke the original? Did we have a sequence of close ups, interiors with a blurry background followed by a long shot with an orange background, and then an suburban daytime exterior group shot? Did the duration of these edits match? This is what we mean by ‘shot pattern’.

Naively, one might expect remake videos to ‘just line up’, but each video might have different length intros, or have some additional pre-roll or title sequence portion that threw alignment off, or a dramatically different frame rate. Some videos also featured remixed songs with different tempos. Clearly things wouldn’t just magically line up.

As we were exploring this problem, our colleagues tackling the audio alignment side of things ran into similar issues on the music side. Tracks would have a similar overall structure, but locally, the tempo might be different, or there could be a longer intro or outro, or more extremely, some improvisation thrown in for fun.

One of the numerical methods we all landed on was Dynamic Time Warping — a time series analysis method to measure similarity between time series samples. We needed a way to add a Dynamic Time Warping (DTW) similarity metric to Synopsis Inspector — as well as a viable input signal to feed the algorithm.

Unfortunately at the time most DTW implementations we could integrate into our desktop macOS framework worked with mono audio signals designed for music information retrieval tasks. This means the dimensionality of each sample was “1”:

Example time series for a single channel value

For our video tasks, our sample of time series had a dimensionality in the hundreds for class probabilities of each frame. Even worse, in the thousands if we leaned on our networks backbone embedding.

Example time series for a 3 channel value. Our real data was hundreds of channels per sample.

Wanting to avoid costly, slow, and error prone reimplementing, so we decided to quickly try some dimensionality reduction techniques to see if we could use an existing implementation and get a sense that DTW would be viable for finding Bad Guy shot for shot remakes.

Example dimensionality reduction of our sample 3 channel data above. Being able to go from multi channel to single channel allowed us to use off the shelf DTW frameworks meant for audio. Its lossy, but it works.

We landed on using an “inter frame similarity score” — This meant we would be throwing out semantic information (for example the “shot framing” predictions) and only capturing ‘how similar each frame was to the next’ via a distance metric, which produced our single dimensional score we could feed to DTW.

Intuitively, on “dramatic edits” similarity should be low since the picture would drastically change from one cut to the next. Within a slow moving single shot the similarity scores should be high since the image stays roughly the same. For shots with dynamism we should have a medium amount of similarity.

Here is the result of our initial DTW explorations on user video:

Exploring similarity via Dynamic Time Warping in Synopsis Inspector. Video © their respective owners, courtesy YouTube.

Clearly DTW was providing some useful signal even when throwing out spatial and semantic frame information.

Now that we have some familiarity with the data and have a sense of what we can reliably discern, we needed to take cues from the UI and UX teams to understand how they imagine fun and interesting ways to explore the data and find correlations.

With the design teams input, we worked with IYOIYO and Googles creative teams to explore various ontologies and figure out what we could auto tag with existing models within Google and via Synopsis, and what classes required new tagging infrastructure. IYOIYO built a very nice python notebook labeling UI, which you can see here:

Screenshot of video labeling interface with videos blurred out. Image courtesy IYOIYO Studios.

As we gained confidence that we could pull some useful signals via visual analysis, we had to pivot to tackling full scale deployment engineering problems.

The visual analysis code needed to run on internally Google, and work on anonymized input data we didnt have access to for privacy reasons. Our rough and ready experiments and macOS desktop analysis would not suffice.

We moved analysis first to Google Colab to ensure numerical consistency on a different runtime, and then passed code to IYOIYO for integration with their tooling. Finally Google would do their magic behind the curtain and run both video and audio analysis on whatever insane cluster they had at their disposal.

One pain point we ran into is that as we were prototyping we migrated from TensorFlow training to PyTorch training. This seemed innocuous at first glance and we were getting good results using a converted PyTorch to Tensorflow model in Google Collab.

Unbeknownst to us, at the time available model conversion tooling wasn’t handling NCHW / NHWC tensor order for some runtimes. This is a technical detail one can usually ignore when sticking with a single environment, but its an implementation detail that matters when moving models between frameworks, runtimes and hardware. You can read more about NCHW vs NHWC here.

Our tests on Google Colab didnt have this issue via the GPU runtime; however the internal cluster had a slightly different hardware and runtime and our latest and greatest PyTorch model wouldn’t load. Frustratingly we were unable to find a fast enough fix, so we ended up running an older TensorFlow model. They call it the bleeding edge for a reason, don’t they?

We are super proud to have played a role in making Infinite Bad Guy. The creative requirements reinforced the value of creative metadata.

Theres a reason that film professionals use the terminology they use — each term is descriptive of an important concept — from composition, framing, lighting, symmetry and more. These concepts are incredibly valuable and are part of the culture for creating and consuming visual media. Not only that, they are learnable, and can be leveraged to do really fun and valuable quantitative analysis. So much so that we started Special Circumstances around that exact idea.

Thanks to Billie for an incredibly fun and catchy song, and of course to IYOIYO, Google Creative Lab and YouTube for the incredible opportunity.

If you are tackling challenging problems in film and video, don’t hesitate to reach out.

— Special Circumstances is Anton Marini & Rahul Somani.

Special Circumstances is a Computational Cinematography R&D company. We build next generation tools for film makers.

Special Circumstances is a Computational Cinematography R&D company. We build next generation tools for film makers.