Experiments with Computational Cinematography
The Walkbox event took place in the fall of 2017, long before our company Special Circumstances was formed. The ideas and lessons learned from Walkbox helped inform Synopsis, our computational cinematography toolkit and also helped seed music video editing techniques used in Trash, a fun, social, mobile video editor which is now part of VSCO.
This project helped highlight that production and post production is very close to a revolution in new, content aware tools that will change the industry and overhaul existing workflows used by filmmakers of all kinds.
We wanted to share some lessons, insights and challenges we faced.
In collaboration with Technology Humans and Taste and AV&C — we designed and created Walkbox — an immersive experience machine of epic proportions in Shanghai that capped off Michael Kors’ global The Walk campaign.
Walkbox made each guest the star of their own brand fashion film, through the experience of a director-led film shoot moving down a 50’ walk against an LED wall that placed the subjects in glamorous locations shot around the world.
By night’s end, we’d published over 300 unique, personal, and beautiful films — shared broadly by celebrities, influencers, and guests — and engaged with by millions online.
Besides the various creative challenges Walkbox entailed, the major technical hurdle was to reduce content acquisition and editing time to near realtime while retaining high quality and engaging content. The target was for a model to walk off of the runway and a 30 second video be edited, with titles and credits, color correction, and a sound track, exported and ready for posting to social media while the next model steps onto the runway.
Typically content is edited after a shoot, not on premises. On the occasions that content is edited on premises, there is a significant lag time for final edits to be available. A large number of editors and edit workstations are required for fast turn around, which comes with its own unique logicists, media and data management and workflow issues. We wanted to avoid a large bull pen of editors and equipment, and streamline the process as much as possible while retaining creative control and agency in the output.
Creatively, our clients wanted edits to feel deliberate, like a short form fashion film with consistent branding and messaging. With those constraints in mind we helped design an on premesis production to post production pipeline which allowed for both human creativity and machine analysis to shine. It worked as follows:
- 3 Camera feeds would be fed into an analysis system and generate useful and relevant creative metadata to power downstream edit logic.
- A machine assisted editing tool would leverage the creative metadata and create multiple rough cuts which would then be exported as a Final Cut Pro project for human review. Rule based edit logic with some fuzziness baked in would provide a way to tweak rough cuts the day of the show.
- Our “human in the loop” — an experienced senior quality control editor — could then review each rough cut, select the best and quickly adjust the edit to taste if necessary.
These 3 phases were acquisition, analysis and edit, and finally review.
Step 1 — Acquisition
We had 3 cameras follow a model on their runway walk, with a director monitoring the shots and providing feedback to camera operators. The runway was backed by a large format video wall, providing backdrop content and dynamism. The video backdrop was designed to have 3 chapters; an intro, hero “paparazzi” moment, and an outro; each featuring various global locations, and with their own display and design logic. This provided structure and pacing and variety to each walk.
Because this was a live show, we didn’t have the luxury of multiple takes, we had to get usable shots in a single walk. To that end each camera operator was requested to get specific types of shots and keep various framing, so we had enough coverage to work with for our edit.
We chose close up, medium and long framing with camera operators capturing moments useful for inserts edits — think accessories or garment highlights.
We also shot over-cranked, so we could slow our models down a bit and provide a more epic feel and helped us highlight the flows of the garments on display.
Step 2 — Analysis and Edit
For our analysis phase, we designed a machine learning model which understood ‘creative cinematic concepts’ such as shot framing (close up, medium, long etc). Overall, we had roughly 10 classes relating to creative cinematic terminology in our experimental model, which was trained with Tensorflow and converted to run on CoreML on a Mac Pro.
If you are curious about computational cinematography and creative metadata, our co founder Rahul’s article here on AI and film making is a great primer.
We needed to be able to analyze at least 3 times faster than realtime on 1080p Pro Res 422 HQ content so that we could keep up with our 3 camera sources. CoreML, while fairly new at the time, allowed us to easily get high performance via multiple Metal accelerated GPUs, integrate with creative tools like Final Cut Pro and leverage high performance video SDK’s like AVFoundation and IOSurface. Combined with Grand Central Dispatch and NSOperation we had a multi threaded, GPU accelerated video decode and analysis pipeline set up.
Each camera’s generated metadata was then ingested into a custom editing software we developed named “Sync”. Sync was built with AVFoundation, namely the AVMutableComposition API which provided the foundational components of a functional editing timeline. AVComposition supports edits, effects, transitions and the like. These APIS sat underneath a rule based editing framework which took predictions from our CoreML model and determined sequences of edits decisions using fuzzy procedural logic.
We didn’t have the time train and integrate sequence based neural networks like an RNN or LSTM to make edit decisions for us, so our edit ruleset was procedural mixed with fuzzy logic and embracing a bit of random chance and happy accident. By chaining edit logic ‘closures’ we could generate sequences together similar to a Markov chain of any length.
Sync produced rough cuts based off of what was deemed the most salient moments in each camera’s metadata stream, and edited our camera sources to a set of pre-selected sound tracks.
The sound tracks provided an important ‘editing structure’ for us to place selects from our video metadata and put limits on how long each edit should be. Audio signatures such as BPM, valence and energy were used to help provide timing for our editing decisions.
In the end, we decided to provide 3 rough cuts each with slightly different color and edit rules that our creative team chose to help diversify the edits. Edit generation was fully automatic once we rolled on the first runway walk. The editor on their workstation simply had a new Final Cut Pro project pop up on their desktop waiting for them moments after a runway walk was completed.
Sync was designed to export to Final Cut Pro by leveraging FCPXML, so we were able to bake in various creative looks, slow motion and color correction, as well as place title cards and end credits. It was also of critical importance to provide head and tail for each edit so our editors had leeway to adjust the timing. This meant the editor could spend more time adjusting the structure of the edit and not noodling around with key frames or applying effects, helping us dramatically increase edit throughput.
Head and tail refers to ‘extra’ video at the beginning and end of an edit, so that an editor can adjust the timing of an edit without adjusting the entire sequence and having a cascade of fixes to make.
In the end, we generated over 900 rough cuts, 300 final edits in roughly 4 hours of the event. The last edit was completed moments after the final walk.
Here’s a sample of some edits we helped create:
Some things to note about this project’s format is that we had the benefit of creative constraints that we could leverage to our advantage.
Firstly this wasn’t a narrative film with different scenes, dialogue or continuity constraints. That easily reduced the complexity by an order of magnitude or more. Fashion shoots and edits can be free form, whimsical and tend to be creatively very open in how they are edited. We could use that to our advantage.
Secondly, because each camera is a single take our editing logic can roughly keep the same ‘absolute’ time and bounce between cameras and keep temporal consistency. This is a huge win for this type of content as having multiple takes would drastically increase the complexity, analysis time and decision making needed to provide a consistent, high quality result.
While we wanted to keep our QC editors work to a minimum, there were real world cases our model couldnt account for that required a human editors intervention. For example, awkward edits that landed on or featured an odd look, glance or pose; or as the event went into the night, obviously drunken stumbles might not need to be highlighted (as amusing as that could have been).
Overall, ensuring we had a human making creative final calls and having the agency to supersede any machine decisions was not just a bonus, but a hugely important workflow necessity for our guests to feel really good about the videos we made for them. In the end, if they aren't comfortable posting the edits, all of the technical wizardry is for nothing.
We opted for “machine assisted creativity”, not “creative automation”. There’s a huge difference in optimizing tedious tasks like applying creative looks and making selects, and allowing an experienced human operator to focus on creative output — than removing human creativity from the loop.
Finally, our intuition that creative metadata can help power workflow optimizations was proven more or less correct. With only a few classes an a TensorFlow model, we were able to find useful semantic signals in video content and use it to power edit decisions.
The machine learning model that powered Walkbox had roughly 10 classes, and was fine tuned against ImageNet. Our current model has over an order of magnitude more classes and concepts, and is trained using state of the art techniques and leverages many domain insights on a custom data set optimized for the various tasks cinematographers, editors and directors will focus on.
In the next few years production and post production will see an upheaval in workflows, as technologies like mixed reality power advances in virtual production and in camera effects. Machine learning and computer vision is empowering computational cinematography tools like Synopsis and Colourlab.ai.
Content creators will be able to focus on content and creation, not tedious workflow technicalities.