Image courtesy HUSH

Working with our friends at HUSH we helped build a flagship interactive experience for AT&T and HBO Max. We wanted to outline some of our learnings and share the teams journey with you.

The goal: Create a scalable destination-worthy retail experience that magically communicates HBO Max’s depth of content, accessible at the speed of 5G infrastructure and systems.

From the outset, HUSH had some concrete ideas of what The Orbit experience should be:

The Orbit experience shares HBO Max’s content library when a customer simply moves her body, or speaks a few simple words. Featured in nationwide AT&T retail flagships, and with a version on mobile / web, The Orbit sets a new bar for uniqueness, quality and inspiration for destination brand experiences. Visit The Orbit online.

Check out the case study video to get an idea:

“The Orbit Case study: Client: HBO Max, Design Agency: HUSH”

Given the size and extent of the catalog we had access to (pretty much the entire catalog of content HBO Max was streaming at the time), we needed to understand if automated content analysis was a viable route to powering the ideal Orbit experience.

Given that task, Special Circumstances mission was simple: can we help HUSH automatically select diverse, relevant and important moments from the archive and bubble them up to the interactive front end the HUSH team had imagined?

Building our intuition

Our initial task is to get a handle on the data and understand the any possible issues we would need to wrangle with the archive. We were delivered a set of exploratory data — a few master episodes from various HBO properties. Our goal was to mine the episodes for as much data as possible, and get a grasp as to what we could and couldnt reliably understand about the content.

The first hurdle is understanding the edits. While you might be familiar with what the editing style of your favorite series, its quite a different task to reverse engineer that style quantitatively and understand the pacing, composition, and stylistic choices in a reliable and repeatable manner.

Read more about Computational Cinematography and creative metadata from our co founder Rahul’s article on AI and film making.

What you are looking at below is an episode of Euphoria, analyzed with our Synopsis Computational Cinematography software, and loaded up in our internal Cut Detector tool.

Here we are inspecting Kats intro sequence/montage for the episode. Notice how each shot is distinct from the previous, with few repetitions. Each thumbnail has mostly unique framing and composition, all speaking to us about the relationship between Kat and her colleagues.

Lets compare that with a dialogue sequence between Rue and Ali from later in the show:

Euphoria content courtesy HBO

Notice the pattern of shots, the consistent closer framing and the continuity implied by returning to the same composition and subject as the dialogue moves between characters? Here are some more dialogue examples:

Dialogue sequences from Big Little Lies. Images courtesy HBO
Dialogue sequences from True Detective. Images courtesy HBO.
Some dialogue sequences from The Sopranos. Images courtesy HBO.
From The Sopranos. Images courtesy HBO.

The sequence above from The Sopranos has some similarities to a dialogue sequence — continuous returning to a close up shots of Adriana while interspersed with wider location shots. However the dialogue is with herself and where she is going — the inner conversation Adriana is having as she contemplates her fate. Only once you see the who is driving do you realize where why she is so concerned. Great editing. This also is educational, clearly we wont always have perfect a/b shot repetition for dialogue. There aren’t hard rules, just stylistic guidelines and editing conventions on what tends to work and what doesn’t.

The main point of all of this is that we will have a lot of “similar” shots, especially in areas of dialogue.

Shot Understanding

As a sequence of edits is made of individual shots cut together, we need to ensure we understand useful information about these shots themselves so we can select for relevance for The Orbit experience..

For example, if you were looking carefully — you might have noticed something missing: There aren’t many full body shots in any of the sequences above. In other words — we rarely see someone from head to toe. This has ramifications for doing any sort of body tracking or full body pose matching.

This intuition is backed up by inspecting our decompiled episodes in aggregate. Here are roughly 10,000 individual shots loaded into Synopsis Inspector.

Roughly 6000 Medium, Medium Close Up and Close Up shots:

Querying for Medium and Close up shots. Images courtesy HBO.

Roughly 1000 Long and Wide and Extreme Wide shots:

Querying for Long and Wide shots. Images courtesy HBO.

Given that almost all episodic content is narrative involving people, and story tends to be driven by dialogue, it intuitively makes sense we will have a huge bias towards medium and close ups shots. This is so so we as an audience can focus on expression and the performance. Long and wide shots help establish location and the context for the story; the characters relation to the space and the relationship between characters. Once the relationships are established in a scene, we typically see tighter shots.

Given the interactions we want for our ideal Orbit experience, we need to automatically make selects so we can find good clean face poses, upper body poses (given the lack of long shots — a great inclusive accessibility side effect too) and spoken audio. This means we need to recognize a slew of features in the stream of edits.

Creative Metadata

Editing is not just about what you keep, its mostly about what you remove. Given our goal of memorable, diverse content for the experience, we can feel comfortable rejecting clips with title sequences, establishing shots, wide shots, and extreme close ups, insert shots, prop shots, over the shoulder shots, group or crowd shots without a hero/lead.

This means we need to be able to identify a large variety of signals from our individual shots and have high confidence on what we can and cannot keep.

Given all the above reasoning, we have a loose “acceptance / rejection” heuristic for shots that are most likely useful in our experience:

  • Clean singles (ie, one main subject in the shot).
  • Medium, medium close up and close up shots only.
  • No over the shoulder shots.
  • No group or crowd shots.
  • No titles, establishing or prop shots.
  • No gore, or (suggested) nudity / violence.

Additionally, given the dramatic differences in mood from shows like Euphoria to classics like Friends — creatively we wanted to be able distinguish between the lighting, colors, composition and locations so we could experiment with content presentation. This meant we should keep track of

  • Lighting cues.
  • Color Cues.
  • Location Cues.
  • Composition Cues.
Our query: “shot.type.cleansingle and (shot.framing.medium or shot.framing.closeup) and not shot.type.overtheshoulder”. Images courtesy HBO.

For our audio portion, HUSH made a set of fun phrases we wanted users to discover (for example, “I love you”), utterances like ‘huh’ and ‘haha’ and ‘ummm”, and iconic phrases from lead properties (“Winter is coming”). We needed to detect and extract those moments from our content as well.

Modeling Cinema

Building the model to understand the cinematic cues we need is no cakewalk. Creating multiple datasets, trying out different architectures and picking optimal hyperparameters is a painstaking process that’s further complicated by the need to train for multiple tasks in a cohesive fashion.

It’s crucial to be able to rapidly iterate over new ideas and datasets. We used frameworks like fast.ai and Icevision to build custom pipelines that allowed us to quickly try out high level ideas while giving us the freedom to fine tune nitty gritty details when we wanted to optimize further.

For model tracking, we relied on the excellent W&B platform. For coordinating labelling tasks, we relied on a mix of custom workflows and tools as well as Label Studio.

When we built the model that powers Synopsis, we arrived at an architecture that allowed for easy extension, with a shared backbone that generalizes extremely well for cinematic tasks. This allowed us to easily extend our model in short time to accommodate new needs for The Orbit.

For the macOS apps, we used Apple’s coremltools to convert the trained PyTorch model into a CoreML model to leverage the latest M1 chip to the fullest.

Content Ingestion Pipeline

Now that we have a sense that we can reliably extract relevant selects from a slew of different types of content, we need to think about real world deployment concerns.

Given the amount of data (nearly Petabyte scale), and the fact we would be handling some unreleased content from HBO and their partners, we needed to build our Content Ingestion Pipeline on HBO managed AWS infrastructure to meet their content security policy requirements. We worked closely with HBOs internal dev ops team to ensure our instances were locked down, made no external calls nor opened unexpected ports, and ran on their private VPC stack and passed all security checks, and that all auth was run through an HBO managed bastion server.

Architecturally, the pipeline needed to read from an ingest bucket of media, for each individual video master file:

  • Run an edit detection, edit extraction, visual and audio analysis.
  • Run initial machine acceptance /rejection scoring based on inferred creative metadata.

For edits passing the initial machine acceptance criteria:

  • For close up shots, or medium close up shots, compute facial key points.
  • For medium or medium long shots, compute partial body pose.
  • Store computed features for later serving to the interactive front end.
  • Transcode to appropriate deliverable codecs for web and flagship and normalize resolution and audio channel layouts.

For audio phrase analysis, the pipeline needed to parse subtitles to find our phrases, and then align subtitle timing (which isnt exact by any means) to actual spoken words with frame accurate timing.

Additionally the pipeline needed to ensure that the phrase spoken was spoken by an on screen talent. Modern editing has continuous dialogue spoken through edits. We often will see reaction shots to the dialogue rather than the character speaking. Thats not ideal for our experience. This meant edit detection for visual tracking needed to happen independently of the phrase detection, extraction and audio alignment.

This meant one more analysis phase and transcode path.

Finally, given the high profile nature of the content, once the automated process whittled the content down we needed to pass a final human review phase, and then content would be ready be served for our various front ends.

Wowza.

We worked closely with HUSH to help design the various components of the pipeline and make sure everything spoke with one another. We focused our efforts on the initial edit detection and creative metadata inference, which ran the first pass of automated content filtering and the transcoding / format conversions. HUSH’s interactive prowess powered the down stream video and audio analysis, the incredible fabrication and front end interaction and content design.

Master of None

One large technical hurdle the team had to tackle in the Content Analysis Pipeline was the fact that video archives have been compiled over time. This meant files were digitized when encoding standards and formwats we different. Additionally, some content is licensed from 3rd party broadcasters with different internal standards and encoding targets. Finally, much older content was created when video broadcast standards weren’t even digital, and workflows were dramatically different.

This meant that we had to understand the diversity of input formats and ensure that all parts of the pipeline could properly handle whatever was thrown at us.

In short, we had to ensure that our pipeline could properly ingest, analyze and transcode any combination of the following formats:

Container Formats:

  • MXF
  • Quicktime

Video Codecs:

  • ProRes 422 & ProRes 422 HQ
  • Motion Jpeg 2000
  • NVEnc / NVDec H.264
  • HAP

Resolutions:

  • 480p,
  • 720p
  • 1080p
  • 2k

Framerates (interlaced and progressive):

  • 23.967
  • 24
  • 25
  • 29.97
  • 30
  • 59.97
  • 60

Color Profiles:

  • Rec 601
  • Rec 709
  • Rec 2020

And finally SMPTE Time code support.

To make the video portion even more nuanced, we found issues with the latest builds of FFMPEG, where certain MXF masters with MJPEG 2000 streams from some encoder vendors would cause unrecoverable decoder errors. We had to delve through older FFMPEGs to find a version which supported all of the features and codecs we needed while still being able to run hardware accelerated video encode via CUDA along side up to date drivers.

This was not fun.

If you happen to work with Professional Video and ML, we built a Docker container with FFMPEG compiled with NVEnc, and HAP support along side GPU accelerated PyTorch and ONNX installs as the base image for Content Ingestion Pipeline. You can find them here. We hope its helpful!

Audio Formats:

Audio master formats proved to be a huge unexpected complication, as we needed to identify English speaking stereo channel pairs to feed to the audio analysis / phrase alignment portion of the pipeline.

Audio channel layouts varied from vendor to vendor and even within shows from different seasons (they might be finished by different post production facilities). Additionally, audio channel layouts often had no embedded channel metadata which would otherwise indicate where we might find 2 English stereo channels. This is primarily due to the historical lack of metadata authoring tools and reliance on burned in slate guidance rather than machine marked metadata.

The difference between modern channel layout metadata on the left vs no audio channel metadata. Pop quiz: On the right — which one of the 6 tracks (12 channels) is the English language stereo pair we want? I dont know either. Bonus points: Which timecode track should I use? Neither. Its the unnamed “Other #3”.

For example, we might have layouts with 3 stereo pairs each as single mono tracks — one Spanish pair, one English pair and one music only. We might have some 5.1 channels thrown in on a single track. Or 6 unmarked mono tracks that are mixed for 5.1 — next to our previous 6 unmarked mono stereo tracks.

We could have up to 21 unmarked mono tracks / channels in a single episode and need to deduce which was the correct English stereo pair to extract and pass downstream.

Some masters had channel assignments marked on the slate (the opening portion before picture). That helped, however those assignments were not always correct.

To be clear, this is pretty standard affair. Huge archives from diverse sources have errata, arent all conformed to a single digital encoding standard.

Human Review

Given high profile nature of the content we are working with, we all agreed relying on a purely automated system isn’t tenable.

We couldn’t have an accidental gory or nude moment slip by, and we wanted to ensure the overall feeling of the experience was up beat and showed characters in the best light, with no awkward poses, faces or moments.

This meant the pipeline needed one final component, a human review portal to ensure all content passed standards.

We also leveraged the review opportunity to make an “audio track assignment portal” for any content where we had no great guesses at our target stereo pairs.

Resulting Architecture

Image courtesy HUSH

Abstractly, the pipeline is a metaphorical seive. We begin with an episode which may consist of 500 to up to 2000 individual edits. Machine curation runs and we filter 60–70 % of that content automatically. Next we run our deeper vision feature extraction and given confidence values of particular predictions we may filter yet again. Next is human review, acting as an additional filter and resulting in only the juiciest of moments from each episode.

Given all of the constraints, the final pipeline looked more or less like this:

A simplified overview of The Orbits back end Content Ingestion Pipeline which ran on AWS

The above workflow phases mapped very well to isolated stand alone computational tasks, and so the various internal teams built their pipeline components with those tasks in mind, understanding we should parallelize as many tasks as possible.

We ran all analysis on a cluster of GPU instances (around 150 to 200 machines given Amazons regional capacity at the time) which were fed enqueued jobs from SQS. Each job was a specific episode or movie master file. Due to the size of the files (near 100 GB each) quite a lot of time was spend on file copies, and disk reads for analysis and transcoding.

Our first pass machine analysis would transcode a single high quality HAP file of an edit for the flagship experience, and multiple web variants encoded to H.264 by the GPU, and extract each audio channel as a stand alone mono track to drive the audio track assignment interface if needed.

Computed features were saved at the end of each task run into a feature storage database built by HUSH, allowing us to not have to re-compute an entire episode if we wanted to tweak some downstream algorithms.

Post human review, HUSH designed a final pipeline stage that would group together shots for diversity in the experience and to ease the load on web and flagship front end clients. This pass would lean on features from all phases of analysis to create programatically create groups of content we called “buckets”.

Bucketing helped with repeats from the same show, and let the design team curate each experience to feel unique with no repeats for the same user across multiple sessions. Technically this also meant the front end teams could pre-fetch media for any given session to reduce UI latency.

Images courtesy HUSH

Wrapping it up

The Content Ingestion Pipeline was one small part of the overall Orbit design and technology challenge. There is a lot more to talk about : from fabrication, UI / UX design and, interactive computer vision components and interaction design. Suffice it to say, the teams at HUSH built an incredible next generation interactive experience. We are proud to have played a key part in it.

This was also the most ambitious use of Synopsis to date. This project featured some of our favorite shows to ever be made (The Wire, The Sopranos, Euphoria, the list goes on), and it was extremely satisfying to see the intuition that paying attention to the creative details in our favorite medium can really pay off.

Finally — we couldn’t ask for a better team to have worked with. HUSH, HBO and AT&T were a pleasure to collaborate with.

If you are tackling challenging problems in film and video, don’t hesitate to reach out.

— Special Circumstances is Anton Marini & Rahul Somani.

Special Circumstances is a Computational Cinematography R&D company. We build next generation tools for film makers.

Special Circumstances is a Computational Cinematography R&D company. We build next generation tools for film makers.