The Intersection of Large Language Models (like ChatGPT) and Video

Share This Post

These days, no matter where you look, there are companies announcing product integrations with ChatGPT (now with GPT-4) and similar large language model (LLM) technologies. In some cases, these initial integrations are quick add-ons. In others, LLM-tech will form the foundation for future development. 

Over the past four months, we’ve been looking closely at how LLM-based technologies will form the foundation for effectively breaking down and indexing video (particularly video with dense speech). Let’s take a look at the problem.

The Problem: Video is Dense

Most long-form (say >10 mins) videos are pretty dense, especially if there’s a lot of spoken material. But let’s say you spent a lot of time and money producing this video and there’s rich material you’d like to identify and repurpose. How would you do that today? You’d send it out for someone to work on, or you’d find someone (maybe you) to take on the thankless task of watching a video to identify the most salient moments. No matter where you start, you’d probably have the video transcribed.

The Next Problem: Transcription Accuracy

Properly transcribing video (and audio files for that matter) can be super frustrating. There’s company-specific lingo, obscure terms/jargon, hard to pronounce proper names, a range of accents/speaking patterns, and even wonky punctuation and capitalization. Typical transcription tools will advertise that they have single digit percentage error rates, but in practice many bad transcripts are produced because of poor context, not because of individual word transcription errors.

Enter the Large Language Model (LLM)

In the past six months, we’ve started to see the emergence and application of large language models for transcription. These models, which are built on a massive amount of data collected from across the Internet, are doing an increasingly better job of predicting what text comes next after preceding text. For video and audio transcriptions, these models will naturally do a much better job of predicting what spoken word most naturally follows prior spoken words in context. And these models are trained non-stop on the world’s textual context, which catches and adds the right proper nouns, terms, and even capitalization. 

We just implemented one of these models, and we were blown away by the increase in transcription quality across a wide range of video subjects. On the downside, we were also surprised by how much extra work was required to tune the model so that it worked well.

Indexing Segments

Let’s say we now have a well-formed transcript. Now we can apply our natural language processing (NLP) algorithms to accurately identify keywords and topics. In other words, we can ask the machine to identify what a video is effectively “about.” Here at ContentGroove, we’ve created another layer of proprietary tech to identify sentence boundaries and, using those sentences, iterate through an entire transcript looking for clusters of sentences (effectively clips or segments) that are topically rich.

The Next LLM Implementation: Summary

A really terrific use of LLM-tech is to summarize a segment of text. That’s what we do next. With segments auto-selected in the previous step, we can now run those segments through our LLM-based summary model to create an appropriate title for that segment. Here we researched numerous models for this purpose and found one that we could tune effectively for this use case. Today, this model works only in English, so we are starting the process of training this model in other languages. 

Our summary tech allows us to not only name segments that we auto-generate but in the future to name segments that our users create manually.

Combining the Ingredients

After you load video or audio files into our system, you’ll see that we automatically break down the media file using the process described above. We list our auto-generated segments and allow our users to click through them to jump to pre-selected segments to view/listen/modify. From there, it’s all about tuning and moving on to the next step, which could be to generate highlights, create derivative content for additional (e.g., social) channels, or to share internally. 

Want to do this programmatically? Please check out our public API.

The Future

In the (near) future, we will implement other technologies so that we can increase the richness of what we index from a complex video file. With the foundation described above, we will be able to overlay speaker identification, emotion/sentiment, text OCR, and even machine vision to provide even more context to our users.

Try It

Most everything you’ve read about above emerged into the mainstream in the past six months. We think it’s a compelling set of tech to apply to the ever-growing mass of video and audio files on the web, and we’re looking forward to hearing your thoughts on our approach. Try all of this out here.

Note: this entire blog post was written by a human without the aid of ChatGPT, GPT-4, OpenAI, Bing, Bard, DALL-E, Deepmind, or even Grammarly. 😉