An AI-generated comic book image of a robot singing into a microphone.

It’s not live, it’s Memorex AI

This week’s activity is inspired in part by a creepy/creeping thought I (Amy) had over the winter break last year. My car radio dial landed on a station playing the Delilah show, a call-in song request show that has been on the air almost non-stop for, get this, nearly 40 years. As I listened to the show—which I had done periodically since I was a teenager–I wondered how much longer Delilah was likely to continue her show. Then, a creepy question infiltrated my brain–what if Delilah continues to offer her show not as a live human but as an AI-generated Delilah voice, answering calls and taking requests? I shuddered at the idea.

As it turns out, generative AI companies are already way ahead of us. From a podcast featuring an AI-generated George Carlin comedy routine to campaign ads using AI-generated videos and voice, we are already seeing generative AI being used to simulate human voice and video for a variety of purposes. It’s an intimidating continuum of options. You can have text read to you in any number of voices (Gwyneth Paltrow, Snoop Dog,  “Mr. President”, etc.) at Speechify.com, practice “relationship skills” on Blush (the AI dating app that “builds real relationships.”) or check out YOV (You, Only Virtual) which promises that “​​we Never Have to Say Goodbye to those we love” because we can interact with their Versona, an AI-generated “essence” (voice and video) of a loved one.

A screenshot of virtual-influencer Lu do Magalu's Instagram page showing 6.8 million followers.

The use of voice and video generative AI tools to imitate/simulate humans will increase significantly as those tools become more sophisticated, cheaper, and easier to use (not to mention harder to identify as AI-generated). We should expect to see the likeness of dead actors or musicians coming “back to life,” the propagation of virtual influencers and virtual romantic partners, and of course, lots of mis/disinformation (which we’ll discuss in next week’s activity). In education, early explorations of AI-generated classmates in online courses and hologram lecturers have governmental education agencies scrambling to prepare for impacts on students, faculty, and institutions.

This realm of AI brings all of the issues we’ve seen with text and image creation to new heights and adds in new complications and questions. For example, what rights do we have to protect our voice and image from being recreated with a generative AI tool? This question was a key part of the recent SAG-AFTRA actors’ and WGA writers’ strikes. The actors, for example, fought for requiring employers/production companies to get “clear and conspicuous” consent from actors to use their voice or likeness, and to pay actors for “the amount of days they would have been required to perform scenes featuring their digital likeness.” Writers and actors were able to advocate for control over their likeness and livelihoods, but what about people who are no longer alive or those who are unable to advocate for themselves or give consent (e.g., children)? And, how would someone know that their likeness is being used for a generative AI simulation? What recourse is available if someone’s likeness is being weaponized against them (e.g., catfishing, bullying, scams)?

As the lines blur between human voice/video and AI-generated voice/video, and as AI-generated audio/video gets used more widely, it will become increasingly difficult for humans to prove that they are human when interacting with technology. We’ve already grown accustomed to having to prove we’re human for security purposes (e.g., captcha, which stands for Completely Automated Public Turing test for telling Computers and Humans Apart, aka those annoying ‘identify the squiggly letters’ tests). As bad actors (and even AI) have outfoxed those security mechanisms, captchas have become complex enough that people struggle to complete them accurately. It’s estimated that humans spend roughly 500 years a day proving they are human via captchas while also providing free labor to train AI models. Other human verification patterns, like using a reverse image search to debunk fake or photoshopped images, or holding up a piece of paper with a specific message on it (or a driver’s license, etc.) will likely not hold up to audio/image/video generated by AI. Tinder verification now requires users to submit a short video selfie which it then matches to profile pictures using facial recognition. But even that level of verification can be outwitted by tools like AccuFace that use AI to map live video to an AI face (it’s worth watching this video). That software is available for about $250 right now.

 

A screenshot of the Spechify homepage showing various celebrity voices available to read to you.

There is also some portion of society that seems relatively unconcerned about what’s real and what is AI-generated. There have already been a number of highly successful AI-social-media-influencers, like lilmiquela (a “🤖 19-year-old Robot living in LA 💖” with 2.6 million followers on Instagram) who you can see starring in this BMW ad.  It may feel a bit odd to some of us, but when you consider the long history of Photoshopping images, the increasing role of CGI in movies, talking Geckos as brand voices (far less risky than humans), Tu Pac’s hologram performance . . . AI personas may be just a bit farther down that continuum. This kind of AI expansion of possibilities may also have some benefits. On the simpler end of the spectrum, Speechify lets you get any content read aloud with a voice, and speed, of your choice. Celebrity voices, like Snoop Dogg and Gwyneth Paltrow, are available but users can choose (or make) voices that improve the experience for them. The positive impact of text-to-speech is backed by research going back to the 1980s and it’s clear that AI will increase the accessibility and quality of this type of content. More sophisticated interactions are being constructed to teach children positive self-talk, prompt wellness practices based on online behavior, and quite a bit more. The consequences of more sophisticated attempts to leverage AI in mental health will be unclear for some time. The idea that AI could both be undermining social skills for some while helping others learn social skills is a reality. It will be impossible to draw neat boxes around a technology that has this degree of power, complexity, and diverse applications.

It’s worth staying informed about developments in the AI-generated voice and image space, even if you don’t intend to ever use those tools. While these technologies may provide benefits (like improved accessibility of digital content), they carry tremendous risks that we should all be aware of and prepared for. What can you do to prepare? One suggestion security experts offer is for families to have a safe word they can use to verify if a phone call from a loved one is actually from them, not an AI-generated bot (with a human bad actor directing it) seeking to phish sensitive data. Our personal, institutional, and governmental information security strategies will likely be upended as these technologies evolve and we’ll need to keep educating ourselves to try to protect and secure our data.

Activity: Make a mixed-media character

Step 1: Casting call

Of course you can use these technologies to simulate audio/video for live humans, including yourself, and you’re welcome to try that, but we prefer to stick with fictional characters for this activity. The tools we are asking you to use for this activity are not super sophisticated (i.e., you will not get awesome results), but we wanted to use free tools and the free tools are just not as good as the more expensive tools. And, as always, take a look at the Nope! alternatives.

Generate a character description that seems like it’d be interesting to you. This can be based on a real character or something you’ve made up. Use ChatGPT or Bard to assist you if you’re feeling stuck or to help improve your description.

Step 2: Take a headshot

Take your character description and use it to generate an image of your character. You might go back to one of the image generators you liked previously or try something new.

Step 3: Make it move

Use your image as the base for a video. Hugging Face has a space for image-to-video that makes short clips based on still images.

If you’d prefer to just enter a text description to produce video, hotshot.co offers free accounts and you can also use the Hugging Face modelscope space if you don’t mind waiting a bit. Both services created some very odd visuals in our testing.

Step 4: Give it a voice

Voice generation can be based on an existing audio file. We’ve collected a few source files and put them up here if you’d like to download them and use them in your experiments.

We also have a video walkthrough on how to use these models on Hugging Face. The video will demonstrate how two different voice-to-voice models work. Those of you who speak other languages will definitely want to check out how these tools perform with languages other than English.

Step 5: What hath AI wrought?

Share your prompt back and show us how these AI services succeeded (or failed) to live up to your expectations. Use the form below to submit your media and reflections on what the future holds.

Step 6: Dig deeper

Character submissions

An AI-generated comic book image of a robot looking at itself in the mirror.

Nope!

Running much of this software on your computer is more complicated than you’re likely to want to tackle right now. Most of the online options outside of Hugging Face require logins of some kind.

That being said, you may already have some pretty complex options available on your phone or in other services that you already use. More recent iPhones offer “animated Memoji” options that allow you to control cartoon heads with your facial movements. Snapchat filters have long had elements of facial recognition and movement tracking built in. Even if you’ve never used Snapchat, their software page can give you a pretty good idea of what’s possible. Even Zoom does some live AI video manipulation for those of you who are frequently looking for something to try while stuck in online meetings.

Speechify, which we referenced a few times already, gives you a pretty good idea what AI is able to do regarding voice generation and celebrity imitation.

Leave a Reply

Your email address will not be published. Required fields are marked *