Update 2024-02-10

A few days ago Microsoft have announced that OpenAI’s text-to-speech voices were (finally) available on Azure OpenAI Service.

Needless to say I wanted to see how they worked, and this article you’re reading offered the perfect excuse to do it. I had written it just after watching OpenAI DevDay Keynote, to see if I could use their own models to transcribe, summarize, illustrate and narrate the whole thing back to me.

Which apparently, I could (click that video 😳).


So now, having all the models available in Azure OpenAI, I set out to update my code to support them as well. It was a pretty seamless transition (same API as OpenAI, yay!), the only frustrating bit was figuring out which model was available in which region, and which API version each one required. The API versions were fun to debug, I’ll tell you – setting the wrong one would result in a very friendly and utterly self-explanatory NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}} message.

Read on to see how using both Azure OpenAI and OpenAI versions of the following models looks like: GPT-4, Whisper, Text-to-Speech, and DALL*E 3.

Lessons Learned

  1. Whisper is fun to use and works really well. It will misunderstand some of the words, but you can get around that by either prompting it, or by using GPT or good-old string.replace on the transcript. It’s also relatively cheap.
  2. Text-to-speech is impressive – the voices sound quite natural, albeit a bit monotonous. There is a “metallic” aspect to the voices, like some sort of compression artifact. It’s reasonably fast to generate, too – it took 33 seconds to generate 3 minutes of audio. It blew my mind when I noticed they breathe in at times 😱.
  3. GPT-4 Turbo works rather well, especially for smaller prompts (~10k tokens). I remember reading some research saying that after about ~75k tokens it stops taking into account the later information, but I didn’t even get near that range.
  4. DALL·E is..interesting 🙂. It can render some rich results and compositions and some of the results look amazing, but the lack of control (no seed numbers, no ControlNet, just prompt away and hope for the best) coupled with its pricing ($4.36 to render only 55 images!) makes it a no-go for me, especially compared to open-source models like Stable Diffusion XL.

Process

Here’s my process, step by step 😅.

Having downloaded the keynote somehow, I used ffmpeg to extract the audio track with this command:

ffmpeg -i "OpenAI DevDay, Opening Keynote.mp4" -vn -acodec libmp3lame -q:a 5 OpenAIKeynote.mp3

That -q:a bit specifies the mp3 level compression, the lower it’s set, the higher the quality and corresponding filesize. Since the Whisper API limits file uploads to 25 MB, setting -q:a 5 got me the highest audio quality that fit in a single API call.

Then, it was all Python.

These are the packages I’ve used by the way. python-dotenv worked great for loading the configuration and avoid having to make my API key available to all processes.

openai==1.2
pydub==0.25.1
python-dotenv==1.0.0
tiktoken==0.5.1

If you’re using Azure, just make sure use_azure is set to True, and make sure to call the appropriate model (whisper, tts, chat, and dalle) for each task.

You might wonder why I’m creating a client for each model – it’s because not all models are available in all regions, and then it becomes a game of remembering which client is in which region and has access to which model 🤕. I’ve found it much simpler to use dedicated clients for each model, even when they’re pointing to the same endpoints.

One area where I’ve diverged from this is the API_VERSION, I’m using 2023-12-01-preview for everything except TTS, where it’s 2024-02-15-preview (notice it’s from the future 😉, I’m writing this on February 10).

import json
import requests
from IPython.display import display, Markdown
from dotenv import dotenv_values
import tiktoken

config = dotenv_values(".env")

use_azure = True

if use_azure:
    from openai import AzureOpenAI
    whisper_client = AzureOpenAI(
        api_key=config['AZURE_OPENAI_WHISPER_API_KEY'],
        azure_endpoint=config['AZURE_OPENAI_WHISPER_API_ENDPOINT'],
        api_version=config['AZURE_OPENAI_API_VERSION'])
    
    tts_client = AzureOpenAI(
        api_key=config['AZURE_OPENAI_TTS_API_KEY'],
        azure_endpoint=config['AZURE_OPENAI_TTS_API_ENDPOINT'],
        api_version=config['AZURE_OPENAI_TTS_API_VERSION'])

    chat_client = AzureOpenAI(
        api_key=config['AZURE_OPENAI_CHAT_KEY'],
        azure_endpoint=config['AZURE_OPENAI_CHAT_ENDPOINT'],
        api_version=config['AZURE_OPENAI_API_VERSION'])

    dalle_client = AzureOpenAI(
        api_key=config['AZURE_OPENAI_DALLE_KEY'],
        azure_endpoint=config['AZURE_OPENAI_DALLE_ENDPOINT'],
        api_version=config['AZURE_OPENAI_API_VERSION'])

    whisper_model = config['AZURE_OPENAI_WHISPER_MODEL']
    chat_model = config['AZURE_OPENAI_CHAT_MODEL']
    tts_model = config['AZURE_OPENAI_TTS_MODEL']
    dalle_model = config['AZURE_OPENAI_DALLE_MODEL']
else:
    from openai import OpenAI
    client = OpenAI(api_key=config['OPENAI_API_KEY'])
    
    whisper_model = 'whisper-1'
    chat_model = 'gpt-4-1106-preview'    
    tts_model = 'tts-1-hd' # or 'tts-1'
    dalle_model = 'dall-e-3'

path_to_keynote_audio = "./OpenAIKeynote.mp3"
path_to_keynote_transcript = "./OpenAIKeynote_transcript.txt"
path_to_keynote_summary = "./OpenAIKeynote_summary.txt"

Whisper

Whisper was a pleasure to use. Simple API call, ability to use prompts in order to fix spelling issues, relatively fast and cheap. It took 2:20 minutes and cost 27 cents to transcribe the 45:35 minutes keynote.

Notice I didn’t bother with prompting to improve its named entity recognition – I wanted to see some results as fast as possible and wasn’t going to let some context info get in the way of that. Screenwriters call this foreshadowing.

audio_file= open(path_to_keynote_audio, "rb")
transcript = client.audio.transcriptions.create(
  model=whisper_model, 
  file=audio_file,
  response_format="text"
)

with open(path_to_keynote_transcript, "w") as file:
    file.write(transcript)
tokens_count = len(tiktoken.encoding_for_model('gpt-4').encode(transcript))
print(f'Processing {tokens_count} tokens')

Processing 8967 tokens

Hmm, 8.9k tokens. Guess we’ll see if GPT-4 Turbo is worth its salt.

GPT-4 Turbo – Summarization

response = client.chat.completions.create(
  model=chat_model,
  messages=[
    {"role": "system", "content": "You are an expert at summarizing long videos, able to extract their key points and present them in a concise and energetic manner."},
    {"role": "user", "content": f"""Can you summarize this transcript and extract the main points?
     ```{transcript}```"""}
  ]
)

summary = response.choices[0].message.content
display(Markdown(summary))

The first time I ran this code was Thursday (Nov 9) evening and, well, I got this: RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-4-1106-preview in organization org-REDACTED on tokens per min (TPM): Limit 10000, Requested 10018. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.

Just my luck, 18 tokens too many (note that I was using an earlier version of the prompt, I haven’t checked how many tokens this one takes)

I checked my API limits and, lo and behold, only 10k tokens per minute.

The next day however, I noticed the limits were raised to 150k tokens per minute, making GPT-4 Turbo a bit more usable.

This is the summary I got by the way:

Key Points from the OpenAI Dev Day Keynote by Sam Altman:

  1. Company Updates: OpenAI has made significant progress over the last year, introducing models like ChatGPT, GPT-4, and Dolly 3, voice and vision capabilities, and an enterprise version of ChatGPT. OpenAI has reached 2 million developers and 100 million weekly active users, with ChatGPT’s popularity growing through word of mouth.
  2. User Stories: Personal stories showcased how ChatGPT has been used in people’s lives, ranging from translation and writing assistance to helping visually impaired users and enhancing learning experiences.
  3. Announcement of GPT-4 Turbo: GPT-4 Turbo is the new model with several improvements:
    • Extended context length up to 128,000 tokens.
    • More control over model responses through JSON mode and reproducible outputs.
    • Up-to-date world knowledge as of April 2023.
    • New modalities like Dolly 3 image model, text-to-speech, and Whisper V3 for speech recognition.
    • Customization features like fine-tuning for better model adaptation.
    • Higher rate limits and introduction of Copyright Shield for legal protection.
  4. Pricing Updates: GPT-4 Turbo is significantly cheaper than GPT-4, with a large reduction in cost for prompt and completion tokens, making it more accessible to developers.
  5. Microsoft Partnership: Sam Altman welcomed Satya Nadella from Microsoft to discuss their partnership, focusing on developing infrastructure to support foundation models and aiming to bring the benefits of AI to everyone while emphasizing safety.
  6. Introduction of GPTs: Sam spoke about the creation of specialized versions of ChatGPT tailored for specific use cases, called GPTs, which can be programmed with language and shared publicly. The GPT store was announced for distribution and discovery of GPTs, with a revenue-sharing model for popular GPTs.
  7. Assistance API and Enhanced Developer Experience: The Assistance API simplifies building assistive agents with features such as persistent threads, built-in retrieval, Python in a sandbox environment, and improved function calling. Demonstrations illustrated these features in action within interactive apps.
  8. Future Vision: OpenAI’s goal is to move towards AI Agents that will be capable of more complex actions. The organization aims for iterative deployment and continuous updates based on feedback. Sam closed with gratitude towards the OpenAI team and a promise of more advanced developments for the next year.

The keynote was focused on demonstrating significant advancements in AI capabilities, with an emphasis on how these tools empower individuals and developers to create a wide array of applications and services that can potentially transform daily life and industries.

Cool, “Dolly” and “Assistance API”. Guess I should have prompted Whisper after all. Well, nothing that can’t be fixed with a simple string.replace.

summary_clean = summary\
                    .replace('Assistance API', 'Assistants API')\
                    .replace('Dolly', 'DALL·E')

with open(path_to_keynote_summary, "w") as file:
    file.write(summary_clean)

Text-to-speech

Man, I wish this had a better name. Like “shout”1 or something similar. Shout is a good name, I like it – I’ll use it from now on.

Alright, so Shout has two modes, the cheap one and the good one. I’ve used the good one, because you know, I’m that kind of guy.

And, to tell you the truth, I was quite impressed with how good it sounds. While a bit monotonous, it’s not particularly robotic, and the voices offer a good mix of personalities.

response = client.audio.speech.create(
  model=tts_model,
  voice="shimmer",
  input=summary_clean
)
response.stream_to_file(f'./summary-{model}.mp3')

All in 33 seconds.

GPT-4 Turbo – Prompting DALL·E

I wasn’t going to prompt DALL·E “manually”, that’s what we have AIs for.

response = client.chat.completions.create(
  model=chat_model,
  messages=[
    {"role": "system", "content": "You are a visualization expert who excels at identifying and describing the best images to illustrate key points of speeches. You generate excellent and diverse prompts that can be used to have DALL-E generate those images."},
    {"role": "user", "content": f"""Can you identify and describe the four best images that illustrate each point in the speech summary below?
     Output them as a python dictionary, with keys containing the point number and values containing arrays with the four DALL-E prompts.
     ```{summary}```"""}
  ]
)

Notice I’m sending it summary instead of summary_clean 🤦‍♂️. It was late and I was tired. I really should have prompted Whisper 😮‍💨.

Here is the requested Python dictionary with keys representing the point numbers from the speech and values containing arrays with four DALL-E prompts for each key point. Each prompt is designed to help visualize the mentioned updates, user stories, announcements, partnerships, and visions.

dalle_prompts = {
    1: [
        "A collage of logos for ChatGPT, GPT-4, Dolly 3 and enterprise versions with a chart showing growth to 2 million developers",
        "A dynamic graph skyrocketing upwards symbolizing ChatGPT's user growth to 100 million weekly active users",
        "A global map highlighting word-of-mouth spread of ChatGPT's popularity with speech bubbles in various languages",
        "3D models of ChatGPT, GPT-4, and Dolly 3 with voice and vision icons and a celebratory ribbon for company milestones"
    ],
    2: [
        "A visually impaired person using voice commands to interact with ChatGPT for assistance",
        "A student and teacher using ChatGPT on a tablet for enhanced learning experiences in a classroom setting",
        "A person working on a laptop surrounded by floating texts in multiple languages symbolizing translation help",
        "An author with a futuristic AI assistant typing out a book, depicting writing assistance"
    ],
    3: [
        "A futuristic AI brain expanding with a label '128,000 tokens extended context length for GPT-4 Turbo'",
        "A user selecting advanced options in a stylish JSON interface for controlling model responses",
        "An AI robot holding a newspaper dated April 2023, symbolizing the up-to-date knowledge of GPT-4 Turbo",
        "Icons representing Dolly 3 image model, text-to-speech, whisper V3 around an enhanced version of GPT-4 Turbo with customization gears"
    ],
    4: [
        "A price tag with 'GPT-4 Turbo' showing a slashed price compared to 'GPT-4', implying significant cost reduction",
        "A graph comparing GPT-4 and GPT-4 Turbo costs with the latter having a downward trend in prices",
        "A shopping cart with GPT-4 Turbo and a smaller dollar sign compared to a cart with GPT-4, highlighting affordability",
        "GPT-4 and GPT-4 Turbo depicted as fuel pumps, with GPT-4 Turbo pumping out more tokens for less money"
    ],
    5: [
        "Satya Nadella and Sam Altman shaking hands over a network of connected AI brains, symbolizing partnership",
        "Infographic of Microsoft and OpenAI logos with a two-way arrow emphasizing infrastructure development for AI",
        "A globe encircled by beams of light connecting OpenAI and Microsoft logos, representing worldwide AI benefits",
        "A shield with an 'AI safety' badge enveloping OpenAI and Microsoft logos together on a cloud computing backdrop"
    ],
    6: [
        "A marketplace bustling with various GPTs in virtual booths, illustrating the GPT store environment",
        "Customized ChatGPTs in a digital library setting, where each book cover represents a specific use case",
        "A 'GPT store' app icon on a smartphone with popular GPTs being downloaded and dollar signs indicating revenue",
        "Various GPTs as specialized robots each performing a unique task, showcasing their tailored capabilities"
    ],
    7: [
        "An AI assistant represented by a hologram overseeing threads, information retrieval, and Python code in a sandbox environment",
        "Developers surrounding an interactive holographic screen showing the Assistance API in action with real-time assistance",
        "A digital workflow diagram depicting the Assistance API's features being used to build an assistive chatbot",
        "A simplified depiction of a person easily dragging and dropping API components into an app structure"
    ],
    8: [
        "A futuristic AI agent as a humanoid robot interacting with various devices and performing complex tasks",
        "A circular diagram showing iterative deployment process with continuous updates and feedback loops",
        "Sam Altman standing in front of a digital AI development timeline, expressing thanks and leading towards a high-tech future",
        "A neon signboard of 'The Future of AI' with OpenAI logo at the center and next-gen robots around it"
    ]
}

print(dalle_prompts)

These prompts are intended to help create images that are both illustrative and evocative of the different points made during Sam Altman’s speech. They focus on the forward-thinking nature of OpenAI’s work, the practical applications of their AI models, and the vision for the future of technology as it intertwines with human experiences.

Not bad, not bad at all. The ideas are very much reasonable but I’m not entirely convinced these are the best prompts to get DALL·E to illustrate said ideas.

Which reminds me:

DALL·E 3 (HD)

I wanted to see the best DALL·E can offer, so I went ahead and coughed up $0.080 / image for HD quality.

def render_with_dalle(prompt, path):
    try:
        response = client.images.generate(
            model=dalle_model,
            prompt=prompt,
            size="1024x1024",
            quality="hd",
            n=1,
        )
    except Exception as e:
        print(e)
        return

    image_url = response.data[0].url
    r = requests.get(image_url)
    with open(path, 'wb') as f:
        f.write(r.content)

for index, prompts in dalle_prompts.items():
    print(index)
    if index == 1:
        continue
    for prompt_index, prompt in enumerate(prompts):
        print(prompt)
        render_with_dalle(prompt, f'./image-{index}-{prompt_index}.png')

Oh, and in case you were wondering about that that try:except block when the rest of the code gleefully ignored any error responses, well, apparently prompting A visually impaired person using voice commands to interact with ChatGPT for assistance will sometimes result in a BadRequestError: Error code: 400 - {'error': {'code': 'content_policy_violation', 'message': 'Your request was rejected as a result of our safety system. Image descriptions generated from your prompt may contain text that is not allowed by our safety system. If you believe this was done in error, your request may succeed if retried, or by adjusting your prompt.', 'param': None, 'type': 'invalid_request_error'}}.

Guess it must have been a really bad image, especially knowing that images such as this or this don’t trip up the safety system at all.

One thing that tripped me was that I couldn’t use any seed numbers for my images, to ensure repeatability. The API doesn’t allow setting any seeds in the requests, nor does it return any seeds in the responses. Coming over from Stable Diffusion that’s a pretty big minus, which coupled with its significant price ($2 for 25 images 🥶) makes it unlikely for me to use it for anything but toy projects. Or for free with Bing Image Creator.

Conclusion

This was fun and exciting. We are reaching that point where anyone will be able to create anything, and the only differentiator will be good taste. And the ability to pay for best-in-class models, of course.

Anyways, I quite liked playing with the API and seeing what works and what doesn’t. Especially on launch week when everybody hammers the servers 🚀!

Only thing I didn’t like was fiddling with Camtasia and trying to align the audio with the images, and figure out if I should use any transitions (I didn’t), how many images I should use (almost all of them), how to order them, how long should they stay onscreen, the exact chapter animations to use, etcetera.

I think I could have saved me some of this effort by using Whisper to transcribe the Shout-generated audio and output it as srt subtitles (yes, Whisper can do that), and using the time information in there to programmatically align the images. I just thought of this 😕. I guess writing does help one think clearer.

Maybe next time.

Here’s the video again, for your viewing pleasure. Let me know what you think.


  1. I know, I know, I have an amazing sense of humor ↩︎