How AI Programs Can Do More Than Just Text

October 05, 2023

How AI Programs Can Do More Than Just Text

What is multimodal AI?

Multimodal AI is a type of artificial intelligence that can understand and generate different types of data, such as text, images, audio, and video. For example, a multimodal AI chatbot can talk to you using voice, show you pictures, and write captions for them.

Why is multimodal AI important?

Multimodal AI is important because it can do more things than just text-based AI. It can also make the interaction more natural and humanlike. For example, a multimodal AI chatbot can help you with tasks such as:

Splitting a bill from a photo of a receipt
Describing the owner of a bookshelf from a photo
Giving directions to a landmark from a photo
Identifying insects from photos

Multimodal AI can also help people with disabilities, such as blind or low-sighted people, by describing the scenes around them.

How is multimodal AI made?

Multimodal AI is made by combining different types of AI models that can handle different types of data. There are two main ways to do this:

Stacking: This is when one AI model translates one type of data into another, and then feeds it to another AI model. For example, an image captioning model can turn a photo into a text description, and then give it to a text-based chatbot.
Grafting: This is when parts of different AI models are merged together into one model. For example, a text-based chatbot model can have parts of an image recognition model added to it.

Both methods require training the multimodal AI model on data sets that have different types of data together, such as images with captions or text with audio.

What are some examples of multimodal AI?

Some examples of multimodal AI are:

ChatGPT: This is a chatbot made by OpenAI that uses GPT-4V, a large language model that can also handle images. It can talk to you using voice and text, and show you images and captions.
Bard: This is a chatbot made by Google that uses PaLM 2, another large language model that can also handle images and audio. It can also talk to you using voice and text, and show you images and captions.
Be My Eyes: This is an app that helps blind and low-sighted people by connecting them to volunteers or AI agents that can describe the scenes around them using voice. It uses OpenAI’s multimodal version of GPT-4.

What are some challenges and limitations of multimodal AI?

Multimodal AI is still not perfect and has some challenges and limitations, such as:

Hallucination: This is when the multimodal AI makes up or misinterprets information that is not in the data. For example, it might see something in an image that is not there or read something in a text that is not true.
Privacy: This is when the multimodal AI collects or exposes sensitive information from the data. For example, it might reveal personal details from a photo or a voice recording.
Accuracy: This is when the multimodal AI makes mistakes or errors in understanding or generating the data. For example, it might confuse numbers or words in a receipt or a caption.

These challenges and limitations require careful testing and improvement of the multimodal AI models and their data sets. They also require users to be aware and cautious of what they share and receive from the multimodal AI agents.

Search This Blog

Nature