Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Creative Bloq
Creative Bloq
Technology
Joseph Foley

This new AI image reader is frighteningly clever

Screenshots from Blip-2 AI image reader

AI is making a big impact in all kinds of sectors at the moment. In the creative fields, most of the attention (and controversy) has revolved around AI text-to-image generators like DALL-E 2. But there are also tools being developed that work the other way around.

Machine learning software developer Replicate has developed Blip-2, an AI model that can caption images and answer questions about them... sometimes. Just don't take its answers as gospel. 

Blip-2 can caption images and answer questions about them... sometimes (Image credit: Joseph Foley / Replicate)

Replicate demonstrates the effectiveness of Blip-2 with some solid examples. Presented with an image of the Golden Gate Bridge and asked what body of what it crosses, the model correctly responds San Francisco Bay. It's also able to tell us that pandas come from China and that Marina Bay Sands is located in Singapore. 

All you do is upload an image and click submit if you want a caption, or add a question if you're seeking specific information. It then runs predictions on Nvidia A100 GPU hardware. You can then use question answers as added context to ask more questions. It sounds clever and it has several uses – automatic captioning, sorting and classifying images for archiving, for example. But when it comes to trying to find out something we might not know, its predictions can be very unreliable.

No rufous tail here... (Image credit: Joseph Foley / Replicate)
Nor here (Image credit: Joseph Foley / Replicate)

I tested it out by uploading some of my on photos of various kinds of subjects. First up a hummingbird. It gave that the caption: "a hummingbird is flying near some flowers". OK, fine, but that could maybe save me some time if I'm processing a ton of images, but it's not massively informative. I'd like to know what species of hummingbird it is. I ask the question, and it tells me it's a rufous-tailed hummingbird. Only it isn't it's a glittering emerald. I try with another species of bird, and it insists that this species is also a rufous-tailed hummingbird.

OK, so maybe it only got trained on one species of hummingbird. Let's try a mammal. Nobody needs AI to tell them what a panda or an elephant is, so I want to go for something that at least offers a bit of a challenge. A Patagonian mara, say. On the first try, this sends the model into complete fantasy land. It identifies the sleepy rodent as a 'saber-toothed tapir' a species that it seems to have completely made up since there is no reference online to such an animal ever having existed either in reality or fiction.

Wait, wait; I know the answer to this one! (Image credit: Joseph Foley / Replicate)

I crop the image closer and try again. This time Blip-2 got excited. It was so convinced it had the answer, that it spat out the word 'kangaroo' several times over. I'll take that to be a 'blip'. OK, well, there are plenty of specialist apps like iNaturalist for identifying animals. Let's try something else.

It doesn't do hugely well on buildings, other than things on an Eiffel Tower level of fame. It identified the Kavanagh building, a much-photographed 1930s landmark skyscraper in Buenos Aires, as a nondescript hotel in 'So Paulo' (presumably Sao Paulo) in Brazil. I was, however, impressed that Blip-2 identified a mountain landscape in southwestern Argentina as being in Chile. I mean, that's just over the border and the scenery is comparable. But then 'close-ish" isn't really good enough to be very useful for anything when it comes to captioning an image.

Hmm, it's on the right continent (Image credit: Joseph Foley / Replicate)
Bingo! (Image credit: Joseph Foley / Replicate)

Admittedly, it does better on some images. When asked what dance a couple were dancing, it correctly responded tango. It's also able to identify the logos of major companies, such as TikTok. Captions are also generally accurate if extremely vague. I was disappointed that when fed DALL-E 2's famous astronaut riding a horse, Blip-2 only came up with 'a white horse with a man on it' (although when asked what the man was wearing, it did recognise that he's in a space suit).

Captions are generally accurate but vague (Image credit: Joseph Foley / Replicate)
It was unable to tell what this man was making with his machine (Image credit: Joseph Foley / Replicate)
It got confused by this composite image (Image credit: Joseph Foley / Replicate)
The truth but not the whole truth (Image credit: Joseph Foley / Replicate)

As a final test, I subjected the model to an image of our very own Ian Dean playing Swithback on the PSVR 2. Here it recognised the VR headset but thinks there's two of them. Asked what Ian was playing, it said a VR horror game on PS4, so again kind of on the right track, but not an answer that can be relied on.

Overall, it seems Blip-2 hasn't been trained on enough material to be able to correctly tell s anything that most people don't already know. Yes, pandas live in China and the Golden Gate Bridge is in San Francisco. Anything a little more obscure, and it starts getting creative.

This seems to be one of the problems with AI tools like Blip-2 and the text generator ChatGPT. Rather like some humans, they don't admit it when they don't know something. Instead, they just make it up. That might be fine if the people using them as going to run checks on the output, but if not it could soon become even more difficult to trust online articles, images and now captions.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.