Catching Out-of-Context Misinformation with Self-supervised Learning
Combating Misinformation: Detecting Out-of-Context Images on Social Media
Hello, everyone. I am Shivangi, a PhD student at TUM, and today I want to bring to light a rampant problem that we encounter on social media daily - misinformation. Specifically, I will be discussing one particular type of misinformation, namely out-of-context images, which I've been researching along with my advisors, Chris and Matthews.
This blog post aims to be as accessible as possible, regardless of your technical background. For a more in-depth exploration of the project, please refer to our project page linked at the end.
Why Should We Care About Detecting Out-of Context Images?
In our digital age, many of us consume most of our news and information via various social media platforms. But how do we discern which articles are true and which are false?
Fact-checking organizations serve as the answer for many companies. Yet, employing third-party fact-checkers leads to billions of dollars in company expenditure annually. With our project, we aim to automate this fact-checking process, increasing accuracy and reducing labour.
Defining the Problem
Misuses of images on social media generally fall into two categories:
- Image Tampering: It involves playing around with or altering parts of an image to misrepresent its messaging. Multiple methods are available to detect such manipulative actions
- Genuine Images with Irrelevant Claims: Our project focuses on this category, where an original, untampered image is shared with unrelated news or information—essentially, out of context uses of images.
A quintessential example would be an image from the 2016 anti-Trump protests circulated with a false caption about migratory riots, wherein people were burning the American Flag. Our task is to detect such misuse of images.
The Challenge of Limited Data
While available data on social media is overwhelming, the subset of images used to spread misinformation is minuscule, making it difficult to build a supervised model based on this limited training data for our problem.
Therein lies the question central our project: Can we apply the 300 million daily shared and non-misleading images on Facebook to build a model that identifies out-of-context images?
An Unsupervised Strategy for Model Training
Our project entails constructing an unsupervised or self-supervised training scheme. This involves gathering images and captions, identifying which captions are positively associated with an image, and distinguishing which captions have negative associations with that image. This process is vitally important to learning object association that ultimately helps us detect out-of-context images.
Detecting Out of Context Images: Test Time
In the testing phase, we identify relevant objects for each caption and determine a sentence similarity between the two captions.
The principle of our test strategy is simple: If the captions are semantically different but identify the same object in the image, it's likely an out-of-context use of an image. If not, it's considered in-context use. This process yields approximately 85% accuracy in detecting out-of-context images, outperforming other global context models in our comparisons.
Get Involved in the Fight Against Misinformation
If you are interested in combating digital misinformation, you might want to consider participating in the ACM Multimedia System's "Grand Challenge on Detecting Deep Fakes". They are awarding a significant prize to the challenge's winner. You can find more information and register on their website.
Feel free to reach out if you have any questions regarding this issue. Thank you for your interest and for actively participating in the fight against misinformation on social media.
Please refer to our project page for a more detailed exploration of the topic [insert project link].
Video Transcription
Hello, everyone. I'm Shivangi and I'm a phd student at Tum. And um and this talk is going to be about one of the problems that we face on social media every day.Uh that is misinformation and more specifically, I'll be talking about one special type of misinformation, which is out of context images. So I'll discuss what out of context images are and how can they be detected. And this is a joint work with my advisors, Chris and Matthews and I come from an academic background. Uh but I tried to make this talk as easy to follow for people from non technical background. So even if you don't understand anything, uh feel free to ask me any questions during or after the talk and you can find more information about the project at this particular website. This is a project page. So you can go there and you can find everything about the project there. Um So uh why do we want to detect the uh out of context images? So, uh we, we're all living in digital era, right? And we consume most of our information, most of our news via social media platforms like Facebook, Twitter, uh, and, and so on.
Uh, but we don't know which of the articles are genuine, which of the articles are identified as true and which of the, them are false and currently, uh, to identify which of the article is true or which of the article, which of the article is false. What, uh, comp big companies are doing is they are giving this, uh, as a task to a third checking, a third party where which is fact checking organizations and these factcheck organizations basically uh regulate and look at each of these articles and identify whether they are true or false.
Um So this way, companies are spending billions and billions of dollars on uh identifying fake media. And what we want to do is we want to uh with this project, we want to make these things automatic. So that then the fact checkers only need to verify what the model has uh made uh a decision about. Um So when we talk about misuse of images on social media, there are basically two things. So uh first is the image tampering and in image tampering, what happens is people just uh play around or change parts and regions of the image to convey a different message or to misrepresent the person. And an example of this is the image on the right um image on the left, sorry where the original image uh was uh of a student holding a banner that says Black Lives Matter. And this was tampered to show that this student is holding a banner that says Lincoln was racist and there are a lot of methods and tools and techniques that detect this kind of image tampering. Um And uh so this is not the focus of what us, what we are interested in is the second image that you see on the right, which is the genuine image with irrelevant claim.
And the idea behind these images is people do not play around with the image, they do not tamper the image. But what they do is they take this image and they share it with a muse that this image is not at all associated with and this uh news could be true, could be false but is not linked to the image. So uh and this is basically called out of context use of images. And so uh an example is this image on the right where uh the actual image was from from 2016 during anti Trump riots. And this image was shared with a caption that says that people in migrant caravan are burning American flag. Um So this is the focus of this talk, how can we detect this type of misuse of images? So uh this is how we define out of context images. Uh The images are legit, there's no tampering in the image but they are shared in false or misleading context. And so this is another example where you see an image of Omama. And this is an image from 2014 where he was in Maryland and he was there to learn about Ebola vaccine.
Uh And after the outbreak of COVID, this image was shared uh with a caption that says he uh that this image is of Obama visiting Wuhan lab in 2015. And he's discussing something about that project. So this is a kind of uh misuse of images we want to detect where uh the image is legit. People don't do anything with the image but what they do is they share this with a false or misleading information. Um And then there, there there are like some more examples. So uh the image on the left, you see this is basically an image uh of Japan after the earthquake and tsunami happened. And this image uh was shared with the claim that this uh that this shows the um sorry that this shows the garbage patch in the Pacific Ocean. And again, there's another image which is uh this image on the right. And this image is actually of a triceratops puppet in a dinosaur theme park in Indonesia when this uh image was shared with a claim that this is basically showing a triceratops in Indonesia, a real triceratops in Indonesia. So this is the kind of uh thing about out of cortex images. So people just take old images recycle them and share, share them with false claims. And this is the thing that we want to detect with this project.
Um So if you look at uh so we know the problem and uh and we, what we can do is we can go about build supervised models. We can say, let's let's take images, let's take captions, read them to remodel and uh ask the model to predict out of context use. But uh if you look at the other methods that use a supervised techniques, um they use lo lots and lots of data. So for instance, if you look at image classification, the image net um well has like 14 million images. And image net is the data set that people use to benchmark for image classification. And then if you talk about object detection segmentation, then people use MS Boco and this is also a very huge data set. Uh It has around 300,000 images. Um And uh if we look at the images that are shared on social media every day and compare uh these images with how many uh of these images are used to spread misinformation, they are very tiny subset, very handful of images are used to spread misinformation. And uh currently, these are, as I said, these are being regulated by fact checking websites like snobs. Um So we have very limited training data given the nature of the problem that we're trying to solve for, for detecting out of context.
Uh use of images, we don't have enough data to build a supervised model. Um And to give you an estimate, uh there are around 300 million photos that are shared on uh Facebook every day. And this fact checking website SNS has till the date I collected this data um had around 13,000 images as factcheck in total. So do you see the difference that there's like a lot of difference between images that are shared versus images that are used to spread uh misinformation.
So the idea that uh goes around this project was that why can we use the uh the 300 million images that are not misleading, that are not fair to somehow build a model that ultimately helps us detect the out of context use of images. And so basically, these are not misleading and these images um and we don't know if these are misleading or not actually. So we, we just uh note that these are images and they have uh these images have uh several uh claims or several news articles that they are shared with. And we somehow want to leverage this to detect out of context use of images. Um And so the idea that goes around is we wanna build an unsupervised or self supervised trading scheme where basically what we do is we collect a bunch of images and we collect different captions or different information that these images are shared with. So for instance, on the top, you see an image of a train accident and this was shared with two captions. So first was that rescue workers at the site of train accident at Great Belt Bridge in Denmark. So this caption is talking about the rescue workers in general. And the second one said that the passenger train was stuck by a cargo container in Denmark early Wednesday. And this caption is talking about the train. And if you look at the second image, this is an image of a scooter.
And uh this image was shared with these two claims. So electric scooter programs have become popular in crowded cities like San Francisco. And th this is talking about scooter itself. And the second caption that a scooter on the side will walk in downtown San Francisco. So this one is also talking about scooter. And what we do is uh we basically somehow try to associate different captions uh in our data set with the image. So we see that the top two captions uh the rescue workers and the passenger train caption is uh positively uh associated with this image on the left, which is an image of train accident. And the bottom two captions which are about uh scooter are positively related to uh this image of scooter but uh not true, but this is not true, vice versa, right? The scooter captions should have a negative uh relation with the image of a train accident and the captions about train accident and workers should have a negative relation with the image of scooter. And this is the key idea. And uh why do we want to learn this association? This is basically we uh look at test time. I'll explain that this eventually helps us to detect out of context use of images. Um So the training setup is is uh relatively simple.
We take an image and we take two captions, we call them matching and non your matching caption and matching caption is basically the caption that appeared with the image um in some uh social media post or news article and non matching caption in SIM is simply a random caption that is not associated with the image.
And then what we do is we basically use a pre trainin object detector network to detect objects in the image, what objects are in the image. And then we basically use a feature extractor uh which is uh a CNN network and we extract features for each of the objects in the image. And at the same time, in parallel, we use uh text encoder to encode these uh text articles. Um And so in the end, we obtain a 300 dimensional vector both for uh for all the objects and for these textual captions. And then we combine um these uh textual captions with uh the features for the textual caption with all the features for the objects. And this combination is done via dot product. And then uh we simply apply a max operator to get the score. So these scores basically tell us which object is most relevant for a given caption. And we train it with uh laws such that the score for the matching caption is higher than the score for the random caption. And this basically helps us learn uh grounding uh of the objects in the image. And so this is uh the, so here I'm going to show you some visual grounding scores that the model has learned. And the image on the left, you see a train accident.
So the top caption was about rescue workers and it identified and gave a positive score to the worker here. Um The bottom caption is about the passenger train and it gave a positive score to the passenger train. And you see that both of these um captions are not misinformation, but they identify different objects in the image. Similarly, this image on the right. So the top caption is a large morale de depicting a person wearing a mask which is talking about uh the morale which is the painting in the background. And it gave a positive score to that. And then the bottom caption, a woman walks past a morale reminding residents of anti Coronavirus hygiene. This is talking about women and it gave a positive score to women. So, again, different objects but not misinformation. Um And then uh these are the examples where I'm showing actually how uh images are used as misinformation. So the image on the center uses, you see that this is like um in, in, in reality, this is a photograph of for showing soldiers wearing extra suits. And this is from a movie and the bottom caption which says that a photograph shows Russian soldier wearing exo suits in Syria. So both of them are talking about soldiers who's wearing exo suits and both of them correspond to this person.
And on the right, you see that there is a person who uh standing on the streets. And this is uh in reality is an image from uh Tokyo. And at the bottom, you see this image was shared falsely that says that that the image shows a Supreme Army general of antifa import and both of them identify with this uh person. So uh now once we have identified which objects are are the captions talking about, we develop our testing strategy. And what we do is we take the image, we take both of these captions, we pass it through the model and we detect out of context, use of images. Um And the way we do with this, we detect objects, we take both the captions, we compute features uh for these objects. Um and as well as these uh captions, and then basically we identify which is the most relevant object for each of these captions. So uh with this uh red box, you see that uh this is the most relevant uh object for caption one. And the second one you see with this is the most relevant object for caption two. And then we additionally use a pre sentence model that computes a sentence similarity between the two captions.
And when we say that if the sentences are um semantically uh different, but they identify with the same object in the image, then there's a strong signal that this is an out of context use of images. And otherwise we say that this is not out of context and that's how we detect our out of context uh images. Um And uh so we were able to get uh pretty decent results. And this chart basically shows a comparison with uh a bunch of baseline methods. And our method basically goes to uh around 85% of detection accuracy and 77% is S word is using only sentence baseline. And these are other detection methods. And what makes our methods stand out from the other methods is we, we're basically dealing on object level, we're trying to identify which object the caption is talking about. Whereas the other methods are treating uh the global context and they're looking at the entire image in, in general.
Um So I think that that would be it from my side and I have um a news for people who are interested in these kind of problems in this kind of task. So the uh AC M multimedia system this year is actually organizing a challenge based on this problem based on this data set. And it's called Grand Challenge on Detecting Deep Fix. And they'll be avoiding um around 7500 US D. So if you're interested, this is the link to the website, just go and register and request the data set and feel free to participate in the competition. Um Yeah. So that would be from my side. I think we still have two minutes left. So if anyone has any questions, if not, I think I'll stop sharing my screen then um and, and if you have any questions, I, I know I rushed through it because we had like time limitation. But if you have any questions uh regarding the the the topic, please feel free to reach out. Ok, with that, I think I'll, I'll, I'll take a leave and thank you all for attending.