Adriana Romero Soriano - Seeing the unseen: inferring unobserved information from multi-modal data

Automatic Summary

Unlocking the Power of AI in Inferring the Unseen

Have you ever wondered how we are able to build models of the world around us, even when we are unable to fully observe it? The secret lies in leveraging artificial intelligence (AI) to infer unseen information, by reducing uncertainty and filling in missing data. In this blog post, we dive deep into this fascinating concept, guided by the research of Adriana Romero Soriano, a scientist at Facebook AI Research and adjunct professor at McGill University.

An Introduction to Inferring the Unseen from Multimodal Data

We often confront situations where we only have partial observations of an event or an object. Despite observing only parts of a digit, for instance, we can intuit the full number. It might be a seven, a one, or a four. Such partial observations inherently generate uncertainty when trying to discern the full observation. As Adriana Romero Soriano explains, to improve the models we build about the world, we reduce the possible ways to fill in missing data by acquiring new observations. These fresh observations can complement existing information, resulting in more accurate models.

Examples of Inferring the Unseen

Let's look at some intriguing examples of inferring the unseen:

  • Recovering the full 3D shape of an object from a single view image.
  • Generating a high-resolution version of a low-resolution MRI image.

Improving 3D Shape Reconstruction and MR Image Quality

In order to limit ambiguity and recover better 3D object shapes, Romero Soriano and her team explored leveraging the complementarity of visual and tactile data modalities. Tactile data offer detailed insights about a touching object's local structure and its positional information.

The team created a visual-tactile dataset of object-grasp interactions using objects from publicly available computer vision data sets. By fusing visual and tactile signals, they were able to predict and reconstruct the full 3D shape of an object with higher accuracy.

Moving to Medical Resonance Imaging (MRI) reconstructions, they innovated by applying an "active acquisition" approach to increase image quality. This involved acquiring additional measurements selectively, personalized to each patient, to ensure a high-fidelity image reconstruction.

The Future of AI in Inferring Unobserved Information

In conclusion, Romero Soriano's ground-breaking AI research aims to mitigate uncertainty in inferring the unseen - be it a 3D object or an MRI image. Her team achieved this by leveraging complementing data modalities and enabling active acquisition between models.

As AI continues to grow, the ability to infer unobserved information promises to make our lives not just easier, but also more precise and accurate. It's a thrilling prospect and we're eager to see what the future holds.


Video Transcription

All right. So, hi everyone. Uh Welcome to this session of uh Women Tech. I hope you're enjoying uh the conference. My name is Adriana Romero Soriano and I'm a research scientist at Facebook A I research. Uh I also hold an appointment as an adjunct professor at mcgill University.

And today I'm gonna briefly discuss the research that I've been doing over the past few years. Uh on the topic of seeing the unseen in particular, I'll be discussing uh how uh to infer unobserved information from multimodal data as humans. We never fully observe the world around us and yet we are able to build remarkably useful models of it. Let's see a very simple example of this. Uh Here we have a partial observation and we know that this partial observation corresponds to a digit. There are many ways in which we could fill in this in this uh partial observation such that it would lead to fully observed. Uh but different digits, for example, some of you might fill in the partial observation in such a way that it becomes a seven, some others might fill it in in such a way that it becomes a one or even a four and so partial observations naturally induce uncertainty when attempting to recover the full observation due to this uh number of uh possible ways in which we can fill in the missing information.

And so, in order to improve the models that we build, we reduce the many ways in which we can fill in the missing information by acquiring new observations. Uh These observe, these new observations can potentially complement the previews information that we already had. For example, we improve our understanding of the outer space by acquiring new observations. Similarly, we gather information from underwater by taking deep dives. And we also improve our understanding of our own inner space or the one of other organisms on earth by exploring new observations.

And by studying the newly acquired information, we increase our understanding and are able to build models of what once was invisible to us. Similarly, if we go back to our simple uh example in in the world of the digits, if we collect additional information, say we have this additional stroke uh shown down in in the slide in the partial observation, then the number of possible ways in which we can uh fill in this uh missing information reduces.

And so in this particular case, we reduce the uncertainty in such a way that the most likely digit to be recovered is uh digit number seven. And so we keep improving our models of the scene and the unseen by acquiring and leveraging information to complement the previous information that we already had machine learning models are often required to operate in a similar setup. That is the one where they have to infer observed information from uh uh unobserved information from the observed one. Let's see a couple of examples of such systems. Uh So one might uh have this image is an image of an object a model in particular that is being handheld. And one question that we could ask is is can we recover the full 3d shape of the object given this partial view on the object? So here, the partial observation would be the single view image of the object. And the full observation that we would want to recover is its full 3d shape. And so because we only have a single view of the object, there might be different ways in which we can fill in the information that we don't see. In order to infer the 3d shape.

As a second example, one could uh show you this image, this image corresponds to uh um an MR image of a knee. Uh But it has a very uh low resolution. So one possible que question would be what uh what can we do or how um can we address the problem of making this high resolution uh this image, high resolutions work? So what would the high resolution version of this image look like in this talk? I will uh not only answer these two previous questions but also present approaches that allow us to get improved results. Um Let's start with the first question that is uh can we infer the 3d shape of the object in view? So this is the problem of 3d shape reconstruction. And this problem has been tackled in most cases by the computer vision community from a single or multiple view images. In this work, we will focus on the single view formulation. So the goal of 3D shape reconstruction is to infer the whole 3d shape of the object from a single view image of the object. And as mentioned before, because we only have a single view. And so there are parts of the objects that we don't see and that are occluded and there might be different ways in which we might uh fill in this 3d shape. And current 3d shape reconstruction methods result in relatively successful reconstructions already.

If we consider that we have a single view of the chair depicted on the slide, we see that uh the current model, the one that I'm using here is the state of the art uh model obtains uh relatively high quality uh reconstruction. However, as we can observe, there are still some missing details such as the ones in the top uh back of the chair. However, in different settings, if we considering if we consider more challenging scenarios where the ambiguities uh may arise from the unseeing parts, there might be uh many more ways in which we can uh fill in the missing information about the shapes. And in this case, uh we see, for example, the couch or uh the car, we don't have the full idea of how these um objects look like in practice. And so inferring the full 3d shade becomes more uh challenging um in order to limit the possible ways in which we could fill in this missing information and recover better 3d object shapes we explored uh how to leverage the complementarity of vision tactile data modalities. Um So let me try to uh explain why these two modalities. So if we start with vision, although we only have a single view image of the object that we want to reconstruct vision already provides a strong global context for the object in view.

That is it provides us with a good grasp on the general shape of the object and also about its rough position in space. However, uh vision has some ambiguities which may hinder the quality of the reconstructed objects. For example, vision suffers from occlusions and from bus relief if we think of tactile thal. On the other hand, they provide highly detailed information about the local structure of an object being touched. And they also provide important positional information about the object since this object will be very well grounded within the reference frame of the hand touching it. However, uh touch information is also limited in that the local information that it provides will very often fail to extrapolate well to global object understanding even if we have strong object priors. Therefore, vision and touch provide complementary global and local information about an object.

In order to explore the complementarity of these two kinds of signals of the visual and the tactile signals. We started by building a visual tactile data set of object grasp interactions to do so We simulated object grasp interactions using an allegro hand uh which is equipped with high resolution vision based touch sensors on each of its three fingers and thumb. We use objects from publicly available computer vision data sets given their ubiquitous use in research. And once we perform the object grasp interactions, uh we obtained an image of the interaction which we use as a visual signal and depicted on the slide as the vision signal.

And we also obtain uh four haptic signals, four tactile signals as depicted on the slide. And we perform this object grasp interactions from thousands and thousands and of objects. So this led to our simulated data set. And once we had access to the simulated data set, we designed an approach to fuse visual and tactile signals for 3d shape reconstruction. In this approach, we represent a 3d object with a collection of the joint mesh surface elements called charts. And here depicted as the triangles composing in the sphere. Uh Some of the charts are reserved for the tactile signals and some of other charts are reserved for the visual information. In the first step of our model of our pipeline, we predict the touch charts from the touch recordings. For that. We use a neural network to predict the position of the vertices composing uh each chart in in in the touch charts. Then we initialize the vision charts as a sphere and we project the visual signal so the image of the object uh onto all uh charts. And then uh we feed all the charts, the vision and touch charts into a mesh deformation process. Uh This mesh deformation process leverages recent advances in graph neural networks. And at the end of this process, we enforce touch consistency. Uh That is that at the end of the deformation process, the touch charts are copy pasted back to their original position.

And so we use the visual information to fill in the blanks uh and recover the full 3d shape of the object. As a result, we obtain a global prediction of the deformed uh charts representing the object. Let me now show you a visual result depicting uh what the proposed approach, leveraging vision and touch can achieve. On the left column, we see the input vision signal of the bottle. Uh then from left to right, we have uh reconstructions from touch only with a single grasp, followed by vision and touch with a single grasp and then vision and touch. But with increasing number of grasps. So this means that we perform several object grasp interactions in order to obtain our final reconstructions. Uh The top results depict the predicted shape and the bottom result visualizes the error uh on top of the predicted uh surface. From these experiments, we visually confirm that the model gains additional insight into the nature of the object by touching new areas on its surface. And if we look into the details of this, if we see in the case of one grasp without vision, we see that there is uh quite a bit uh of error around the object uh surface highlighted in red as soon as the vision signal kicks in.

Uh So we have one grass plus vision, we see that the error on the surface is mainly reduced around the top of the bottle. And as we keep increasing, the number of grasps uh performed, the error is on the surface of the object is further uh decreased. Um Moving on to the second question that I posed in the beginning of the presentation. Let's see uh what does the high resolution version of this Mr imaging look like? And before we dig into this problem, just let me give you a brief overview on how MRI works. MRI scanners work by successively acquiring measurements in the frequency domain. And from these acquired measurements, we can reconstruct an image using the inverse transform. And generally speaking, the more measurements we take with the MRM machine, the better the resulting image quality.

However, taking many measurements is slow and so a simple way to accelerate the MRI acquisition process is to collect less measurements. For example, we could tell the Mr machine to follow the sampling trajectory depicted under uh the Mr machine on the slide where each of the vertical lines of the white, white vertical lines represents one measurement to be acquired. So if we acquire those measurements from those acquired measurements, we can then apply the inverse fourier transform and obtain a reconstruction of the image. As you see on the slide, this results in low quality images because we have limited the number of measurements that we have acquired.

And so there is a whole field devoted to improving the quality of Mr image reconstruction by leveraging machine learning and more recently deep learning. And therefore, our goal in in this case is to infer the unobserved measurements from the observed one ensuring high fidelity image reconstructions.

Once again, there might be many ways in which we can fill in these missing measurements that could potentially lead to different diagnoses. Hence the importance of uh inferring the right ones, we tackle the MRI reconstruction problem from a perspective of active acquisition uh as a way to reduce the uncertainty by acquiring new measurements. And as a result, we hope to increase the quality of the reconstructed images in the active acquisition scenario. If we believe that the quality of the reconstructed image is not good enough. We would like to keep acquiring additional measurements in order to ensure that we end up with a high fidelity image reconstruction. This means that in some cases, the number of required measurements will be small. And in some other cases, the number of required measurements will be high.

Um note that designing active acquisition strategies allows us to adapt the measurements taken to each patient. And so we can think of this as a way of personalizing Mr scans. Uh Here is a brief overview on how to learn this active acquisition trajectories. Uh So we start with the reconstruction model which takes as input partial measurements taken by the Mr machine and which infers the missing ones in order to reduce the uncertainty or the many ways in which we can fill in the missing measurements. We design a second neural network which takes us input the output of the reconstruction network. The second network outputs a score for each reconstructed measurements and the score tells us how good it is to acquire this measurement in the long run. In practice, uh We choose to acquire the measurements which have the highest added value according to the second network. And after acquiring it, we can go back to the first network with the new information and predict the new image reconstruction. I will now show a short video on how this active acquisition that we have learned um works. So from left to right, you will see the active acquisition trajectory. So new measure as we keep adding new measurements, neo vertical lines will appear, then you have the ground truth image and followed by the reconstructed image and its associated error maps.

So as we see in the video, as we keep adding new measurements, the errors are decreasing, the quality of the reconstructed image is improving and is getting closer and closer to the ground truth, which is the uh first image from the left. So to wrap up this presentation, I started the talk arguing that partial observations entail data uncertainty and that there are many possible ways in which we can fill in the missing information. Uh We then saw two examples of machine learning systems which attempt to infer unobserved information.

And although current models are able to already provide somewhat useful reconstructions of what is unseen, they might be hindered by the uncertainty that arises from it. And so to address this issue, we discussed two different strategies to mitigate the number of possible ways in which we can fill in the missing information which lead to improved results. In particular, we discussed the ability of current models to infer uh the unobserved information.

And we saw that this could be improved by uh leveraging complementary data modalities, which was the case of leveraging vision and touch to improve 3D reconstruction systems. And we also discussed how to equip the models with active acquisition capabilities in order to determine the amount of observations needed to recover the missing information with high fidelity in the context of MRI reconstruction. I'll wrap up the presentation here. Uh I'll stop sharing my take any questions that you might have on the chat.