Data-driven decision making using smart edge computing in IoT environment by Hanane LAMAAZI
Video Transcription
Good evening everyone. And um I just want to say that I'm happy to be uh today with the such wonderful community, which is the Dorman Tech community and this is my second presentation with them. So before starting, I'm just introducing myself. So um I am Hanna LeMay.I'm a postdoctoral fellow with the Khalifa University in UAE. So I was the gold winner uh of the woman, the global community world last year and I was happy to do that. OK. So I'm um recognized as a reviewer for uh S CV and I Triple E publishers and I was uh acting as a T PC members for a set of conferences and I have around 470 citation in my Google scholar scopes. And generally, I'm working on the field of networks, emerging technology, Hut and crowdsourcing and so on. So I made like a um presentation last year about the concept of Hut and distributed the computation. And I distinguish that for non technical attendee, they don't know that much for crowd sensing. So I will try to introduce what is the concept of crowd sensing. So crowd sensing is considered as an enabler technology for uh antennas of things. Uh Since it provide optimal solution for um collecting information and data from physical environment and it relates on interaction between users or human and devices or between devices to uh devices. So uh generally, there are like around five main parameters that constitute the concept of crowd sensing.
So the first one is the smartphone or the devices, the smart devices that can be smartphone or wearable devices that are used for collecting the information. Also the concept of workers, which is the terminology when used in uh the research community, especially the researchers that are working in crowd sensing. So workers can be a simple participants that can volunteer or work in order to collect some specific information. Also it is consists of uh tasks. So task which means that it's a specific request from whatever user or a specific company in order to collect uh predefined da data or a specific type of uh data or information. So based on this uh task request uh raised by uh users or a third party, the workers are starting collecting the information using their devices. So uh the third one is the area of which means that every sensing process, it's it is uh based on a specific area. If the purpose of this sensing is to get some information related to something specific or some services or some products. Last one is the data. So in crowd sensing, everything is about data collecting information from a physical environment, regardless the purpose of the sensing, whatever it's uh with a good intention or bad intention. But the concept is collecting information from the physical environment. So how the data is sensor.
So there is two ways in order to sense the data. The first one is the participatory crowd sensing and there is also the opportunistic one. So in the participatory crowd sensing, the users are volunteered to collect the information. Let's see. For example, yesterday, I got um in a woman take community, there is like a pudding or a survey where they ask for, what are the challenges that woman's got when they want to be integrated and take uh domain? They ask also what was your enabler or uh the the person who helped you in order to overcome this kind of difficulties? What is your age? What is the year of your uh when you get these challenges? So this kind of information users, sometimes they volunteer to fill the forms or the surveys with the goal to help other companies, Chinese to improve their products or to uh even completely propose a new product that can help or new solutions, not only products but new solutions that can help humanity or um whatever any employees in a specific company or um like the problem of gender between women and men and companies and so on.
So it's depending on the goal uh the users can volunteer to provide some specific information. But this is not the case for opportunity, crowd sensing where your data is collected censored and shared without your knowledge or the knowledge of the user. For example, for some applications, they can keep running in the back end and they can keep collecting the information. Sometimes it can be with uh our knowledge but not specifically what type of data it is collected. For example, when you want to install a new application in um the like regulation of the product, like in terms and conditions, uh you can always find this, that we want to collect your information, your uh habits or your preference in order to suggest or recommend some new products or for example, to use it for a good purpose and be sure that we will not share your private information and so on.
So this is an example of opportunistic sensing. So your data is collected and you don't know where it go and or um what is the purpose for that. So the second uh characteristic of participatory crowd sensing is that the user can um determine uh when and where and how he will do the sizing. So why for the simple reason in the participatory, you are the owner of your decision, whatever it is, something that you do it in a volunteer way, like the case of ping and surveys or for example, you get paid to do that and you collect information and then you send it back to a specific company and you get paid for that.
But for the opportunistic, there is no uh we cannot know when our data is collected or uh where. And um we ignore a lot of information about our data that is collected by, by the applications. So it can just keep collecting information until it gets a huge amount of data and uh in a specific time and then the processing can be done in um offline or uh once all the information is stored in a specific uh platforms, yeah, the third difference between participatory and opportunistic is the phone or the devices issues.
So as we mentioned in the second characteristic that user uh can determine when and where to, to start the sensing or to the collection of information. So he keep in mind that his phone should be uh full charge. Uh For example, not broken, should contain uh all the sensors like camera, microphone and so on. So all those um issues can be avoided uh in participatory crowd setting. But this is not the case for opportunistic one. Why? Because for the simple reason, the data is collected without our knowledge. So if our uh phone it's uh turned it off. So the sensing process is stopped because it's related to the condition of our devices. The last one, it's uh with participatory, it's um considered as a higher load or uh it's a little bit expensive, not that much but for the simple reason that the workers they get paid for doing the sensing process or the company that uh collect this information can sell your information to another third company, which is not the case for opportunistic because our data is censored even without our knowledge.
So we are not getting paid for anything. So um here's some examples of mobile crowd sensing applications. So the well known is youtube and Facebook. So as you can see, for example, um like I um I start looking how to make my presentation nice. So I, I went to youtube and I start seeing some videos. And after a small time I found like there is other videos that come and recommend other videos that are related to what I'm looking for. So this is it your information, your habit, your preference are collected by the those uh companies and then they use it in order to propose you or to uh recommend other products related to what you are looking for or something that you even know, but they propose it for you.
For example, for Uber. Um uh for example, let's use the terminology of mobile crowd, I think. So for example, a task could, could be um clients that you want to reach a, a specific destination. So uh he raised the the the request and uh Uber application. So uh the workers will be the drivers once they get this information or this request. So the nearest driver to that client will pick up the the client and drop it uh out to uh to the location that uh he want to go to it. OK. So to uh summarize all and this first part, that's all what concern mobile crowds sending is collecting information and collecting the data. But the question that we can have in mind is uh is this data has some characteristic, some dimensions, this data can be used like it is or it needs extra processing, what is the quality of data or this data can be attacked or uh falsified or changed by third parties? So here uh we will see the main factors that can affect the quality of data. So the first one is the research constraint. So it's depending on the devices that collect this information, whatever the type of devices like for example, a smartphone or wearable devices or dash camera or just camera series and so on. It play a big role on providing a good or bad quality of data.
Also the manufacturer regulation, for example, for let's see, uh the regulation uses an iphone. It's not the same with Samsung and it is not the same for Huawei and so on. So every manufacturing company uh has its own specific regulation that can like put some rules or some constraint to collect the information. Also the device is characteristic. So let's see, for example, dash camera can have like a microphone and the camera in order to collect videos.
But with a smartphone, we can have, we can have in addition to the microphone and camera, we can have other sensors like sensors for uh temperature measurements or other things. So as a consequence, this diverse can provide a very high heterogeneity of data. So we cannot have like uh data with the same type. And also it can provide incom list and turn that uh the data can be incomplete or send just half of data because a problem is happening on your devices and so on, it can be low accurate. It can be a data with uh a noise. The second factor it's the network. So what is your transmission technology that you are using in order to send your data? Whatever is it a Wi Fi connection? Uh And it will be better uh in terms of speed because it's faster than uh for example, Bluetooth, it can, for example, allow you to transmit a huge amount of data. But for example, for the Bluetooth connection, it can be good and the point that you don't have a connection you don't have access to, to internet. So you can still use your Bluetooth in order to share your data. So this uh data coming from different transmission technology can be affected by the high delay of transmission. So it's dependent on the technology can for example, provide some connection loss. Uh If for example, you use a Wi Fi or four G.
And suddenly it's uh uh you don't have any data in your phone, so you cannot send your or share your information. It can have a high overhead. And uh the third one is the area of anes. So also it's dependent on the location of the workers and the location of the task or for example, where I want to sense if this area it's crowded or uh only there is a few people, what is the kind of this overall environment it can affect the quality of data. So as a consequence, we can have a red data. For example, we are two workers in the same place and we are doing the same process and task and then we will share for sure the same information. So we will get redundant data also, since we have redundant the data, uh the data, the amount of data become huge because it's the same one and it's repeated many times because of workers location. Also we can have like a a loss because of like congestion or like because of the interference between the different transmission of the data. For example, like this is due to to the network characteristics. So if we send all of us the same data at the same time and a very big amount of data, the data can be lost just to this problem of congestion.
Uh And the final one, it's the anomaly and outlier also because we have for example, um a task in a specific area that we want to uh do the sensing in it. So a worker can be close to the task and a worker can be very far from the task. But both of them still can be able to collect information and send it to the server. So in this case, closes one to the task will provide um a specific data data with a specific characteristic. But for the other ones, since, since they are far, they can send even for the same task but different data. So in this case, we can have like anomalies or outliers, which means that our data is not consistent. So data as a concept, it can be like defined based on um a set of dimensions. So there are in the literature, there is uh four dimensions of data. So the first one is that the data can be radical, which means that a data can define an inert quality existing within the data as like reputation or the validity of the data. For example, for a specific time, this data cannot be valid. A context VL one which means that it is a context basis. For example, information that I want to collect for like a crash, it's not the same like uh uh information that I want to collect uh in a party. So this is two completely different uh contexts.
So also the data can depend on that. And also it depends on the time uh which time I want to set the data or what is the size of the data that I want to uh collect? It can be representational, which means that it can uh focus on a data formal andre and understanding. So what is the data or that I want to provide? And in which purpose? So for presentational data, it's like for example, for statistic for um graphs, for charts and so on. So this kind of data, it's more about what we can understand from it and how we can present it. So the last one is the accessibility which describe not only the availability of the data but also the degree of usage or how much this data is used. So the question that comes in mind if this data can be attacked. So this is for sure, since it is something that contain uh information, whatever it's um a very sensitive information or just uh uh daily uh basic uh information, this data can be attacked. So there are three types of attacks. Uh The first one is the data corruption, the data exfiltration and the data disruption. So the data corruption attack, it's a cure when a data lose its um integrity, it can be uh due to internal factors or external one.
So for the internal can be caused by external virus that are installed or stored in your uh target devices, uh laptop or a smartphone. And this virus can modify the original data can partially delete it or completely destroy it. And also this can be related to the software and the hardware malfunctions. And due for example, to incompatibility between Softwares or between, for example, uh system versions or whatever and software and hard hardware failures for the external one, it is uh associated to the environment, whatever what happened in your environment, it can affect your devices.
It means that your uh data can be corrected, like for example, waters or storms or power outage and, and so on. So the second one is the data disruption. And I think this one, it's not really um it is an attack that can be also happening throughout the user itself. So uh it, this kind of attack can occur when a user lo uh lose access to its data due to uh virus. Uh or for example, a software problems, hardware malfunctions or even for example, network down or connection, the connection is lost. For example, uh let's give a um I think for all students, they uh or for example, people that they work in academia or research, they get this uh this problem of data disruption. For example, sometimes uh when we are using like a Microsoft Word and we are trying to uh like write our reports and suddenly we forget to save uh the data and then uh we don't know what happened to, to the laptop or for example, in case of uh network is down or um for example, the laptop is off or whatever.
So once we win, we want to look for that file, we couldn't find it. So it is lost. So this is like a simple way to describe data disruption. So you have your data so you cannot um access to it because for a sample problem that you didn't save it and it is um lost the last one. It's, it's most serious one. It's the data exfiltration uh attack which refer to any unauthorized transfer of data from one device to another one. So it's described like a data thief uh or data exportation or uh extrusion. Uh it can be uh done by uh fashion or outbound emails by uploading data to um like another uh insecure devices. It can be also done by uh unauthorized uh software or uh shady websites. And finally, it can be happening uh using and safe behavior in the cloud. So who can exfiltrate the data or for example, who are the really Attackers that are able to exfiltrate the data? So it can be anyone from inside the company or inside or the user. And if it is a term of company, it can be an insider or even um outsider with a malicious intention. Also, it can be a former employee, it can be uh inadvertent insider, it can be privileged user or a third party.
So here's some example of data accelerations for that happened for um well known uh companies. So for HWA company, they lose around or data on 30 million cars account, uh get stolen using a card stealing malware. Also uh for Amazon customer emails and uh address and the phone numbers was stolen using a malicious insiders and travel X uh company, they lose around five giga or gigabytes of sensitive corporate data using the run somewhere. So losing this sensitive data can cause um a very serious problem. Uh Not only for the companies that they can lose um money, they can lose trust from their users, they can lose a lot of things and also for the user itself. So um no one of us want to his data or private data can be used by um Mauritius people, but this is happening in, in this world. So here just like um flashback for uh where we were storing our data. So at the beginning, this is what I'm remember. It's the floppy disk but for the bench card, bench tape, magnetic tape, hard drive, um I don't know that much. I was not yet born, but for the other ones, especially I think for the computer readable and uh for the floppy disk, it was still um used until I think 2000 and also for the computer readable uh CRM, we use it also like for, in order to save our data, but they have a lot of uh problems in terms of like they, they were not very efficient, they cannot store more than uh if I remember more than two gigabytes or four gigabytes and they can get affected very easily with the virus and very easily damaged just through the external problems and like um anything happening and if you didn't be careful, uh how or where to save it, it will be damaged.
So and at the end, you will lose your data. But those solutions traditional one and all solutions cannot be used in order to store the affirmation in this time. For the simple reason is that the amount of data it's become very huge, especially with the social media platforms, the emerging technologies, the development that's happening in in the world. So uh people they keep sharing information on a daily basis. And for example, for one person, it can uh you can share, for example, um more than uh five gigabytes or 10 gigabytes per day just through image or through the um videos between friends and so on. So this huge amount of data should be really stored in a platform that is very powerful. So as a solution, the cloud platforms as a centralized platform was proposed in order to store the uh the whole data that is censored or collected from the physical environment. So the centralized platform was a very optimal solution, especially for applications that they have a delay in energy tolerance. And also uh the application that they didn't require a real time processing. But with these solutions, a set of challenges cure the first one, it's the delay, as I mentioned that it is not an optimal solution for applications that they need a real time answer or that they have a real time requirement.
Also, centralized platform are not able to discover or um connect or to select non connected devices and crowd sensing like devices that are using Bluetooth or um like uh wi I direct transmission communication. But they uh usually like uh on device that use more WI I connection or four G connection. Since they, they are able to, to share a very large amount of data. Also, they had a problem of computation. Uh Since uh this huge data is all sorted in one entity. And this data before uh starting with the computational process, they need to be filtered, they need to be manipulated, they need to be processed and then computed in order to get the final or only the data that is needed for the final project. So in order to increase their capability to uh make a very high computational processing, they need to be like uh the company need to invest on uh increasing the power of those entities which is highly expensive. And this is also one of the main challenges of this centralized platform. So as a solution and recently um like adopted as an optimal solution for those challenges that the centralized platform are suffering from is the computing.
So the computing is considered as an enabler of advanced iot uh service for a large data since it allowed to or it is placed as an intermediate layer between the cloud or centralized platform and the end user, any the final user. And also they allow a like a distributed or para parallel computation and they can do uh same responsibilities as a cloud. But with a set of doing the processing in uh one platform, it is distributed over a set of uh servers or the the information is sliced and distributed in uh for a set of servers. So the benefit of computer is the first one. As I mentioned, the parallel computation which allow to resolve the problem of space based on the location. Since the node is spaced near to or close to the end user, the offloading uh for both the mobile mobile devices and also for the server. So the server now uh is not in charge to do the overall processing for the whole data. But only it can process only the outcomes of the computation uh done locally by the users by the edge servers. Um In terms of computation, it reduce also the computation complexity. It's not the same as a decentralized platform. And uh also it reduced the latency and the the treats. Since let's see, for example, uh if for example, all the data in a centralized platform get lost.
So we cannot, we cannot take it back since uh all the data is stored in one place. But with each computing, uh the data is stored in um a set of servers. And if for example, one server of those servers get attacked. So we can still taken back information from the other servers. If it is not the same information, at least we can have historical data in order to prevent what is missing in the server that it gets attacked. This is somehow a good solution to preserve the privacy and to protect the data from lost. So here is just an illustration of uh the difference between centralized platform and distributed one. So as you can see all the users are sending their information to uh the central unit which is the cloud. But in the distributed architecture, we have a set of servers distributed in an area of interest and the workers or the users are sending the information to the closet servers, not all of them, but only the closet uh one. And the this server can do uh some processing like filtering the data or for example, uh removing unnecessary data or um even like compress or zip the data like make it small with a small size and then send it back to the cloud.
So the cloud will not do the processing from the beginning since it gets only the preprocess data and he can do like extra processing if it is needed. OK. So here in our uh center, we developed a set of solutions. Uh First uh solutions that can consider this concept of each computer in order to um like process the data uh locally and into each service. And also the solutions that consider the data quality provided from the workers. So the first one, it was like a brilliant uh like a framework that can deploy a set of entities or set of edge nodes. And those edge nodes are placed close to the task and to the end users. And then they select workers based on their quality of service and they process, they, they start the censoring process or they collect information from the users and then they send it back to, to the cloud as a finding of this uh solution, we found that this our solution, it's uh it's good in terms of uh selecting uh workers with high quality of service, it can also save costs or budget and can provide a more consistent data.
Similarly, for the second solutions, it is based on workers that they provide a high quality of data. And then we can we keep uh reelecting the workers based on the data and check the data if it is fitting with the expected quality of service required by the task or not. Once it is, this condition is satisfied, we stop the selection. So this solution provide more consistent data and require less delay and energy consumption and less resource computational resource. So as a takeaway, we talked about crowd sensing crowdsourcing that can be participatory or opportunistic. We talked about the data dimensions and uh the data quality and we mentioned that this data can be attacked through Mauritius uh users. And also the data is stored in a centralized platform at the beginning and then to overcome those problems. We um yeah, the technology or the community uh proposed a new uh solution which is H computing. And now uh we, we aiming to have like a solutions that's proposed that's considered the problem of data and provide a very high quality of data. So this is if you want to connect with me. So this is my linkedin my gmail, my Google Scopic um Google scholar and scopes for my research. And thank you so much.
If you have uh any questions, don't hesitate to uh contact me or uh um no, actually um we are a center uh that he is uh doing something that we are doing some um research related to this uh concept. But no, we just publish and uh publish in a journal. Um And that's it. So we didn't yet um go through L company solutions. So we didn't yet prototype the solution. We just uh propose it in terms of uh applications, we develop it as an application, but we didn't yet um any built up prototype in order to take it for uh industry. OK. Any other question? OK. So um if you want to ask me anything, just uh you can contact me, I will share my, my uh profile, uh linkedin profile and I will be happy to answer your questions. OK. Fine. So applications and oil and three um uh actually, I don't have any idea about this, but uh it depends what do you mean? For example, for us applications, we it's, it's uh more about uh development programs that allow some computation. So if you mean like a real application that exists, so um it depends on the purpose um that you want to use this application, it's for which purpose. So to collect information or just uh compute um like cost or uh like um it depends on what you want.
You can, you can find it, but I, I cannot um mention anything specific because really, I don't have any idea about it about especially oil industry. OK. So thank you so much and hope to see you again. Thank you. Thank you for your attendance.