Storing your data in the cloud: doing it right
An Introduction to Optimal Cloud Storage - Object Storage
Hello, everyone! I'm Roy Wasserman, a Senior Principal Software Engineer at Red Act. I've spent many years developing software, particularly focusing on storage systems. Today, I want to introduce you to one type of storage that has transformed the cloud era: object storage.
What is Object Storage?
Object storage, unlike block storage or file system storage, is a new kind of storage that can handle Zetta scale data, data at approximately 1300 Exabytes. It's incredibly scalable and cost-effective. In an object storage system, data is stored in a flat namespace. Within this system, objects are contained in buckets and are referenced by their name. Unlike file system storage where directories can house other directories, a bucket cannot contain another bucket.
Here are the key features of object storage:
- Flat Namespace: This means it is easier to manage the storage and also reduces the cost.
- Immutable: Objects cannot change or be updated. One can overwrite the whole object or read any part of it but cannot update a part of the object.
- Rich Metadata: This allows users to store their metadata on objects. It can be beneficial for analytics and object search.
Object Storage and The Cloud
Object storage, extremely popular in the cloud era, took off with the introduction of AWS’s S3, simple storage service. With S3, all access to storage and management is done through APIs using HTTP requests, making remote server access much easier.
Examples of object storage can be seen in:
- AWS S3
- Google Cloud Storage
- RedHat Ceph
- Digital Ocean (uses Ceph for their cloud services)
Application of Object Storage - Photo and Document Storage
Photo Storage: Object storage is an optimal solution for storing large, immutable objects such as photos or videos. For large objects, object storage offers a multipart upload feature where the object is divided into separate parts and uploaded separately, cutting down upload time.
Document Storage: Despite documents being more mutable, they can be a good fit for object storage. One significant advantage is the ability to keep versions of a document. This feature allows a user to go back to any previous version of a document and even revert changes.
Life Cycle Management and Data Security
Lifecycle strategy is a very efficient feature of object storage. It enables automatic expiry of data or its movement to a different storage type as per set criteria. Policies can be defined to delete old versions or failed multipart uploads, freeing up space.
When it comes to data security, there are critical measures to consider:
- Transport Layer Encryption: Always use HTTPS to secure the transport of your data over networks.
- Data at Rest Encryption: Encrypt data when it lands at the storage. You can also employ client-side encryption for maximum privacy.
- Access Control: Incorporate access control for bucket and object access rights, and avoid public buckets.
In conclusion, object storage is a scalable, cost-effective storage system perfect for the cloud era. To fully leverage it, you should make good use of its unique features and always take steps to maintain data safety and privacy.
For any further inquiries, feel free to leave a message in the conference chat or over Slack. Thank you.
Video Transcription
Hello, everyone. Welcome to storing your data in the cloud, doing it right. And today I'm gonna introduce you to optic storage the right way to store data in the cloud. I'm Roy Wasserman. I'm a senior principal software engineer at Red Act.I've been developing software for many, many years already and almost half of my career in open source. I love low level programming and especially storage and been working on different storage system distributed for me quite a while. Um In the last few years, I'm focusing on se which is a sub defined storage, completely open source. And of course, it provides also object storage, also block and fin. And today I'm uh openshift Data Foundation, uh I architect at Red Act at openshift Data Foundation, provide persistent data services for containers application running an openshift which is red, a platform as a service infrastructure for the hybrid cloud based on C. Uh And I'm from Israel if anyone interested uh working from home, even before COVID. So today I want to introduce you to object storage. I wanna tell you uh what's special about it and why it's such a good fit to the cloud then go very cool unique feature. It has that make it even perfect, more perfect for the cloud. And as we are talking about cloud, we also have to have a section about security and privacy. So storage is is a very important infrastructure have been around for quite a while.
This is actually revolution and and now in today's cloud era, it's even it's transformed even more. And usually when we talked about storage, we would divide it to block storage or file system storage. Uh But now today, we also have a new kind of storage called optic storage.
So why do we need this first type? So basically, what happened is that we create so much data, ZTA scale data. Actually, it's not, as you say, it's like 1300 exabyte, which is actually Zetta bytes. Uh And it's called expose, which means that we need different kind of storage system to store the data. And that can be very, very scalable. Um And also uh be cost effective because it's also quite expensive to store so much data and block storage and five system have inherent limitation that make it harder. So that's why there is the declaring the object storage. So what is object to uh it, it's first of all, it's a flat name space. Uh We have buckets, bucket contain objects, uh we reference object by name. So, so it's bucket. So if you tell bucket or you want, if it's the object, you tell this bucket and that other object name, but a bucket cannot contain another bucket. Unlike with file system, the directories can have other directories inside. We are talking about flat name space. This limitation helps uh with reducing the cost of the storage and makes the management much more easier for the storage. The second difference uh from files is that objects are immutable. What do you mean immutable? I don't mean that objects cannot change and be updated.
But the way we them is as one entity, you cannot update a part of the object. You overwrite it completely, you can of course read whatever part of the object word, read whatever you want. But when you write an object, it's always the full object. And because it's much easier to manage the storage with object storage, we can add very rich metadata to the object and better access control, but also use the metadata. So how I'm as a user of optic storage can store my metadata on the object. And we can see example where it can help us. It can help with analytic with searching, search of object and so on object. So it has been around quite a while. People don't know. But from the middle of the nineties, it's quite a while. Uh but it didn't uh break out. It was like a a niche uh E MC at center which provided they call it called it dress story. But that was it. And the change what change made object storage so popular is the cloud. In 2006 aws announced their cloud service AC two as the compute cloud and a storage service that comes with it called free simple storage service. And I really like that. And S3 is object storage objects are with one addition that is very important. All the access to the storage, all the management is done with was FFP I when I refer to A US P I, they, it means the P I is always by using an HTP request which is perfect for the cloud because in cloud, we always have lots of services all around.
Usually all our access are remote and http makes it easier. It's quite easy to use HTP to access a remote server or remote storage. Uh You don't need to do. It's uh the firewalls are open, everything is standard, standard networks and much, much easier than the storage protocols.
And today optic storage is dominating the cloud with of course AWS S3 which is huge. Uh Every big cloud vendor has its own uh optic storage, Google cloud storage, for example, as a storage and so on. And on premise, we have staff that also provide um SV compatible object storage. And I noticed I want digital lotion because digital lotion are using C to provide a cloud of each one. So that little exam uh of uses of the object storage So the class as we say is photos. So if you think about photos are nature immutable because of the format of photos. Um Even if we just uh edit the photo, let's say rotate, we have to F to write the photo as one unit, you never write part of the photo. So object to is classical for photos. This is Flickr. I don't know what's going on Flickr as used to be out, but I don't know. Uh I always use Flickr because they use F as well. Uh So here in Flickr, we have albums, album is a bucket and of course, its photo is a different object. So our first problem is uploading the photos to the cloud photos that we are actually bigger. We have better cameras on the phone.
So uh when we upload the photo in the cloud, it's very annoying if we have some uh that will care or some other problem and we need to upload them again. Um And also uh it's not always photos. Sometimes you want a video, those are even bigger. So we need a different way to upload the photos, the object to the cloud uh and objects. So it has what we call multipart upload to solve the problem. So multiple upload with the name, we take the big object and we cut it into parts and we upload each part separately. It's a transaction based protocol. First of all, do you have a limit command, you go to the object store and tell him I want to upload this object. You give me the name. Uh You can tell him how many parts and part sides, but you don't have to and the object storage will return and upload ID. Uh You need to use it or keep it. And then we upload each part separately and I without correct upload ID. And when we are done, we send the complete command and only then the object will be created on the object storage and visible there.
So now when I upload it for a big object, first of all, in case I have a network problem, all the parts that were were already successfully uploaded will be there. Then uh I also can parallelize uh the the upload, getting much better network foot, much shorter time to upload. I can pause and re zoom object. So for example, for my phone, I only upload photos when I'm in Wi Fi area. So if I go out of the Wi Fi area, the upload will pause. And when I go back, it will continue from the last part that I still didn't upload, you can upload uh an object. You don't know the act, the, the the final size. For example, when you stream video and you can use it as a transaction instead of what you use with child system meaning. But there are some pitfall from using uh multi powers. First of all um you need that to make to understand the performance instead of one request, we have free. So if the object is small, you wanna uh make sure that uh you actually don't use my upload QQ because you can get up to three times slower output. Uh And also uh if you use regular upload, you should know that the object is size is limited to five giga. So always for objects larger than five giga, you should use multiple upload, which is limited with five tera by five tera.
Luckily, when we use S3, we have lots of frameworks and decay that can handle all those for the for us. And you just need to make sure that the defaults are set correctly and the threshold to using multiple upload is correct. Another problem we have is, is what happens if I never complete the upload, I'm storing the parts in the storage. So I'm using capacity and actually paying for it. Uh But it's a waste. Uh We call those out uploads and those needs to be cleared up. You can do it manually and later in the presentation, I'm gonna present you how you can do it automatically. Le le let's take another example of documents. So first, it seems that documents are not a good fit for object to it because documents are actually very beautiful. You change them a lot in our in every part. But as they are relatively not large and the the updates are infrequent human frequency is low compared to computers. They actually are you fit and what object storage adds to documents, which is a very important feature. And you can see here, I can keep version from the document. So if I did a mistake and I will get back, I can go to the search and get another version. So, so what happens if I don't specify which version I want to read, I'll always read with the current version version the latest, but I can specify the version ID and get an old version. If I want to, I can revert to it.
So basically versioning is a bucket feature, you enable it per bucket and then you will get every time you overload the object, it won't be, the old data won't be deleted. But the new version is added as you can see. One problem is here is that uh it uses lots of space. So if, if I have a document or it's five mega usually uh if I have 10 version of it, I'm using 50 mega. That's a lot. So what we need to do to remember when we use versioning is to delete the old version that I don't no longer need. So we free up space. Sadly, I uh don't remember to do anything. Uh But here uh the object stories help us. Um You can see this is an example from, right? Good. But uh the old version are only kept for 40 days. The object storage will delete them for us. This is I think uh with the best feature that object storage has and it's called life sacking a life cycle, you can configure a rule on the object that the optic storage. And when the condition met the optic storage uh will execute the rule. You have two types uh of transitions. One is experion deleting the data and one is to moving the object. So for example, with expiration, I can tell the search, I want to delete uh all the versions that are not current version but are older than 30 days.
This is one kind of all the other kind of all I can say. I don't wanna, I wanna delete all the multi power upload that haven't completed uh for for a week, for seven days. This will clean all the multi failed, multiple uploaded and clean the space automatically without any need of my involvement. And usually turing is used for archiving. I can say any object that wasn't read for the last year should be moved to the archive and it's not used. It's a very cool feature. I think it's very useful. And another example, I said we are flat name space, but you can actually have um much uh complex uh archy. Uh For example, here you have folder one containing 42 that contain 43 you can do it in object storage. Uh This is with what it's called in a prefix. Prefix is basically a, a key, a metadata key that you can add to an object. And so for example, uh object data in F one will have pre prefix plus 41 for the two, whatever prefix slash F for the one slash folder two and so on. And when we do a listing of a bucket with the prefix, you'll get only the object that species this prefix.
So if I see, I wanna see only the object in for the two, I will give that prefix, the listing and get them as we are in the cloud. It's very important to, to make sure your data is secure and to keep it in private. I've sadly seen many MS configuration uh that made the the data public. So first of all use FCTP unless it's your own private network or your FV PM. I said it's not default. Um I wish they would do that but make sure HTPS transport is encrypted. Next, next, the storage itself always encrypt the data. When it gets the data, this is service at encryption, but that's not good for your privacy. Objector has clients that encrypted, meaning the data is already encrypted in the SDK basically. And, and even before it gets to the actually and when the storage get it, it's already encrypted. This way, uh we provide a can with your data, you have two types of keys, you can provide your keys or use the key management system. And I think, uh, we have access controlled for bucket and one of the default, but it's not default. One of the options. Public bucket, please don't use public buckets. There are lots of leaks every few months of public buckets today. Aws actually can warn you if you have such buckets. And so we have lots of granularity of uh, of access control to bucket an object. Sadly, I don't have time to go over them. Please use them and try to use the most limited access UPN.
And we are trying new, we have bucket use bucket and user policy to allow you much better granity. For example, you can share bucket between user. Uh you can uh we give a read only for a non invoice user, you can mistake for IP or market risk. So presenting to you today, object storage is designed for the scale of the cloud is and its unique features make it perfect to the cloud and it can help you to develop much better application for the cloud. Uh But to use the feature, please use the storage uh the API S and not use it as a file system as that will, won't allow you to use features like versioning or life cycle and and so on. We are in the cloud, make sure the data is safe and it's private and I use the apple private cloud. Please set at this gateway for your optic storage. Uh Feel free to ping me for question in the I, I assume it will be the conference chat that is set or I'll always slack. Thank you.