20 May 2015 Building a file explorer on top of Amazon S3
Amazon S3 is a simple file storage solution that is great for storing content, but how well does it stack up when used as the storage mechanism for a web-based file explorer?
Recently I was tasked with doing just this for a client. Furthermore, as opposed to the existing solution (which used CKFinder and synchronised copies of the files between our own server and the bucket), I needed to connect to an S3 bucket directly. In this post I’ll talk about how we did it.
Don’t reinvent the wheel….. or maybe we should!
Initially it seemed like the best thing to do was to use an existing implementation of a web based file explorer that connects to S3. I assumed that there must be a good solution that someone had taken the time to implement. So I had a look online and found 2 possible contenders, CKFinder with an S3 plugin and ELFinder with an S3 plugin. They were both written in PHP, which wasn’t ideal but we could live with it.
I decided to first try ELFinder to see how it would perform. I started uploading a few files and folders and it worked great, so I decided to see how it would perform with more than a few files in the bucket.
To do this I connected to another bucket that had a copy of the live environment that we were planning to use. It contained tens of thousands of files and about 2,600 different directories. The performance was really bad – it took almost 1 minute to load, and about the same time to complete any sort of action such as navigating.
So why was it so slow?
S3 is not the same as a traditional file system. It works on a key/value pair model. The path of the file (including the file name) make up the key, and the value is the contents of the file.
You can search for files based on a prefix, which offers reasonable performance for displaying the contents of a directory. However, S3 provides no efficient way of listing the directory structure.
So in order to list even the top level of directories, you would need to request all of the keys from S3, iterate over them and work out the top level directory names based on the prefix of each key. This made ELFinder really slow, because it would do this on every request.
Stick to the basics and do it well
We created a RESTful web service on the backend, caching all of the keys that were in S3. With those keys we were able to construct an in-memory tree structure, so that we could quickly retrieve the list of folders within a directory. We also created an in-memory hash map containing a list of files for each directory for fast retrieval.
To reduce the complexity of the project we opted to go for a minimum viable product (MVP) approach. This reduced the need to implement features that would require additional time to implement and support.
We supported navigating, deleting files, uploading files, creating folders, renaming files, downloading files and deleting folders (but only when a folder was empty). This last caveat on deleting folders was useful in reducing complexity, as well acting as a safeguard against accidentally deleting the entire contents of a folder.
We didn’t need functionality to rename a folder. This was fortunate because S3 does not allow file renames. We would have required the program to loop over all files in the target folder and make a copy of them, with the new directory name in the key. This would have been a slow process, as we could potentially have had to copy many gigabytes of data just to rename a folder.
Empty folders, easy right?
Supporting empty folders was not straightforward, because S3 doesn’t support them natively. Because the entire path of a file is stored within its name, there is no great way of creating an ’empty’ folder.
We ended up implementing empty folders by using a hidden file with a specific key that represented that the folder was empty. This worked great, except for directories that weren’t created by our application – once our in-memory cache got updated, the empty folder would be gone without the user deleting it. So our implementation assumed that some pre-existing conditions would be met in the S3 bucket in order for things to function correctly.
We also had to provide a way of searching for a file in a given directory. This is definitely something that can’t be done using S3. Since an S3 bucket can only be searched based on prefix, we would need to implement our own search functionality.
Thankfully having a cache of files and using AngularJS filters made the process of searching for a file reasonably straightforward to implement, while providing a fast user experience. However, it did mean that we were sending a whole list of files in a directory to the user so that they could filter the files in that directory. Fortunately, during our testing we didn’t notice any performance hit.
Our file system implementation was shaping up to be significantly faster than anything else available. We were caching all of the file structure in memory so that when a user was loading the application or navigating through the directories there would be no need to make a request to S3.
Every 5 minutes the application would refresh the cache, with the process taking around 15-20 seconds. This was much better then doing it for every request. The only time the S3 bucket needed to be accessed was during the cache refresh.
Another problem with the existing S3 file system products available for PC/Mac was that they are authenticated using AWS credentials, which would prevent a security risk for our business users. With our implementation we were able keep the AWS IAM Role on the server and provide each user with credentials to login to the application.
We also received a late requirement to restrict read and/or write access to some directories. This was able to be implemented using Spring Security to prevent directories from showing up to a user and restricting write access to some files. This feature was not supported in any available S3 file explorer that I could find.
We also provided 2 display modes for the file explorer. One would just list the files and the other would display a thumbnail of the image. When implementing this functionality, the hash code that S3 provides for a file was actually really useful to ensure that we didn’t download and generate a thumbnail for images every time that they were requested. When a thumbnail was requested, the image would be download from S3 to the web server.
The size of the file was reduced using the ImgScalr Java library and the result was saved on the webserver along with the hash code. When the same thumbnail was requested again, the application would simply get the hash code for the file from S3, compare it to the hash code from the previous request, and if it matched, the cached thumbnail would be returned to the user.
Amazon S3 provides a highly redundant, highly available and highly scalable storage service. However, it is important to keep in mind that it is not a directory-based file system. Instead, it stores files in key-value pairs with no real directory functionality.
In our instance S3 presented something of an impedance mismatch, so in implementing our solution we had to think outside the box to create a fast and reliable file explorer. I have really enjoyed implementing this solution and tackling the various problems that came up. The end result has provided a lot of value to the customer, and we have been able to deliver a solution that will out perform most others.