amerov May 13, 2017 at 13:21

The best way to upload files to Ruby with Shrine. Part 1

From the sandbox

This is the first part of a series of posts about Shrine . The purpose of this series of articles is to show the benefits of Shrine over existing file downloaders.

It's been over a year since I started developing Shrine. During this time, Shrine received a lot of interesting functionality, the ecosystem grew significantly and enough developers began to use Shrine in production.

Before delving into the explanation of the benefits, you need to take a step back and consider in detail what primarily served as the motivation for the development of Shrine.

In particular, I want to talk about the limitations of existing bootloaders. I think it’s important to know about these limitations so that you can make the choice that best suits your needs .

Requirements

The requirements were as follows:

Files on Amazon S3 Must Download Directly
Files should be processed and deleted in the background.
Processing may be performed during the download process.
Sequel Integration
Ability to use with frameworks besides Rails

In my opinion, the first two points are very important, because they allow you to achieve optimal user interface when working with forms. But the last two points should also not be left without attention:

1. Using Amazon S3 or analogues, allows you to optimize the file upload process.
This definitely has a number of advantages: reduced resource consumption, horizontal scaling with storage encapsulation, work with cloud-based solutions like Heroku, which do not provide disk write capabilities and have a time limit for query execution .

2. Processing and deleting files in background tasks makes it possible to work with files asynchronouslywhether you are storing files on the local file system or on external storage such as Amazon S3, this will greatly improve the user interface. Using background tasks is also necessary to maintain the high throughput of your application, because workers will not be tied to slow requests.

3. Processing on during the download process works fine with small files, especially when creating several versions of files, for example, different sizes for pictures. on the other hand, download processing is also necessary for large files such as video. Therefore, we need a library that can work with any type of file.

4. Using with ORMs besides ActiveRecord is also very important. As already appeared more functional and productive ORM for Ruby.

5. Finally, worthy Rails alternatives have appeared in the Ruby community. Need the ability to easily integrate with any web framework.

Now we will go through the existing libraries and consider their main shortcomings, taking into account the requirements.

Paperclip

Easy file attachment management for ActiveRecord

We can say stasis - goodbye Paperclip, because there is a strong dependence on ActiveRecord. Since this is a very common library that is used with ActiveRecord, let's go over the rest of the requirements.

Direct download

Paperclip does not have direct download capabilities. You can use aws-sdk to generate links and parameters for direct loading to S3 and then edit the model attributes in the same way as when downloading a file via Paperclip.

However, Paperclip only works with one repository. To work, it is necessary that all downloads take place directly in the main S3 storage. This leads to a security problem, since an attacker can upload files without attachment, and as a result, many orphan files can be created. It would be much easier if S3 did this for you .

Background Tasks

For background tasks, delayed_papeclip is used . However, delayed_paperclip starts tasks only after the file is fully downloaded. This means that if you do not want or cannot make direct downloads to S3, your users will have to download the file twice (first to the application, then to the storage) before any background processing occurs. And it is very slow.

In addition, delayed_paperclip does not support deleting files in the background. This is a big minus, because you have to perform an HTTP request for each version of the file (if you have several versions of files stored on S3). Do not expect this feature to be added, as Paperclip also checks for the existence of each version before uninstalling.. Of course, you can disable file deletion, but then you will have a problem with orphaned files.

Finally, delayed_paperclip is now bound to ActiveJob , which means that now it will not be possible to use it directly with libraries for background tasks.

False positive mime-type spoofing attack detection

Paperclip has the functionality of detecting whether someone is trying to replace the MIME type of a file. However, this functionality often works falsely, this leads to the fact that there is a chance of causing a validation error, even if the file extension matches the contents of the file. This is quite a decisive factor, since in this case a false positive can be very annoying for users.

Of course, you can disable this functionality, but this will make the application vulnerable to attacks when downloading files .

Carrier wave

Great file upload solution for Rails, Sinatra and other web frameworks

CarrierWave is the answer to Paperclip who stored the configuration directly in the model, encapsulation in classes.

CarrierWave has integration with Sequel .

Unfortunately, for the carrierwave_backgrounder and carrierwave_direct extensions, the ORM CarrierWave integration is not enough. You need a lot of additional ActiveRecord specific code to make it work.

Direct download

As mentioned earlier, the CarrierWave ecosystem has direct loading solutions on S3 - carrierwave_direct . This works in a way that allows you to create a form for direct upload to S3, and then assign the S3 key of the downloaded file to your bootloader.


<%= direct_upload_form_for @photo.image do |f| %>
  <%= f.file_field :image %>
  <%= f.submit %>
<% end %>

However, what if you need to make several downloads directly to S3? README notes that carrierwave_direct is for single downloads only. What about the JSON API? This is the usual form, all it does is generate URLs and parameters for uploading to S3, so why doesn't carrierwave_direct get this information in JSON format?

But what if, instead of re-implementing the entire S3 request generation logic using fog-aws, it simply relied on aws-sdk ?

# aws-sdk
bucket  = s3.bucket("my-bucket")
object  = bucket.object(SecureRandom.hex)
presign = object.presigned_post


  <% presign.fields.each do |name, value| %>
    
  <% end %>

# JSON version
{ "url": presign.url, "fields": presign.fields }

This method has the following advantages: It is not tied to Rails, it works with the JSON API, it supports multiple file downloads (the client can simply make a request with this data for each file), and it is more reliable (since now the parameters are generated by the officially supported gem )

Background Tasks

First, it’s worth noting that carrierwave_direct provides instructions for setting up background processing. However, correctly setting up background tasks is a rather difficult task , so it makes sense to rely on a library that does this for you.

Which brings us to carrierwave_backgrounder . This library supports background task processing, but in my experience it was unstable ( 1 , and 2 ). In addition, it does not support deleting files in the background, which is a decisive factor when deleting multiple files.

Even if we overcome all this, we cannot integrate carrierwave_backgrounder with carrierwave_direct. As I mentioned, I want to upload files directly to S3 and process and delete them in background tasks. But it seems that these two libraries are incompatible with each other, which means that I can not achieve the desired performance with CarrierWave for my cases.

Closing unresolved issue on Github

I understand that sometimes people are thankless to the maintainers of popular open-source libraries and it is worth being softer and more respectful to each other. Nevertheless , I can not understand , why the developers CarrierWave closed unresolved problem .

One such closed task is to unnecessarily execute CarrierWave processing before validation. This is a serious security hole, since an attacker can transfer any file to an image processor, since validation of file sizes / MIME / measurements will be performed only after processing. This makes your application vulnerable to attacks like ImageTragick , image bombs or just uploading large images.

Refile

Uploading files to Ruby, Attempt # 3

Refile was created by Jonas Niklas, author of CarrierWave, as a third attempt to improve file uploads in Ruby . Like Dragonfly, Refile was designed to be capable of processing on the fly. Having suffered from the complexity of CarrierWave, I found that the simple and modern design of Refile is really promising, so I started to contribute to it, and in the end I was invited to shortly.

Refile.attachment_url(@photo, :image, :fit, 400, 500) # resize to 400x500
#=> "/attachments/15058dc712/store/fit/400/500/ed3153b9cb"

Some of Refile’s new ideas include: temporary and permanent storage as first-order storage, clean abstractions for storage, IO abstraction, clean internal design (no GOD objects), and direct download from the box. Thanks to the clean Refile design, creating Sequel integration was pretty straightforward.

Direct download

Refile is the first library to download files, which comes with built-in support for direct downloads, allowing you to asynchronously download the attached file the moment the user selects it. You can upload the file via Rack or directly to S3 using Refile to generate S3 request parameters. there is even a JavaScript library that does everything for you.

<%= form.attachment_field :image, presigned: true %>

There is also a great performance boost. When you upload a file directly to S3, you upload it to the bucket directory, which is marked as "temporary." Then, when validation passes and the record is saved, the downloaded file is moved to a permanent storage. If the temporary and permanent storage is located on S3, then instead of reloading it, Refile will simply issue an S3 COPY request.

No words, my requirements for direct downloads were met.

Background Tasks

One of the limitations of Refile is the lack of support for background jobs. You might think that since Refile performs processing at boot time and has S3 COPY optimizations, background tasks are not needed here.

However, the S3 COPY request is still an HTTP request and affects the duration of the form submission. In addition, the speed of the S3 COPY request depends on the file size, so the larger the file, the slower the S3 COPY request.

In addition, Amazon S3 is just one of many cloud storage services, you can use another service that best suits your needs, but which does not have such optimization or even supports direct download.

Processing during loading

I think that processing during the loading process is great for images that are stored locally and processed quickly. However, if you store the originals on S3, then Refile will serve the initial version request much more slowly, since it needs to load the original from S3 first. In this case, you need to think about adding background tasks that pre-process all versions.

If you download larger files, such as videos, it is usually best to process them after the download, rather than during the download process. But Refile does not currently support this.

Dragonfly

Ruby gem for processing during loading - suitable for loading images in Rails, Sinatra

Dragonfly is another solution for processing during the loading process, which was on the scene much longer than Refile, and, in my opinion, has much more advanced and flexible processing capabilities during the loading process.

Dragonfly does not work with Sequel, this is to be expected, I would even be ready to write an adapter, but the general behavior associated with the model seems to mix with the behavior specific to ActiveRecord models , so it is not clear how to do this.

There is also no support for background tasks or direct downloads. You can do the latter manually, but it will have the same drawbacks as Paperclip.

There is another important point. Obtaining files through the image server (Dragonfly application for processing during the download process) is a completely separate responsibility. I mean, you can use another file upload library that comes with everything (direct downloads, background tasks, various ORMs, etc.) to upload files to the repository, and still use Dragonfly to serve these files .

map "/attachments" do
  run Dragonfly.app # doesn't care how the files were uploaded
end

Attack

Another approach to downloading files

Attache is a relatively new library that supports processing during loading. The difference between Dragonfly and Refile is that Attache was designed to run as a separate service, so the files are downloaded and shared through the Attache server.

Attache has ActiveRecord integration for linking downloaded files to database records and has support for direct downloads. But there is still not enough ability to back up and delete files in background tasks. In addition, Attache is not flexible enough.

Please note, like Dragonfly, Attache does not need to be integrated with the model - you can use Shrine for this. This year I visited RedDotRubyConf in Singapore, where I happened to meet with the author of Attache, and after a very interesting discussion about problems with downloading files, we came to the decision that it would be useful to use Shrine for the logic of file attachments, and just connect Attache to as a backend.

This way, Attache can still do what it does best - distribute files, but at the same time delegate work with attachments to Shrine.

Finally

Support for direct downloads, managing files in the background, processing at boot, and being able to use with other ORMs are what I really expect from the library. However, none of the existing libraries supported all of these requirements.

So I decided to create a new Shrine library based on knowledge from existing libraries.

The goal of Shrine is not to be clumsy, to provide functionality and flexibility that will optimize various tasks when working with files.

This is an ambitious goal, but after a year of active development and research, I feel that I have achieved this. At least there are more features than in any other Ruby library. In the rest of this series, I will introduce you to all the cool features you can use with Shrine, so stay tuned!

Original: Better File Uploads with Shrine: Motivation
Other articles from the series in the author’s blog:

Better File Uploads with Shrine: Uploader
Better File Uploads with Shrine: Attachment
Better File Uploads with Shrine: Processing
Better File Uploads with Shrine: Metadata

Tags:

The best way to upload files to Ruby with Shrine. Part 1

Requirements

Paperclip

Direct download

Background Tasks

False positive mime-type spoofing attack detection

Carrier wave

Direct download

Background Tasks

Closing unresolved issue on Github

Refile

Direct download

Background Tasks

Processing during loading

Dragonfly

Attack

Finally

Also popular now: