Storage of a large number of files

    image

    Good health, Habrazhiteli! In the process of working on the project of a dating site, it became necessary to organize the storage of photos of users. According to the terms of reference, the number of photos of one user is limited to 10 files. But there may be tens of thousands of users. Especially considering the fact that the project in its current form already exists from the beginning of the “zero”. That is, there are already thousands of users in the database. Almost any file system, as far as I know, reacts very negatively to the large number of child nodes in the folder. From experience, I can say that problems begin after 1000-1500 files / folders in the parent folder.

    Disclaimer I googled before writing the article and found several solutions to the issue under discussion (for example, here or here ). But I did not find a single solution that exactly corresponds to mine. In addition, in this article I just share my own experience in solving the problem.

    Theory


    In addition to the storage task as such, there was also a condition in the TK, according to which the ability to leave captions and photographs was needed. Of course, without a database here is not enough. That is, the first thing we do is create a table in which we prescribe the mapping of meta-data (signatures, titles, etc.) with files on the disk. Each file corresponds to one line in the database. Accordingly, each file has an identifier.

    A small digression. Let's talk about autoincrement. On a dating site there can be a dozen or two thousand users. The question is how many users in general go through the project for all the time of its existence. For example, the active audience of "dating-ru" is several hundred thousand. However, just imagine how many users have retired during the lifetime of this project; how many users have not been activated yet. And now add our legislation, which obliges you to store information about users for at least six months ... Sooner or later, 4 of a billion UNSIGNED INT kopecks will run out. By this best for the primary-key to take BIGINT .

    Now let's try to imagine a BIGINT number.. This is 8 bytes. Each byte is from 0 to 255. 255 child nodes is quite normal for any file system. That is, we take the file identifier in hexadecimal representation, break it into chunks of two characters each. We use these chunks as folder names, the latter being the name of a physical file. PROFIT!

    0f/65/84/10/67/68/19/ff.file

    Elegant and simple. File extension here is not important. Anyway, the file will be given to the script, which will give the browser in particular the MIME type, which we will also store in the database. In addition, storing information about a file in the database allows you to override the path to it for the browser. Let's say the file we actually have is relative to the project directory along the way /content/files/0f/65/84/10/67/68/19/ff.file. And in the database you can write him the URL, for example,/content/users/678/files/somefile. SEO-Schnick now, probably, quite smiled. All this allows us not to worry more about where to place the file physically.

    Table in db


    In addition to the identifier, MIME type, URL and physical location, we will store in the table md5 and sha1 files for screening out the same files if necessary. Of course, we also need to store relationships with entities in this table. Suppose the user ID to which the files belong. And if the project is not very big, then in the same system we can store, say, photos of the goods. By this, we will also store the name of the class of the entity to which the entry belongs.

    By the way, about the birds. If you close the folder with .htaccess for external access, the file can only be obtained through a script. And in the script it will be possible to determine access to the file. Looking ahead a bit, I’ll say that in my CMS (where the aforementioned project is being sawn) access is determined by basic user groups, of which I have 8 - guests, users, managers, admins, unactivated, blocked, remote and super admins. A super admin can do absolutely everything, so he is not involved in the definition of access. If the user has a super-admin flag, then he is super-admin. It's simple. That is, we will determine the accesses of the remaining seven groups. Access is simple - either to give the file, or not to give. Total, you can take the field type TINYINT .

    And one moment. According to our legislation, we will have to physically store custom pictures. That is, we need to somehow mark the pictures as deleted, instead of physically deleting. It is most convenient to use a bit field for these purposes. I usually use the INT type field in such cases . So with a margin, so to speak. Moreover, I have already established tradition to place the DELETED flag in the 5th bit from the end. But it is not fundamentally the same.

    What we have in the end:

    createtable`files` (
      `id`bigintnotnull auto_increment, -- Первичный ключ`entity_type`char(32) notnulldefault'', -- Тип сущности`entity`bigintnull, -- ID сущности`mime`char(32) notnulldefault'', -- MIME-тип`md5`char(32) notnulldefault'', -- MD5`sha1`char(40) notnulldefault'', -- SHA1`file`char(64) notnulldefault'', -- Физическое расположение`url`varchar(250) notnulldefault'', -- URL`meta`textnull, -- Мета-данные в формате JSON или сериализованного массива`size`bigintnotnulldefault'0', -- Размер`created`     datetime notnull, -- Дата создания`updated`     datetime null, -- Дата редактирования`access`      tinyint notnulldefault'0', -- Битовый доступ`flags`intnotnulldefault'0', -- Флаги
      primary key (`id`),
      index (`entity_type`),
      index (`entity`),
      index (`mime`),
      index (`md5`),
      index (`sha1`),
      index (`url`)  
    ) engine = InnoDB;

    Dispatcher class


    Now we need to create a class with which we will upload files. The class must provide the ability to create files, replace / change files, delete files. In addition, it is worth considering two points. First, the project can be transferred from server to server. So in the class you need to define a property that contains the root directory of files. Secondly, it will be very unpleasant if someone bangs a table in the database. So you need to provide the ability to recover data. With the first all in general is understandable. As for data backup, we will only reserve something that cannot be restored.

    ID — recovered from the physical location of the
    entity_type file — cannot be restored;
    entity — cannot be restored;
    mime- restored with the help of expansion finfo
    the md5 - recovering from the very file
    sha1 - recovering from the file itself
    file - recovering from the physical location of the file
    url No - does not recover
    the meta - not restored
    size bed - recovering from the file itself
    Created - you can take the information from the file
    updated - can take information from the file
    access - not restored
    flags - not restored

    Immediately you can drop the meta-information. It is not critical for the functioning of the system. And for more rapid recovery, you still need to save the MIME type. Total: entity type, entity ID, MIME, URL, access, and flags. In order to increase the reliability of the system, we will store backup information on each final folder separately in the folder itself.

    Class code
    <?phpclassBigFiles{
        const FLAG_DELETED = 0x08000000; // Пока только флаг "Удалён"/** @var mysqli $_db */protected $_db       = null;
        protected $_webRoot  = '';
        protected $_realRoot = '';
        function__construct(mysqli $db = null){
            $this->_db = $db;
        }
        /**
         * Установка/чтение корня для URL-ов
         * @param string $v  Значение
         * @return string
         */publicfunctionwebRoot($v = null){
            if (!is_null($v)) {
                $this->_webRoot = $v;
            }
            return$this->_webRoot;
        }
        /**
         * Установка/чтение корня для файлов
         * @param string $v  Значение
         * @return string
         */publicfunctionrealRoot($v = null){
            if (!is_null($v)) {
                $this->_realRoot = $v;
            }
            return$this->_realRoot;
        }
        /**
         * Загрузка файла
         * @param array  $data    Данные запроса
         * @param string $url     URL виртуальной папки
         * @param string $eType   Тип сущности
         * @param int    $eID     ID сущности
         * @param mixed  $meta    Мета-данные
         * @param int    $access  Доступ
         * @param int    $flags   Флаги
         * @param int    $fileID  ID существующего файла
         * @return bool
         * @throws Exception
         */publicfunctionupload(array $data, $url, $eType = '', $eID = null, $meta = null, $access = 127, $flags = 0, $fileID = 0){
            $meta = is_array($meta) ? serialize($meta) : $meta;
            if (empty($data['tmp_name']) || empty($data['name'])) {
                $fid = intval($fileID);
                if (empty($fid)) {
                    returnfalse;
                }
                $meta = empty($meta) ? 'null' : "'" . $this->_db->real_escape_string($meta) . "'";
                $q = "`meta`={$meta},`updated`=now()";
                $this->_db->query("UPDATE `files` SET {$q} WHERE (`id` = {$fid}) AND (`entity_type` = '{$eType}')");
                return $fid;
            }
            // File data
            $meta  = empty($meta) ? 'null' : "'" . $this->_db->real_escape_string($meta) . "'";
            $finfo = finfo_open(FILEINFO_MIME_TYPE);
            $mime  = finfo_file($finfo , $data['tmp_name']);
            finfo_close($finfo);
            // FID, file nameif (empty($fileID)) {
                $eID = empty($eID) ? 'null' : intval($eID);
                $q = <<<sql
    insert into `files` set
        `mime`       = '{$mime}',
        `entity`     = {$eID},
        `entityType` = '{$eType}',
        `created`    = now(),
        `access`     = {$access},
        `flags`      = {$flags}
    sql;$this->_db->query($q);
                $fid = $this->_db->insert_id;
                list($ffs, $fhn) = self::fid($fid);
                $url = $this->_webRoot . $url . '/' . $fid;
                $fdir = $this->_realRoot . $ffs;
                self::validateDir($fdir);
                $index = self::getIndex($fdir);
                $index[$fhn] = array($fhn, $mime, $url, ($eID == 'null' ? 0 : $eID), $access, $flags);
                self::setIndex($fdir, $index);
                $fname = $ffs . '/' . $fhn . '.file';
            } else {
                $fid = intval($fileID);
                $fname = $this->fileName($fid);
            }
            // Move file
            $fdir = $this->_realRoot . $fname;
            if (!move_uploaded_file($data['tmp_name'], $fdir)) {
                thrownewException('Upload error');
            }
            $q = '`md5`=\'' . md5_file($fdir) . '\',`sha1`=\'' . sha1_file($fdir) . '\','
               . '`size`=' . filesize($fdir) . ',`meta`=' . $meta . ','
               . (empty($fileID) ? "`url`='{$url}',`file`='{$fname}'" : '`updated`=now()');
            $this->_db->query("UPDATE `files` SET {$q} WHERE (`id` = {$fid}) AND (`entity_type` = '{$eType}')");
            return $fid;
        }
        /**
         * Чтение файла
         * @param string $url         URL
         * @param string $basicGroup  Базовая группа пользователя
         * @throws Exception
         */publicfunctionread($url, $basicGroup = 'anonimous'){
            if (!ctype_alnum(str_replace(array('/', '.', '-', '_'), '', $url))) {
                header('HTTP/1.1 400 Bad Request');
                exit;
            }
            $url = $this->_db->real_escape_string($url);
            $q = "SELECT * FROM `files` WHERE `url` = '{$url}' ORDER BY `created` ASC";
            if ($result = $this->_db->query($q)) {
                $vars = array();
                $ints = array('id', 'entity', 'size', 'access', 'flags');
                while ($row = $result->fetch_assoc()) {
                    foreach ($ints as $i) {
                        $row[$i] = intval($row[$i]);
                    }
                    $fid = $row['id'];
                    $vars[$fid] = $row;
                }
                if (empty($vars)) {
                    header('HTTP/1.1 404 Not Found');
                    exit;
                }
                $deleted = false;
                $access  = true;
                $found   = '';
                $mime    = '';
                foreach ($vars as $fdata) {
                    $flags   = intval($fdata['flags']);
                    $deleted = ($flags & self::FLAG_DELETED) != 0;
                    $access  = self::granted($basicGroup, $fdata['access']);
                    if (!$access || $deleted) {
                        continue;
                    }
                    $found   = $fdata['file'];
                    $mime    = $fdata['mime'];
                }
                if (empty($found)) {
                    if ($deleted) {
                        header('HTTP/1.1 410 Gone');
                        exit;
                    } elseif (!$access) {
                        header('HTTP/1.1 403 Forbidden');
                        exit;
                    }
                } else {
                    header('Content-type: ' . $mime . '; charset=utf-8');
                    readfile($this->_realRoot . $found);
                    exit;
                }
            }
            header('HTTP/1.1 404 Not Found');
            exit;
        }
        /**
         * Удаление файла (файлов) из хранилища
         * @param mixed $fid  Идентификатор(ы)
         * @return bool
         * @throws Exception
         */publicfunctiondelete($fid){
            $fid = is_array($fid) ? implode(',', $fid) : $fid;
            $q = "delete from `table` where `id` in ({$fid})";
            $this->_db->query($q);
            $result = true;
            foreach ($fid as $fid_i) {
                list($ffs, $fhn) = self::fid($fid_i);
                $fdir = $this->_realRoot . $ffs;
                $index = self::getIndex($fdir);
                unset($index[$fhn]);
                self::setIndex($fdir, $index);
                $result &= unlink($fdir . '/'. $fhn . '.file');
            }
            return $result;
        }
        /**
         * Помечает файл(ы) флагом "удалено"
         * @param int  $fid    Идентификатор(ы)
         * @param bool $value  Значение флага
         * @return bool
         */publicfunctionsetDeleted($fid, $value=true){
            $fid = is_array($fid) ? implode(',', $fid) : $fid;
            $o = $value ? ' | ' . self::FLAG_DELETED : ' & ' . (~self::FLAG_DELETED);
            $this->_db->query("update `files` set `flags` = `flags` {$o} where `id` in ({$fid})");
            returntrue;
        }
        /**
         * Имя файла
         * @param int $fid  Идентификатор
         * @return string
         * @throws Exception
         */publicfunctionfileName($fid){
            list($ffs, $fhn) = self::fid($fid);
            self::validateDir($this->_realRoot . $ffs);
            return $ffs . '/' . $fhn . '.file';
        }
        /**
         * Обработка идентификатора файла.
         * Возвращает массив с папкой к файлу и шестнадцатиричное представление младшего байта.
         * @param int $fid  Идентификатор файла
         * @return array
         */publicstaticfunctionfid($fid){
            $ffs = str_split(str_pad(dechex($fid), 16, '0', STR_PAD_LEFT), 2);
            $fhn = array_pop($ffs);
            $ffs = implode('/', $ffs);
            returnarray($ffs, $fhn);
        }
        /**
         * Проверка каталога файла
         * @param string $f  Полный путь к каталогу
         * @return bool
         * @throws Exception
         */publicstaticfunctionvalidateDir($f){
            if (!is_dir($f)) {
                if (!mkdir($f, 0700, true)) {
                    thrownewException('cannot make dir: ' . $f);
                }
            }
            returntrue;
        }
        /**
         * Чтение резервного индекса
         * @param string $f  Полный путь к файлу резервного индекса
         * @return array
         */publicstaticfunctiongetIndex($f){
            $index = array();
            if (file_exists($f . '/.index')) {
                $_ = file($f . '/.index');
                foreach ($_ as $_i) {
                    $row = trim($_i);
                    $row = explode('|', $row);
                    array_walk($row, 'trim');
                    $rid = $row[0];
                    $index[$rid] = $row;
                }
            }
            return $index;
        }
        /**
         * Запись резервного индекса
         * @param string $f      Полный путь к файлу резервного индекса
         * @param array  $index  Массив данных индекса
         * @return bool
         */publicstaticfunctionsetIndex($f, array $index){
            $_ = array();
            foreach ($index as $row) {
                $_[] = implode('|', $row);
            }
            return file_put_contents($f . '/.index', implode("\r\n", $_));
        }
        /**
         * Проверка доступности
         * @param string $group  Название группы (см. ниже)
         * @param int    $value  Значение доступов
         * @return bool
         */publicstaticfunctiongranted($group, $value=0){
            $groups = array('anonimous', 'user', 'manager', 'admin', 'inactive', 'blocked', 'deleted');
            if ($group == 'root') {
                returntrue;
            }
            foreach ($groups as $groupID => $groupName) {
                if ($groupName == $group) {
                    return (((1 << $groupID) & $value) != 0);
                }
            }
            returnfalse;
        }
    }


    Consider some points:

    - realRoot - the full path to the folder with the file system ending in a slash.
    - webRoot - path from the site root without a leading slash (see below why).
    - As a DBMS, I use the MySQLi extension .
    - In fact, in the upload method, the first argument passes information from the $ _FILES array .
    - If, when calling the update method, to transfer the ID of an existing file, it will be replaced if the input array in the tmp_name is non-empty.
    - You can remove and change the file flags for several pieces at once. For this you need to transfer instead of the file identifier either an array with identifiers, or a string with them separated by commas.

    Routing


    Actually, it all comes down to several lines in htaccess in the site root (it is assumed that mod_rewrite is enabled):

    RewriteCond%{REQUEST_URI} ^/content/(.*)$
    RewriteCond%{REQUEST_FILENAME} !-f
    RewriteRule ^(.+)$ content/index.php?file=$1 [L,QSA]

    “Content” is the folder in the root of the site in my case. Needless to say, you can name the folder differently. And of course, index.php itself, which in my case is stored in the content folder:

    <?php
        $dbHost = '127.0.0.1';
        $dbUser = 'user';
        $dbPass = '****';
        $dbName = 'database';
        try {
            if (empty($_REQUEST['file'])) {
                header('HTTP/1.1 400 Bad Request');
                exit;
            }
            $userG = 'anonimous';
            // Вот тут будем определять группу юзера; любое решение на Ваш выбор
            $files = new BigFiles(new mysqli($dbHost,$dbUser,$dbPass,$dbName));
            $files->realRoot(dirname(__FILE__).'/files/');
            $files->read($_REQUEST['file'],$userG);
        } catch (Exception $e) {
            header('HTTP/1.1 500 Internal Error');
            header('Content-Type: text/plain; charset=utf-8');
            echo $e->getMessage();
            exit;
        }
    

    Well, by itself, we close the file system itself from external access. Put the content/filesfile .htaccesswith only one line in the root of the folder :

    Deny from all

    Total


    This solution avoids the loss of file system performance due to an increase in the number of files. At least troubles in the form of thousands of files in one folder can be precisely avoided. And at the same time, we can organize and control access to files by human-readable addresses. Plus compliance with our grim legislation. Immediately make a reservation, this solution is NOT a complete way to protect content. Remember: if something is played in the browser, it can be downloaded for free.

    Also popular now: