Finding Fat (The Quest For FAT)
When developing a certain hardware and software complex, it was necessary to create a client device, which for other devices should look like a regular USB flash drive, or, more formally, a USB Mass Storage Device. The device is unusual in that it must simulate the FAT file system with files of a sufficiently large size (2GB or more) for the outside world, despite the fact that the files themselves on the device, of course, are absent and are on the network. Anyway, these are not files, but some kind of audio streams.
The task, at first glance, is simple: for each request to read a block (SCSI command) we give the contents of this block. A block can either belong to any of the “files” or contain FAT overhead information.
The first thought was, of course, to pack the FAT image with, for example, bzip2 and unpack it on the device as needed. Three problems arise immediately:
Well, not to mention the fact that you have to port bzip2 to the microcontroller.
So something else had to be come up.
The problem can be posed as follows: it is necessary to write a code that will take some description of the file system as an input in some form, and return its contents for each request of the sector number. The content is either service information or file data, which are taken from the corresponding audio stream at a given URL.
This formulation of the question leads us to a system of rules:
Note that we are talking about sectors, not “clusters,” since a cluster is a concept of the FAT file system itself. Devices work at the block level, they are also sectors. Let's say our “playlist” contains 10 “files” of 2Gb each (2Gb is such a practical approach to infinity). If each rule has a size of one byte, which, of course, is impossible, then we get it. byte for all the rules. Somewhat more reasonable. But, of course, the rules are not unique to each sector. We will set the rules for the ranges of sectors. This leads us to a set of rules:
We will also try to pack the sectors themselves. Since we are not faced with the task of compressing data - the data themselves are not available in the device and are taken from the Web, you just need to somehow provide more or less compact presentation of the service data of the file system itself. At first glance, there are many repeating sequences in this data, so we will encode as follows: repetitive sequences are presented as Non- repeating sequences , presented as In addition, those sequences that we have already encoded, or it would be nice not to re-insert them, but refer to them. Probably, we have another sequence. Perhaps in the process of implementation, other sequences may appear for a more compact representation of the file system structures.
All this is very similar to the command system of a virtual machine, and since there are calls, that is, a stack. The simplest known virtual machine is one of the varieties of a fort. In fact, this is a reverse
Polish entry on steroids, with an added stack of return addresses from calls, which eliminates the need to bother with organizing function frames: everything is extremely simple - when returning from a call, remove the
top word from the R stack and go to the address to which it points .
In addition, the token threaded code (and this will be it) for the two-stack machine has a very good density, which, in this case, suits us very well.
Interpreting such a code is fast, it turns out just an average of five times slower than the native code, and also very simple.
So, we have some kind of coding system, a system of rules and some kind of virtual machine on which these rules should be played.
It remains to generate these rules from a certain description, get a bytecode and implement a machine for its interpretation. And only then we will see what happened.
With the implementation of a virtual machine, the situation is simple: it will work on a microcontroller, respectively, so far there are no C options here. True, it is possible that there is nothing to write there - it will turn out to be generated in some
way.
All that remains is the generation of the rule system from the description, the description itself, the code generation and the description of the commands of this code. In addition, it would be good to check the rules not sequentially, but somehow more reasonably: organize checks in the form
comparison tree, so that the number of comparisons per sector would be of the order of the binary logarithm of the number of comparisons.
We’ve finished the initial analysis, we would have to make a prototype and see what we get.
We will need to generate and, possibly, read binary data of various dimensions and endianesses (FAT service data are written in low-endian format) and work with nested data structures.
What would it be implemented on? C, C ++ or maybe Python? Or Ruby? Joke.
Of course, we will do it at Haskell: the task is not the simplest, it requires some kind of performance, and we have little time. Well, anyway, the server that will call this code is also implemented on
Haskell, so the choice is quite natural.
Let's get started.
The central thing in the system is the “rules”. They convert the description of the file system, the code is generated from them. We describe them: In addition, there is a description of the file system itself, which consists of directories and files, with some specifics of the FAT itself. Here we dwell in more detail. The strange constructors DirDor and DirDotDot are nothing more than the '.' Directories and '..', which - here's a surprise - are top-notch, physically present directory entries. Fortunately, they are only links and do not require the allocation of clusters. Otherwise, everything is pretty obvious: the first attribute of type constructors is a unique identifier. It obviously can be useful to us in order to understand the firmware from which “file” the data was requested.
The second attribute is the file name. In the case of a file, we also add its size and data. This, of course, is not the data of the file itself, but some indication of the firmware of the device where to get this data. There you can write, for example, a syssh structure or stream URL. Therefore ByteString.
Now we need to somehow construct the Entry, taking into account the requirements of the file system: each directory, except the root, must contain the entries '.' and '..', they must refer to the corresponding directories, should not be
identical record names, names should not contain forbidden characters and so on and so forth. It can be seen that manually creating this structure is difficult, in addition, the user of the API must deal with this, and he will definitely mix something up and everything will break, but this is a serious matter. So it’s better to prohibit the import of Entry-type content from our module, and provide the user with some more convenient and error-protected solution. Something like: It looks good, even not knowing the language can understand what is described here. It’s easy to implement: to generate something, there is already a ready-made Monad Writer.
In addition, we will need to distribute unique identifiers, so State is also useful, where we put some kind of counter. Since we want to cross State and Writer, the monad transformer will not hurt us. Something like this: Each function takes parameters like name, size, and another monadic value for constructing nested records. Each such calculation will be run in a separate Writer, and the State will be dragged along to ensure that the identifiers are unique. So, we have set the directory structure, now we need to get the rules out of it somehow. To do this, somehow place the data files and directories on the "disk". We assume that they are placed sequentially, first directories, then files:
The code as a whole is quite obvious: we take all the records, remove the '.' and '..' which do not have their own clusters, but only point to foreign ones, make directories go first, then files (there is no difference, but it’s more logical,
and the volume table will be faster to read), we select sectors (it’s more convenient for us work with sectors, “clusters” is an artificial concept) and that’s it.
It is worth noting the universe function from the uniplate module. It allows you to list all the elements of a nested structure in a list (if desired, with list comprehension) to avoid the routine writing of recursive traversal functions.
It is for her sake that we declared the type Entry deriving (Data, Typeable) above.
Having files placed by sectors, it costs us nothing to generate rules for them:
the encodeBlock function here can encode a ByteString into a sequence of rules, writeEntries generates directory entries and encodes them, and mergeRule tries to combine the range of sectors of successive rules.
The generation of one directory entry looks something like this: Here we use the exceptionally useful PutM monad from Data.Binary.Put, which allows you to output data of any capacity and endianity to a lazy byte string. So, we have the directory structure of the FAT volume, we have their allocation by sector and the corresponding rules. What do we have left? Here you need to step back a little and remember the FAT device. If you do not go into unnecessary details that are widely available on the Web and in literature, then FAT32 is designed like this:
So far, we only have rules for DATA. FAT1 and FAT2 are cluster allocation tables. Each file or directory (which is also a file) occupies a chain of clusters in the data area, and each cluster in the data area is represented by a 32-bit value in FAT1 and FAT2 (they are identical).
Each FAT cell contains the number of the next cluster of the file, the last cluster is marked with a special value. The number of the first cluster of the file is indicated in the directory entry. The data we have is placed sequentially, so that in each cell of the chain the number N + 1 will be written, where N is the previous value.
Here the first problem arises: for our calculated 10 x 20Gb, this table will occupy as many as 655360 32-bit values, which again exceeds the available RAM. However, these rules cannot be compressed.
our primitive RLE packing algorithm, since there are no duplicate values. However, since we were able to generate this sequence once, we will probably be able to generate it again, already on the device.
A closer look showed that the values in one sector of the allocation table depend on the maximum value in the previous one, and in general, the sequence is determined by the expression:
where Na is the first value for this sector, Nsect is the number of the requested sector (it will be at the top of the stack of our fort machine), M, BASE and STEP are the constants calculated statically, Ni is the i-th number of the sequence, and in total in the sector, obviously 512/4.
Thus, we have acquired a new sequence that generates a series of values based on dynamic data (sector numbers). We add types for this sequence and adjacent ones: Looking ahead, we will add another rule for the callback, which should be called after the generation of the file data sector so that the device firmware takes a buffer and fills it with real data.
It would be possible to immediately generate a table in the form of a set of rules, but for some reason I needed it in binary form, in addition there is already a debugged function for encoding binary strings, and it is easy to
make mistakes in direct generation .
This table is quite large, and in the case of a large data area and a small cluster size, poor Haskell has a hard time.
At some point, from a large lazy Word32 list, the application felt really bad, so I had to quickly rewrite it to lazy byte lines and use runPut / runGet to put 32-bit values there and retrieve them.
Surprisingly, this led to an acceleration of about ten times and everything began to work at an acceptable speed, although, of course, you should rewrite it in such a way as to immediately generate rules and not create data.
But for the concept, it will.
We omit the generation functions of the table and the rules for it, they are quite large, but quite obvious: The table encoding function first associates each sector with one REQ a rule (NSER _ _ _), then considers sectors in pairs and if two sectors form a common sequence of values, then the rule for the sector is replaced by the rule for the range of sectors, the result is compact enough to bring it here:
It looks promising, clearly better than a two megabyte piece of data.
The second copy of the table is accurate to constants, so in the future you can replace this sequence by subtracting the constant from the offset and calling the first table. But that later.
So, we have FAT1, FAT2 and DATA. It remains to get only BootSect and FAT32 Info. This is static binary data, so we again use Data.Binary.Put, and then we pack it into the rules.
These two modules (Put and Get) are literally indispensable and personally I quote them higher than the binary patterns in Erlang, although this is subjective. We put our packer on the result, merge the rules into ranges, and get the final list of rules that describes our entire file system.
So, we have a set of rules. It remains to generate a comparison tree for them and
compile it all in bytecode.
Let's start with the tree: It may not be the best option, but the rules turned out to be less than a hundred, so far you should not worry. It is up to the virtual machine, the set of commands, and the compiler: Alas, here a simple Haskell type system starts to be missed: I want to set compile-time invariants for the commands and their classes, so that, for example, it would be impossible to create a command with the wrong opcode. But you just can’t do this, but you don’t want to introduce a separate type for each opcode, an existential data type for a command and still do not want to use metaprogramming to generate opcodes.
Put it off until better times, let’s get by with what we have. Anyway, for the implementation of the virtual machine, you will have to write tests, so the errors that come up will pop up there.
So, there is a virtual machine command system, now you need to compile a comparison tree into it, built from our rules: Here we use the favorite way to generate all sorts of things using eDSL built on top of the Writer monad. The generation of flat code from the comparison tree leads to a lot of “snot”, for example, to long chains of exit from blocks:
jumping into the next blocks and so on. normalize eliminates these disgraces, and breaks the code into blocks, each of which begins with a label, and ends with an unconditional jump to the next block command. There are no conditional or unconditional jump commands inside the block; they are valid only at the end. We need such blocks to calculate label offsets. After it, you can merge blocks, getting rid of unnecessary transitions completely.
We will write the Show instance for our bytecode for beautiful fort printing and see what we get after optimizing the blocks: Not perfect, but there are no snot, the general code is partially allocated to the procedures, there is a branch tree. It will do. It remains to run it on something, for this we need to implement, finally, the virtual machine itself.
It can be simply written in C, since only the opcodes change significantly, but experience has shown that it is better to generate it all than to monitor the consistency of the opcodes and C code later. There is no way to verify this, and the situation when the compiler produces one, and vm wants to interpret something completely different, is quite likely. So it’s better to generate everything. Again, we outline the mini-eDSL for generating C, so as not to bother closing the brackets, indents and semicolons.
Again Writer, no variety ... Let's see what we got:
Well, that should be. An important nuance: in order for switch to be compiled into the transition table, it is necessary that the values of its labels go sequentially and do not have holes. And probably fit in bytes. In case of violation of these heuristics, C compilers can generate a comparison tree, which in this case does not suit us at all. We provided the opcode sequence with the definition of the Enum instance for our Opcode type (see above).
What a pity that such a low-level, it would seem, C does not have a standard way to navigate to a variable address, even though GCC supports such an extension. Only not for all interesting platforms there is GCC, so we restrict ourselves to switch-based interpretation.
Our virtual machine is ready. Let's write tests for her. This is easy - let the test VM take a bytecode stream as input, generate the contents of the buffer as a result of their interpretation and send it to the output stream. Each test case, therefore, will be considered passed if the contents of the buffer ultimately meet expectations.
We’ll write tests ... ... and test cases: and a shell to run them: Run, fix all the problems and crashes in the core (surprisingly few) and run everything together: Everything works as expected: the image of the file system is generated, checked and mounted. The content is as described on our eDSL.
The size of the compiled rules file in this case is a little over 2Kb and lends itself to further optimization, 2Kb is quite an acceptable size for dynamic downloads even via GSM / EDGE, not to mention 3G.
Fort performance also lends itself to optimization, not to mention the fact that in the most extreme case it can be compiled in C and then into the native processor code.
Here's a short story about the benefits of Haskell in the national economy.
The task, at first glance, is simple: for each request to read a block (SCSI command) we give the contents of this block. A block can either belong to any of the “files” or contain FAT overhead information.
The first thought was, of course, to pack the FAT image with, for example, bzip2 and unpack it on the device as needed. Three problems arise immediately:
- A compressed image of the file system from the heel of 100 megabyte files and several directories takes something around 125Kb, which exceeds the amount of RAM on the device.
- Bzip2, apparently, does not know anything about sectors and works at the thread level, therefore, it will be necessary to pack sector-by-sector, therefore, the image size will be extremely prohibitive: as you know, packing with small blocks is unprofitable - the overhead is large compared to the gain from compression.
- Image preparation (mounting an empty file, creating a file system, packing) takes a lot of time and is hardly acceptable for implementation on a server with estimated thousands of users. And the image itself in the process of creation requires some fairly noticeable disk space.
Well, not to mention the fact that you have to port bzip2 to the microcontroller.
So something else had to be come up.
The problem can be posed as follows: it is necessary to write a code that will take some description of the file system as an input in some form, and return its contents for each request of the sector number. The content is either service information or file data, which are taken from the corresponding audio stream at a given URL.
This formulation of the question leads us to a system of rules:
Номер сектора => Содержимое сектора
Note that we are talking about sectors, not “clusters,” since a cluster is a concept of the FAT file system itself. Devices work at the block level, they are also sectors. Let's say our “playlist” contains 10 “files” of 2Gb each (2Gb is such a practical approach to infinity). If each rule has a size of one byte, which, of course, is impossible, then we get it. byte for all the rules. Somewhat more reasonable. But, of course, the rules are not unique to each sector. We will set the rules for the ranges of sectors. This leads us to a set of rules:
2*1024*1024*1024 * 10 / 512 = 41 943 040
Номер сектора (A) => Содержимое сектора
Диапазон секторов (A,B) => Содержимое сектора
We will also try to pack the sectors themselves. Since we are not faced with the task of compressing data - the data themselves are not available in the device and are taken from the Web, you just need to somehow provide more or less compact presentation of the service data of the file system itself. At first glance, there are many repeating sequences in this data, so we will encode as follows: repetitive sequences are presented as Non- repeating sequences , presented as In addition, those sequences that we have already encoded, or it would be nice not to re-insert them, but refer to them. Probably, we have another sequence. Perhaps in the process of implementation, other sequences may appear for a more compact representation of the file system structures.
(Признак RLE, Количество, Символ)
(Признак Sequence, Последовательность символов)
(Вызов последовательности, идентификатор)
All this is very similar to the command system of a virtual machine, and since there are calls, that is, a stack. The simplest known virtual machine is one of the varieties of a fort. In fact, this is a reverse
Polish entry on steroids, with an added stack of return addresses from calls, which eliminates the need to bother with organizing function frames: everything is extremely simple - when returning from a call, remove the
top word from the R stack and go to the address to which it points .
In addition, the token threaded code (and this will be it) for the two-stack machine has a very good density, which, in this case, suits us very well.
Interpreting such a code is fast, it turns out just an average of five times slower than the native code, and also very simple.
So, we have some kind of coding system, a system of rules and some kind of virtual machine on which these rules should be played.
It remains to generate these rules from a certain description, get a bytecode and implement a machine for its interpretation. And only then we will see what happened.
With the implementation of a virtual machine, the situation is simple: it will work on a microcontroller, respectively, so far there are no C options here. True, it is possible that there is nothing to write there - it will turn out to be generated in some
way.
All that remains is the generation of the rule system from the description, the description itself, the code generation and the description of the commands of this code. In addition, it would be good to check the rules not sequentially, but somehow more reasonably: organize checks in the form
comparison tree, so that the number of comparisons per sector would be of the order of the binary logarithm of the number of comparisons.
We’ve finished the initial analysis, we would have to make a prototype and see what we get.
We will need to generate and, possibly, read binary data of various dimensions and endianesses (FAT service data are written in low-endian format) and work with nested data structures.
What would it be implemented on? C, C ++ or maybe Python? Or Ruby? Joke.
Of course, we will do it at Haskell: the task is not the simplest, it requires some kind of performance, and we have little time. Well, anyway, the server that will call this code is also implemented on
Haskell, so the choice is quite natural.
Let's get started.
The central thing in the system is the “rules”. They convert the description of the file system, the code is generated from them. We describe them: In addition, there is a description of the file system itself, which consists of directories and files, with some specifics of the FAT itself. Here we dwell in more detail. The strange constructors DirDor and DirDotDot are nothing more than the '.' Directories and '..', which - here's a surprise - are top-notch, physically present directory entries. Fortunately, they are only links and do not require the allocation of clusters. Otherwise, everything is pretty obvious: the first attribute of type constructors is a unique identifier. It obviously can be useful to us in order to understand the firmware from which “file” the data was requested.
data Rule = REQ Int [Chunk] | RANGE Int Int [Chunk] deriving Show
data Chunk = SEQ BS.ByteString
| RLE Int Word8
deriving (Eq, Ord)
data Entry = DirRoot Int [Entry]
| DirDot Int
| DirDotDot Int
| Dir Int String [Entry]
| File Int String Int BS.ByteString
deriving (Eq, Ord, Data, Typeable, Show)
The second attribute is the file name. In the case of a file, we also add its size and data. This, of course, is not the data of the file itself, but some indication of the firmware of the device where to get this data. There you can write, for example, a syssh structure or stream URL. Therefore ByteString.
Now we need to somehow construct the Entry, taking into account the requirements of the file system: each directory, except the root, must contain the entries '.' and '..', they must refer to the corresponding directories, should not be
identical record names, names should not contain forbidden characters and so on and so forth. It can be seen that manually creating this structure is difficult, in addition, the user of the API must deal with this, and he will definitely mix something up and everything will break, but this is a serious matter. So it’s better to prohibit the import of Entry-type content from our module, and provide the user with some more convenient and error-protected solution. Something like: It looks good, even not knowing the language can understand what is described here. It’s easy to implement: to generate something, there is already a ready-made Monad Writer.
fileContents = ...
fatSample2 = filesystem $ do
file "file0" (16384) fileContents
dir "A" $ do
file "file1" (megs 100) fileContents
dir "C" $ do
file "file3" (megs 100) fileContents
file "file4" (megs 100) fileContents
file "file5" (megs 100) fileContents
dir "E" $ emptyDir
dir "B" $ do
file "file2" (megs 50) emptyFile
In addition, we will need to distribute unique identifiers, so State is also useful, where we put some kind of counter. Since we want to cross State and Writer, the monad transformer will not hurt us. Something like this: Each function takes parameters like name, size, and another monadic value for constructing nested records. Each such calculation will be run in a separate Writer, and the State will be dragged along to ensure that the identifiers are unique. So, we have set the directory structure, now we need to get the rules out of it somehow. To do this, somehow place the data files and directories on the "disk". We assume that they are placed sequentially, first directories, then files:
newtype EntryIdT m a = EntryIdT {
runF :: (WriterT [Entry] (StateT (Int, Int) m)) a
} deriving (Monad, MonadWriter [Entry], MonadState (Int, Int))
type EntryIdM = EntryIdT Identity
runEntryIdM :: (Int, Int) -> EntryIdM a -> ([Entry], (Int, Int))
runEntryIdM init f = runState (execWriterT (runF f)) init
filesystem :: EntryIdM () -> Entry
filesystem f = DirRoot 0 dirs
where dirs = fst $ runEntryIdM (1,0) f
dir :: String -> EntryIdM () -> EntryIdM ()
file :: String -> Int -> (EntryInfo -> BS.ByteString) -> EntryIdM ()
data AllocEntry = AllocEntry { beginSect :: Int
, endSect :: Int
, entry :: Entry
} deriving (Show)
allocate :: ClustSize32 -> Int -> Entry -> [AllocEntry]
allocate cl from = eFix . eAlloc . eOrder . filter eFilt . universe
where eFilt (File _ _ _ _) = True
eFilt (Dir _ _ _) = True
eFilt (DirRoot _ _) = True
eFilt _ = False
eOrder = uncurry (++) . partition (not.isFile)
eAlloc = reverse . snd . foldl fentry (from, [])
fentry (n, xs) e =
let sectors = entryLen cl e `div` fatSectLen
begin = n
end = begin + sectors - 1
n' = n + sectors
allocated = AllocEntry begin end e
in (n', allocated : xs)
eFix = id
The code as a whole is quite obvious: we take all the records, remove the '.' and '..' which do not have their own clusters, but only point to foreign ones, make directories go first, then files (there is no difference, but it’s more logical,
and the volume table will be faster to read), we select sectors (it’s more convenient for us work with sectors, “clusters” is an artificial concept) and that’s it.
It is worth noting the universe function from the uniplate module. It allows you to list all the elements of a nested structure in a list (if desired, with list comprehension) to avoid the routine writing of recursive traversal functions.
It is for her sake that we declared the type Entry deriving (Data, Typeable) above.
Having files placed by sectors, it costs us nothing to generate rules for them:
generateData :: Maybe CalendarTime -> ClustSize32 -> [AllocEntry] -> [Rule]
generateData ct cl es = mergeRules $ execWriter $ do
forM_ es $ \(AllocEntry {beginSect = a, endSect = b, entry = e}) -> do
case e of
DirRoot _ es -> writeEntries a b es
Dir _ _ es -> writeEntries a b es
File _ _ _ bs -> tell [RANGE a b (encodeBlock (BS.take (fatSectLen) bs))]
where
...
the encodeBlock function here can encode a ByteString into a sequence of rules, writeEntries generates directory entries and encodes them, and mergeRule tries to combine the range of sectors of successive rules.
The generation of one directory entry looks something like this: Here we use the exceptionally useful PutM monad from Data.Binary.Put, which allows you to output data of any capacity and endianity to a lazy byte string. So, we have the directory structure of the FAT volume, we have their allocation by sector and the corresponding rules. What do we have left? Here you need to step back a little and remember the FAT device. If you do not go into unnecessary details that are widely available on the Web and in literature, then FAT32 is designed like this:
entryRecordShort :: String
-> Int
-> Int
-> Maybe CalendarTime
-> [ATTR]
-> BS.ByteString
entryRecordShort nm size clust clk a = runPut $ do
putNameASCII nm -- Name
putWord8 (fatAttrB a) -- Attr
putWord8 0 -- NTRes
putWord8 0 -- CrtTimeTenth
putWord16le cT -- CrtTime
putWord16le cD -- CrtDate
putWord16le cD -- LstAccDate
putWord16le cHi -- FstClusHI
putWord16le cT -- WrtTime
putWord16le cD -- WrdDate
putWord16le cLo -- FstClusLO
putWord32le (fromIntegral size) -- FileSize
where ...
| BootSect | FAT32 Info | FAT1 | FAT2 | DATA |
So far, we only have rules for DATA. FAT1 and FAT2 are cluster allocation tables. Each file or directory (which is also a file) occupies a chain of clusters in the data area, and each cluster in the data area is represented by a 32-bit value in FAT1 and FAT2 (they are identical).
Each FAT cell contains the number of the next cluster of the file, the last cluster is marked with a special value. The number of the first cluster of the file is indicated in the directory entry. The data we have is placed sequentially, so that in each cell of the chain the number N + 1 will be written, where N is the previous value.
Here the first problem arises: for our calculated 10 x 20Gb, this table will occupy as many as 655360 32-bit values, which again exceeds the available RAM. However, these rules cannot be compressed.
our primitive RLE packing algorithm, since there are no duplicate values. However, since we were able to generate this sequence once, we will probably be able to generate it again, already on the device.
A closer look showed that the values in one sector of the allocation table depend on the maximum value in the previous one, and in general, the sequence is determined by the expression:
Na = BASE + (Nsect - M) * STEP Ni <- [Na, Na + 1 ..]
where Na is the first value for this sector, Nsect is the number of the requested sector (it will be at the top of the stack of our fort machine), M, BASE and STEP are the constants calculated statically, Ni is the i-th number of the sequence, and in total in the sector, obviously 512/4.
Thus, we have acquired a new sequence that generates a series of values based on dynamic data (sector numbers). We add types for this sequence and adjacent ones: Looking ahead, we will add another rule for the callback, which should be called after the generation of the file data sector so that the device firmware takes a buffer and fills it with real data.
data Chunk = SEQ BS.ByteString
| RLE Int Word8
| SER Word32 Word32
| NSER Word32 Int Word32 -- base offset step
| CALLBACK Word8
deriving (Eq, Ord)
It would be possible to immediately generate a table in the form of a set of rules, but for some reason I needed it in binary form, in addition there is already a debugged function for encoding binary strings, and it is easy to
make mistakes in direct generation .
This table is quite large, and in the case of a large data area and a small cluster size, poor Haskell has a hard time.
At some point, from a large lazy Word32 list, the application felt really bad, so I had to quickly rewrite it to lazy byte lines and use runPut / runGet to put 32-bit values there and retrieve them.
Surprisingly, this led to an acceleration of about ten times and everything began to work at an acceptable speed, although, of course, you should rewrite it in such a way as to immediately generate rules and not create data.
But for the concept, it will.
We omit the generation functions of the table and the rules for it, they are quite large, but quite obvious: The table encoding function first associates each sector with one REQ a rule (NSER _ _ _), then considers sectors in pairs and if two sectors form a common sequence of values, then the rule for the sector is replaced by the rule for the range of sectors, the result is compact enough to bring it here:
type ClusterTable = BS.ByteString
genFAT :: ClustSize32 -> Int -> [AllocEntry] -> ClusterTable
encodeFAT :: Int -> ClusterTable -> [Rule]
REQ 32 [SEQ [F8], RLE 2 255, SEQ [0F], RLE 3 255, SEQ [0F], RLE 3 255, SEQ [0F], RLE 3 255, SEQ [0F], RLE 3 255, SEQ [0F], RLE 3 255, SEQ [0F], RLE 3 255, SEQ [0F], SEQ [08], RLE 3 0, SEQ [09], RLE 3 0, SEQ [0A], RLE 3 0, RLE 3 255, SEQ [0F], SER 12 128] RANGE 33,231 [NSER 129 33 128] REQ 232 [SER 25601 25610, RLE 3 255, SEQ [0F], SER 25612 25728] RANGE 233 431 [NSER 25729 233 128] REQ 432 [SER 51201 51210, RLE 3 255, SEQ [0F], SER 51212 51328] RANGE 433 631 [NSER 51329 433 128] REQ 632 [SER 76801 76810, RLE 3 255, SEQ [0F], SER 76812 76928] RANGE 633 831 [NSER 76929 633 128] REQ 832 [SER 102401 102410, RLE 3 255, SEQ [0F], SER 102412 102528] RANGE 833 931 [NSER 102529 833 128] REQ 932 [SER 115201 115210, RLE 3 255, SEQ [0F], RLE 468 0] RANGE 933 1056 [RLE 512 0]
It looks promising, clearly better than a two megabyte piece of data.
The second copy of the table is accurate to constants, so in the future you can replace this sequence by subtracting the constant from the offset and calling the first table. But that later.
So, we have FAT1, FAT2 and DATA. It remains to get only BootSect and FAT32 Info. This is static binary data, so we again use Data.Binary.Put, and then we pack it into the rules.
These two modules (Put and Get) are literally indispensable and personally I quote them higher than the binary patterns in Erlang, although this is subjective. We put our packer on the result, merge the rules into ranges, and get the final list of rules that describes our entire file system.
fatGenBoot32 :: FAT32GenInfo -> BS.ByteString
fatGenBoot32 info = addRsvd $ runPut $ do
-- BOOT AREA sect0
putBytes [0xEB, 0x58, 0x90] -- 0 JmpBoot
putBytes bsOEMName -- OEMName
putWord16le bps -- BytesPerSec
putWord8 spc -- SectPerClust
putWord16le rsvd -- ReservedSecCnt
putWord8 2 -- NumFATs
putWord16le 0 -- RootEntCnt
putWord16le 0 -- TotSec16
putWord8 0xF8 -- Media
putWord16le 0 -- FAT16Sz
putWord16le 0x3F -- SectPerTract
putWord16le 0xFF -- NumHeads
putWord32le 0 -- HiddSec
putWord32le sectNum -- TotSec32
-- FAT32 Structure
putWord32le fsect -- FATSz32
-- ...
-- и так далее
So, we have a set of rules. It remains to generate a comparison tree for them and
compile it all in bytecode.
Let's start with the tree: It may not be the best option, but the rules turned out to be less than a hundred, so far you should not worry. It is up to the virtual machine, the set of commands, and the compiler: Alas, here a simple Haskell type system starts to be missed: I want to set compile-time invariants for the commands and their classes, so that, for example, it would be impossible to create a command with the wrong opcode. But you just can’t do this, but you don’t want to introduce a separate type for each opcode, an existential data type for a command and still do not want to use metaprogramming to generate opcodes.
data CmpTree = GEQ Int CmpTree CmpTree | CODE [Rule]
deriving (Show)
mkCmpTree :: [Rule] -> CmpTree
mkCmpTree r = mkTree' rulemap
where rulemap = M.fromList $ map (\x -> (fsect x, x)) r
splitGeq n m =
let (a, b, c) = M.splitLookup n m
in (a, c `M.union` (maybe M.empty (M.singleton n) b))
mkTree' xs | M.null xs = CODE []
| M.size xs < 3 = CODE (map snd (M.toList xs))
| otherwise =
let ks = map fst $ M.toAscList xs
n = ks !! (length ks `div` 2)
(le, geq) = splitGeq n xs
in GEQ n (mkTree' le) (mkTree' geq)
- классификатор команд, команды могут кодироваться разнообразно
- для достижения максимальной компактности
class OpcodeCL a where
isRLE :: a -> Bool
arity0 :: a -> Bool
arity1 :: a -> Bool
arity2 :: a -> Bool
arity3 :: a -> Bool
firstCode :: a
lastCode :: a
data Opcode = DUP | DROP
| CONST | CRNG
| JNZ | JZ | JGQ | JNE | JMP | CALLT | CALL | RET
| NOT | EQ | NEQ | GT | LE | GQ | LQ | RNG
| LOADS2 | LOADS3 | LOADS4 | LOADS5 | LOADS6 | LOADS7
| LOADS8 | LOADS9 | LOADS10 | LOADSN
| SER | NSER | NSER128
| RLE1 | RLE2 | RLE3 | RLE4 | RLE5 | RLE6 | RLE7 | RLE8
| RLE16 | RLE32 | RLE64 | RLE128 | RLE256 | RLE512 | RLEN
| OUTLE | OUTBE | OUTB
| NOP
| CALLN
| DEBUG
| EXIT
deriving (Eq, Ord, Enum, Show)
data CmdArg = W32 Word32 | W16 Word16 | W8 Word8 | ADDR Addr
data Addr = ALabel Label | AOffset Int
data Cmd = Cmd0 Opcode
| CmdConst Word32
| Cmd1 Opcode CmdArg
| Cmd2 Opcode CmdArg CmdArg
| Cmd3 Opcode CmdArg CmdArg CmdArg
| CmdJmp Opcode Addr
| CmdCondJmp Opcode Addr
| CmdLabel Label
| RawByte Word8
type Label = Int
type Block = (Label, [Cmd])
Put it off until better times, let’s get by with what we have. Anyway, for the implementation of the virtual machine, you will have to write tests, so the errors that come up will pop up there.
So, there is a virtual machine command system, now you need to compile a comparison tree into it, built from our rules: Here we use the favorite way to generate all sorts of things using eDSL built on top of the Writer monad. The generation of flat code from the comparison tree leads to a lot of “snot”, for example, to long chains of exit from blocks:
mkVMCode :: CmpTree -> [Block]
mkVMCode xs = normalize maxl code
-- skip
scanT :: CmpTree -> GenM ()
scanT (GEQ n left right) = do
s <- newLabel
l <- runGen' (scanT left) >>= withLabel
r <- runGen' (scanT right) >>= withLabel
_ex <- newLabel
label s
dup
cnst n
jgq (labelOf r)
block l >> jmp _ex
block r >> label _ex
scanT (CODE []) = op0 EXIT
scanT (CODE rules) = mapM_ scanR rules
scanR :: Rule -> GenM ()
scanR ( REQ n code ) = do
s <- newLabel
code' <- runGen' (scanmC code) >>= withLabel
ex <- newLabel
label s
dup
cnst n
jne ex
block code'
label ex
scanR ( RANGE a b code ) = do
s <- newLabel
code' <- runGen' (scanmC code) >>= withLabel
ex <- newLabel
label s
dup
crng a b
jz ex
block code'
label ex
-- skip
L1:
...
JMP L2
L2:
JMP L3
L3:
JMP L4
L4:
EXIT
jumping into the next blocks and so on. normalize eliminates these disgraces, and breaks the code into blocks, each of which begins with a label, and ends with an unconditional jump to the next block command. There are no conditional or unconditional jump commands inside the block; they are valid only at the end. We need such blocks to calculate label offsets. After it, you can merge blocks, getting rid of unnecessary transitions completely.
We will write the Show instance for our bytecode for beautiful fort printing and see what we get after optimizing the blocks: Not perfect, but there are no snot, the general code is partially allocated to the procedures, there is a branch tree. It will do. It remains to run it on something, for this we need to implement, finally, the virtual machine itself.
...
L215:
DUP
CONST 2122
JGQ L220
DUP
CRNG 00000843 00000849
JZ L235
RLE512 00
EXIT
L220:
DUP
CRNG 0000084A 000C8869
JZ L223
LOADS2
BYTE 48
BYTE 45
RLE2 4C
LOADS7
BYTE 4F
BYTE 20
BYTE 57
BYTE 4F
BYTE 52
BYTE 4C
BYTE 44
RLE2 21
CALLN 00
EXIT
L223:
DUP
CRNG 000C886A 000E1869
JZ L235
RLE512 00
CALLN 00 ;; а вот и вызов коллбэка прошивки --- значит, это сектор данных
EXIT ;; файла
L235:
EXIT
...
L0:
LOADS5
BYTE 02
BYTE 08
BYTE 20
BYTE 00
BYTE 02
RET
...
It can be simply written in C, since only the opcodes change significantly, but experience has shown that it is better to generate it all than to monitor the consistency of the opcodes and C code later. There is no way to verify this, and the situation when the compiler produces one, and vm wants to interpret something completely different, is quite likely. So it’s better to generate everything. Again, we outline the mini-eDSL for generating C, so as not to bother closing the brackets, indents and semicolons.
Again Writer, no variety ... Let's see what we got:
stubs :: String
stubs =
envFile $ do
comment "top of the file"
put "#include <stdint.h>"
put "#include \"emufatstubs.h\""
defines
...
stmt (pt codeType ++ op `assign` "code")
endl
push a "n"
put "for(;;)"
braces $ indented $ do
put "switch(*op)"
braces $ do
forM_ codes $ \op -> do
put (printf "case %s:" (show op))
indented $ decode op
endl
put "default:"
indented $ exit
exitLabel
indented $ stmt "return 0"
...
decode (CRNG) = do
skip "1"
stmt (tmp0 `assign` pop a)
stmt (tmp1 `assign` decode32) >> skip "4"
stmt (tmp2 `assign` decode32) >> skip "4"
push a ( _and (tmp0 `gq` tmp1) (tmp0 `lq` tmp2) )
next
decode (CALL) = do
skip "1"
stmt (tmp0 `assign` decode32) >> skip "4"
stmt (push' r pc')
jump tmp0
...
#define DEFSTACK(n, t, l) ...
#define RESET(a) ...
#define PTOP(a) ...
#define TOP(a) ...
#define POP(a) ...
#define PUSH(a,v) ...
#define NEXT(x) ...
#define JUMP(x, b, o) ...
#define SKIP(x, n) ...
#define PC(x, b) ...
#define DECODE32(op) ...
#define DECODE8(op) ...
...
DEFSTACK(a, uint32_t, 16);
DEFSTACK(r, uint32_t, 8);
uint32_t tmp0;
uint32_t tmp1;
uint32_t tmp2;
uint32_t tmp3;
...
uint8_t *op = code;
PUSH(a, n);
for(;;)
{
switch(*op)
{
...
case CRNG:
SKIP(op, (1));
tmp0 = POP(a);
tmp1 = DECODE32(op);
SKIP(op, (4));
tmp2 = DECODE32(op);
SKIP(op, (4));
PUSH(a, ((tmp0 >= tmp1) && (tmp0 <= tmp2)));
NEXT(op);
...
case CALL:
SKIP(op, (1));
tmp0 = DECODE32(op);
SKIP(op, (4));
PUSH(r, PC(op, code));
JUMP(op, code, tmp0);
...
case EXIT:
goto _exit;
default:
goto _exit;
}
}
_exit:
return 0;
...
Well, that should be. An important nuance: in order for switch to be compiled into the transition table, it is necessary that the values of its labels go sequentially and do not have holes. And probably fit in bytes. In case of violation of these heuristics, C compilers can generate a comparison tree, which in this case does not suit us at all. We provided the opcode sequence with the definition of the Enum instance for our Opcode type (see above).
What a pity that such a low-level, it would seem, C does not have a standard way to navigate to a variable address, even though GCC supports such an extension. Only not for all interesting platforms there is GCC, so we restrict ourselves to switch-based interpretation.
Our virtual machine is ready. Let's write tests for her. This is easy - let the test VM take a bytecode stream as input, generate the contents of the buffer as a result of their interpretation and send it to the output stream. Each test case, therefore, will be considered passed if the contents of the buffer ultimately meet expectations.
We’ll write tests ... ... and test cases: and a shell to run them: Run, fix all the problems and crashes in the core (surprisingly few) and run everything together: Everything works as expected: the image of the file system is generated, checked and mounted. The content is as described on our eDSL.
testJne = makeTest $ do
[l1, l2] <- replicateM 2 newLabel
cnst 1
cnst 2
jne l1
exit
label l1
cnst 0xCAFEBABE -- 1
outle
cnst 1
cnst 1
jne l2
cnst 0xCAFEBABE -- 2
outle
exit
label l2
cnst 0xFFFFFFFF
outle
tests = testSuite $ do
...
test "testJne" testJne (assert $ do
a <- getWord32le
b <- getWord32le
return $ a == 0xCAFEBABE && b == 0xCAFEBABE)
runTest :: String -> Test -> IO Bool
runTest path (T{tname=nm, tcode=code, tcheck = tc})= do
let bin = toBinary code
(inp,out,err,pid) <- runInteractiveProcess path [] Nothing Nothing
BS.hPut inp bin
hClose inp
res <- BS.hGetContents out
let r = tc res
hPutStrLn stderr (printf "test %-24s : %s" nm (if r then "PASSED" else "FAILED !"))
return r
...
case args of
...
... -> mapM_ (runTest path) tests
...
...
...
test testJgq : PASSED
test testJne : PASSED
test testCallRet1 : PASSED
...
...
helloFile = const $ BS8.pack "HELLO WORLD!!"
fatSample2 = filesystem $ do
file "file0" (16384) helloFile
dir "A" $ do
file "file1" (megs 100) helloFile
dir "C" $ do
file "file3" (megs 100) helloFile
file "file4" (megs 100) helloFile
file "file5" (megs 100) helloFile
dir "E" $ emptyDir
dir "B" $ do
file "file2" (megs 50) emptyFile
...
$ ./FatGen bin | cbits/genfat 1000000 > fat.img
521106 / 1000000 ( 13027 kb/s)
$ fsck.vfat ./fat.img
dosfsck 3.0.9, 31 Jan 2010, FAT32, LFN
Free cluster summary uninitialized (should be 15863)
./fat.img: 10 files, 115209/131072 clusters
$ sudo mount -o loop ./fat.img /mnt/test2/
$ find /mnt/test2/
/mnt/test2/
/mnt/test2/FILE0
/mnt/test2/A
/mnt/test2/A/FILE1
/mnt/test2/A/C
/mnt/test2/A/C/FILE3
/mnt/test2/A/C/FILE4
/mnt/test2/A/C/FILE5
/mnt/test2/A/C/E
/mnt/test2/B
/mnt/test2/B/FILE2
The size of the compiled rules file in this case is a little over 2Kb and lends itself to further optimization, 2Kb is quite an acceptable size for dynamic downloads even via GSM / EDGE, not to mention 3G.
Fort performance also lends itself to optimization, not to mention the fact that in the most extreme case it can be compiled in C and then into the native processor code.
Here's a short story about the benefits of Haskell in the national economy.