[Go] [JS] And again about handling marc formats
Greetings, I have written two articles (on geektimes tyts tyts ) about MARC formats.
Today I have an article with technical details, I cleaned up the code for my solution, removed the magic from there and generally combed it.
Under the cut: go and js friendship, hatred of marc formats.
And so, let's start with the “core” package, for working with marc formats, the package is written in go, 63% coverage with tests.
https://github.com/t0pep0/marc21
“Head” of the entire package - MarcRecord structure
And just two methods working with it, these are
Frankly speaking, I don’t see any point in stopping at them. The only thing ReadRecord upon reaching the end of Reader returns err == io.EOF.
We look further, we are interested in the structures of Leader and VariableField, as well as why VariableField is made as a slice and not a hashmap (because, in contrast to all standards and common sense, the situation of the existence of two different fields (in content), with one tag is possible, running ahead I’ll say in advance that this is also true for SubField)
The leader’s structure, the right word, nothing interesting, just a set of flags, and what is not exported is used only for serialization / deserialization. Two methods are attached to it - serialization and deserialization, are called from {Read, Write} Record (for other structures this is also true.
The structure of the "variable field". I want to note several interesting points right away - three-character tags, RawData - could be made a string, but for me it was more convenient to work with an array of bytes. During serialization, if the field has no subfields (len (Subfields) == 0), then RawData is written, otherwise RawData is ignored
Name - one character,
Data is truncated - again it was possible to use a string, but I decided ...
There are no special nuances in the package, I can say only one thing at once - before adding a field, make sure that the field has at least something other than a tag, otherwise, you risk spending a lot of time thinking about the high and trying to understand why the export to OPAC \ IRBIS does not work.
Sample code that does not change data, but, in fact, simply copies one record file to another
Now let's move on to https://github.com/HerzenLibRu/BatchMarc
In fact, this is the js interpreter https://github.com/robertkrimen/otto/ with the library mentioned above.
The difference from the code above is that here we open the file with js and create a js machine, passing its rules.
Let's take a closer look at the js machine and its constructor.
As we see - everything is simple and corny, the embedding was not used consciously.
Two functions are added to the standard otto delivery - LoadSource and WriteResult, plus class constructors (MarcRecord, Leader, VariableField, VariableSubField) are added.
I will not detail the implementations of the function, but I will pay attention to an interesting point in otto that there is an Object type that can be reduced to js variables. The Object type has a Call method (the same goes for Set / Get methods), which allows you to call a variable method. Duck here - Object.Call does not allow calling a method on a nested class.
It is noteworthy that it swears at a mistake of the type, and because of this, the right decision has long crossed the mind.
A few words about JS. There are no artificially created variables - just create an instance of the class from the MarcRecord constructor and load it with LoadSource (instance), to send changes to go at the end of the script, specify WriteResult (instance).
PullRequest \ IssueRequest - welcome.
Today I have an article with technical details, I cleaned up the code for my solution, removed the magic from there and generally combed it.
Under the cut: go and js friendship, hatred of marc formats.
And so, let's start with the “core” package, for working with marc formats, the package is written in go, 63% coverage with tests.
https://github.com/t0pep0/marc21
“Head” of the entire package - MarcRecord structure
type MarcRecord struct {
Leader *Leader
directory []*directory
VariableFields []*VariableField
}
And just two methods working with it, these are
func ReadRecord(r io.Reader) (record *MarcRecord, err error)
func (mr *MarcRecord) Write(w io.Writer) (err error)
Frankly speaking, I don’t see any point in stopping at them. The only thing ReadRecord upon reaching the end of Reader returns err == io.EOF.
We look further, we are interested in the structures of Leader and VariableField, as well as why VariableField is made as a slice and not a hashmap (because, in contrast to all standards and common sense, the situation of the existence of two different fields (in content), with one tag is possible, running ahead I’ll say in advance that this is also true for SubField)
type Leader struct {
length int
Status byte
Type byte
BibLevel byte
ControlType byte
CharacterEncoding byte
IndicatorCount byte
SubfieldCodeCount byte
baseAddress int
EncodingLevel byte
CatalogingForm byte
MultipartLevel byte
LengthOFFieldPort byte
StartCharPos byte
LengthImplemenDefine byte
Undefine byte
}
The leader’s structure, the right word, nothing interesting, just a set of flags, and what is not exported is used only for serialization / deserialization. Two methods are attached to it - serialization and deserialization, are called from {Read, Write} Record (for other structures this is also true.
type VariableField struct {
Tag string
HasIndicators bool
Indicators []byte
RawData []byte
Subfields []*SubField
}
The structure of the "variable field". I want to note several interesting points right away - three-character tags, RawData - could be made a string, but for me it was more convenient to work with an array of bytes. During serialization, if the field has no subfields (len (Subfields) == 0), then RawData is written, otherwise RawData is ignored
type SubField struct {
Name string
Data []byte
}
Name - one character,
Data is truncated - again it was possible to use a string, but I decided ...
There are no special nuances in the package, I can say only one thing at once - before adding a field, make sure that the field has at least something other than a tag, otherwise, you risk spending a lot of time thinking about the high and trying to understand why the export to OPAC \ IRBIS does not work.
Sample code that does not change data, but, in fact, simply copies one record file to another
package main
import (
"github.com/t0pep0/marc21"
"io"
"os"
)
func main() {
orig := os.Args[1]
result := os.Args[2]
origFile, _ := os.Open(orig)
resultFile, _ := os.Create(result)
for {
rec, err := marc21.ReadRecord(origFile)
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
//А здесь - делайте что хотите....
err = rec.Write(resultFile)
if err != nil {
panic(err)
}
}
}
Now let's move on to https://github.com/HerzenLibRu/BatchMarc
In fact, this is the js interpreter https://github.com/robertkrimen/otto/ with the library mentioned above.
func main() {
marcFile, err := os.Open(os.Args[1])
outFile, _ := os.Create(os.Args[2])
jsFile, _ := os.Open(os.Args[3])
jsBytes, _ := ioutil.ReadAll(jsFile)
jsRules := string(jsBytes)
if err != nil {
return
}
for {
rec, err := marc21.ReadRecord(marcFile)
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
if rec == nil {
break
}
res := new(marc21.MarcRecord)
js := NewJSMachine(rec, res)
err = js.Run(jsRules)
if err != nil {
panic(err)
}
res.Write(outFile)
}
}
The difference from the code above is that here we open the file with js and create a js machine, passing its rules.
Let's take a closer look at the js machine and its constructor.
type jsMachine struct {
otto *otto.Otto
source *marc21.MarcRecord
destination *marc21.MarcRecord
}
func NewJSMachine(source, destination *marc21.MarcRecord) (js *jsMachine) {
js = new(jsMachine)
js.otto = otto.New()
js.otto.Run(classJS)
js.otto.Set("LoadSource", js.fillSource)
js.otto.Set("WriteResult", js.getResult)
js.source = source
js.destination = destination
return js
}
func (js *jsMachine) Run(src string) (err error) {
_, err = js.otto.Run(src)
if err != nil {
return err
}
return nil
}
As we see - everything is simple and corny, the embedding was not used consciously.
Two functions are added to the standard otto delivery - LoadSource and WriteResult, plus class constructors (MarcRecord, Leader, VariableField, VariableSubField) are added.
I will not detail the implementations of the function, but I will pay attention to an interesting point in otto that there is an Object type that can be reduced to js variables. The Object type has a Call method (the same goes for Set / Get methods), which allows you to call a variable method. Duck here - Object.Call does not allow calling a method on a nested class.
source := call.Argument(0)
if !source.IsObject() {
return otto.FalseValue()
}
object := source.Object()
//Вот так правильно
jsValue, _ := object.Get("VariableField")
jsVariableFields := jsValue.Object()
jsValue, _ = jsVariableFields.Call("length")
//А вот так - не правильно
jsValue, _ = object.Call("VariableField.length")
It is noteworthy that it swears at a mistake of the type, and because of this, the right decision has long crossed the mind.
A few words about JS. There are no artificially created variables - just create an instance of the class from the MarcRecord constructor and load it with LoadSource (instance), to send changes to go at the end of the script, specify WriteResult (instance).
PullRequest \ IssueRequest - welcome.