Cancel January 22, 2011 at 17:11

Using bulkloader for backup, recovery and data migration

Bulkloader is an interface in Google App Engine for downloading data from / to storage on Google servers. Bulkloader is convenient to use for backup / recovery / migration of application data, however, the documentation and examples of use are catastrophically small, and on a complex application you will come across various problems and bugs. I myself have been digging various sources of information for quite a while, delving into the source code of the SDK, reading bugs, writing my work rounds; and now I am ready to present some of the fruits in the form of a detailed article.

The article is very large, keep in mind.

I will not particularly go into the details of creating App Engine applications, this topic has been repeatedly raised and examples can be found in the sea, including in Russian. However, synthetic examples are usually poorly perceived, so we will consider the "real" application - the personal blog engine, the topic is well-known and understandable to everyone. And as the backup file format, we’ll choose plain XML.

Retelling of the documentation will not be here either.

Terminology

There is no Russian-speaking established terminology, so I allowed myself some liberty, calling “Kind” either “class” or “type”. Entity has remained an entity.

Import = deserialization = recovery. Export = serialization = backup.

Training

Hereinafter we use the GAE SDK version 1.4.1 or higher on the local unix / linux machine, on windows everything is almost the same. When working with the application on google servers, there are certain nuances, but you can read about them in the official documentation, here we still only work with the local server.

The main GAE SDK programs (appcfg.py, dev_appserver.py) must be available for execution in the console (SDK paths are written in the appropriate environment variables, for example).

What is bulkloader

Bulkloader is a Python framework and to use it you will have to not only write configuration files, but also code. But, having mastered the framework, you will get a very powerful mechanism for saving and restoring data from your server inside the App Engine.

You choose in what format to store data on the local machine, and bulkloader converts them according to certain rules (import and export). At the end of the article there is a link where you can learn more about bulkloader.

In order for the bulkloader to work, the application must first enable the access point to the API. So, enable remote_api, for this in the application config (app.yaml) add a section (if it is not there)

builtins:
- remote_api: on

This section includes the access point to the API at http: // servername / _ah / remote_api, for a local server with default settings it will be http: // localhost: 8080 / _ah / remote_api.

Data schema

Let's start with the application data schema. Everything is clear for the blog: Articles (Article), Comments (ArticleComment), Rendered Articles (RenderedArticle). Comments are presented in a tree. An article rendered in html is stored in a separate storage entity.

Entity classes refer to each other as follows:

Article → RenderedArticle (link to the rendered article in the article)
ArticleComment → Article (link to the article from the comment)
ArticleComment → ArticleComment (link to the parent comment)

class RenderedArticle(db.Model):
    html_body = db.TextProperty(required=True)
class Article(db.Model):
  shortcut = db.StringProperty(required=True)
  title = db.StringProperty(required=True)
  body = db.TextProperty(required=True)
  html_preview = db.TextProperty()
  rendered_html = db.ReferenceProperty(RenderedArticle)
  published = db.DateTimeProperty(auto_now_add=True)
  updated = db.DateTimeProperty(auto_now_add=True)
  tags = db.StringListProperty()
  is_commentable = db.BooleanProperty()
  is_draft = db.BooleanProperty()
class ArticleComment(db.Model):
    parent_comment = db.SelfReferenceProperty()
    name = db.StringProperty()
    email = db.StringProperty()
    homepage = db.StringProperty()
    body = db.TextProperty(required=True)
    html_body = db.TextProperty(required=True)
    published = db.DateTimeProperty(auto_now_add=True)
    article = db.ReferenceProperty(Article)
    ip_address = db.StringProperty()
    is_approved = db.BooleanProperty(default=False)
    is_subscribed = db.BooleanProperty(default=False)

The models show that many different types of data are used: two kinds of links, dates, strings, Boolean values, lists. Looking ahead, I note that the biggest problems have arisen with lists and links.

Checking the operation of bulkloader

We fill the database and check the operation of bulkloader through the API:

appcfg.py download_data --email = doesntmatter -A wereword --url = http: // localhost: 8080 / _ah / remote_api --kind = Article --filename = out.dat

We specify the name of the application in the -A parameter, any line in the --email parameter (it doesn’t matter for the local server), in the --kind parameter - the entity class that we want to download (look at the download_data argument). After the command is executed (just press Enter at the password request) a file with a backup of the specified entity class (out.dat) and a bunch of different logs (files of the form bulkloader- *) will appear in the current directory. By default, the SQLITE3 format is used for backup, you can open the resulting file (out.dat) in any SQLITE3 viewer and study it. Its structure is of little use for practical use (for example, migration), so we will continue to write the config (and other related files) for bulkloader so that the data is exported in a format that is more convenient for us.

Writing a configuration file for bulkloader

The current version of the SDK supports two data export / import formats: CSV and XML, we will use the second. The configuration file is already a familiar YAML file, it describes how data transformation is performed when exporting / importing data from the storage. The official documentation says how to generate the basic config from the application, but we will write it from scratch. We will call this file config.yaml, usually I create a separate backup directory in the application tree and place everything necessary in it, it practically does not intersect with the main application.

At the beginning - in the python_preamble section - those python modules are defined that will be needed in the export / import process. Here is the "gentleman's set" of modules, base64 and re are standard python modules, google. * Are modules from the SDK, and helpers is our own module, helpers.py file located in the current directory. In helpers.py, we will have various workarounds and other useful functions for importing / exporting data, but in the beginning just create an empty file with that name, we will add the code later.

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users
- import: helpers

The next section of the config is transformers, it describes the "converters" of entities in the local backup format and vice versa. Here you must describe all the fields of the entity class that you need. Each entity class is described in a separate section called kind, here is the simplest example of such a section, in it we describe the converter for the Article class:

transformers:
- kind: Article
  connector: simplexml # тип коннектора
  connector_options: # параметры коннектора
    xpath_to_nodes: "/blog/Articles/Article" # XPath, определяющий путь к элементу с данными для одного объекта
    style: element_centric # стиль работы с XML, в данном случае — элементо-ориентированный
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

A short note, XPath support is very weak, you can really use only expressions of the form "/ AAA / BBB / CCC".

Now download the data from the server using the just created config (option --config):

appcfg.py download_data --email = doesntmatter -A wereword --url = http: // localhost: 8080 / _ah / remote_api --kind = Article --config = test.yaml --filename = Article.xml

And we get the resulting XML containing data about two objects:

6
8

Please note that XML includes only those fields that we described in the configuration file in the transformers section; in our case, this is only a record key. In the export_transform parameter, we specified a specific converter for this field - transform.key_id_or_name_as_string. This is a function from google.appengine.ext.bulkload.transform module. For fields of another type, other converter functions are used, and a regular lambda expression in python can act as such a converter.

And now the whole piece of the config that describes the Article entity class:

- kind: Article
  connector: simplexml
  connector_options:
    xpath_to_nodes: "/blog/Articles/Article"
    style: element_centric
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string
    - property: rendered_html
      external_name: rendered-html
      export_transform: transform.key_id_or_name_as_string
      # deep key! It's required here!
      import_transform: transform.create_deep_key(('Article', 'key'), ('RenderedArticle', transform.CURRENT_PROPERTY))
    - property: shortcut
      external_name: shortcut
    - property: body
      external_name: body
    - property: title
      external_name: title
    - property: html_preview
      external_name: html-preview
    - property: published
      external_name: published
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M')
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M')
    - property: updated
      external_name: updated
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M')
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M')
    - property: tags
      external_name: tags
      import_transform: "lambda x: x is not None and len(x) > 0 and eval(x) or []"
    - property: is_commentable
      external_name: is-commentable
      import_transform: transform.regexp_bool('^True$')
    - property: is_draft
      external_name: is-draft
      import_transform: transform.regexp_bool('^True$')

Let's analyze it in detail. For each field of the object, the property parameter is specified, which describes the data conversion rules for this field.

The external_name parameter sets the name of the corresponding element in the XML file.

In the parameter import_transform is a function for importing data, it converts data from backup to the desired data type of the field. We can assume that this is deserialization.

In the export_transform parameter, the function of converting the field to text that will be written to backup, data serialization.

For simple types (String, for example), an explicit description of the import and export functions is not necessary; the standard one is used, which is quite enough. We will talk about other types separately.

Let's start with the rendered_html field, it is, firstly, a reference to an object of another class (in our case, RenderedArticle), and secondly, this object of the RenderedArticle class is a child of the corresponding object of the Article class. Therefore, during deserialization, it is necessary to "construct" a valid reference to the object, this is done from the values of two fields using the standard transform.create_deep_key method:

    - property: rendered_html
      external_name: rendered-html
      export_transform: transform.key_id_or_name_as_string
      # deep key! It's required here!
      import_transform: transform.create_deep_key(('Article', 'key'), ('RenderedArticle', transform.CURRENT_PROPERTY))

Please note that the import / export_transform parameters must contain expressions, resulting in a function that takes one argument and returns one value. And in the example above, we see a function call with specific arguments, this function is a kind of decorator and returns a function already prepared for data conversion. transform.create_deep_key accepts several two-element tuples as input, each of which reflects one level in the chain of relations of objects, and the tuple itself contains the name of the entity class and the name of the element (from the XML file); From these fields, a key value is generated.

In our case, the chain consists of two objects, and we use the value transform.CURRENT_PROPERTY to get rid of the field name of the current object from the relationship chain. In principle, instead of transform.CURRENT_PROPERTY, it is quite possible to write rendered_html.

Fields with dates also require a special approach, but here everything is simple - we use function generators from the SDK, we specify the date / time formatting template in the argument:

    - property: published
      external_name: published
      export_transform: transform.export_date_time('%Y-%m-%dT%H:%M')
      import_transform: transform.import_date_time('%Y-%m-%dT%H:%M')

Fields with a list of strings, here the standard method is used for serialization, so you do not need to write anything, but import requires a special approach:

    - property: tags
      external_name: tags
      import_transform: "lambda x: x is not None and len(x) > 0 and eval(x) or []"

When exporting (serializing), the list of strings is converted to an element of this form:

[u'x2', u'another string']

However, an empty list of strings is converted to an empty string:

And when importing using a standard converter, an empty field will be converted to a None value, which, obviously, is not a valid list and will cause problems when trying to read this field in the application. Therefore, we use a lambda expression that performs the correct (relatively) conversion. However, due to a bug in the SDK, this still will not help you much, because the error is in the field type validator.

When working with Boolean fields, we also use a simple converter for deserialization:

    - property: is_commentable
      external_name: is-commentable
      import_transform: transform.regexp_bool('^True$')

In standard export, Boolean values are converted to the “True” and “False” strings, while we use an even more general method when importing - only the “True” string is converted to True, and all the rest are converted to False.

The resulting XML file with the imported objects of the Article class looks something like this:


   aaa bbb ccc
   2011-01-20T08:196Falsethis is new articleaaa bbb cccshort-cut-129541856572011-01-19T06:29True

   ff gg hh
   2011-01-19T06:308Falseanother articleff gg hh[u'x2']short-cut-129541859092011-01-19T06:29True

Working with object relationship chains

Relations or dependencies between objects are built using the parent argument when creating an object of some entity class. The new object falls into the same group of entities as the one specified in parent. This approach allows you to use, for example, transactions to maintain data integrity. The chains of relations during import and export must be handled in a special way. And there are several nuances that we will consider below.

So, we have the Article entity class, objects of this type are articles, it contains the source code of the article in the markup language, a small preview and other service information. And the text of the article rendered in html code is stored in a separate object of the RenderedArticle class. Separation of the rendered text into a separate entity class was done in order to circumvent the restriction on the total size of an object adopted by App Engine, and in fact Article and RenderedArticle objects act in a one-to-one relationship. The RenderedArticle object is created in the same entity group as the Article object.

Here is the part of config.yaml config for the RenderedArticle entity class

- kind: RenderedArticle
  connector: simplexml
  connector_options:
    xpath_to_nodes: "/blog/RenderedArticles/Article"
    style: element_centric
  property_map:
    - property: __key__
      external_name: key
      export:
        - external_name: ParentArticle
          export_transform: transform.key_id_or_name_as_string_n(0)
        - external_name: key
          export_transform: transform.key_id_or_name_as_string_n(1)
      import_transform: transform.create_deep_key(('Article', 'ParentArticle'), ('RenderedArticle', transform.CURRENT_PROPERTY))
    - property: html_body
      external_name: html-body

Notice how the data export is described in the example above. Firstly, one key field of the object is converted into two elements in the backup. Secondly, upon import, the key field is “assembled” from the values of two elements - ParentArticle and key. The transform.key_id_or_name_as_string_n (0) code returns a function that, as a result of execution on the key field, returns the specified component of the composite key.

The generated XML based on this config looks something like this:

6aaa bbb ccc7
8ff gg hh9

Now we will consider export-import of an object of the ArticleComment class, recall that comments are a tree, that is, a comment can have a “parent” comment, in addition, each comment has a link to the parent post.

- kind: ArticleComment
  connector: simplexml
  connector_options:
    xpath_to_nodes: "/blog/Comments/Comment"
    style: element_centric
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string
      import_transform: transform.create_deep_key(('Article', 'article'), ('ArticleComment', transform.CURRENT_PROPERTY))
    - property: parent_comment
      external_name: parent-comment
      export_transform: transform.key_id_or_name_as_string
      import_transform: helpers.create_deep_key(('Article', 'article'), ('ArticleComment', transform.CURRENT_PROPERTY))
    - property: article
      external_name: article
      export_transform: transform.key_id_or_name_as_string
      import_transform: transform.create_foreign_key('Article')
    - property: name
      external_name: name
    - property: body
      external_name: body

At first glance, everything looks simple, but at one point the “default” behavior of the converters breaks down. Note that the parent_comment field may be None, which indicates a top-level comment. If we use the transform.create_deep_key method during the import process, then we get an error on the value None:

BadArgumentError: Expected an integer id or string name as argument 4; received None (a NoneType).

I also made a bug about this error , but so far I have not received any reaction from the developers. To work around this bug, use the helpers.py file, where we place the replacement of the transform.create_deep_key method. The work around is very simple, we only generate the key if the value is not None:

def create_deep_key(*path_info):
    f = transform.create_deep_key(*path_info)
    def create_deep_key_lambda(value, bulkload_state):
        if value is None:
            return None
        return f(value, bulkload_state)
    return create_deep_key_lambda

In the comments, I can tell you more about what is happening in this function, if anyone is interested.

Thus, with an optional reference to the object will be restored correctly.

Now we are working with the article field, which contains a link to the article to which the comments belong. To restore the reference to the object, we use the transform.create_foreign_key method, it works similarly to the transform.create_deep_key method, but without taking into account the relationship chains. Here I want to draw attention to a potential bug, if the link to the object is empty, during recovery you will encounter exactly the same error as a couple of paragraphs above.

Conclusion

It is already quite possible to work with bulkloader, but very carefully. You need to constantly monitor the announcements and read the documentation carefully after each release of the SDK, since not all changes fall into the changelog. Also left behind was an overview of working with binary data, but here everything is simple:

    - property: data
      external_name: data
      export_transform: base64.b64encode
      import_transform: transform.blobproperty_from_base64

Next time we’ll talk about localization features in GAE-python-django applications.

References

Bulkloader documentation home page
PDF with the presentation of bulkloader on Google IO'2010 , I highly recommend looking, there is a detailed description of the entire operation scheme of this subsystem
Bulkloader demo , presentation examples
A test project on which I "run in" methods of working with GAE.

Tags: