
Generating Dummy Data with Mimesis: Part II

Previously, we published an article on how to generate dummy data with the help of Mimesis - library for the Python programming language. The article you are reading is a continuation of the previous one, because we will not give the basics of working with the library. If you missed an article, were too lazy to read, or simply didn’t want to, then you probably want to right now, because this article assumes that the reader is already familiar with the basics of the library. In this part of the article we will talk about best practice, talk about several, in our opinion, useful features of the library.
Remarque
First of all, I would like to note that Mimesis was not designed for use with a specific database or ORM. The main task that the library solves is the provision of valid data. For this reason, there are no strict rules for working with the library, but there are recommendations that will help to keep your test environment in order and will prevent the growth of entropy in the project. The recommendations are quite simple and fully consistent with the spirit of Python (if this is not so, then we are waiting for comments).
Structuring
Contrary to the statement made above that the library is not intended for use with a specific database or ORM, the need for test data most often arises just in web applications that perform some operations (most often CRUD) with the database. We have some recommendations for organizing test data generation for web applications.
The functions that generate the data and write it to the database must be kept close to the models, and even better, as the static methods of the model to which they relate, following the example of the method _bootstrap()
from the previous article. This is to avoid running around files when the structure of the model changes and you need to add some new field. The model Patient()
from the previous article demonstrates the idea well:
class Patient(db.Model):
id = db.Column(db.Integer, primary_key=True)
email = db.Column(db.String(120), unique=True)
phone_number = db.Column(db.String(25))
full_name = db.Column(db.String(100))
weight = db.Column(db.String(64))
height = db.Column(db.String(64))
blood_type = db.Column(db.String(64))
age = db.Column(db.Integer)
def __init__(self, **kwargs):
super(Patient, self).__init__(**kwargs)
@staticmethod
def _bootstrap(count=500, locale='en', gender):
from mimesis import Personal
person = Personal(locale)
for _ in range(count):
patient = Patient(
email=person.email(),
phone_number=person.telephone(),
full_name=person.full_name(gender=gender),
age=person.age(minimum=18, maximum=45),
weight=person.weight(),
height=person.height(),
blood_type=person.blood_type()
)
db.session.add(patient)
try:
db.session.commit()
except IntegrityError:
db.session.rollback()
Keep in mind that the example above is a Flask application model that uses SQLAlchemy. Organization of dummy data generators for applications created using other frameworks is similar.
Create Objects
If your application expects data in one specific language and only in it, then it is best to use a class Generic()
that provides access to all provider classes through a single object, rather than produce multiple instances of the provider classes individually. Using Generic()
you will get rid of extra lines of code.
Right:
>>> from mimesis import Generic
>>> generic = Generic('ru')
>>> generic.personal.username()
'sherley3354'
>>> generic.datetime.date()
'14-05-2007'
Wrong:
>>> from mimesis import Personal, Datetime, Text, Code
>>> personal = Personal('ru')
>>> datetime = Datetime('ru')
>>> text = Text('ru')
>>> code = Code('ru')
At the same time, it is true:
>>> from mimesis import Personal
>>> p_en = Personal('en')
>>> p_sv = Personal('sv')
>>> # ...
That is, importing the provider classes separately makes sense only if you are limited only to the data that the imported import class has, in other cases it is recommended to use it Generic()
.
Writing data to the database
If you need to generate data and write it to the database, then we strongly recommend generating data in batches, and not at once 600k
. It must be remembered that there may be some restrictions on the part of the database, ORM, etc. The smaller the chunks of data generated for recording, the faster the recording.
Good:
>>> User()._bootstrap(count=2000, locale='de')
Very bad:
>>> User()._bootstrap(count=600000, locale='de')
Upload Images
The class Internet()
has several methods that generate links to images. For testing, links to images located on remote resources are quite enough, however, if you still want to have a set of random images locally, then you can upload images generated by the corresponding class methods Internet()
using a function download_image()
from the module utils
:
>>> from mimesis.utils import download_image
>>> from mimesis import Internet
>>> img_url = Internet().stock_image(category='food', width=1920, height=1080)
>>> download_image(url=img_url, save_path='/some/path/')
User Providers
The library supports a large amount of data and in most cases it will be quite enough, but for those who want to create their own providers with more specific data, this feature is supported and is done as follows:
>>> from mimesis import Generic
>>> generic = Generic('en')
>>> class SomeProvider():
... class Meta:
... name = "some_provider"
...
... @staticmethod
... def one():
... return 1
>>> class Another():
... @staticmethod
... def bye():
... return "Bye!"
>>> generic.add_provider(SomeProvider)
>>> generic.add_provider(Another)
>>> # ...
>>> generic.some_provider.one()
1
>>> generic.another.bye()
'Bye!'
Everything is simple and clear without comment, therefore, we clarify only one point - the attribute of the name
class Meta
is the name of the class through which access will be made to the methods of the user class provider. By default, the class name is the lowercase class name.
Builtin providers
Most countries where a particular language is official have data that is specific to those countries only. For example, CPF
for Brazil, SSN
for the USA. Such data can cause inconvenience and disrupt order (or at least annoy) by the fact that they will be present in all objects, regardless of the chosen language standard. You can see for yourself if you look at an example of how it would look (the code will not work):
>>> from mimesis import Personal
>>> person = Personal('ru')
>>> person.ssn()
>>> person.cpf()
I think everyone will agree that it looks really bad. We, as perfectionists, made sure that the Brazilian CPF did not bother the “Pole” and for this reason the provider classes that provide this kind of local-specific data are placed in a separate subpackage ( mimesis.builtins
) in order to preserve the structure of the classes common to all languages and their objects.
So it works:
>>> from mimesis import Generic
>>> from mimesis.builtins import BrazilSpecProvider
>>> generic = Generic('pt-br')
>>> class BrazilProvider(BrazilSpecProvider):
...
... class Meta:
... name = "brazil_provider"
...
>>> generic.add_provider(BrazilProvider)
>>> generic.brazil_provider.cpf()
'696.441.186-00'
In general, you do not have to add inline classes to an object Generic()
. In the example, this was done only to demonstrate in which cases it would be appropriate to add a built-in provider class to the object Generic()
. You can use it directly as shown below:
>>> from mimesis.builtins import RussiaSpecProvider
>>> ru = RussiaSpecProvider()
>>> ru.patronymic(gender='female')
'Петровна'
>>> ru.patronymic(gender='male')
'Бенедиктович'
What data most often necessitates your work? What was missed in the library and what should be added immediately? We would be very glad to hear your wishes / recommendations / comments.
Link to the project: here .
Link to documentation: here .
On the first part of the article: here .
That's all for me, friends. Good luck to you and may the force be with you!