Does your Data Engineer team need it?

We often find cool English-language articles that seem useful to our team, and decided that it would be great to share their translation with Habra's readers. Today we have prepared a translation of an article by Tristan Handy, the founder of Fishtown Analytics.

The role of the data engineer in modern startups is changing rapidly. Are you sure that you understand well when and why your team may need such a specialist?

I often communicate with the leading representatives of the analytics world and notice that their understanding of the role of a data engineer in a team is not true. This can create difficulties for the entire data analysis team, and I would like companies to learn to avoid such problems.

In this article I want to share my ideas about when, how and for what it is worth hiring a data engineer. My reasoning is based on my experience at Fishtown Analytics , where I worked with more than a hundred start-ups with venture capital support and helped them build teams of data analysis and processing, as well as knowledge gained through communication with representatives of various data processing companies.

If you lead a team of data experts, this post is for you.

The role of the data engineer is changing

Modern software allows you to more and more automate the boring work associated with the analysis and data processing.

In 2012, at least one data engineer was required to analyze the entire data set in a venture-funded startup. Such a specialist had to extract data from different systems so that analysts and corporate clients could work with them further. Often it was necessary to somehow convert the data so that it was easier to analyze them. Without a data engineer, data analysis specialists would simply have no data to work with, so often the formation of a team began with a data engineer.

By 2019, most of this can be done with ready-made solutions. In most cases, you and a team of analysts can build a data processing system yourself, without the help of a person with extensive experience in data science. And this pipeline will not be bad at all - modern ready-made tools are perfect for solving such problems.

The ability to build pipelines themselves appeared to analysts and data scientists recently - only 2-3 years ago. This happened mainly due to three products: Stitch , Fivetran and dbt (it should be said that dbt is a product of my company, Fishtown Analytics). They were released almost immediately after Amazon Redshift, when teams of analysts in startups realized that they needed to create data warehouses. It took several more years to make these products of high quality - in 2016, we were still pioneers.

Now the pipeline, built using Stitch, Fivetran or dbt, is much more reliable than that developed specifically using Airflow. I know this not from theory, but from my own experience. I'm not saying that it is impossible to build a reliable infrastructure with Airflow, just most startups do not. In Fishtown Analytics, we have worked with more than a hundred teams of analysts in different startups, and this scenario has been repeated many times. We constantly help people move from their own pipelines to ready-made solutions, and each time the effect is positive.

Data Engineer Must Not Write ETL

In 2016, Jeff Magnusson wrote a fundamental post, " Data Engineers Should Not Write ETL ." This was the first post in my memory that called for such changes. Here is my favorite part from there:

* “Over the past 5 years, data processing tools and technologies have evolved. Most technologies have already received such a development that they can adapt to your needs, unless, of course, you need to process petabytes of data or billions of events per day.

If you do not need to go beyond the capabilities of these technologies, you most likely do not need a team of highly specialized programmers to develop additional solutions.

If you manage to hire them, they will soon get bored. If they get bored, they will go away from you to Google, Facebook, LinkedIn, Twitter - places where their experience is really needed. If they are not bored, most likely they are pretty mediocre. And mediocre programmers have really succeeded in building a huge amount of complex, unsuitable for normal work nonsense, which they call “solutions”. ”*

I really like this quote, because it not only stresses that today you don’t need data engineers to solve most of the ETL problems, but also explains why you should n’t ask them to solve these problems at all .

If you hire data engineers and ask them to build a pipeline, they will think that their task is to build pipelines. This will mean that tools like Stitch, Fivetran and dbt will be a threat to them, not a powerful source of power. They will find the reasons why the ready-made pipelines do not meet your individual data needs and why analysts should not independently engage in data conversion. They will write code that is fragile, difficult to maintain and inefficient. And you will rely on this code, because it is the basis of everything else that your team does.

Run away from specialists like the plague. The growth rate in your team of analysts will decrease dramatically, and you will spend all your time on solving infrastructure problems, and this is not at all something that brings income to your business.

If not ETL, then what?

So do you need a data engineer for your team? Yes.

Even with new tools that allow data analysts and data science specialists to create pipelines themselves, data engineers are still an important part of any professional data team. However, the tasks on which they should work and the sequence in which it is worth hiring employees to work with data have changed. Below I will talk about when to do it, and now let's talk about what the data engineers in modern start-ups are responsible for.

Data engineers are still an important part of any professional data team.

Your data engineers should not create pipelines for which there are already ready solutions, and write SQL data transformations. This is what they should focus on:

organization and optimization of basic data infrastructure
build and support custom pipelines,
support of a team of data specialists by improving the design and performance of the pipelines and queries,
building non-SQL data transformations.

Organization and optimization of basic data infrastructure

Although data engineers in startups no longer need to manage Hadoop clusters or configure hardware for Vertica, work is still required in this area. After making sure that your technology is working at the peak of its capabilities, you get a significant improvement in performance, cost, or both. This usually involves the following tasks:

the creation of a monitoring infrastructure to track the status of the pipelines,
monitoring all tasks affecting cluster performance,
regular maintenance,
setting up table schemes (partitioning, compression, distribution) to minimize costs and increase productivity,
development of custom data infrastructure when there are no ready solutions.

These tasks are often overlooked in the early stages of development, but they become critical as the team and data grow. In one project, we were able to gradually reduce the cost of building a table in BigQuery from $ 500 to $ 1 per day by optimizing the table sections. This is really important.

Uber is a good example of a company that has succeeded. Data processing specialists at Uber have created a tool called Queryparser, which automatically tracks all queries to their data infrastructure and collects statistics about the resources used and usage patterns. Uber Data engineers can use metadata to configure their infrastructure accordingly.

Data engineers are also often responsible for building and maintaining the CI / CD pipeline that manages the data infrastructure. In 2012, many companies had a very weak infrastructure for version control, management, and testing, but now everything is changing, and the data engineers are behind this.

Finally, data engineers in leading companies are often involved in the creation of tools that do not exist in finished form. For example, Airbnb engineers created Airflow because they had no way to efficiently generate data processing digraphs . And Netflix engineers are responsible for building and maintaining a complex infrastructure for developing and operating tens of thousands of Jupyter Notebooks .

You can simply buy most of your basic infrastructure, but someone still needs to maintain it. And if you are a truly progressive company, you probably want to expand the capabilities of existing tools. Data engineers can help with both.

Building and maintaining custom pipelines

Although data engineers no longer need to manually transfer data to Postgres or Salesforce, suppliers have “only” about 100 integration options. Most of our clients can immediately reach from 75 to 90% of the data sources with which they work.

In practice, integration is carried out in waves. As a rule, the first stage includes the main application database and event tracking, and the second stage includes marketing systems, such as ESP, and advertising platforms. Today, ready-made solutions for both stages are already available for sale. When you dive into working with data from SaaS vendors in your subject area, you need data engineers to build and maintain these niche data processing pipelines.

For example, Internet sales companies interact with a host of different products in the field of ERP, logistics and delivery. Many of these products are very specific and almost none of them are commercially available. Expect your data engineers to create similar products in the foreseeable future.

Building and maintaining reliable data processing pipelines is a challenge. If you decide to invest your resources in their creation, be prepared that it will require more funds than originally budgeted, and the maintenance will also require more effort than you planned. The first version of the pipeline is simple to build, but it is difficult to make it maintain the consistency of data in your storage. Do not commit to maintaining your own data processing pipeline until you are sure that your business is working. Once you do, take the time to make it reliable. Think about using Singer, an open source framework from the creators of Stitch - we built about 20 integrations using it.

Support a team of data experts by improving the design and performance of the pipelines and queries

One of the changes we are seeing in the field of data engineering over the past five years is the emergence of ELT, a new variant of ETL, which converts the data after it is loaded into the repository, but not before. The essence and reasons for this change are already well covered in other sources. I want to emphasize that this shift has a huge impact on who builds these pipelines.

If you are writing Scalding code to scan terabytes of event data in S3 and then load it into Vertica, you probably need a data engineer. But if your event data (exported from Google Analytics 360) is already in BigQuery, it means that it is already fully accessible in a high-performance, scalable environment. The difference is that this environment “talks” to SQL. This means that analysts can now create their own data transformation pipelines.

This trend developed in 2014, when Looker launched the PDT tool . The trend intensified when entire teams of data specialists began building data digraphs from 500+ nodes and processing large data sets using dbt over the past two years. At this stage, the model is deeply rooted in modern teams and gave analysts as much autonomy as never before.

Switching to ELT means that data engineers no longer need to perform most of the data conversion tasks . It also means that teams without engineers can go a long way using data conversion tools created for analysts. However, data engineers still play an important role in building data transformation pipelines. There are two situations where their participation is crucial:

1. When you need to improve performance

Sometimes the logic of a business process requires some particularly complex transformation, and it is useful to engage a data engineer to assess how a particular approach to table creation affects performance. Many analysts do not have much experience in optimizing performance in analytical data warehouses, and this is an excellent reason to start working with a narrower specialist.

2. When the code gets too complicated

Analysts are well able to solve business problems using data, but often do not think about how to write extensible code. At first glance it’s easy to start building tables in a database, but things can quickly get out of control. Involve a data engineer who can think up the general architecture of your repository and develop particularly complex transformations, otherwise you risk being alone with a coil that is almost impossible to unravel.

Building non-SQL data transformations

SQL can initially satisfy most data conversion needs, but it cannot solve all problems. For example, it is often necessary to add geo-data to the database by taking latitude and longitude and linking them to a specific region. Many modern analytical repositories are not yet able to solve this problem (although this is beginning to change! ), So the best solution could be to build a pipeline in Python that complements the data in your storage with information about the region.

Another obvious use of Python (or other languages other than SQL) is for machine learning. If you have personalized product recommendations, a demand forecasting model, or an outflow prediction algorithm that takes data from your repository and sets weights, you can add them as end nodes of your SQL processing digraph.

Most modern companies that do this with non-SQL use Airflow. dbt is used for the SQL-based part of the data digraph, and non-SQL nodes are added as leaves. This approach takes the best of both approaches — data analysts can still be primarily responsible for SQL-based transformations, and data engineers can be responsible for the ML code for commercial operation.

When does your team need a data engineer?

Changing the role of the data engineer also implies rethinking the sequence of hiring employees. Previously it was thought that first of all you need data engineers, because analysts and data science experts have nothing to work without a ready data processing and analysis platform. Today, data analysis and processing specialists can work independently and create the first version of the data infrastructure using ready-made tools. Consider hiring a data engineer when your startup has any of the 4 signs of scale:

there are 3 data science analyst / specialists on your team
your BI platform has 50 active users
the largest table in your storage reaches 1 billion rows,
You know that you need to build 3 or more custom data processing pipelines over the next few quarters, and all of them are critically important.

If you have not yet encountered any of these situations, your data-processing team can probably work independently using ready-made technologies, support from external consultants and advice from colleagues (for example, in the Locally Optimistic or dbt communities in Slack).

The main thing you need to understand is that the data engineer has no business value by itself, his main job is to increase the productivity of your analysts. Your data team interacts with stakeholders, measures KPIs, and creates reports and models — these are the ones that help your business move in the right direction every day. Hire a data engineer to enhance an existing, large team: if after you hired a data engineer, the efficiency of four of your analysts increased by 33%, it was most likely a good solution.

Data processing engineers benefit businesses by helping your analysts and data scientists be more productive.

In my opinion, if you decide to expand your team of data specialists, the best ratio is about 5 to 1: five data science analysts / specialists per data engineer. If you are working with particularly large or unusual data sets, this ratio may change, but this is a good guideline.

Who is worth hiring?

As the role of the data engineer changes, the requirements for the ideal candidate also change. My dear colleague Michael Kaminsky very well said this in our correspondence on this subject, so I will quote him here:

“I think about all these changes, first of all, about the role of the data engineer in the team. From the creator of the infrastructure, it has become a supporting link for a wider group of specialists. This is a significant change, and some data engineers (who would like to focus on building infrastructure) are not always happy.
I think the most important thing for startups is to hire a data engineer who is full of the strength and desire to create tools for a team of analysts / data science experts. If you hire a data engineer who just wants to dig in the backend and hates working with people who have less technical skills, most likely it will end badly. I am looking for data engineers who are happy to work with analysts and researchers and are ready to say: "What you are doing seems to me to be completely ineffective, and I want to make a difference for the better."

I fully agree with Michael. Now the best data engineers in startups are the support and support of the team, they participate in almost everything that the data management team does. They should like teamwork and they should be motivated to achieve success with the whole team.

If you have reached this place, thanks for reading :) This topic really really worries me. Please write a comment if you think that I’m completely wrong, I’m interested to hear about your experience with the data engineers on your team.

Finally, if you decide to hire a data engineer right now, my company conducts quite a few interviews with such specialists - we think that this is a good way to keep abreast of the industry. If you want to arrange a final test of the performance of a new potential member of the team before making an offer, we will be happy to have a final interview with your candidates, just write to us!

Comment from Gleb Sologub, Skyeng Analytics Director :

We have 30+ full analytics analysts at Skyeng and there is not a single data engineer yet. This became possible because our entire data infrastructure is built on cloud services , which Tristan is talking about. We use Amazon Redshift as an analytical repository, Stitch and Matillion ETL for collecting data from 40+ production databases, Segment for collecting events, Redash and Tableau for reports and dashboards, Amazon SageMaker for ML.

The task of the analyst in our company is to help the manager cope with the business problem and make a decision. At the beginning of work on each task, the analyst needs to understand the problem, come up with hypotheses and MVP solutions to test them, find out what data is needed for this and whether they are already in the analytical repository. If they are not there, then any of our analysts is able to set up a simple pipeline, which regularly adds and updates the necessary data in the repository, or build a tableau from existing tables in the repository.

However, there are a number of problems that we face with this approach, and this is exactly what the data engineer should do according to Tristan. For example, now we are solving performance problems mostly extensively, since all cloud tools easily allow you to do this: clicked a couple of buttons and now you have added several nodes to the cluster, chose a more expensive rate and have more often and quickly data.

But at some point, it becomes more profitable to hire a data engineer who will optimize the infrastructure and budget, optimize the pipelines and data storage schemes, set up all kinds of monitoring, learn how to catch and correct non-optimal queries and will help the team of analysts to do their work more efficiently and effectively. Now we have just come to this point and opened a vacancy at 90% corresponding to what Tristan writes about. If you like this role and tasks, here is an example of such a job in Skyeng.

Tags: