Sunteți pe pagina 1din 8

Hi and welcome to my e-book

about "14 tips for data engineers".


My name is Bartosz Konieczny
and I am a data engineer with
more than 10 years of experience
in software engineering.

Aside from my professional work,


I write the articles about data and
Scala on
www.waitingforcode.com​.

It's another one of my ebooks. Previously I published the documents


about "10 useful points to know about Python before starting to code"
and a glossary of "89 data terms to know". If you have subscribed to
my mailing list, I will send them to you soon.

If you have not read my previous data documents, send me an e-mail. I will give them to you
with great pleasure! They are not a requirement to work with the points included here but they
may help you in your daily data work.

In this document I would like to share with you 14 tips that may be applied to any data-oriented
projects. All of them are the fruit of my past experience as software engineer who did some
data engineering mistakes at the beginning. And since it's better to learn from mistakes made by
the others, I will share mine with you :-)

I divided them into 3 big sections. Each of them represents a particular moment to use them -
either before, during or after completing the coding work.
Before you start to code

Explore
It's the first step to do when you start to work with new data project. Even if you have already
been working with given dataset, you should always take some time to discover it and ensure
that its structure didn't evolve. Of course, if you have a chance to deal with fully structured data,
you will only need to read the schema. Otherwise (JSON case), you will need to play a little with
the data to discover the underlying rules and identify potentially nullable fields.

The exploration will also give you some input about eventual data quality issues like inconsistent
date format or type. Still, it will be more true for semi-structured data (JSON). You should
encounter fewer problems in case of fully structured and documented datasets.

To explore a dataset there is nothing better than ad-hoc querying languages where SQL is the
king. If you prefer more programmatic way, you can use notebooks like Jupyter, Zeppelin or
Databricks Notebook. On the other hand, if you prefer less interactive way, you can opt for
another solution and, for instance, write an Apache Spark data exploration application which will
generate some reports about the dataset you will work on (doing it from a notebook is better
though).

During data exploration keep also focus on the personal data and check what policy apply to it.

Think fault-tolerance
Data processing frameworks provide built-in fault-tolerance. That means that they handle
different strategies like checkpointing for you. But that doesn't mean that you should ignore
everything related to the fault-tolerance. Despite these included guarantees, you should always
think about the delivery semantics provided by the fault-tolerance mechanisms, like the eventual
data loss or data duplication.You should also simulate the situations when things go wrong in
order to discover if it requires some manual intervention such as automation.

Recovery
Write your code with a potential risk of the reprocessing in mind. Bad things happen and
sometimes you may not cover your application with unit tests enough to detect them. Also, the
error may be purely human and someone can delete the dataset.
In that case, you should always be able to recompute the data. In such case, think about the
budget too. Reprocessing big volumes of data may be costly.

During the development

Start small
Try to write your basic data processing logic and apply it on a small part of the main dataset. It
will help you to validate the logic without having to wait for a long time for the feedback and
spending a lot of money on the expensive compute resources.

To select your test data, you can simply take X first files if you're working with batch, or filter
specific events if you're doing streaming processing. It's up to you to define the filtering
condition. For instance, if you have some numeric values, like customer ids, you can always
take the events where that id is smaller than 2000.

Test soon
When you make sure that your initial data processing logic works fine on the small part of the
main dataset, try to apply it as soon as possible to the main one. It will help you to validate the
logic and also to see whether you already need to add some code optimization This test will
also give you some input about the required resources, scalability rules to define and so on.

It's good to do these tests on the same dataset every time. Thanks to that you will be able to
see if the changes you made improved the execution time and generated results.

If you don't know, check locally first


You are not supposed to know everything about everything. It's quite normal that you have
doubt zones even about a framework you're using daily. To solve them, whenever possible,
think local-first. Unless you know how to read and understand the logs, it's very often much
easier to work with small dataset (see "Start small" point), add some breakpoints in your IDE
and check what happens to a given feature locally.

It happened to me when I wanted to check how Apache Spark works with GZIP compressed
files. To analyse the behavior, I didn't need to process thousands of them. Instead I simply took
3 files, processed them locally and analyzed what happened with the help of breakpoints.
Note
Always think to note your observations. If you added more computation power, changed the
configuration, partitioning or whatever, write it down and tell how it impacted your pipeline. It will
let you avoid going back and forward around the same solutions. It will also help your
colleagues to learn new approaches and give you some help since they will know what you are
working on. Maybe it will even inspire them to do the same and give you the information back.
You can go even further and share your observations with the community!

TDD
Data landscape is continuously evolving and very often you won't master all currently used
tools. That's the reason why you can make bad decisions about the frameworks or patterns. For
instance, you may start by writing your data to a hardly scalable data store or end up with a
conclusion that given framework is not satisfying to you.

But I have some good news. Even though you test your processing logic with different
frameworks (Apache Spark, Apache Flink to not quote only few of them) and data stores, you
can still do it smoothly and without fear to break something down. The only rule is - write your
business logic as a well tested abstraction.

Thanks to that, you will be able to plug this abstraction to any executable environment without
the fear of breaking it between your tests. It will also help you in the overall maintenance when
after every small change you will be able to detect whether the change failed the processing or
not.

Without the unit tests, your judgment about potential runtime gain will be compromised because
there will always be the risk that the improvement comes from a regression.

Debug
Debugging data processing-oriented applications is not easy. Since the processing is distributed
and works on big volume of data, adding the breakpoints could be overkill. A good alternative
are...the logs. If you think that one code can fail, always try to catch the failure along with the
context. For example, if you are processing Apache HTTP logs, the context for you will be the
log causing the exception. This point goes with the "Start small" and "Test soon" ones because
during the first iteration of your project, you will need to detect as much data quality issues and
processing errors as possible.

Also, if you need to debug a streaming application with continuously arriving data, you can add
debug code directly to the running application. Let's take an example of filtering. If you see that
you're filtering too much data, you can always add a debugging message in your code like
"Filtered because of filter#1 for record...", "Filtered because of filter#2 for record…". Maybe
there are some types of records you didn't expect when you defined the filters. Or maybe your
unit test didn't catch all the scenarios. If you add debugging code to your running application
always remember either to remove it or to lower logging level so as not to generate too much
noise.

Iterate
Do not expect that you will manage to write the final version data app in the first iteration each
time. Very often you will repeat the above steps in several iterations before reaching the
deployable version. It's important to keep that in mind and do not try to implement everything at
once.

For example, if you are doing classical filter-map-aggregate operation, you can start by writing
the code for the filtering stage in the first iteration, cover it with unit tests, test it on the small and
final datasets and only if everything is ok, pass to the next step. Thanks to that you will not only
be able to isolate and test your features easier but also to share the project's composition
between different team members.

After you terminated a new version

Avoid bing-bang
That's true for software engineering in general. If you make some changes in your logic and you
want to test it at scale, try to isolate the changes and avoid the bing-bang deployment. With the
bing-bang deployment you will need much more time to figure out what change caused the
regressions, of course if the latter ones weren't caught by your tests. Testing one change or
several very small ones at once should help to isolate the behavior better and figure out what
caused the change.

Monitor and be alerted


That's especially true for streaming projects but not only. You should be aware of every bad and
good thing that happens in your data pipelines. So during your data project take a special care
about the monitoring and alerting part. And collaborate too. Ask your colleagues who will use
the data that are the key metrics and provide them with the generated dataset as metadata. You
can later use these metrics to validate the dataset and not expose it to the end users if it doesn't
meet the requirements. For instance, if you see that the dataset you generated today has twice
less data than one week ago for the similar volume of input data, you can consider it as wrong
and keep it in your staging area instead of pushing to the production.

Take nothing for granted


I had that problem with one of my first data pipelines. It was a batch processing of the last hour
accumulated data. During the weeks everything was fine, I was getting the output regularly and
the volumes were consistent. Unfortunately one day it started to behave strangely because the
volume of data started to decrease, until it reached almost 100MB which was very little. After an
analysis I figured out the problem. The input files had more and more of latency and, therefore, I
was processing less and less data per hour.

It taught me not to take everything for granted. Even though you're expecting your data at hourly
basic, always try to ensure that the data you're processing are really the ones you want to
process. For instance, you can add a requirement before starting your pipeline and check
whether the volume of your input data is between some specific ranges. You can also add a
control after the processing to verify whether the number of output rows is consistent with the
results generated by previous tasks.

Dry run
It's the last step before the production stage. Always try to deploy your whole application with
the infrastructure it will use on production before really moving to production. You can output the
generated dataset to some temporary storage that won't be used by the final users but you
should use the production data for the tests. It will help to put your application into the real
context and know whether you are ready to release it or not.

The reason why it's important to make some dry runs is that it often will be difficult to reproduce
your production environment - especially if you have really big volumes of data. If it's not the
case, make dry run on your development environment but in the production conditions.
Otherwise, you have no other choice than to deploy it on production without public exposition
and, if everything goes well, enable the exposition part.
You can find a summary of above 14 points in the following picture:

That's all. Above steps should improve your data engineering life. If you asked me to summarize
them in fewer words, I would advise you to start small and work in short iterative cycles with a
well unitary tested code.

Bartosz Konieczny
www.waitingforcode.com

S-ar putea să vă placă și