Moving your data pipelines from on-premise solution to a public cloud provider can be a very daunting endeavor. However, for many businesses, the benefits of moving to the cloud far outweigh the risks but what exactly are the problems that data engineers and software developers are likely to run into? What blockers and pitfalls are they likely to run into? How should the team be set up to maximize these benefits and make sure the team is set up for success?
Before answering these questions, let us first explore the tremendous potential of moving to a cloud platform and why cloud migration is increasing so rapidly.
Cost Reduction. Owning and maintaining your own Data Centre can be expensive. As well as the hardware refresh costs, there is the overhead of having to manage outages for software upgrades and physical fixes. Moving to the cloud offers the potential to manage and accurately predict your costs.
Flexibility. Perhaps one of the biggest motivators for migrating to a cloud platform is the flexibility and reliability it offers. Multiple server types, predefined machine images, and the latest software versions are all within easy reach with reliability built into your service.
Scalability. Having the ability to gain that extra bit of computing power when you most need it (and then dropping back down again) is a major factor for cloud adoption.
Disaster Recovery/Security. Data Centre based computing requires additional hardware and storage in an external location as part of a full and proper disaster recovery strategy. It also requires a mechanism for maintaining the data transfer to ensure (and prove) that no data has been lost. This is taken care of in the cloud as multiple copies are taken both availability zones and regions to ensure restoration can be done as quickly as possible.
Immediately useable (and useful) set of tools. Most of the big cloud providers offer a set of tools that can be utilized on the platform and can be extremely useful for getting your applications up and running very quickly. Everything from networking to Machine Learning tools can be used (at a price).
While moving to the cloud is clearly good from a business point of view, how does it affect the way data engineers have to work?
What is Data Engineering?
Data engineers are the designers, builders, and managers of data pipelines. They develop the architecture, the processes and own the performance and data quality of the overall solution. To that end, they need to be specialists in architecting distributed systems and creating reliable pipelines, combining data sources and building and maintaining data stores.
The role has evolved in the last few years as software engineers have been required to learn more about data and traditional database engineers have been required to learn software engineering languages as businesses have moved away from enterprise warehouse solutions to distributed ‘big data’ pipelines.
Skills required by a Data Engineer
As such, data engineers require skills in a number of technical disciplines. These include scripting languages (such as LINUX and Python), object-oriented programming skills (particularly Java and Scala) and of course SQL and how the syntax varies between different applications. It also requires an understanding of distributed systems, data ingestion, and processing frameworks and storage engines. Experienced data engineers have knowledge of the strengths and weaknesses of each tool and what it is best used for. There is also a requirement to know the basics of DevOps, particularly when having to install new tools, running statistical experiments and implementing machine learning for Data Scientists.
So, with all those skills and knowledge, what is the problem? Surely migrating straight to the cloud is easy?
Well not quite. Despite a decent knowledge of operations, Data Engineers as NOT DevOps. They do not have a deep level of understanding of networks, VPCs, subnets, security and infrastructure languages (like Terraform). Plus, as more companies look to move to a multi-cloud strategy, the complexity of cloud account structures requires specialist knowledge. There is also an increasing requirement to help users navigate their way around the intricacies that occur from using multiple data mining tools. Analysts in the past have never had to worry about how big a cluster must be to make sure their query completes in a reasonable time, nor have they had to interpret a SQL failure message that reads like a java run-time error. The development feature teams often do not have the time to help with this so something else is required.
Data and Application Operations (DOPS & APO)
DOPS work with engineers and network teams and is responsible for the support of the managed data and shared cloud accounts. This includes VPCs, Subnets, Identity and Access Management, White and Blacklisting and ensuring account security is compliant with the governance team. They also support application deployments, are the point of contact for any infrastructure issue resolution and are the focal point for upgrades and patching (where required).
They also provide a vital service in helping to manage the cloud costs. Developers (and testers for that matter) have a habit of launching clusters to serve their needs – in the development, testing and production environments – but then often forget to tear them down afterward. This means the company can end up spending a fortune of virtual instances that are simply not being used. The DOPS team will monitor, alert and generally police this to make sure it does not occur.
The APO team is more focused on supporting the teams that are using the data. If you run an application that auto-scales a cluster based on the CPU utilization needed for a query to complete, it is within the interests of the company to have someone who is an expert in query optimization, or the costs are going to spike with poorly written queries. That is where the APO team come in. They are experts in not only rewriting queries for speed but for teaching the users how to do this themselves. They also monitor query and table usage so that a deprecation program can be created on low usage tables, as well as directly supporting engineering teams with ‘proof of concepts’ for new external applications. With the rate of new products entering the data processing market, providing a service to evaluate new products is vital and ensures continued innovation within the engineering team.
So DOPS ensure the cloud infrastructure is supported across all the data teams and APO ensures that the applications and users are supported.
Perfect. Now you are all set up to fully utilize the power of the cloud. Or are you?
What about data quality and data availability?
As the business world starts to rely more on more on machine learning, the accuracy of the underlying data that ML models are trained on has become far more prevalent.
It is no longer acceptable to have ‘mostly’ good data; even the smallest amount of ‘bad’ data can cause inaccuracies in predictive analytics.
As data engineers, we bear the brunt of any criticism and rightly so – data scientists often bemoan the fact that much of their time is spent cleaning up data rather than producing the models they are trained to do. We are the first part of a long chain and the world of data engineering has to embrace this responsibility.
This is the usual timeline for a Production failure:
- Production Support is alerted to a failure in the middle of the night
- They apply a ‘Band-aid’ fix to get the application up and running again
- The next day they inform the development team who own the code to assess options
- The development team then plan the reprocessing of bad data to stop users from having to halt their work
- A permanent fix is suggested, estimated and then put on the backlog (often never to be seen again!)
The other issue with data quality is that feature development teams can spend multiple days within a sprint simply trying to get to the bottom of failures. This means that promised roadmap items get pushed further and further back, making the teams less efficient and causing frustration and mistrust from the stakeholders.
So, what can we do about it?
Step forward the Data Reliability Engineering team!
Data Reliability Engineering (DRE)
DRE is what you get when you treat data operations as a software engineering problem. Using the philosophy of SRE, Data Reliability Engineers are 20% operations and 80%, developers. This is not about being a production support team – this is about being a talented and experienced development team that specializes in data pipelines across multiple technical disciplines.
The 6-step mission of DRE is:
- To apply engineering practices to identify and correct data pipeline failures
- To use specialist knowledge to analyze pipelines for weaknesses and potential failure points and fix them
- To determine better ways of coping with failures and increase automation of reprocessing functionality
- To work with pipeline developers to advise of potential DQ issues with new designs
- Utilize and contribute to Open Source DQ Software products
- Improve the ‘first to know rate’ for DQ issues
So, the DRE team own the failure, the fix, and the message out to users. They can call in the feature team developers help if specialist knowledge is required but aim to handle in-house as much as possible thus freeing feature teams to continue with their roadmap.
One of the other main functions of DRE is to set up and own the data quality platform. Whether it is an open-source DQ solution like Apache Griffin or Quibble or a set of Spark libraries like AWS DeeQu, the team must work out the best way of alerting about data issues and where possible, set up an ‘ETL circuit breaker’ within the job flow that will kill a job if there are any anomalies detected on key fields. This will stop incorrect data getting into the Production tables which is one of the things that so frustrates users. The DRE team are actively encouraged to contribute back to these open source projects with any improvements or modifications they have made.
But…does that mean the feature teams simply throw Data Quality responsibilities aside and over the fence to DRE? Certainly not! Each team still has a responsibility for their own pipeline and DQ should be a core element of the architecture and design. The DRE teamwork with both developers and Product teams to make sure that DQ is included in estimates and themselves are part of the sign off process for QA/UAT.
Is DRE the complete solution to all Data Quality problems? Unfortunately, not – bad data issues will always occur as edge cases for data, in particular, are so hard to predict. However, having a dedicated engineering team for DQ shines a light on issues, provides transparency to stakeholders and data consumers and builds trust between data engineering, data science and the analysts who are so dependent of correct data.
To sum up, migrating to the cloud can be very difficult for engineers and users alike, but if handled in the right way, can lead to multiple gains in flexibility, scalability and lower costing as well as creating an environment for exploratory analytics and accelerated innovation. This model for a team structure allows the feature teams time and space to provide the needs of the business, provides two operations teams to make sure the infrastructure and users are all supported and a capability team that helps build trust by improving data quality and availability.
Torq Pagdin, Director of Data Engineering at Hotels.com lays out how he has set up his team to maximize the power of data in a public cloud platform.