Simplifying big data testing

While the trends around the use of big data are morphing fast, so are the accompanying challenges. Big data testing is one of the key challenges in a successful big data implementation. Umesh Kulkarni, Head of Presales and Business Solutions and Sagar Pise, Senior Technical Architect, L&T Infotech, explain.

Often, organisations struggle to define the testing strategy for structured and unstructured data validation, working with non‑relational databases, setting‑up of an optimal test environment and finding the right talent to test complex ecosystems.

The key questions we posed during our TEST Focus Groups roundtable discussions were: how do organisations devise a one‑stop big data testing solution in the knowledge that each big data project has many different aspects and priorities for its quality assurance? And how do they do it in a simplified and structured way?

Big data versus ‘Lots of Data’

At the outset of our discussions, it was important to clarify what is meant by a ‘big data’ problem, and identify what makes a ‘big data’ problem different from a ‘lots of data’ problem.

The three key characteristics of big data are:

  • Volume – data above a few terabytes.
  • Velocity – data needs to be processed quickly.
  • Variety – different types of structured/unstructured data.
  • High‑level indicators that an organisation is dealing with big data are:
  • Traditional tools and software technologies are failing to process data.
  • Data processing needs distributed architecture.
  • Data processing involves structured, semi‑structured and unstructured data sets.
  • As discussions continued, and the roundtable participants shared their experiences of how their respective organisations are leveraging big data, it was acknowledged that there is a major change in how data is generated and consumed. Previously, only a few entities or companies generated data and rest of the world consumed that data. Today, everyone is generating data and everyone is consuming data. And whilst most organisations have started focusing on big data testing, very few have a clearly defined big data test strategy.

Big data flow and assurance checkpoints

When implementing a big data testing strategy, it is important to understand the big data flow and assurance checkpoints.

Data coming from heterogeneous data sources, such as Excel File, Flat File, Fixed Length File, XML, JSON & BSON, Binary File, MP4, Flash Files, WAV, PDF File, Word Doc, HTML File, etc., needs to be dumped into big data stores in order to process it further and get meaningful information out of it. Before moving these data files into big data stores, it is always preferred to verify source file metadata and checksums as part of first level assurance check point.

Once heterogeneous source files are dumped into big data stores, as part of pre‑big data processing validation, it is important to verify whether files are dumped as per dumping rules and they are dumped completely without any data discrepancy in terms of extra, duplicate or missing data.

Once data dumping is completed, an initial level of data profiling and cleansing is done followed by actual functional algorithm execution in distributed mode. Testing of cleansing rules and functional algorithm are the main assurance check points at this layer.

After the execution of data processing algorithms, the cream of data obtained is then given to downstream systems such as the enterprise data warehouse for historical data analysis and reporting systems. Here report testing and ETL testing act as assurance check points.

All of these high‑level assurance checkpoints will ensure:

  • Data completeness – wherein end‑to‑end data validation among heterogeneous big data sources is tested.
  • Data transformation – wherein structured and unstructured data validations are performed based on business rules.
  • Data quality – wherein rejected, ignored and invalid data is identified.
  • Performance and scalability – wherein scalable technical architecture used for data processing is tested.

Traditional Data Processing versus big data Processing

Discussions around the differences between traditional data processing and big data processing models raised questions with respect to tools and technology, skill sets, process and templates and cost.

When considering the big data characteristics of volume, velocity and variety, the following challenges and recommended approaches to deal with them were highlighted at the roundtable sessions:

  • 100% test coverage versus adequate test coverage: focus required on adequate test coverage by prioritising projects and situations.
  • Setting up a production‑like environment for performance testing: recommends simulating and test stubbing components which are not ready and available during testing phase.
  • High scripting efforts while dealing with variety of data: requires a strong collaboration among testing, development, IT and all other teams, along with a strong resource skill set.

Key Components of a big data Testing Strategy

The recommended big data testing strategy should include the following:

  • Big test data management.
  • Big data validation.
  • Big data profiling.
  • Big data security testing.
  • Failover testing.
  • Big data environment testing.
  • Big data performance testing.

Emerging big data Skill sets and Ongoing Training

During the final part of the discussion, there were different views on the technical skill set required for big data testing. As big data testing is still an emerging area, it was commonly agreed that a big data tester needs strong technical skills at least till the point where the industry has standard big data testing tools. Along with this, big data testers need to have a strong knowledge of the business domain and processes in order to carry out effective big data testing. A typical QA‑BA model will play a major role to overcome business domain and process knowledge. Additionally, due to the frequent changes in big data technologies and frequently increasing big data tools stack, ongoing training will be necessary.

Edited for web by Jordan Platt