Project’s Innovations

There is need for tangible big data analytics tools and services, appropriate for the requirements of users and practitioners. These solutions must fundamentally improve existing technology, methods, standards and processes, facilitate ease of interaction with all big data processing components, so that they address real needs.

Several existing frameworks (GraphLab, Map/Reduce, etc.) and systems (Cloudera, Bigtop, etc.) that relate with the processing of distributed big data. However, difficulty in programming paradigms, slows down the emergence of novel products and services out of big data frameworks and systems. Furthermore, existing solutions fail to engage non-IT experts into a more direct interaction with enterprise workflows for extracting actionable knowledge from big data.
Organizations have traditionally valued and protected their data. Even within the same organization, there are roadblocks to the free flow of data. We are at a crucial point in time where organizations are realizing that access to data is not a zero-sum game. Sharing can lead to multiplicative effects that will benefit the economy and society.

There is lack of commonly agreed standards and frameworks which makes data integration a very challenging and costly process. There exist some technologies and efforts that can help in accommodating data from multiple heterogeneous sources, like devising standards for common semantic data models and formats, Linked Data (http://linkeddata.org), data anonymization, data aggregation, etc. However, the degree of data sharing and re-use is at an unsatisfactory level, and there is a need for technology advancements to foster data sharing and re-use.
There exist several platforms for big data like Cloudera, Bigtop, etc., some of which are built within EU-funded projects, like Big Data Europe, Bigfoot, etc. Platforms like Big Data Europe significantly improve flexibility, usability, and failover recovery. However, some aspects of a platform that will ultimately yield safe environments for non-IT experts to experiment on and re-use big data need to be further developed and integrated.
There exist some tools and methods for processing noisy, incomplete, and complex data. Noisy data is handled through various filtering methods or through appropriate robust models for machine learning like Huber losses. Incomplete data is handled via interpolation; in more advanced methods, the analytics algorithms are resilient to incomplete data and can handle it by the algorithm structure, like, e.g., in machine learning with sparsity, low rank, and other regularization models. Finally, complex data can be handled via cascading. However, there is still a need for advances of real-time, streaming processing of heterogeneous data.
Experimental Protocol includes:
  • dependent and independent verification and validation variables of the experiments to be conducted;
  • a statistical power analysis to determine the number of experimental subjects required by the cross-sectorial experiments;
  • ways to access and engage the required number of experimental subjects;
  • a concrete and coherent experimentation schedule for real-life industrial cases;
  • industrial-validated benchmarks that can demonstrate significant increases in various parameters of data processing, the speed of data analysis, the size of data assets that can be processed and so on;
  • a verification and validation approach including standards and benchmarks of the Big Data domain.

The BDV Reference Model (BDV SRIA, European Big Data Value Strategic Research and Innovation Agenda, Version 4.0, October 2017) has been developed by the BDVA, taking into account input from technical experts and stakeholders along the whole Big Data Value chain, as well as interactions with other related PPPs. The BDV Reference Model may serve as a common reference framework to locate Big Data technologies on the overall IT stack. It addresses the main concerns and aspects to be considered for Big Data Value systems. The BDV Reference Model distinguishes between two different elements. On the one hand, it describes the elements that are at the core of the BDVA; on the other, it outlines the features that are developed in strong collaboration with related European activities.

bdva


The BDV Reference Model is structured into horizontal and vertical concerns.

  • Horizontal concerns cover specific aspects along the data processing chain, starting with data collection and ingestion, and extending to data visualisation. It should be noted that the horizontal concerns do not imply a layered architecture. As an example, data visualisation may be applied directly to collected data (the data management aspect) without the need for data processing and analytics.
  • Vertical concerns address cross-cutting issues, which may affect all the horizontal concerns. In addition, vertical concerns may also involve non-technical aspects.
It should be noted that the BDV Reference Model is compatible with the emerging ISO JTC1 WG9 Big Data Reference Architecture.
  • Develops data processing tools and techniques applicable in real-world settings, and demonstrates significant increase of speed of data throughput and accessibility;
  • Releases a safe environment for methodological big data experimentation, for the development of new products, services, and tools;
  • Develops technologies that increase the efficiency and competitiveness of all EU companies and organisations that need to manage vast and complex amounts of data;
  • Offers tools and services for fast ingestion and consolidation of both realistic and fabricated data from heterogeneous sources;
  • Facilitates simultaneous batch and real-time processing of Big Data;
  • Offers enhancement of real–Time Data with batch (historical) Data – off–loading of Compute intensive Operations to GPUs – parallel Processing of many Input and heterogeneous Channels;
  • Provides a pool of algorithms from traditional ETL/aggregation algorithms to graph processing;
  • Extends Big Data runtimes and tools to support fast and scalable data analytics;
  • Takes advantage of in-database computation, indexing and filtering and improves programmer’s productivity through simple interfaces and a sequential programming paradigm;
  • Develops a distributed large-scale framework for powerful and scalable Data Processing;
  • Promotes management of heterogeneous and federated infrastructures including Cloud and GPU resources and orchestration across diverse resource providers;
  • Integrates infrastructure elasticity capabilities offered by COMPs and Hecuba runtime environments;
  • Automatizes and save resources (cost & time), and move procedures which could require human interaction to automated procedures, minimizing costs (money & time) and in some cases even human error.
  • Enables telecom companies to process immense amount of data which are constantly being produced by their customers. This processing allows the extraction of critical learnings about customer behavior and system performance, and enables the allocation of appropriate resources where needed, for better optimization and resource utilization.
  • Enables better system maintenance, and predictive resource allocation, thus reducing customer churn and complaints.
  • Probes customers for new products and markets, and the availability of big data during such explorations can be useful for proving preliminary efforts worthy of further investment or abandoning such efforts when proven non-promising.
  • Introduces real-time big data analytics processing in plant floor (e.g. car manufacturing)