The COMPSs Programming Model is formed mainly by two main components: its programming interface and its runtime module. The COMPSs runtime module provides the applications with a data-driven task scheduling and implements a best-effort approach to improve data locality, among other features. In I-BiDaaS, we are improving the task management of this execution platform; implementing automatic and transparent task granularity based on the access patterns of each application.
COMPSs1 is a task-based programming model that aims to ease the development of parallel applications for distributed infrastructures. Its native language is Java but it also provides bindings for C/C++ and Python (PyCOMPSs)2. The framework is based on the idea that, in order to create parallel applications, the programmer does not need to be aware of all the underlying computing infrastructure details and does not need to deal with the intricacies of parallel paradigms. The programmer just needs to specify which are the methods and functions that may be considered tasks (“parallelizable units”) and to provide details on the parameters of the tasks. Then, the sequential code is automatically instrumented by the COMPSs runtime to detect the defined tasks and to build a task graph, which includes the data dependencies between them, thus producing a workflow of the application at execution time. Besides, COMPSs is responsible for scheduling and executing the application tasks in the available computing nodes as well as handling data dependencies, data transfers, and exploiting data locality whenever possible. Finally, thanks to the abstraction layer that COMPSs provides, the same application can be executed either in Clusters, Grids, Clouds, Fog and Edge. COMPSs is used as a programming model for the development of new algorithms for the I-BiDaaS pool of ML algorithms that are not currently available.
By using COMPSs, Qbeast, and Hecuba, it is possible to provide fast interactive analysis and arbitrary approximate queries. Qbeast organizes data both according its multidimensional properties and following a random uniform distribution. In that way, it is possible to select uniform samples of the data that satisfy the query efficiently and incrementally. Furthermore, Qbeast integrates with COMPSs through Hecuba, allowing distributed applications to run in parallel seamlessly, while taking advantage of data locality and tasks dependencies. The Advanced ML module will invoke the application/method implemented with COMPSs to start its execution.
Find out more: http://compss.bsc.es/.
1 R. Badia, J. Conejero, C. Diaz, J. Ejarque, D. Lezzi, F. Lordan, C. Ramon-Cortes, and R. Sirvent, Comp superscalar, an interoperable programming framework, SoftwareX, 32-36, 2015
2 E. Tejedor, Y. Becerra, G. Alomar, A. Queralt, R. M. Badia, J. Torres, T. Cortes, J. Labarta, PyCOMPSs: Parallel computational workflows in Python, The International Journal of High Performance Computing Applications, 2017
Hecuba is a set of tools and interfaces developed at BSC, that aims to facilitate programmers an efficient and natural interaction with a non-relational database. It consists of a database – Hecuba DB (a key-value database – Apache Cassandra or ScyllaDB) and Hecuba interface (between Python applications and key-value databases) with the Qbeast storage system. Using Hecuba, applications can access data like regular objects stored in memory and Hecuba translates the code at runtime into the proper code, according to the backing storage used in each scenario. Hecuba integrates with the COMPSs programming model1. This integration provides the user with automatic and mostly transparent parallelization of the applications. Using both tools together enables data-driven parallelization, as the COMPSs runtime can enhance data locality by scheduling each task on the node that holds the target data.
Apache Cassandra is a distributed and highly scalable key-value database. Cassandra implements a non-centralized architecture, based on peer-to-peer communication, in which all nodes of the cluster can receive and serve queries. Data in Cassandra is stored on tables by rows, which are identified by a key chosen by the user. This key can be composed of one or several attributes, and the user has to specify which of them compound the partition key and which of them the clustering key. A partitioning function uses the partition key to decide how rows are distributed among the nodes.
ScyllaDB is a drop-in replacement for Apache Cassandra that targets applications with real-time requirements. The primary goal of ScyllaDB is to reduce the latency of the database operations by increasing the concurrency degree using a thread-per-core event-driven approach thus lowering lock contentions and better exploiting the modern multi-core architectures. The programmer interface and the data model are the same as with Apache Cassandra. The peer-to-peer architecture that provides scalability and availability is also inherited from Cassandra.
The current implementation of Hecuba supports Python applications that use data stored in memory, in both Apache Cassandra database or ScyllaDB (Hecuba DB).
Hecuba supports the interaction between Apache Cassandra and both the advanced machine learning module and the TDF. The first step to integrate Hecuba in the I-BiDaaS platform was to define the data models that fit the requirements of the applications and, if it was necessary, to extend the Hecuba interface to support those models. The machine learning algorithms developed in I-BiDaaS use the interface provided by Hecuba to access the database. Regarding the interaction with the TDF, it has been carried out through the UM component. For this, a Hecuba adapter for UM was developed. This Hecuba adapter writes the data using Hecuba from the appropriate UM channel in the Hecuba DB, as needed. As an alternative, an intermediate module can be developed that takes as input the data to be stored and uses the Hecuba interface to write that data into the storage system.
In I-BiDaaS, we have enhanced Hecuba with the necessary extensions to support the data models of the applications that use databases. Also, we improved the data-driven scheduling model implemented in Hecuba to use data partitioning techniques that are automatic and transparent to the programmer and enhance data locality while deciding optimal task granularity.
Find out more:http://datadriven.bsc.es/hecuba/
1 R. Badia, J. Conejero, C. Diaz, J. Ejarque, D. Lezzi, F. Lordan, C. Ramon-Cortes, and R. Sirvent, Comp superscalar, an interoperable programming framework, SoftwareX, 32-36, 2015
Advanced ML module (UNSPMF). This module contains a pool of machine learning and analytics algorithms implemented in COMPSs programming model1 and Python. The utilization of COMPSs allows the users to program their applications through a sequential paradigm with an annotation of code pieces that can be parallelized, while the actual parallelization is done automatically by the COMPSs runtime module.
For example, we implement the alternating direction method of multipliers (ADMM2) utilizing COMPSs and CVXPY3, a mature Python library for convex optimization. The implementation requires minimal programming effort to adapt the code to various optimization problems. In other words, our implementation integrates two mature open source libraries (COMPSs and CVXPY), where CVXPY per-node solvers are orchestrated through COMPSs, and COMPSs is harnessed to achieve good parallelization.
The pool of implemented algorithms currently consists of:
All these algorithms are publicly available in the I-BiDaaS knowledge repository. Furthermore, UNSPMF has contributed a version of the ADMM for Lasso to the Dislib project.
Find out more:I-BiDaaS knowledge repository
1 Task-based programming in COMPSs to converge from HPC to big data. Javier Conejero, Sandra Corella, Rosa M Badia, and Jesus Labarta, The International Journal of High Performance Computing Applications 1–16, 2017.
2 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Foundations and Trends in Machine Learning, 3(1):1–122, 2011. (Original draft posted November 2010.)
3 CVXPY: A Python-Embedded Modeling Language for Convex Optimization, S. Diamond and S. Boyd, Journal of Machine Learning Research, 17(83):1-5, 2016.
The Orchestrator in the 1st integrated version of the I-BiDaaS platform was created by ITML for orchestrating the I-BiDaaS components; it operates as a middleware between the User Interface and the rest of the infrastructure.
The orchestrator is a REST API with DB support that provides:
In order to effectively provision the resources that are required to follow dynamic properties of the application, a resource management and orchestration module (RMO), which is provided by ATOS, sits on top of the infrastructure layer. The software modules involved have been used to programmatically interact with the set of virtual and physical resources that compose the application, managing the interactions and interconnections among the different parts of a cloud based application. Aside from being able to automate individual tasks, the orchestrator bundles them together into distributed large optimized workflows. Offering a unified automation process under one central umbrella allows to work on automation and analytics considering the whole service (workflow), while avoids vendor lock-in offering an abstraction layer on top of different types of providers.
The brain of the RMO modules is the resource orchestrator, which is based on Cloudify, allows the platform to describe the deployments using a common specification across different types of Cloud Service Providers. The adaptation engine module monitors that the service conditions are fulfilled, this is evaluated through the information exposed by the cloud orchestrator module, the set of modules manage the lifecycle of the service deployments, being able to apply pre-defined elasticity rules at runtime. The deployment descriptors (or blueprints) are TOSCA compatible and are packaged together with the configuration files and the installation scripts required for running the application. In our current solution, OpenStack and Amazon AWS has been the main technologies used during the project course. The technology enablers selected after a careful evaluation are capable to accommodate I-BiDaaS requirements using open source licenses.
Streaming analytics and pattern matching module (FORTH). The module will be used for high-speed stream processing and pattern matching. More specifically, FORTH’s current solution is a real-time high speed stream processing and pattern matching engine, tailored for continuous incoming stream data. The engine implements string searching and regular expression matching as underlying functionalities, using pre-compiled DFA automata originated from sets of fixed strings and/or regular expressions. More specifically, patterns given as fixed strings or regular expressions are compiled into a DFA automaton. Then, the pattern searching engine reports any pattern occurrence inside the given input data. In order to provide acceleration for the searching computations and processing, the solution utilizes the characteristics of general purpose GPUs (GPGPUs). High-end commodity GPGPUs offer multiple advantages, such as thousands of cores, low energy consumption and low cost. FORTH implements a stream processing and pattern matching engine specifically on accelerators. The currently deployed solution in FORTH is able to process and analyze multiple Gbit/s of real-time streams of data. The engine mentioned will be offered as an API that could be utilized via various demanding applications of data analytics, offloading computationally intensive tasks to GPUs for acceleration.
The GPU-accelerated pattern matching sub-module, developed by FORTH will be utilized by the SAG's Apama Stream Processing sub-module, in order to offload compute-intensive filtering operations. Several queries, that utilize string searching or regular expression matching to filter incoming data, can be benefited in terms of performance, by offloading these operations to a highly-parallel computational device, such as the GPU. Obviously, the integration of the GPU-accelerated pattern matching sub-module into the Apama sub-module, will introduce extra data copies between the two modules. Indeed, the data will need to be copied from the Apama stream processing engine to the memory space of the GPU, and vice versa. In order to amortize these costs, we will explore different approaches for data copies, including IPC abstractions provided by the operating system (such as sockets/pipes, etc.), as well as modern language artifacts, such as Java Native Interface (JNI).
FORTH will provide a commodity cluster for the Infrastructure Layer. The commodity cluster consists of multiple modern high-end General Purpose GPUs (GPGPUs), other accelerators (such as the Intel Phi), and multi-core processors that support the new Intel SGX technology (ensuring data integrity and privacy). Utilizing a cluster of commodity accelerators, such as GPUs, is vital for the I-BiDaaS architecture, due to the nature of FORTH’s offered solutions. The high-speed string and regular expression searching module (that will be used in the GPU-accelerated streaming analytics and pattern matching sub-module), requires the utilization of hardware accelerators in order to promote the optimization of computationally intensive workloads.
To efficiently offload the requested workloads accordingly to the appropriate resources, an adaptive to high traffic volumes scheduler will be used, offering high performance and low power consumption. More specifically, the commodity GPGPUs cluster consists of modern high-end discrete and integrated GPUs and provides the ability to fit into different computational and performance requirements. FORTH’s cluster includes also modern multi-core CPUs that support Intel Software Guard Extensions (Intel SGX).
Data Fabrication Platform – DFP (IBM). DFP is a web-based central platform for generating high-quality, realistic data for testing, development, and training. The platform provides a consistent and organizational wide methodology for creating test data for databases and files. Users can specify rules, based on which the platform fabricates realistic synthetic data and places the data in a database or a file. The data and the meta-data logic is extracted automatically and is augmented by application logic and testing logic modelled by the user. The modelling is performed via rules (constraints) that the platform provides. Once the user requests the generation of a certain amount of data into a set of test databases or test files, the platform core constructs a constraint satisfaction problem (CSP) and runs the IBM integrated solver to find a solution (which is the generated data) such that the modelled rules as well as the internal data consistency requirements are all satisfied.
The platform is capable of generating data from scratch, inflating existing databases or files, moving existing data, and transforming data from previously existing resources, such as old test databases, old test files and also production data. In essence, the platform provides a comprehensive and hybrid solution that is capable of creating a mixture of synthetic and real data according to user requirements. The generated data can be written into most common databases, e.g., DB2, Oracle, Ms SQL Server, PostgreSQL and also SQLite, or into files with common formats, e.g., CSV, XML, JSON and positional flat files.
The new aspects planned for I-BiDaaS are to extend the current platform to enable automated modelling process which is known to be complex and time consuming. The platform will eventually be able to generate new data from rules which were automatically learned and extracted from real/production data. The fabricated data should be similar to the training data thus realistic. Data similarity in this context is defined as different data that has similar statistical properties both column wise and cross-column wise. Such properties include numerical, dates, words, distributions, means, variances, quantiles, confidence intervals, patterns, distinctness, uniqueness, cross-column correlations and more. The platform modelling automation feature will use either basic inputs such as a column/fields name, its data type and any database constraint rules, in order to find the best match for existing rules in a set of categorized, pre-defined rules. The advanced option will use data analysis to automatically construct the data modelling rules that will generate similar data. This option needs to take into consideration cases where the analysis of the data may leak information and avoid using such data. An example where such a thing may happen and how to mitigate this could be a database containing a column with credit card numbers. The data analysis may report it as uniformly distributed data where each value’s cardinality is one. Also, the number of unique values and the number of distinct values are going to be similar to one another but also similar to the total number of the sampled values. This use case could be a generic filtering criterion. The TDF tool can be installed locally on-premise or as an external service that has the relevant access permissions to the user’s data sources and data targets. Within I-BiDaaS, the data will be fabricated into standard relational databases supported by TDF, and “pushed” to other platform modules through UM via a MQTT protocol.
Find out more: IBM test data fabrication
Apama streaming analytics sub-module (SAG) is an event processing platform. It monitors rapidly moving event streams, detects and analyzes important events and patterns of events, and immediately acts on events of interest according to your specifications. Event-based applications differ from traditional applications instead of continuously executing a sequence of instructions, they listen for and respond to relevant events. Events describe changes to particular real-world or computer-based objects. Technically, they are collections of attribute-value pairs that describe a change in an object. Rather than executing a sequence of activities at some defined point, an event-based system waits and responds appropriately to an asynchronous signal as soon as it happens. In this way, the response is as immediate (or real-time) as possible.
One can use Apama monitors or Apama queries to work with events. Apama queries allow business analysts and developers to create scalable applications to process events originating from very large populations of real-world entities. Scaling, both vertically (same machine) and horizontally (across multiple machines), is inherent in Apama query applications. Scaled-out deployments, involving multiple machines, will use distributed cache technology to maintain and share application state. This makes it easier to deploy across multiple servers, and keep the application active even if some servers are taken down for maintenance or have failed.
Apama can work together with Java and C++ programs. The interaction is controlled by Apama and the external routines are called directly from the Apama program or the control is given to the external program which injects code and events into a running Apama correlator. In this approach, both parts run in separate processes which might be desirable, but I-BiDaaS will consider and test both approaches. The Apama client library, which allows Java or C++ code in an external process to send/receive messages to/from the Apama correlator, should be used to call the Hecuba routines (BSC) and connect to the code running on the hardware-accelerated streaming analytics sub-module by FORTH. Development effort is required to accomplish this.
Find out more: Documentation and a free Community Edition for testing can be downloaded from www.apamacommunity.com.
Universal Messaging module – UM (SAG). UM is a Message Orientated Middleware (MOM) product that guarantees message delivery across public, private and wireless infrastructures. It provides a guaranteed messaging functionality without the use of a web server or modifications to firewall policy. UM uses a standard publish-subscribe methodology. A user can publish data to a channel or subscribe to a channel and receive all the data which is sent to the respective channel. An admin user or the moderator can manage and monitor all the UM channels. UM includes a heavily optimized Java process capable of delivering high throughput of data to large numbers of clients while ensuring that latencies are kept to a minimum.
UM’s messaging clients support synchronous and asynchronous middleware models. Publish Subscribe and Queues functionality are supported and can be used independently or in combination with each other. UM’s messaging clients can be developed in a wide range of languages on a wide range of platforms. Java, C# and C++ running on Windows, Solaris and Linux are all supported. Mobile devices and Web technologies such as Silverlight all exist as native messaging clients. UM supports many open standards at different levels from network protocol support through to data definition standards. At the network level, UM will run on an TCP/IP enabled network supporting normal TCP/IP based sockets, Secure Sockets Layer (SSL) enabled TCP/IP sockets, HTTP and HTTPS, providing multiple communications protocols. UM provides support for the Java Message Standard (JMS) and supports the MQTT and Advanced Message Queuing Protocol (AMQP). Within I-BiDaaS, UM will communicate with DFP via MQTT messaging, while for the Hecuba set of tools an adapter for subscribing to UM channels will be developed.
Find out more: Software AG official site.
The webMethods Hybrid Integration Platform can be deployed on-premises, in public and private cloud environments, or as a managed service. Additionally, you can choose from several cloud-hosted hybrid integration services, including:
webMethods Integration Cloud—an Enterprise Integration Platform-as-a-Service (eiPaaS) that helps you connect cloud-based and on-premises applications and rapidly deploy integrations to the cloud. Integration Cloud provides prebuilt connectors, graphical mapping and orchestration of integration flows, and preconfigured “recipes” that enable faster integrations for popular SaaS apps. You get quicker business implementations, the ability to scale on demand, and secure cloud to on-premises hybrid connectivity.
webMethods API Cloud—an API management Platform-as-a-Service (apiPaaS) that provides the end-to-end API management capabilities, delivered in the cloud as a subscription service. You can safely and securely expose your APIs to your developer communities, as well as manage the entire lifecycle of your APIs. The API developer portal helps you attract and grow your developer ecosystem, and the API gateway protects and secures your APIs from threats and malicious attacks.
Find out more: Software AG official site.
This is a visual analytics software that provides a fast way to explore and analyze streaming data and events. MashZone allows to analyze both real-time and at-rest data. For I-BiDaaS, custom data feeds and dashboards will be mostly used. Dashboards can be customized by adding custom style templates for the dashboard application and the dashboard content. Additionally, custom components can be created via the pluggable widget framework.
In the context of I-BiDaaS, MashZone will facilitate integration by exploiting native existing interfaces with UM and Apama.
Find out more: Software AG official site.
Advanced Visualisation Toolkit sub-module – AVT (AEGIS). This module provides the means to visualise a number of indicators deriving from the analysis of Big Data. It enables the final user to explore data at a high level through a number of interconnected, interactive visualisations that also allow drilling into more detailed information so as to reveal hidden relationships and insights. To achieve this goal, the toolkit follows an internal architecture that comprises of 3 layers, namely: data, business and presentation. This architecture provides the flexibility to adapt to various domains and data sources including data in batches (e.g. direct access to databases, APIs, etc) or in streams (e.g. MQTT).
The Data layer includes the original data and the mechanisms that enable their retrieval by the business layer of the AVT. The data include a number of Indicators that are domain specific and can be calculated from the original data; e.g., for a yearly log of telephone calls an Indicator could be the number of calls per day for the given year. Being domain specific, the Indicators and the means to calculate and retrieve them will be developed within the context of the relevant task of the project (T2.4 – Design and development of interactive visualization tools).
The business layer of AVT handles the processing of the received Indicator data in a format suitable for the visualisations. A cache mechanism is also available to enhance performance. Finally, this layer serves the REST services required by the visualisations to offer their functionalities.
The presentation layer includes a timeline analysis component and multiple visualisations. The timeline analysis component offers the functionalities for temporal analysis of the Indicator values. Through the timeline control, the end user may move back in time, narrow the viewed time window, and compare two different time points, in order to get insights of the data. The timeline control drives the displayed data in all the other visualisations of the currently opened view inside the AVT. These visualisations include a set of interactive graphs and charts that form the heart of the AVT. Standard chart types (e.g. bar, line, etc.) but also advanced visualisations like streaming data monitoring widgets, interactive graphs and spatiotemporal visualisations are employed to offer diverse data representations. Different visualizations are used to display different types of data, in order to make the understanding of data easier. Changes of the time window propagate to the rest of the visualisations, so that they always follow the selected time window. This way, end users can get a situational awareness of the system (where original data comes from) at any given time. They can check to see if any abnormal situation seems to be about to happen (or already took place) and generally search for patterns in events and Indicator values that can reveal correlations and help them decide on appropriate actions.
Find out more: AEGIS official site.
This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.
This is a synthetic data stream based on real-time, cell network events. These events are picked up by the antennas that are closer to the mobile phone thus providing an approximate location of the device. Every transaction of a mobile phone generates one of those events. A transaction can be, for instance, placing or receiving a call, sending or receiving an SMS, asking for a specific URL in your mobile phone browser, or sending a text message or a data transaction from/to any mobile phone app. There are also some synchronization events like, for instance, turning your mobile phone on or off, or when switching between location area networks (relatively big geographical areas comprising several cell towers).
‘Advanced analysis of bank transfer payment in financial terminal’ use case dataset is a dataset generated collecting most of the relevant and additional contextual information of a bank transfer done by a CAIXA employee in a bank office. It is executed by the employee in the name of a customer that authenticates itself and orders it, by identifying all the relational tables that can contain information of the transfer, the customer, the receiver of the transfer, the office and terminal where it is executed and the employee that proceeded with it. The dataset was used in order to identify anomalies that lead to potentially fraudulent bank transfers or bad practices done in the offices.
The generated dataset provides data on the relationships between customers in order to build part of the social graph of the bank. The data was synthetically generated based on real data coming from a set of restricted tables (relational database), with information related to the customers and their IP address when connecting online.
The generated dataset provides data on the relationships between customers in order to build part of the social graph of the bank. The data was collected from the logs of connections to the online banking of CaixaBank customers and stored in the entity’s Datapool for security and fraud prevention reasons. More concretely the data is coming from a set of restricted tables (relational database), with information related to the customers and their IP address when connecting online, used afterwards to see potential relationships between users’ and skip or enhance the security controls on eventual bank transfers between the connected users.
This dataset contains the information of mobile-to-mobile bank transfers ordered through online banking (web or application) by CaixaBank’s customers. It was generated for the assessing that the controls applied to user authentication are applied adequately (e.g. second factor authentication) in accordance with PSD2 regulation and depending on the context of the bank transfer.
The dataset has been generated after receiving unstructured sets of a large amount of heterogeneous data from several sources and levels from the production line. CRF analysed all information and selected seventeen parameters (e.g. piston speed in the first and second phase, piston stroke, intensification pressures) of the production of the engine block by die-casting. The synthetic dataset was generated in order to analyse the provided parameters and try to cluster them by identifying those that are most representative of each cluster. The clusterization is useful to identify the behaviour of parameters and help to understand those that affect the quality of the process and products.
The MES dataset contains sensors data associated with the type of vehicle being produced. This dataset has to be used with SCADA dataset. Comparing the date and time between the data of the MES and the SCADA it is possible to correlate the information contained in the SCADA dataset with the vehicle processed.
The dataset contains sensors data of welding process in the sub-assembly lines of the vehicle. Different types of sensors are installed on the production line and acquire different data information (e.g. acceleration, velocity, pressure, temperature and so on). All of the sensors record their perception of the surroundings, uploading and transfer this information to a server that manages the data. For example, accelerometers are used for measuring vibration and shock on machines and basically anything that moves. Therefore, the monitoring of vibrations is important to check the status of a machine and the analysis of the trend of vibrations over time allows to predict the onset of deterioration and to intervene in time before the failure for the Predictive maintenance. This dataset can be used to analyse sensor data and obtain thresholds for anomalous measurements. There are 147 sensors.