Automotive big data: Applications, workloads and infrastructures

2015 
Data is increasingly affecting the automotive industry, from vehicle development, to manufacturing and service processes, to online services centered around the connected vehicle. Connected, mobile and Internet of Things devices and machines generate immense amounts of sensor data. The ability to process and analyze this data to extract insights and knowledge that enable intelligent services, new ways to understand business problems, improvements of processes and decisions, is a critical capability. Hadoop is a scalable platform for compute and storage and emerged as de-facto standard for Big Data processing at Internet companies and in the scientific community. However, there is a lack of understanding of how and for what use cases these new Hadoop capabilities can be efficiently used to augment automotive applications and systems. This paper surveys use cases and applications for deploying Hadoop in the automotive industry. Over the years a rich ecosystem emerged around Hadoop comprising tools for parallel, in-memory and stream processing (most notable MapReduce and Spark), SQL and NOSQL engines (Hive, HBase), and machine learning (Mahout, MLlib). It is critical to develop an understanding of automotive applications and their characteristics and requirements for data discovery, integration, exploration and analytics. We then map these requirements to a confined technical architecture consisting of core Hadoop services and libraries for data ingest, processing and analytics. The objective of this paper is to address questions, such as: What applications and datasets are suitable for Hadoop? How can a diverse set of frameworks and tools be managed on multi-tenant Hadoop cluster? How do these tools integrate with existing relational data management systems? How can enterprise security requirements be addressed? What are the performance characteristics of these tools for real-world automotive applications? To address the last question, we utilize a standard benchmark (TPCx-HS), and two application benchmarks (SQL and machine learning) that operate on a dataset of multiple Terabytes and billions of rows.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    50
    Citations
    NaN
    KQI
    []