Spark Scalability Analysis in a Scientific Workflow

Renan Souza,Vitor Silva,Pedro Miranda,Alexandre A. B. Lima,Patrick Valduriez,Marta Mattoso

Spark Scalability Analysis in a Scientific Workflow

2017

Renan Souza
Vitor Silva
Pedro Miranda
Alexandre A. B. Lima
Patrick Valduriez
Marta Mattoso

Spark is being successfully used for big data parallel processing in many business domains (social media, finance, retail). Spark's scalability, usability, and large user community have motivated developers from scientific domains (bioinformatics, oil and gas, astronomy) to try it. However, scientific applications' profile, e.g., black-box programs and intense file writes, differs from traditional business workflows, which may affect its scalability. We present a scalability analysis of Spark in a real case-study in Oil and Gas domain. We explore workloads on a 936-cores HPC cluster processing 330 GB of scientific data. We show that it scales very well when running long-lasting scientific tasks, but its performance is lower for short-duration tasks.

Keywords:

Data science
Social media
Big data
Usability
Workflow
Parallel processing
Scalability
Spark (mathematics)
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations