Transaction aware tape-infrastructure monitoring

2014 
Administrating a large scale, multi protocol, hierarchical tape infrastructure like the CERN Advanced STORage manager (CASTOR)[2], which stores now 100 PB (with an increasing step of 25 PB per year), requires an adequate monitoring system for quick spotting of malfunctions, easier debugging and on demand report generation. The main challenges for such system are: to cope with CASTOR's log format diversity and its information scattered among several log files, the need for long term information archival, the strict reliability requirements and the group based GUI visualization. For this purpose, we have designed, developed and deployed a centralized system consisting of four independent layers: the Log Transfer layer for collecting log lines from all tape servers to a single aggregation server, the Data Mining layer for combining log data into transaction context, the Storage layer for archiving the resulting transactions and finally the Web UI layer for accessing the information. Having flexibility, extensibility and maintainability in mind, each layer is designed to work as a message broker for the next layer, providing a clean and generic interface while ensuring consistency, redundancy and ultimately fault tolerance. This system unifies information previously dispersed over several monitoring tools into a single user interface, using Splunk, which also allows us to provide information visualization based on access control lists (ACL). Since its deployment, it has been successfully used by CASTOR tape operators for quick overview of transactions, performance evaluation, malfunction detection and from managers for report generation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    1
    Citations
    NaN
    KQI
    []