Big Data Journey: From Collection to Analysis to Predictive Use

Written by Dr. Allan M. Zarembski

RAILWAY AGE, MARCH 2020 ISSUE: The December 2019 Big Data in Railroad Maintenance Planning Conference, held at the University of Delaware, continued to highlight the advances the railroad industry is making in addressing the growing volume of data from inspection and other systems, and learning how to apply it. This was clearly shown in the keynote address, where the keynote speaker, Jeffrey D. Knueppel, P.E., now retired as General Manager of Southeastern Pennsylvania Transportation Authority (SEPTA) presented how his agency is already using a range of data collection, data analysis and data management tools in the everyday running of SEPTA. In fact, he titled his presentation “SEPTA’s Journey Into Big Data” and presented a range of Big Data applications that are currently implemented or being implemented on SEPTA.

Knueppel noted that SEPTA has started its move away from manual data management and into the realm of automated data collection and analysis, and is moving toward the next step of using data in a predictive manner, in five major focus areas:

  • Customer experience.
  • Workforce development and support.
  • Rebuilding the system.
  • SEPTA as a business.
  • Safety and security.

As can be seen in Figure 1, Data Analytics is viewed by SEPTA as being fundamental to making improvements in all five of these focus areas. Knueppel went on to show how Big Data and corresponding Big Data tools and techniques are being applied within SEPTA. These data analytic tools have been developed internally and by several conference sponsors and presenters. Thus, for example, in the key conference areas of maintenance management for rolling stock and track, Knueppel gave specific application examples:

Real-time remote condition monitoring, with automatic notices of unusual conditions, such as the “TekTracking” System Proof of Concept Application for Remote Monitoring at 40th Street Trolley Portal.

Use of automated track inspection systems and associated analysis algorithms for Just-in-Time Replacement, such as the use of GREX’s  “Aurora” system for wood tie replacement.

Use of centralized vehicle onboard diagnostic systems with remote access such as  the Wi-Tronix Violet System purchased for SEPTA’s new locomotives.

Use of new-generation automated inspection systems to include Remote Bridge Monitoring, GPS-Enabled Drones (Above-Ground); GPS-Denied Drones (Under Ground); Head-End Video; Geometry Car; UT Testing; and Ground-Penetrating Radar.

In all cases, the goal is to convert Data into Action, as shown in Figure 2.

The more-than 25 technical presentations addressing Big Data issues in track, equipment and operations, followed the keynote speaker’s lead in addressing what is being done right now in the industry, with the increasing amount of data being collected in all aspects of the railway industry. There is an increasing use of Data Science, the interdisciplinary field using evolving analysis tools and techniques to extract knowledge or insights from data in various forms, either structured or unstructured. Associated Data Analytic tools are being integrated into a new structure, Rail Data Science as illustrated in Figure 3.

Rail Data Science, sometimes referred to as Railroad Big Data Analytics, has been divided into 9 basic steps as follows:

  • Understand the problem(s).
  • Data collection and investigation.
  • Data cleaning and integration.
  • Feature engineering.
  • Model development and feature selection.
  • Model training and parameter tuning.
  • Model evaluation and comparison.
  • Model prediction and decision-making.
  • User feedback and possible further refinement (possibly back to step 1 and the whole process iterates).

While most railways have developed the tools for extensive data collection, cleaning, storing, collating and integration remain major challenges as is illustrated in Figure 4 and Figure 5. Likewise, structuring the data in preparation for and in conjunction with the model development steps (Figure 5) represents a critical part of any effective data modeling activity.

Likewise, the analysis and modeling tools represent critical steps in the Data Analytics process. The range of such tools currently being used by railways, suppliers and researchers include predictive analytic tools such as Logistic Regression and Bayesian Inference to Machine Learning and Deep Learning techniques, Image Recognition, Blockchain Technology, Language Recognition, Text Analytics, etc. One non-traditional approach uses Text Analysis and Latent Semantic Analysis (LSA), a Language Recognition technique used to analyze relationships between sets of documents and the terms they contain, to look at rail safety data in a new way.

The use of data for improved operations, maintenance and safety was an ongoing theme for the conference. This included applications in all aspects of railroad operations to include track, rolling stock and transportation.

On the track side, use of data analytics addressed many of the key aspects of track maintenance and safety, ranging from rail wear prediction, broken rail safety, tie inspection, and prediction of track geometry degradation and associated risk of derailments. Several such presentations looked at using such tools as Logistic Regression analysis to forecast probability of degradation of track geometry as a function of supplemental measurement data that provides for increased prediction accuracy over the traditional traffic and MGT inputs. This is illustrated in Figure 6, which shows the range of additional input variables that can be introduced using such Data Analytic techniques, and Figure 7, which presents an example output showing the probability of developing a geometry defect as a function of several of these key input variables.

Likewise, use of data analytics for addressing both transportation and rolling stock (equipment) was discussed for a range of issues.

  • On-Time Performance.
  • Conflicts (Merging/Diverging Routes; Meets/Passes).
  • Rolling Stock (Equipment) Maintenance and Failures.
  • Locomotive Maintenance and Failures.
  • Train Handling.
  • Positive Train Control.
  • Safety.

This included the use of Data Analytics to predict anomalous events as shown in Figure 8, and to address the effect of unplanned (anomalous) events on on-time performance as discussed in Figure 9.

Data Analytics is leading the way in the move from Corrective (Reactive) maintenance to Predictive (Preventive) maintenance.  This is, in fact, the path that SEPTA’s Knueppel discussed and showed in Figure 1. However, helping in this move is the development of the concept of a “Digital Twin,” a digital replica of a living or non-living physical entity (Figure 10). By bridging the physical and the virtual world, data is transmitted seamlessly, allowing the virtual entity to exist simultaneously with the physical entity. The evolution of Data Analytics is moving toward the concept of a Digital Twin.

For example, in the case of track data, the static and dynamic data collected by the broad array of inspection tools currently available (and being implemented in the near future) are locational and as such can be referenced to a digital Asset Register. This is independent of whether data comes from measuring, monitoring, IoT or inspections and reports. These in turn can be brought together in a Digital Twin, allowing Maintenance Engineers to:

  • Evaluate all relevant data and decide on what to do, when, by whom, with what information.
  • Generate work orders for fault correction and (condition-based) maintenance, plus reports.
  • With one source of truth at strategic, tactical and operational levels, allowing clear line of sight for forecasting, planning and predictive maintenance.

Thus, we continue to witness an evolution in data analysis (Data Analytics or Data Science) moving from Deep Learning, to the Internet of Things (IoT), to Cognitive Computing to the Digital Twin (Figure 11). This evolution is clearly evident in the presentation of each year’s Big Data in Railroad Maintenance conference, with each new conference providing new focus on what has been accomplished in the “mining” of the railroads’ Big Data and the implementation of data analytics to develop predictive models and tools for both maintenance and safety. The University of Delaware expects even more insightful information to be available in its Big Data 2020 conference, which will be held Dec. 16-17, 2020, at the University of Delaware Newark campus. For more information, contact Professor Allan M Zarembski at [email protected].


  • Knueppel, J., “SEPTA’s Journey Into Big Data,” SEPTA, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Zarembski, A. M., “The Emerging Role of Data Science in Railroad Maintenance Management,” Railway Age, May 2018.
  • Attoh-Okine, N., “Big Data and Differential Privacy: Analysis Strategies for Railway Track,” Wiley, May 2017.
  • Wilczek, K, “Use Cases of Big Data Technology in Track Maintenance,” Plasser & Theurer, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Liu, Xiang, “Artificial Intelligence-Aided Broken Rail Derailment Risk Analysis,” Rutgers University, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Henderson, R. Rail and Transit Consultant, Bentley, “Big Data Driven Decisions to Transform Track Maintenance,” 2019 Big Data in Railroad Maintenance Planning Conference.
  • Attoh-Okine, N., “Vanilla Lite Data Analysis Techniques in Railway Track Engineering – Time to Let Go,” 2019 Big Data in Railroad Maintenance Planning Conference.
  • Williams, T and John Betak, J, “Using LDA Topic Modelling to Identify Themes in British and American Railroad Accidents,” 2019 Big Data in Railroad Maintenance Planning Conference.
  • Stark, T., and Thompson, H., “Track Geometry Defects Using Site-Specific Fouled Ballast Monitoring,” 2019 Big Data in Railroad Maintenance Planning Conference.
  • Zarembski, A. M. “Probabilistic Relationship for Development of a Severe Track Geometry Defect Based on Ballast Condition as Measured by GPR,” 2019 Big Data in Railroad Maintenance Planning Conference.
  • Pelli, Eric, “Driving Railroad Optimization with Improved Data Management and Analytics,” Collins Aerospace, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Fusting, C. and Wall, N, “Applying MLOps to Maximize Customer Value: A Case Study on Improving Industry Reference Systems,” RailInc, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Tegelberg, Erland, “Asset Management: Limits and Opportunities for Big Data,” Strukton Rail, 2019 Big Data in Railroad Maintenance Planning Conference.
  • Wikipedia,
  • Henderson, Robert, “Big Data Driven Decisions to Transform Track Maintenance,” Bentley, 2019 Big Data in Railroad Maintenance Planning Conference.

Allan M Zarembski, Ph.D, PE FASME, Hon. Mbr. AREMA, is Professor of Practice and Director of the Railroad Engineering and Safety Program, Department of Civil and Environmental Engineering, University of Delaware-Newark.

Tags: , ,