Airline itinerary choice modeling using machine learning

2018 
This paper deals with the airline itinerary choice problem. Consider for example that a customer is searching for flights from London to New York, traveling next week on Tuesday and coming back on Saturday. This search request is then processed by a travel provider (e.g., an online travel agent) that proposes between 50 and 100 different alternatives (itineraries) to the customer. The itineraries have different attributes, among others: number of stops, total trip duration, and price. The question is: “which one is (probably) going to be selected by the customer?” There is a growing interest within the travel industry in better understanding how customers choose between different itineraries when searching for flights. Such an understanding can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs, thus increasing their revenue. This can be used for filtering alternatives, sorting them or even for changing some attributes in real-time (e.g., change the price). The field of customer choice modelling is dominated by traditional statistical approaches, such as the Multinomial Logit (MNL) model, that are linear with respect to features and are tightly bound to their assumptions about the distribution of error (Gaussian, Gumbel, etc.). While these models offer the dual advantages of simplicity and readability, they lack flexibility to handle correlations between alternatives or non-linearity in the effect of alternative attributes. A large part of the existing modelling work focuses in adapting these modelled distributions, so that they can match observed behaviour. Nested (NL) and Cross-Nested (CNL) Logit models are good examples of this: they add terms for highly specific feature interactions, so that substitution patterns between sub-groups of alternatives can be captured. In this work, we present an alternative modelling approach based on machine learning techniques.  The selected machine learning methods do not require any assumption about the distribution of errors, they can also create non-linear relationships between feature values and the target class, include collinear features, and have more modelling flexibility in general. In particular, we have chosen to work with Random Forests. Random Forests are ensembles of decision trees which aggregate their predictions. A decision tree is a tree of nodes, each of which applies a linear threshold to a single variable. Each decision tree in a random forest receives a random subset of the input features, and then generates a tree deterministically from those features. In fact, Random Forests are well adapted for our problem as model bifurcations (branches) automatically partition the customers into segments and, at the same time, it captures nonlinearities relationships within attributes of alternatives and characteristics of the decision maker. Indeed, there are two main segments to be taken into account for our particular problem: business and leisure air passengers behave very different when it comes to book flights. Business passenger tends to favour alternatives with convenient schedules like shorter connection times and time preferences, while leisure passengers are very price sensitive, in other words they can accept a longer connection time if this is reflected in a lower price of the tickets. The problem is that the “segment” is not explicitly known when the customer is searching, however it could be derived by combining different factors. For example, industry experts know that business passengers have a tendency to book with less anticipation and are not predisposed to stay on Saturdays nights. In spite of this, these are not “black or white” rules, so this reinforce the need for a model able to detect these rules depending on the data and actual customer behaviour. Another observed advantage is that Random Forests are also fairly quick to train and very quick to predict, which enables fast iteration. As a matter of fact, our numerical experiments show that Machine Learning methods were faster than MNL for the learning process, both using public libraries with default parameters. We trained and tested our models on a dataset consisting of flight searches and bookings on six European markets, extracted from GDS (global distribution system) logs. The choice set consists of the results of a flight search request. Each search request includes between 10 and 250 different itineraries, one of which has been booked by the customer. Choice sets are regrouped by Origin and Destination market. Our main experiments consist in comparing MNL vs Machine Learning. We evaluate the performance of the different models by comparing their prediction of fractional shares of choices against the actual distribution of choices. It should be noted that in our data set the alternatives are not fixed from one shopping session (choice situation) to the next, we compare the predicted and actual market share of groups of alternatives, such as flight numbers, flights from one airline, or flights departing during a specific time window. These KPIs are particularly useful, as they are often the final result expected from the model. Our general finding is that, on most origin and destination markets, machine learning models outperform MNL on all relevant metrics. In particular, we found a reduction in the error of airline prediction of more than 70% on four of the six markets we considered. We also find that training a model on several markets at once results in similar performance – something which could greatly help in scaling the models to a large amount of markets. The main conclusion of our work is thus that non-linear machine learning methods such as the ones we present here can provide clear benefits to some choice modelling applications, such as air travel itinerary choice by passengers, notably thanks to the better handling of non-linearity, overall greater flexibility they provide, and fast learning computation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    43
    Citations
    NaN
    KQI
    []