Disclaimer: This dissertation has been written by a student and is not an example of our professional work, which you can see examples of here.

Any opinions, findings, conclusions, or recommendations expressed in this dissertation are those of the authors and do not necessarily reflect the views of UKDiss.com.

Pre-game Prediction of Outcomes in NBA Matches

Info: 7080 words (28 pages) Dissertation
Published: 9th Dec 2019

Reference this

Tagged: Information Technology

Pre-game prediction of outcomes in NBA matches

Abstract—We treat the problem of predicting the outcome of basketball games in the National Basketball Association professional league. Our models only incorporate information available ahead of the beginning of each match to produce highly accurate pre-game predictions. In particular, we develop a C5.0 decision tree model which can predict the outcome of NBA games with 74.08% accuracy. This figure compares very favourably with respect to those achieved by existing models in the associated scientific literature, and is achieved under robust cross-season testing conditions. Our findings demonstrate that it is possible to successfully leverage past-season data to achieve accurate game- outcome prediction. We achieve this feat by developing models that retain sensitivity to the precise composition of each team’s line-up for a given game.


The applications of data mining to sports are multifold, allowing for the identification of factors that contribute to suc- cessful performance, the ranking of same, and the development of predictive models to forecast the outcomes of particular games or tournaments. The extraction of useful knowledge regarding “on-field” performance from in-game data, and the leveraging of the associated competitive advantages, has proven ability to transform the careers of players, coaches, talent scouts and team owners (as famously portrayed in [1]). Such impacts are magnified by the scale of investment in professional sports, with a recent report estimating the value of the global sports industry to constitute one percent of global GDP [2]. Moreover, the potential to develop accurate predictive capabilities clearly has profound impacts on the multi-billion-euro gambling industry. Given the above, it is unsurprising that there has been a dramatic increase in both the volume and variety of in-game statistics being captured, as facilitated by the technological innovations underpinning the era of Big Data (see [3]), and the sophistication of the statistical methods and machine-learning algorithms being applied to their analysis (see [4]).

The sporting context of this study is the game of basketball and its foremost professional league, the National Basketball Association (NBA), where detailed in-game statistics (such as the number and type of shots taken by each player) have been available since the roll out of the Advanced Scout system in the 1995/96 season (per [3]). The research objective of our study is to incorporate similar statistics into the development of an accurate system for predicting the outcome of NBA games, a particularly hard task given that competition within the NBA has been demonstrated to be consistently well-balanced (per [5]). In particular, our predictive models will be constructed, refined, trained, optimised and tested on the basis of the data contained in [6], a recently-released, publicly-accessible resource containing player- and team-based statistics for every game played in the past 15 complete seasons (from 2002/03 to 2016/17).

The temporal granularity of the data set [6] is particularly fine, with statistics being recorded at a “play-by-play” event level, giving rise to a total of 8.7 million records split across 15 separate season files. In [7], this level of granularity is invoked to achieve “in-game” outcome prediction, whereby pre-game statistics are complemented by in-game performance measures that aggregate as a function of game time, allowing for the development of a series of outcome predictions which evolve in accordance with in-game events. It is of course more difficult to predict the outcome of a game using only pre-game data, but these predictions are more broadly applicable, both in coaching and betting terms. We propose to adapt and re- engineer the data in [6] to accommodate pre-game prediction, whereby the most current season-to-date information available before the game begins may be leveraged for prediction. The situations where the training and unseen testing data are drawn from the same season (“in-season” prediction) and distinct seasons (“cross-season” prediction) are both considered in [7], with substantially better results being achieved in the former scenario, supporting the idea that cross-season prediction is more difficult than in-season prediction. Although there is naturally additional training data available to effect a cross- season approach, the inherent difficulties in such a task are readily apparent when one considers the impact of intra-season variations in squad composition, game rules, prevailing tactical approaches, etc. Despite these difficulties, our treatment of [6] will consider the ambitious undertaking of using the first 12 seasons of game data to train models that can accurately predict the outcome of games in the most recent 3 seasons.

Although there is a substantial degree of commonality in the way in which data mining techniques are applied to predict the outcome of sports matches (per [8]), the particulars of the game of basketball and the format of the NBA merit consideration in the context of our stated goals. A key distinc- tion in this regard is the binary win/lose nature of basketball outcomes, with a tie after 4 quarters (each of 12-minutes duration) triggering at least one 5-minute overtime period until the winning team has been determined. Given the intense and unrelenting nature of the sport, basketball teams can avail of unlimited substitutions, whereby the five players representing a team on court can be repeatedly permuted from a roster of active players, varying between 13 and 15 players in size. The identities of the 30 teams participating in the NBA have remained unchanged over our reference period, aside from some minor instances of re-branding. These 30 teams are split into two 15-team conferences, and further divided into

three 5-team within-conference divisions, with the frequency with which two teams meet being affected by their conference and division membership. There are a total of 1230 games in the regular season, with each team playing 41 games at home and 41 games away, with a total of between 60 and 105 further games being played in the “post-season” best-of- 7-games playoff series.

Whereas the phenomenon of home advantage is widely recognised across a variety of sports, its effects are particularly pronounced in the case of basketball, with a relatively stable annual home-win ratio of approximately 60% being recorded since the beginning of the nineties [9] (with associated dif- ferences in levels of rest between games being identified as a significant contributory factor [10]). Indeed, we observe a home-win ratio of 60.15% across the totality of the data in [6], and a home-win ratio of 59.27% across our testing period of the last three seasons. Clearly, for our models to demonstrate predictive value, our accuracy should exceed this “No-information Rate” threshold of 59.27%. Another aspect of basketball that should be reflected in our model development is its high-scoring nature, suggesting that outcomes are less susceptible to chance, whereby skill-based features should contain significant predictive power (as supported by [11]).

As discussed in the “Data Mining Methodology” section, our models address the above considerations by separately incorporating season-to-date, per-game cumulative averages for both the home and away teams across a range of per- formance measures (number of wins, score, fouls, assists, steals, blocks, turnovers and offensive, defensive and other rebounds). This skill rating of the two teams is enhanced by the introduction of weighted plus-minus figures, separately averaged across each team’s five starting players and the remaining group of players on the roster, and form variables capturing the most recent outcomes. Per the “Data Mining Methodology” and “Results/Evaluation” sections, appropriate combinations and subsets of these features are selected (on the basis of techniques such as k-Means clustering, principal component analysis and random-forest variable-importance indicators) to serve as the predictive input for a suite of machine-learning algorithms, including C5.0 Decision Trees, logistic regression, k-Nearest Neighbours, Random Forests, Rotation Forests, Support Vector Machines (SVMs), Artificial Neural Networks (ANNs) and various ensembles. Our most accurate model is realised by a winnowed C5.0 decision tree, achieving a stable prediction accuracy rate of 74.08% in the difficult context of cross-season testing. Per the discussion that follows, this accuracy result compares very favourably with the related work in this field. We close this article with a presentation of the primary conclusions of our study, and a discussion of potential methods for further enhancing the predictive capabilities of our models.


As discussed above, the analysis of [6] presented in [7] substantially differs from our work, not least in terms of gran- ularity (player-level) and focus (in-game prediction). However,

we do note that substantial in-game prediction accuracy is achieved in [7] through the usage of advanced techniques (involving Mixture Density Networks, combined applications of ANNs and mixture density models). The pre-game predic- tion accuracy of the suite of random-forest classifier models developed in [7] trail those of our models, with the apparent exception of their “adv+enc” model which appears to possess pre-game accuracy of approximately 73%. However, we note that these findings are with respect to in-season performance, with the pre-game accuracy of these random-forest classifiers collapsing below the No-Information Rate when faced with the task of cross-season prediction. Intriguingly, it remains unclear whether the Mixture Density Network approach can yield accurate pre-game cross-season prediction accuracy. With re- spect to the task of in-season prediction, the authors of [7] fail to address the temporal ordering of the data, thereby allowing potential sources of bias to detract from their work, as training data drawn from later in the season may encode the outcomes of testing data, and performance trends will likely stabilise over time. As above, we guard against this source of bias by ensuring that our training data predates our testing data.

In-game NBA prediction is also considered in [12], but the models developed therein crucially rely on the evolving points differential between the teams, and thus do not translate to the context of pre-game prediction. Of more relevance to our work are the in-game prediction studies [13] and [14], as their models account for in-game substitutions and confirm that line-up changes have a significant impact on game outcome. These studies convinced us of the importance of retaining “player visibility” in our models, whereby the pre- game prediction is sensitive to the identities and features of the players in the match-day squad.

Returning our attention to pre-game prediction, we should expect a baseline accuracy of at least 60%, accounting for the effect of “home advantage”. In [15], a na ̈ıve baseline model predicting winners solely on the basis of who won the previous game between the teams was shown to achieve 64% accuracy. The on-the-day pre-game prediction accuracy of basketball experts has been estimated to be just below 70%, per [16], [17] and [18], although these experts had the possiblity of withholding their predictions in the face of uncertainty. Per [11], the bettors’ favourite wins 70% of the time.

Per [16] and [19], the state-of-the-art performance with respect to pre-game NBA prediction is widely reported to be the 74.33% average accuracy achieved in [18] through the usage of ANNs. The selection of this approach was influenced by the successful application of neural networks to predict the outcomes of NFL American football games, per [20] and [21] (indeed, per [22], ANNs constitute the most commonly applied class of machine-learning algorithms with respect to sports-prediction tasks). While four varieties of ANNs and two ensembles of same are considered in [18], it is the relatively simple feed-forward network with only one hidden layer which achieves 74.33% accuracy. Our study of this work has revealed a number of factors which detract from the robustness of this figure, which is based on in-season prediction. Firstly,

a small sample of 620 games from the 2007/08 season are used for training and validation (with respect to 10-fold cross validation), while the testing set consists of a mere 30 games. Moreover, these games are specifically chosen to be the first 650 games of the season to reduce the effects of injury and player transfers, as the model uses team-level performance metrics only, and is therefore blind to the identities and skill attributes of the players involved in a particular game. More significantly still, these team-level performance metrics are calculated by determining the associated seasonal average across the 2007/08 season, rendering the real-world applica- bility of the models moot (as such information is of course unknowable throughout the season, with reliable estimates only becoming available after a significant portion of the season has elapsed). Despite these flaws, this study does serve to illustrate the potential benefits of applying dimensionality reduction, as a significant increase in accuracy is achieved through the replacement of 22 original model features with 4 significant principal components (linear combinations of the original model features).

In [23], the problem of cross-season prediction is addressed, with training and testing data being respectively drawn from five consecutive NBA seasons (2005/06-2009/10) and the subsequent season. A comprehensive approach towards data sourcing is presented, with a wide variety of features being in- corporated into the models (including the number of days’ rest enjoyed by opposing teams ahead of their contest). However, the encoding of certain features is ad-hoc (with a somewhat- arbitrary weighting being assigned to the outcomes of past instances of the given contest) and the potential benefits of dimensionality reduction are not explored. The team-level per- formance metrics for each team (including home/away status, points scored, number of offensive rebounds, etc.) are averaged over the last 10 games played, in contrast to our season-to- date approach. In keeping with this design choice, the decision is made to discard the first two months of data from all training, validation and testing data in the running of 10- fold cross validation. Despite this heavy-handed intervention, the best result achieved with respect to a suite of models encompassing logistic regression, Na ̈ıve Bayes classification, ANNs and SVMs is the relatively modest 69.67% prediction accuracy obtained using logistic regression.

A similar suite of models is considered in [24], with two versions of each type of model being generated for each season between 1992/93 and 1996/97, the first based on the previous season’s data only and the second based on the previous season plus season-to-date data. The linear-regression models (with an associated classification threshold value) achieve the best performance, with an average of 70% accuracy across the four seasons. Interestingly, an average increase of ap- proximately one percent is observed across the four types of models through the incorporation of season-to-date data, lending support to the validity of our modeling approach.

For additional context regarding our accuracy figures, we note that a largely similar collection of machine-learning algorithms were applied in [19], [16], [25] and [17], achiev-

ing maximum comparable prediction accuracy rates of 67%, 67.7%, 70% and 67% respectively, with elementary methods such as Na ̈ıve Bayes classification and logistic regression outperforming the “black-box” methods of SVMs and ANNs.

Although our data set [6] contains player-level “box-scores” statistics (including the number of assists, steals, blocks, etc. in a given game), our decision to aggregate these measures to the team level can be justified on the basis of the findings in [26], where it was established that player-level box-score statistics only hold significance for a tiny minority of star players, with 99% of players contributing less points, assists or rebounds than the mid-range value.

In terms of predictive power, it appears reasonable to suggest that a team’s recent performances should carry more weight than historical ones, with “form” variables appearing in the models developed in [17], [19] and [16]. In the contrary direction, the incorporation of a form measure was shown to offer no predictive benefit in [18]. In light of the above, we decided to engineer a form variable into our models, while reserving judgment regarding its effectiveness.


The Knowledge Discovery and Data Mining (KDD) pro- cess is a highly-iterative framework for data mining that prioritises the extraction of valid, novel, potentially useful and interpretable patterns from data sources. In applying the KDD approach, a data analyst (repeatedly) passes through the phases of Data Selection, Pre-processing, Transformation, Data Mining and Interpretation/Evaluation, per Fig. 1.

Fig. 1. Overview of the KDD process

Data Selection: As above, our project is based on the analy- sis of [6], a large data set detailing 358 features of 8.7 million event-based game records, split across 15 csv files. Motivated by its broad applicability, we decided to consider the task of pre-game prediction, thereby differentiating our study from [7]. We chose to select end-of-fourth-quarter records from each game to ensure consistency with respect to the per- game statistics we sought to derive, as end-of-game statistics would be distorted by the occurrence of overtime periods. This extraction reduced our data source to one data set with 19,061 rows (one for each game) and 359 features (having encoded the associated season number). This data set contained features such as the outcome of each game, the identities of the home and away teams, their box-score totals for that game (eg. number of home-team assists, blocks, offensive rebounds, etc.) and player-level features for each of the players on the home and away team rosters (eg. the identity of player number x for the home team, their starting status for that game, their

box scores for the game). Crucially, the identity of the player with a given home- or away-team number was represented by a unique personal identifier which was consistent across our 15 reference season, whereby game-by-game changes in the identity of the occupier of a given team’s player number 1 position could be captured, and the performance of a given player tracked across different seasons and potential changes in team affiliation.

Pre-processing: Given the provenance of our data, this derived data set required little pre-processing. We confirmed that the data set contained no missing values, and noted the encoding of empty slots on a team roster (whereby the game features of player 15 for the home team were recorded as −1 in the case where no such player was on the game-day roster).

Transformation: Our re-purposing of [6] required substantial efforts with respect to data transformation. Clearly, a data set with 359 features raises concerns regarding feature redundancy and result interpretation. Moreover, per [27], models with a large number of predictors are susceptible to over-fitting their predictions to patterns in the training data, compromising their prediction accuracy with respect to unseen testing data. Furthermore, given the changes in occupancy of a team’s player number x role, the validity of basing predictions on player number x attributes is questionable. Thus, we decided to take a team-level view of the data. In particular, for each game and each team, we divided the players into two groups, a five-man “Starters” group and a “Non-Starters” group of varying size, and computed an average “weighted plus minus” figure for each of these groups (accounting for the varying number of Non-Starters). Per [28], the traditional “plus-minus” figure captures the change in score differential across all periods in which the player is on court. Although there exist refinements of this measure in the literature, the simple and natural weighting of a player’s plus-minus figure by the minutes they spent on court (our notion of “weighted plus minus”) appears to be under-utilised. Per the Related Work section, our choice to aggregate to the Starters and Non- Starters groups was motivated by the findings in [13] and [14] regarding the importance of player line-ups. Similarly, recognising that skill plays a particularly important predictive role in basketball, per [11], we have encoded our weighted plus-minus figures into our models to reflect team skill. When accumulated across seasons, as facilitated by the presence of unique personal identifiers as discussed above, these four features (Weighted Plus-Minus for Home Team Starters, Away Team Starters, Home Team Non-Starters and Away Team Non- Starters) possess significant predictive power, underpinning the accurate performance of our models.

Our other measures of team skill are derived from the existing team-level box-score game totals in [6], as discussed above. We historicise this game-based data to create season- to-date cumulative-average figures for each team’s score per game, fouls per game, offensive-rebounds per game, etc. To achieve this, we converted our one-record per game data set into a temporary data frame containing two records per game, one for each team. For each game, we identified the last

time each team was involved in a game and incorporated the associated measures. Fig. 2 illustrates the functioning of this running-totals data frame: we note the update of measures for Teams 23 and 24, which both played two games within the reference period, and the existence of the games played column, which facilitates the calculation of our cumulative per-game averages.

Fig. 2. A running totals data frame enabling season-to-date historization

Fig. 2 also contains the “Form” variable we created, influ- enced by the application of similar predictors in [17], [19] and [16] per the Related Work section. This variable reflects a team’s performance over their last five games. It is encoded as a binary number of length five in the temporary data frame presented Fig. 2, for ease of update, and as a win percentage in our stable pre-game-totals data frame, for ease of analysis. Again, we see the updating of this variable with respect to Teams 23 and 24 (we note that leading zeros are ignored in these binary form figures).

We note that our weighted plus-minus figures are non-zero at the start of each season, and even at the start of season 1, as they are derived from individualised figures pre-existing at the start of our first season and persisting across seasons. By contrast, our season-to-date per-game cumulative-average fig- ures start at zero for each season, with their representation of team skill developing over the course of the season. Our form variable, on the other hand, clearly requires some initialisation. Unsure of the predictive worth of form variables (per the findings in [18], as above), we choose not to discard the first five games of each season on this basis, the approach adopted in [23]. Instead, in keeping with the findings in [19] that encoding the last five games of the previous season offers no predictive benefits, we chose an arbitrary form initialisation of 10101 for every team (this decision errs on the side of caution, as it can only serve to compromise the prediction accuracy of our models). In keeping with this choice, the reference period in Fig. 2 was chosen to reflect some variation in the form variable. We note that there is no consensus in the literature regarding the optimal range of a form variable, with our choice of a five-game range being based on experimentation across a variety of possible values.

The above interventions resulted in the construction of a stable pre-game-totals data frame with 29 features, a game- outcome feature and 28 team-level features (evenly divided

between the home and away teams).
Data Mining: In keeping with our study of the literature,

as outlined above, we applied a suite of machine-learning algorithms to our prediction task. For clarity, we will restrict our discussion to our findings with respect to C5.0 decision tress, logistic regression, k-Nearest Neighbours, SVMs and ANNs. We chose not to apply Na ̈ıve Bayes classifiers as all our 28 predictors were numeric (and thus would require conversion to categorical data through ad-hoc binning). The C5.0 algorithm is the most commonly-used implementation of a decision tree, whereby feature-based decision criteria are ordered on their basis of their ability to reduce entropy (and thereby producing relatively homogeneous classification leaf nodes). Importantly, C5.0 tree models incorporate sensible pruning parameters, ensuring that the resultant trees do not grow to too great a depth. Thus, C5.0 output is highly interpretable, and can readily be converted into a deterministic flowchart of if-then rules. The k-Nearest Neighbours method is an instance-based learner, predicting an outcome on the basis of a majority vote across the k closest instances (as measured by Euclidean distances in the multi-dimensional space of normalised features). Binary logistic regression is a well-understood statistical method, whereby the output of an ordinary linear regression model is transformed into an approximation of the probability of belonging to a particular class. ANNs (Artificial Neural Networks) model the human brain by means of directed graphs, whose input nodes are in one-to-one correspondence with the predictor features. They contain at least one hidden layer, which attaches learned weights to received input (reflecting node importance) and outputs the evaluation of these weighted vectors by an ac- tivation function. SVMs (Support Vector Machines) addresses binary classification problems by splitting data on the basis of maximising distance between data and the associated decision boundary curve, with the shape of this boundary (eg. linear, polynomial) being determined by the choice of kernel.

As discussed above, we control for sources of bias in our predictions by respecting the temporal ordering of the data. We adopt a systematic approach with respect to the training, tuning and testing of each class of machine-learning algorithms, applying 10-fold cross validation in all cases. In 10-fold cross- validation, a data set is evenly partitioned into 10 subsets, with each subset in turn acting as a test set with respect to the corresponding nine-subset training set, thereby ensuring that the model is tested against every instance in the data and yielding a robust measure of model accuracy. Similarly, when seeking to “tune” or optimise model parameters, we again apply 10-fold cross validation. Per [27], an important distinction arises in this setting: firstly, the data is divided into training and testing set, with the cross-fold validation being applied to the training set only. Testing across the training set in this manner allows for the optimisation of model parameters, with a best model being derived as a result. This model is in turn trained on the full training set and then tested on the original, and crucially unseen, testing set (ensuring that the model optimisation process does not introduce bias

through exposure to the testing data). Heeding the advice in [29], we apply stratified sampling in the construction of cross- validation subsets, to better ensure the reliability of our results. In keeping with the iterative nature of the KDD process, our initial ad-hoc approach to model training, tuning and testing was manual in nature, involving simple hold-out data splits and the application of loop constructions to build multiple models. Subsequently, we achieved our current systematic approach, incorporating all of the above points, by harnessing the capabilities of the caret package, as outlined in [30].

Interpretation/Evaluation: A discussion of our results is presented below.


Given the binary nature of our prediction task, and our indifference with regard to whether a miss classification is a false positive or false negative (as befitting the context), the standard classification accuracy measure (proportion of correct classifications to total classifications) suffices to eval- uate the performance of our models. The accuracy rates for our finalised suite of models, trained, tuned and tested in a consistent manner through the application of the caret package as discussed above, is presented in the following table.

Here, the “h2o AutoML best model” is the most accurate model returned by h2o’s Automated Machine Learning feature after one hour.

Per the above table, our C5.0 model possesses impressive prediction accuracy. As above, the parameters of this model were uncovered by applying a tuning process via caret. In particular, 10-fold cross validation repeated 5 times revealed the optimum parametrisation with respect to the goal of model accuracy, as illustrated in the following figure.

Our boosting trials parameter indicates that 20 C5.0 trees are constructed and polled to develop our optimum C5.0 model. Per Fig. 3, as “winnow=TRUE”, our optimum model employs automatic feature selection. Fig. 4 illustrates the reduced set of 18 features chosen by the winnowing process, and their relative importance to the model. As suggested previously, our weighted plus-minus figures underpin the performance of this model. Our derived five-game form figure also demonstrates it’s worth with respect to this model.

Finally, our confusion matrix output is presented in Fig. 5. The scale of the boost in accuracy achieved when the optimum model is trained on the full set of training data is perhaps surprising. We note the relatively tight 95% confidence interval around our 74.08% accuracy figure, the Kappa value of 0.448 (which generally represents “moderate” prediction accuracy

Method Accuracy
C5.0 74.08%
SVM 72.26%
k-NN 71.98%
h2o AutoML best model 71.92%
Logistic Regression 71.9%
ANN 71.26%

Fig. 3. Tuning our C5.0 model via caret

Fig. 4. Attribute usage in our C5.0 model

with respect to an arbitrary classification task, but can be considered high in the context of pre-game sports prediction), and the balanced performance accuracy of our model with respect to the games its predicts as home wins and those it predicts as away wins.

Fig. 5. Our C5.0 model’s confusion matrix

Our final SVM model was derived by employing the same systematic caret-based tuning approach. Its performance is summarised in Fig. 6. The impact of varying the values of the associated parameters is illustrated with respect to the four kernel choices considered in this project (linear, polynomial, radial and sigmoidal) is illustrated in Fig. 7. The relatively

poor performance of an SVM with sigmoidal kernel across the range of parameters considered is noteworthy, as is the extent to which performance diverges with respect to variations in the shape of the decision boundary.

Fig. 6. Our SVM model’s confusion matrix

Fig. 7. Our SVM model’s confusion matrix

The incorporation of automated feature selection allowed us to dispense with the manual selection of significant features. Our analysis of variance table with respect to our logistic regression model demonstrates one method by which such manual feature selection may be achieved, per Fig. 8.

Interestingly, we see some sharp divergence between the features deemed most significant with respect to logistic regression as opposed to those selected by our C5.0 model. While our weighted plus-minus features are again ranked highly (with the addition of our two Starters values making large contributions to the accuracy of our model, per the de- viance column), our form features are deemed less significant with respect to this model.

As discussed above, we have confidence in the robustness of our findings due to the rigorous approach we applied to all aspects of model development, training, tuning and testing. Indeed, when presented with modeling decisions, we chose to err on the side of caution in all instances, guarding against positively biasing our accuracy findings. Similarly, our replacement of manual and ad-hoc model training, tuning and

Fig. 8. The ANOVA table for our logistic regression model

testing methods with a systematic, caret-based approach places our findings on a sound footing.


The primary contribution to knowledge made by our work appears to be the amelioration of the difficulty of cross-season prediction through retaining model visibility to the specific composition of team line-ups, achieved through the presence of personal identifiers in [6]. Our findings also highlight the usefulness of our weighted plus-minus measures, which appear to be under used in the existing literature. That our highly- accurate model takes the form of a decision tree is significant, as this model can be converted into interpretable if-then rules and could conceivably be used by coaches to tailor the line- up selections with respect to the opposition (to allow for the resting of key players, on occasions where the model would still predict success). Our accuracy figures may also have implications for the betting industry, although we have not attempted to evaluate whether our high prediction accuracy would give a competitive edge with respect to the odds offered by bookmakers.

Given more time, we would experiment with the potential benefits of reducing our amount of historical data to a more recent range. We would also investigate how our models could be sensibly extended to a player-level of granularity.

In particular, an interesting topological approach to identify clusters of “star players” is outlined in [31], and merits further consideration. Our model may also benefit from the appropriate introduction of shooting scores, per [32].


[1] M. Lewis, Moneyball: The Art of Winning an Unfair Game. New York, NY, USA: WW Norton & Company, 2004.

[2] KPMG and CII, “The business of sports,” 2016. [Online]. Avail- able: https://assets.kpmg.com/content/dam/kpmg/in/pdf/2016/09/the- business-of-sports.pdf

[3] E. Morgulev, O. H. Azar, and R. Lidor, “Sports analytics and the big- data era,” International Journal of Data Science and Analytics, 2018.

[4] B. Ofoghi, J. Zeleznikow, C. MacMahon, and M. Raab, “Data mining in elite sports: A review and a framework,” Measurement in Physical Education and Exercise Science, vol. 17, no. 3, pp. 171–186, 2013.

[5] Y. De Sa Guerra, J. M. Martn Gonzlez, S. Sarmiento Montesdeoca, D. Rodrguez Ruiz, A. Garca-Rodrguez, and J. M. Garca-Manso, “A model for competitiveness level analysis in sports competitions: Appli- cation to basketball,” Physica A: Statistical Mechanics and its Applica- tions, vol. 391, no. 10, pp. 2997–3004, 2012.

[6] “2018 MIT Sloan Sports NBA Play-by-Play Dataset,” 2018. [Online]. Available: https://www.stats.com/data-science/

[7] S. Ganguly and N. Frank, “The Problem with Win Probability,” in MIT Sloan Sports Analytics Conference, 2018.

[8] M. Haghighat, H. Rastegari, and N. Nourafza, “A review of data mining techniques for result prediction in sports,” Advances in Computer Science: an International Journal, vol. 2, no. 5, pp. 7–12, 2013.

[9] R. Pollard and G. Pollard, “Long-term trends in home advantage in professional team sports in north america and england (1876–2003),” Journal of Sports Sciences, vol. 23, no. 4, pp. 337–350, 2005.

[10] O. A. Entine and D. S. Small, “The Role of Rest in the NBA Home- Court Advantage,” Journal of Quantitative Analysis in Sports, vol. 4, no. 2, pp. 6–16, 2008.

[11] R. Y. Aoki, R. M. Assuncao, and P. O. Vaz de Melo, “Luck is hard to beat: The difficulty of sports prediction,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’17, 2017, pp. 1367–1376.

[12] R. Lin, “Mason: Real-time nba matches outcome prediction,” Masters Dissertation. Arizona State University, 2017. [Online]. Available: https://repository.asu.edu/

[13] H. S. Bhat, L.-H. Huang, and S. Rodriguez, “Learning stochas- tic models for basketball substitutions from play-by-play data,” in MLSA@PKDD/ECML, 2015.

[14] M.-H. Oh, S. Keshri, and G. Iyengar, “Graphical model for basketball match simulation,” in MIT Sloan Sports Analytics Conference, 2015.

[15] R. A. Torres, “Prediction of NBA games based on Machine Learning Methods,” University of Wisconsin Madison Introduction to Artificial Neural Networks and Fuzzy Systems Course Project, 2013. [Online]. Available: https://homepages.cae.wisc.edu/ ece539/fall13/project/

[16] J. Perricone, I. Shaw, and W. S ́wie, “Predicting Results for Professional Basketball Using NBA API Data,” p. 229, 2014. [Online]. Available: http://cs229.stanford.edu/proj2016/report

[17] D.Miljkovic ́,L.Gajic ́,A.Kovacˇevic ́,andZ.Konjovic ́,“Theuseofdata mining for basketball matches outcomes prediction,” SIISY 2010 – 8th IEEE International Symposium on Intelligent Systems and Informatics, pp. 309–312, 2010.

[18] B. Loeffelholz, E. Bednar, and K. W. Bauer, “Predicting NBA Games Using Neural Networks,” Journal of Quantitative Analysis in Sports, vol. 5, pp. 1–10, 2009.

[19] A. Zimmermann, “Basketball predictions in the NCAAB and NBA: Similarities and differences,” Statistical Analysis and Data Mining, vol. 9, no. 5, pp. 350–364, 2016.

[20] J. Kahn, “Neural network prediction
football games,” 2003. [Online]. Available: http://homepages.cae.wisc.edu/ ece539/project/f03/kahn.pdf

[21] M. C. Purucker, “Neural network quarterbacking,” IEEE Potentials, vol. 15, no. 3, pp. 9–15, 1996.

of nfl

  •   [22]  Z. Ivankovic, M. Rackovic, B. Markoski, D. Radosav, and M. Ivkovic, “Analysis of basketball games using neural networks,” 2010 11th International Symposium on Computational Intelligence and Informatics (CINTI), pp. 251–256, 2010. [Online]. Available: http://ieeexplore.ieee.org/document/5672237/
  •   [23]  C. Cao, “Sports Data Mining Technology Used in Basketball Outcome Prediction,” Masters Dissertation. Dublin Institute of Technology, 2012. [Online]. Available: http://arrow.dit.ie/
  •   [24]  M. Beckler, H. Wang, and M. Papamichael, “NBA Oracle,” Carnegie Mellon University Machine Learning Course Project, vol. 1, no. 2009, 2013. [Online]. Available: http://www.mbeckler.org/coursework/2008- 2009/10701 report.pdf
  •   [25]  L. Richardson, D. Wang, C. Zhang, and X. Yu, “NBA Predictions,” pp. 1–10, 2014. [Online]. Available: http://www.stat.cmu.edu/ lrichard/links/nba predictions.pdf
  •   [26]  P. O. Vaz de Melo, V. A. Almeida, and A. A. Loureiro, “Can complex network metrics predict the behavior of NBA teams?” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD 08, pp. 695–703, 2008.
  •   [27]  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc., 2001.
  •   [28]  “Basketball Reference website,” 2018. [Online]. Available: 
  •   [29]  R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” IJCAI, vol. 14, no. 2, pp. 1137–1145, 
  •   [30]  M. Kuhn, “Building Predictive Models in R Using the caret Package,” 
Journal of Statistical Software, vol. 28, no. 5, 2008.
  •   [31]  P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson, “Extracting insights from the shape of complex data using topology,” Scientific Reports, vol. 3, 
pp. 1–8, 2013.
  •   [32]  G.Page,G.Fellingham,andC.Reese,“UsingBox-ScorestoDetermine 
a Positions Contribution to Winning Basketball Games,” Journal of Quantitative Analysis in Sports, vol. 3, no. 4, 2007.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

Related Content

All Tags

Content relating to: "Information Technology"

Information Technology refers to the use or study of computers to receive, store, and send data. Information Technology is a term that is usually used in a business context, with members of the IT team providing effective solutions that contribute to the success of the business.

Related Articles

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: