Hours ahead automed long short-term memory (LSTM) electricity load forecasting at substation level: Newcastle substation

Nowadays, electrical energy is of vital importance in our lives, every country needs this resource to develop its economy, factories, businesses, and homes are the basis of the economic structure of a country. In the city of Newcastle as in other cities are in constant development growing day by day in terms of industries, homes and businesses, these elements are the ones that consume all the electricity produced in Newcastle. Although Australia has strategically located substations that serve the function of supplying all existing loads with quality power, from time to time the load will exceed the capacity of these substations and will not be able to supply the loads that will arise in the future as the city grows. To find a solution to this problem, we use a deep learning model to improve accuracy. In this paper, a Long Short-Term Memory recurrent neural network (LSTM) is tested on a publicly available 30-minute dataset containing measured real power data for individual zone substations in the Ausgrid supply area data. The performance of the model is comprehensively compared with 4 different configurations of the LSTM. The proposed LSTM approach with 2 hidden layers and 50 neurons outperforms the other configurations with a mean absolute error (MAE) of 0.0050 in the short-term load forecasting task for substations.


Introduction
Load forecasting has been an important process for years in the field of energy in general and of electric utility. In the industries, the needs of load forecasting, such as planning, operations, and maintenance, become more important than before. Nowadays, with the promotion of smart grid technologies, load forecasting is of even greater importance due to its applications in the planning of demand side management, electric vehicles, and distributed energy resources. Many users (households, businesses, and government) of the utilities produce their own load forecasts, which results in the inefficient and ineffective use of resources (Weicong & Yan, 2019). This paper proposes an integrated load forecasting framework with the concentration on substation level. The tool conducts a deep learning model (LSTM) analysis of Newcastle CBD substation system during the past three years of 30-minute load data for zone substation in the Ausgrid network on which normal voltage supplies to 33kV feeders are served by that substation. The purpose of the assessment is to predict the next hours loading in Newcastle CBD substation and automate the process (Ausgrid, Distribution and Transmission Annual Planning Report, 2018). The dataset is from 2014 to 2016 hourly contains 30-minute metered real power data for individual zone substations in the Ausgrid supply area from January 1 st , 2014 to December 31 st , 2016, in annual sets.
The data is taken directly from the original Ausgrid zone substation dataset.

79
This analysis accounts for major factors that influence the real and reactive power consumption at any given time and day. Major factors considered include random ("stochastic") customer behaviour by time of day, day of week, and season, as well as ambient weather conditions (temperature and humidity) that have a significant impact on demand. The LSTM model is used to "predict" the hourly real power demand on randomly selected weekdays (including summer, winter, and shoulder season weekdays) when normal voltage (not reduced voltage) is applied to the feeders. The predicted real power values are then compared with the actual measurements on randomly selected weekdays to determine the accuracy of the predictions. The accuracy of the predictions is well within the target error rate, with a mean absolute error (MAE) of 0.0050.
The rest of the paper is organised as follows. Section I provides the background of the load forecasting community. Section II presents forecasting framework based on LSTM. Section III Introduces the experimental setup. The testing dataset and experimental results are given in the section IV while section V concludes the paper.

Related work
Smart grid data has been used for many electricity load forecasting tasks (Wang, Chen, Hong, & Kang, 2018). The data is treated as sequential data. Most commonly Autoregressive integrated moving average (ARIMA), Support Vector Machines (SVM), linear regression, and Artificial Neural Networks (ANN) have been tested to forecast the electricity load.
An ARIMA model is a class of statistical models for analysing and forecasting time series data.
It explicitly carters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skilful time series forecasts. ARIMA is an acronym that stands for Autoregressive Integrated Moving Average.it is a generalisation of the simpler Autoregressive Moving Average and adds the notion of integration. (Norizan, Maizah, Zuhaimy, & Suhartono, 2010) found that the MAPE for the one-step ahead out-sample forecasts from any horizon ranging from one week led time to one month one week lead time are all less than 1%. Therefore, they proposed that a double seasonal ARIMA model with one-step ahead forecast must be considered in forecasting time series data with two seasonal cycles, especially in Malaysia load data. On the other hand, (Bishnu, Motoi, Aya, & Toshiya, 2019) proposed a forecasting method for the electricity load of a university buildings using a hybrid model comprising a clustering technique and the ARIMA model and the combination has proved to increase the performance of forecasting rather than that using the ARIMA model alone.
A Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification and regression challenges. However, it is mostly used in classification 80 problems. The SVM was proposed during the EUNITE network competition by (Bo-Juen & Ming-Wei, 2004). They found that the temperature (or other types of climate information) might not be useful in such a mid-term load forecasting problem and that the introduction of time-series concept may improve the forecasting. Support Vector machines have been successfully employed to solve nonlinear regression and time series problems. However, SVM have rarely been applied to forecasting electricity load. Moreover, Simulated Annealing (SA) algorithms were used to illustrate the proposed SVMSA (support vector machines with simulated annealing) model. SVMSA has been used by (Ping-Feng & Wei-Chiang, 2005) in load data from Taiwan, the empirical results reveal that the proposed model outperforms the other two modes, namely the autoregressive integrated moving average (ARIMA) model and the general regression neural networks (GRNN) model. An Artificial Neural Networks (ANN) or neural networks (NNs) are biologically inspired computer programs designed to simulate the way in which the human brain processes information. ANNs gather their knowledge by detecting the patterns and relationships in data and learn (or are trained) through experience, not from programming. An ANN is formed from hundreds of single units, artificial neurons, or processing elements (PE), connected with coefficients (weights), which constitute the neural structure and are organised in layers. The power of neural computations comes from connecting neurons in a network. Each PE has weighted inputs, transfer function and one output. The behavior of a neural network is determined by the transfer functions of its neurons, by the learning rule, and by the architecture itself.
During training, the inter-unit connections are optimized until the error in predictions is minimized and the network reaches the specified level of accuracy. Once the network is trained and tested it can be given new input information to predict the output. ANN represents a promising modelling technique, especially for data sets having non-linear relationships which are frequently encountered in electricity load processes. In terms of model specification, ANNs require no knowledge of the data source but, since they often contain many weights that must be estimated, they require large training sets. In addition, ANNs can combine and incorporate both literature-based and experimental data to solve problems. The various applications of ANNs can be summarised into classification or pattern recognition, prediction, and modelling. (Hong, C., & A., 2002) proposed an artificial neural network (ANN)-based short-term load forecasting technique that considers electricity price as one of the main characteristics of the system load, demonstrating the importance of considering pricing when predicting loading in today's electricity markets. Therefore (Abdollah & Mohammad-Reza, 2013) proposed a new hybrid forecasting method based on the wavelet transform ARIMA and ANN for short-term load forecasting. In the proposed model, the autocorrelation function and the partial autocorrelation function are utilised to see the stationary or non-stationary behaviour of the load time series. Finally, the outputs of the ARIMA and ANN are 81 summed. The empirical results show that the proposed hybrid method can improve the load forecasting accuracy suitably.

Multilayer neural network
Before developing the LSTM model, we will talk about a special type of network called a multilayer perceptron (MLP). This network consists of three layers: an input layer, a hidden layer, and an output layer (Peter J. & Richard A., 2016). The input layer and the output layer are fully connected to the hidden layer, respectively. Such a network with more than one hidden layer is called a deep artificial neural network. A deep layer network is shown below Figure 1. It is a well-defined technique for solving reallife problems such as speech recognition, image classification, video analysis, as well as forecasting video analysis, as well as weather forecasting, stock market forecasting and stock market stock market and energy demand forecasting. convolutional neural network (CNN). Data scientists use it specifically for sequence production problems.
LSTM is a unique type of recurrent neural network (RNN) capable of learning long-term dependencies which is very useful for certain types of predictions that require the network to retain information over a very long time. LSTM networks are very suitable for classifying, parsing, and making predictions based on time series and sequence data.
Like a typical neural network, the LSTM is comprised of layers and neurons. Input data is propagated through the network for prediction. However, in a feedforward neural network, information flows only in forward direction from the input nodes to the hidden layers and to the output nodes, besides that, is not cycle or loop in the network, there are some issues in this specific architecture because it cannot handle sequential data very well and it only considers the current inputs and cannot memorize or take into consideration of the previous input and the LSTM overcome these problems.
Unlike a neural network, LSTM not only considers current input, but it also considers and memorizes the previous input in such a way that higher accuracy can be achieved. As mentioned like RNN, LSTM has recurrent connections so that they stay from the previous activation of the neuron from the previous time step. However, the RNN also suffers from two problems, the first is vanishing gradient problem. The vanishing gradient problem is a particular problem with RNN as the update of the network involves unrolling the network for each input time step, in effect creating a very deep network that requires weight update. The second is the exploding gradient problem, where the accumulation of large derivatives results in the model being very unstable and incapable of effective learning, the large changes in the model weights creates a very unstable network, which at extreme values, the weights become so large that is causes overflow resulting in missing (or NaN) weight values of which can no longer be expanded.
The LSTM has a unique formulation that allows it to avoid the problems and maintains a constant error which allows them to continuously learn over numerous time steps.
In short, LSTM overcomes the memory problems that RNN suffers. One reason for the success of this recurrent network lies in its ability to handle the exploding/vanishing gradient problem, which stands as a difficult issue to be circumvented when training recurrent or very deep neural networks. LSTM can achieve impressive results in sequential prediction problems and has gained huge popularity in recent years.
Sequence prediction is different to other types of supervised learning problems, the secret is within the models that must be trained and making predictions. Generally, predictions that involve sequence data are referred to as sequence prediction problems and there are four common types of 83 sequence predictions problems, namely: sequence prediction, sequence classification, sequence generation and sequence to sequence prediction. No matter which types of problems that is dealing, the sequence imposes an explicit order on the observations and the order is very important and it must be respected in the formulations of predictions problems that use the sequence data as input and output for that model.

Limitations of LSTM
Although LSTM are good for solving sequence problems and the results are very impressive, LSTM are not failsafe. Their two most problematic aspects are those arising from overfitting and their black box character.
Overfitting occurs whenever a model fits its training set so well that it fails to generalise correctly when we use it on a test set different from the training data set, we used to build it (Jabbar & Khan, 2014). This is a common problem in many machine learning techniques. Neural networks include a multitude of adjustable parameters: the weights that model the connections between neurons (Dietterich, 1995). This large number of parameters makes them prone to overfitting problems. To solve this problem, a multitude of techniques have been proposed that, to a greater or lesser extent, make it possible to avoid this problem (Yunita, 2018), (Schittenkopf, Deco, & Brauer, 1997), (Piotrowski & Napiorkowski, 2013), (Salman & Liu, 2019), (Tetko, Livingstone, & Luik, 1995).
In certain application domains, the most pressing problem with artificial neural networks is our inability to determine how neural networks reach a conclusion. In a neural network, we can look at the input of the network and see what its output is, but its inner workings are something that cannot be described symbolically. For us, a neural network is, to a large extent, a black box. This problem is something on which much remains to be done and which may have implications for the security of systems that employ neural networks internally (Franco, 2019).

LSTM architecture
Before moving on LSTM, let us have a quick look into the RNN, to have a better understanding of the basic recurrent relations concept.
Recurrent Neural Network is a generalization of feedforward neural network that has an internal memory. RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation. After producing the output, it is copied 84 and sent back into the recurrent network. For planning, it considers the current input and the output that it has learned from the previous input.
Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. In other neural networks, all the inputs are independent of each other.
But in RNN, all the inputs are related to each other. First, it takes the X(0) from the sequence of input and then its outputs h(0) which together with X(1) is the input for the next step. So, the h(0) and X(1) is the input for the next step. Similarly, h(1) from the next is the input with X(2) for the next step and so on. This way, it keeps remembering the context while training Figure 2. The formula for the current state is.
Applying Activation Function: W is weight, h is the single hidden vector, W hh is the weight at previous hidden state, W xh is the weight at current input state, tanh is the activation function, that implements a non-linearity that transforms the activations to the range [-1.1] Output:

85
Y t is the output state.

From RNN to LSTM
In an LSTM network, three gates and four steps are present: Step 1: Forget gate, decides what information to discard from the cell. It is decided by the sigmoid function. it looks at the previous state (h t−1 ) and the content input and outputs (x t ) a number between 0 (omit this) and 1 (keep this) for each number in the cell state C t−1 .
Step 2: Input gate, decides what values from the input to update the memory state. The tanh function decides which values to let through 0.1 And gives weightage to the values which are passed deciding their level of importance ranging from -1 to 1.
Update cell state c t = f t * c t−1 + i t * g t (7) Step 4: Output gate, decides what to output based on input and the long-term memory of the cell.
The tanh function decides which values to let through 0.1 and gives weightage to the values which are passed deciding their level of importance ranging from-1 to 1 and multiplied with output of Sigmoid Figure   3.
The LSTM model is summarized by the two main function.

Experimental setup
Deep learning is an algorithm that is elaborated in a programming language, the TensorFlow.Keras in Python 3 is utilized, within this API (Application Programming Interface) an artificial neural network is defined which is then converted into a set of commands that are executed on the computer. The components of the neural network that require intensive hardware resources are the processing of input data, the training of the deep learning model, the storage of the trained deep learning model and the deployment of the model. Within all these, the training of the deep learning model is the most intensive task because two main operations are performed, one at the forward step and the other at the backward step.

87
In the forward step, the input is passed through the neural network and after processing the input, an output is generated. In the backward step, the neural network weights are updated based on the error obtained in the forward step. Both operations are essentially matrix multiplications.
As deep learning requires considerable hardware to efficiently execute these large matrix multiplications, this work uses the Graphics Processing Unit (GPUs) 1 from Google Collaboratory, a virtual processor from Google to enhances the models. A GPU can contain from 1 000 to 4 000 cores specialized in data processing; this high density of cores allows the GPU to have a high level of parallelism that allows it to execute many computations at the same time.

Data preparation and LSTM model building Dataset
The selected methods were implemented on a dataset of real power load. The dataset contains 30-minute metered real power data for Newcastle substations in the Ausgrid supply area from January 1 st , 2014, to December 31 st , 2016, in annual sets. The data is taken directly from the original Ausgrid zone substation dataset but is reformatted here to adhere to the consistent NEAR-WESCML data format for zone substation data. The standard provides a consistent view of zone substation data across distribution businesses like TransGrid network.
TransGrid carries bulk electricity from generators through high voltage transmission lines, underground cables, and substations. The high voltage electricity is then converted to low voltage electricity suitable for household consumption at substations closer to power users. Distribution networks such as Ausgrid, Endeavour Energy and Essential Energy deliver electricity through smaller poles and wires to more than 3 million homes and businesses throughout New South Wales (NSW) and the ACT, Low demand very often has a trend due to organic growth highly related to gross domestic product, GDP, economic development for that reason we only keep the data for three years which is not many samples for us to keep our data set as stationary as possible for a better training and prediction.

Data preparation
Developing a LSTM is very similar to developing RNN, there are still several key differences that we need to be aware of.
The data for our sequence prediction problem needs to be scaled when training a neural network, such as LSTM. When a network is fit on unscaled data that has a range of values 2 it is possible for large 89 inputs to slow down the learning and convergence of your network and in some cases prevent the network from effectively learning your problem. There are two types of scaling of our series which include normalization and standardization. In the presented paper we use normalization. Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1, requires that we know or can accurately estimate the minimum and maximum observable values. We may be able to estimate these values from the available data. If your time series has an upward or downward trend, estimating these expected values may be difficult and normalization may not be the best method to use for your problem.
A value is normalized as follows: In Python we can use Sklearn in MinMax scaler functions to perform the normalization. First, we need to fit the scaler using available training data estimate the minimum and maximum transform and normalize the data.
The second commonly used method is standardization. Standardization of a data set consists of rescaling the distribution of values so that the mean of the observed values is 0 and the standard deviation is 1. This can be thought of as subtracting the mean value or centring the data.
Like normalization, standardization can be useful, and even necessary, in some machine learning algorithms when the data have input values with different scales.
Normalization assumes that our observations fit a Gaussian distribution (bell curve) with a wellbehaved mean and standard deviation. We can standardise our time series data if this expectation is not met, but we may not get reliable results.
Standardization requires that we know or can accurately estimate the mean and standard deviation of the observable values. We may be able to estimate these values from your training data.
A value is standardized as follows:

90
Deep learning libraries assume a retro eye on the representation of our data and assume that the input sequence of all features has the same length. The input to the LSTM must have a three-dimensional form consisting of: • Samples, which is usually the number of rows in the dataset.
• Time steps, which are the past observations of a feature.
• Features, which are the columns of the dataset.

Case study
For the three years of our study, we have a sample of 52 560 data with an interval of 30 minutes. as we want to make forecasts in the hours ahead, we convert the data with an hourly interval, and we are left with a sample of 26 280 data of real power load. The experiment was conducted using LSTM network to predict few next hours load based on last 24 hours. The results show that it is an easy task for the LSTM network architecture, and hence give low error ratios with the test dataset.

ARIMA
With the goal of comparing the performance of the LSTM neural network against a statistical forecasting method, in this section we apply ARIMA methodology to the load series described in previous sections.  The timeseries is stationary at 95% level of confidence Source: own estimation with statsmodels 92 Figure 5 shows the autocorrelation and partial autocorrelation functions, ACF and PACF. The results suggest that the time series is AR(p).

Figure 5. ACF and PACF Source: own estimation with statsmodels
We use the following function in order to determine the order, p, that minimizes AIC: The function indicates that the "Best order" is 29: BEST ORDER 29 BEST AIC: 16890.5838.
We used 90% of the data to train the model. The complete model is included in the Annex. 3 Figure 6 shows the actuals and fitted values, and the 95% confidence interval.
3 It is worth noting that the function is very time consuming.
93 Figure 6. Actuals, fitted and 95% confidence interval Source: own estimation with statsmodels The out of sample MSE is 0.3547, which exceeds the MSE's attained by the LSTM neural networks.

Conclusions
The city of Newcastle, like many cities around the world, is constantly growing in terms of industry, homes, and businesses, and it is these elements that consume all the electrical energy produced in this locality. With this upward shift, the electricity consumption needs of this society must be met in any case.
To solve this electricity demand problem, in this work a LSTM model was applied to the electricity load demand data at the Newcastle substation to forecast the demands in the next hours to come.