Disclaimer: This dissertation has been written by a student and is not an example of our professional work, which you can see examples of here.

Any opinions, findings, conclusions, or recommendations expressed in this dissertation are those of the authors and do not necessarily reflect the views of UKDiss.com.

Analyzing Electricity Consumption Behavior of Households Using Big Data Analytics

Info: 9722 words (39 pages) Dissertation
Published: 9th Dec 2019

Reference this

Tagged: Environmental StudiesEnergy





In this modern sub meters, every competitive retail market is looking to analyze their customer data to understand the behavior of their consumption. Large volumes of sub meters data provide lots of opportunity for electricity board to gain more knowledge on customers’ electricity consumption behaviors with sub meters data. There are lots of analytical solutions which provide the electricity consumption of house hold but these solutions are not providing the behavior of consumption. So, in this paper, we are proposing an analyzing electricity consumption behavior of households using big data analytics, which includes the relation of electricity consumption with the type of usage (laundry, kitchen, heaters and AC etc.). In this paper, we are using Big Data (hadoop and mapreduce) for finding the electricity consumption behavior by implementing two levels of clustering algorithms and then fits the best models to find the consumption behavior of households in various situations and areas. Implementing data mining (K-means) technique for combining electricity consumption with external factors is primarily carried out to obtain the typical dynamics of consumption behavior, with the difference between any two consumption patterns. To tackle the challenges of big data, the mapreduce programming techniques are integrated into a divide-and-conquer approach toward big data analytics. In this paper, we are proposing methods to find the various electricity consumption behaviors like time, type of usage and season using large data sets and Big Data.

Keywords: Analyzing Households Power, Load Profiling, Big Data, Electricity Consumption, Behavior Dynamics, Distributed Clustering, Demand Response.



NATIONS around the world are having the aggressive targets to restrict the monopolistic power systemtowards liberalized markets especially on the demand side. In a real-timeelectricity supply market, Electricity load serving entities (ELSEs) are being developed in great numbers [1]. Having a better understanding of electricity consumption patterns and realizing personalized power managements are effective ways to enhance the competitiveness of ELSEs [2]. Meanwhile, smart grids have been revolutionizing the electrical generation and consumption through a two-way flow of power and information. As animportant information source from the demand side, advanced metering infrastructure (AMI), has gained increasing popularity worldwide; AMI allows LSEs to obtain electricity consumption data at high frequency, e.g., minutes to hours [3]. Large volumes of electricity consumption data reveal information of customers that can potentially be used by LSEs to manage their generation and demand resources efficiently and provide personalized service.

Load profiling,  which  refers  to  electricity  consumption behaviors  of  customers  over  a specific  period, e.g., one day, summer, winter,  can  help  ELSEs  to understand how electricity is actually used for different customers and obtain the customers’ load profiles or load patterns. Load profiling plays an important role in the Time of Use (ToU) tariff design [4], nodal or customer scale load forecasting [5], demand response and energy efficiency targeting [6], and non-technical loss (NTL) detection [7].

The core of load profiling is data clustering which can be classified into two categories: direct clustering and indirect clustering [8]. Direct clustering means that clustering methods are applied directly to load data. Heretofore, there are a large number of clustering techniques that are widely studied, including k-means [9], fuzzy k-means [10], hierarchical clustering [11], self-organizing maps (SOM) [12], support vector clustering [13], subspace clustering [14], ant colony clustering [15] and etc. The performance of each clustering technique could be evaluated and quantified using various criteria, including the clustering dispersion indicator (CDI), the scatter index (SI), the Davies-Bould in index (DBI), and the mean index adequacy (MIA) [16].

The deluge of electricity consumption data with the widespread and high-frequency collection of smart meters introduces great challenges for data storage, communication and analysis. In this context, dimension reduction methods can be effectively applied to reduce the size of the load data before clustering, which is defined as indirect clustering. Such clustering can be categorized into two sub-categories, feature extraction-based clustering and time series-based clustering. Feature extraction which transforms the data in the high-dimensional space into a space of fewer dimensions [17], is often used to reduce the scale of the input data. Principal component analysis (PCA) [18], [19] is a frequently used linear dimension reduction method. It tries to retain most of the covariance of the data features with the fewest artificial variables. Some nonlinear dimension reduction methods including  maps, curvilinear component analysis (CCA) [20], and deep learning [21] have also been applied to electricity consumption data. Moreover, as electricity consumption data are essentially time series. A variety of mature analytical methods such as discrete Fourier transform (DFT) [22], [23], discrete wavelet transform (DWT) [24], symbolic aggregate approximation (SAX) [25], and the hidden Markov model (HMM) [26] have been discussed in the literature. These methods are capable of reducing the dimensionality of time series and of maintaining some of the original character of the electrical consumption data.

The existing studies on load profiling mainly focus on individual large industrial/commercial customer, medium or low voltage feeder, or a combination of small customers, load profiles of which shows much more regularity [25].

It should be noted that although these dynamic characteristics are always “deluged” in a combination of customers, they could be described by several typical load patterns. However, with regard to residential customers, at least two new challenges will be faced. One challenge is the high variety and variability of the load patterns. As indicated by Fig. 1, there are clear differences in the electricity consumption patterns of the two residents. Peak loads have different amplitudes and occur at different times of day, for example. Electricity consumption patterns also vary on a daily basis even for the same customer. In this case, several typical daily load patterns are not fine enough to reveal the actual consumption behaviors. The daily profile should be decomposed into more fine-grained fragments, which are dynamically changed and identified. Moreover, as the consumption behavior of a specific customer is essentially a state-dependent, stochastic process, it is important to explore the dynamic characteristics, e.g., switching and maintaining, of the consumption states and the corresponding probabilities. The other challenge is that of “big data”. Considering the high frequency and dimensionality of the data contained in the load curves, data sets in the multi-petabyte range will be analyzed [27]. Traditional clustering techniques are tricky to be executed in a “big data world”.

To tackle these two challenges, this paper implements a time-based Markov model to formulate the dynamics of customers’ electricity consumption behaviors, considering the state-dependent characteristics, which indicates that future consumption behaviors would be related to the current states. This assumption is reasonable as various electricity consumption behaviors would last for different periods of time before being capable of change, as could be abstracted from historical performances. The transitions and relations between consumption behaviors, or rather consumption levels, in adjacent periods are referred to as “dynamics” in this paper. These dynamics have been modeled by Markov model in several works [23]. However, few papers con-sider the dynamics as a factor for clustering. Profiling of the dynamics could provide useful information for under-standing the consumption patterns of customers, forecasting the consumption trends in short time periods, and identifying the potential demand response targets. Moreover, this approach formulates the large data set of load curves as several state transition matrixes, greatly reducing the dimensionality and scale.

Below are the various attributes those used for finding the various consumer behaviors.

Data Set Attributes:

  1. date: Date in format dd/mm/yyyy
  2. time: time in format hh:mm:ss
  3. global_active_power: household global minute-averaged active power (in kilowatt)
  4. global_reactive_power: household global minute-averaged reactive power (in kilowatt)
  5. voltage: minute-averaged voltage (in volt)
  6. global_intensity: household global minute-averaged current intensity (in ampere)
  7. sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
  8. sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
  9. sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

Below is the sample data used in this paper, by which lot of major features are extracted to find the electricity consumption behavior. In addition to the Markov model, this paper tries to address the “data deluge” issue in three other ways. First, applying SAX to transform the load curves into a symbolic string to

Fig:1.Daily electricity load profiles of various meters.

reduce the storage space and ease the communication traffic between smart meters and data centers. Second, a recently reported effective clustering technique by Fast Search and Find of Density Peaks (FSFDP) is first utilized to profile the electricity consumption behaviors, which has the advantages of low time complexity and robustness to noise points. The dynamics of electricity consumptions are described by the differences between every two consumption patterns, as measured by the Kullback–Liebler (K-L) distance. Third, to tackle the challenges of big and dispersed data, the FSFDP technique is integrated into a divide-and-conquer approach to further improve the efficiency of data processing, where adaptive k-means is applied to obtain the representative customers at the local sites and a modified FSFDP method is performed at the global sites. The approach could be further applied toward big data applications. Finally, the potential applications of the proposed method to demand response targeting, abnormal consumption behavior detecting and load forecasting are analyzed and discussed. Especially, entropy analysis is conducted based on the clustering results to evaluate the variability of consumption behavior for each cluster, which can be used to quantify the potential of price-based and incentive-based demand response.

The contributions of this paper are as follows:

  1. Time-based Markov model is applied to formulate the electricity consumption behavior dynamics instead of the shape of daily load profiles;
  1. Customer segmentation is performed by a high-efficient clustering algorithm named FSFDP which is robust to noise and need no iterations;
  1. A distributed clustering framework combining adaptive k-means and FSFDP is proposed to tackle the large and distributed data set;
  1. The application of the proposed modelling method and profiling algorithm are analyzed and discussed.

Fig:2.Clustering of electricity consumption behaviour dynamics processes.

The rest of the paper is organized as follows: In Section II the basic methodology of clustering of electricity consumption behavior dynamics is introduced. In Section III, a divide-and-conquer distributed clustering algorithm for big data sets is proposed. In Section IV case studies and some analysis for demand response targeting and distributed clustering are conducted based on public data from Ireland. The potential applications of the proposed method and algorithm are dis-cussed in Section V. Finally, the conclusions are drawn in Section VI.



The proposed methodology for the dynamic discovery of the electricity consumption can be divided into six stages, as shown in Fig. 2. The first stage conducts some load data preparations, including data cleaning and load curve normalization. The second stage reduces the dimensionality of the load pro-files using SAX. The third stage formulates the electricity consumption dynamics of each individual customer utilizing time-based Markov model. The K-L distance is applied to measure the difference between any two Markov model to obtain the distance matrix in the fourth stage. The fifth stage performs a modified FSFDP clustering algorithm to discover the typical dynamics of electricity consumption. Finally, the results of the analysis of the demand response targeting are obtained in the sixth stage. The details of the first five stages will be introduced in the following, and the demand response targeting analysis part will be further explained in the case studies.

A. Data Normalization

Data preparations including data cleaning is not the subject of this paper and will not be discussed. To make the load profiles comparable, the normalization process transforms the consumption data of arbitrary value x= {X1,X2, . . .XH} to the range of [0, 1], as shown in (1).

where, Xi and X∑i denote the actual and normalized electricity consumption at time i; Xmin and Xmax denote the minimum and maximum consumption over H periods respectively.

It should be noted that the normalization is performed on a daily basis instead of over entire periods. This strategy is chosen for at least three reasons. First, it can weaken the impact of anomalous days with critical peaks or bad data injections. Second, it can provide load shapes with little effect from daily or seasonal changes in the maximum values. Third, it can filter out the base load, which has little effect on demand response and reserve, in favor of the fluctuant part, which shows greater potential in demand response.

B. SAX for Load Curves

SAX is a powerful technique for the dimensional reduction and representation of time series data with lower bounding of the Euclidean distance [30]. SAX discretizes numeric time series into symbolic strings by two steps: transforming the load data into a piecewise aggregate approximation (PAA) representation and then symbolizing the PAA representation into a discrete string.

The basic idea of PAA is intuitive and simple, replacing the amplitude values falling in the same time interval with their mean values, as shown in (2).

wherej is the index of the normalized load data; i is the index of the transformed PAA load data; ki is the ith time domain breakpoint; and X¯i is the average value of the ith segment [31].

The averaging of the PAA can smooth out large, short-duration “spikes” of load profiles. It has been proven that PAA has all the pruning power of the Haar-based DWT and

It can be defined for arbitrary length queries with lower computation cost. The transformed PAA time series data are then referred to the SAX algorithm to obtain a discrete symbolic representation. The amplitude axis is partitioned into N intervals, and each univocal representation wp corresponds to an amplitude range [βp1, βp]. On this basis, the mapping from a PAA approximation x¯i to a word wp is obtained as follows:

Fig:3.Electricity consumption data of customers in various day periods over one week and its SAX representation.

Hence, the load curves can be represented by a symbolic string α. For example, Fig.3 shows the normalized electricity consumption data collected from customers over one week (168 hours) at a frequency of 60seconds. The timeaxis is divides into five periods each day. These data can be represented as “Early morning, Evening, morning, night and noon”, with five symbols and a total of 35 periods. For traditional SAX, the time domain is divided into regular intervals, and inside each interval, the average of the amplitude values is calculated.

The main concern of SAX is the determination of the time domain breakpoint ki and the amplitude breakpoint βp. Generally, the time domain is partitioned uniformly, and the amplitude axis is partitioned based on the normal distribution hypothesis [25], [40].

In order to make the breakpoints clear in physics meaning, this paper adopts a non-regular interval on both time domain and amplitude. Specifically, the time domain breakpoint ki is determined by comprehensively taking the implementation of the ToU tariff and the regular routine of customers into consideration. For example, the time domain can be divided into three intervals according to the definition of peak, flat, valley periods, during which the electricity prices are the same respectively under ToU tariff. For another example, four time periods named overnight period, breakfast period, day-time period, and evening period, can be approximately chosen through some statistics.

Each word transformed from normalized load profiles by SAX corresponds to a discrete state of the Markov model in the next step. In general, the states are equally probable for an optimal use of the Markov models. Thus, the amplitude breakpoint βp is determined by the quantiles of the statistical distribution of the amplitudes in the whole data set. For example, if we want to simplify the electricity consumption

by three states, the consumption levels correspond to 33.33% and 66.67% of the cumulative distribution function (CDF).

Fig:4.Power consumption behavior per hour in a day

Another question is how many states are needed in the Markov model. The dissimilarity between transformed symbolic string by SAX and original load profiles is gradually reduced by increasing the number of states [40]. However, much more states may result in large size and sparsity of transition probability matrix, which may result in meaningless of transition probability matrix and “curse of dimensionality” in the step of clustering. Thus, the number of states is a

Fig:5.Analyzing Electricity Consumption Behavior of    Households Using BIGDATA Analytics

tradeoff between the information loss by SAX and size if transition probability matrix.


C. Time-Based Markov Model

If we want to predict the trend or level of electricity consumption for each customer, we may make full use of their past and present states. If the future consumption level or state depends only on the present state, it is called a Markov property and can be modeled by a Markov chain. Various Markov models have been applied to load forecasting.

For a symbolic string with N symbols, discrete Markov model with N corresponding states can be applied to model the dynamic characteristics of their consumption levels. However, customers have different dynamic characteristics at different periods for their regular routines every day. Therefore, time-based Markov model is applied to formulate the characteristics. For each adjunct period, a Markov chain can be modeled. Then, the one-step transition number matrix Ft at period t can be calculated. From Ft, the transition probability matrix Pt at period t can be further estimated according to (4).

ˆptijdenotes the estimated one-step transition probability from

statej to state i at time t.

It has proven that P is the unbiased estimation of transition

probability matrix Pt.

In the following, a test of the Markov property of electric-ity consumption should be conducted to validate the Markov hypothesis. Following the test theorem of the Markov property proposed in [37]

Given a significance level α, if χ2χα2((N−1)2) holds, we can be reasonably confident that the electricity consumption of customers has a Markov property.

D. Distance Calculation

The dissimilarity/distance measurement is a fundamental problem in clustering. There exist many ways to compute the distances between two matrices, such as 1-norm distance and 2-norm distance (Euclidean distance). However, different from general matrices, aN×N state transition probability matrix essentially consists of N probability distributions, where each row (e.g., the ith row) corresponds to a probabilistic distribution of the state of the next period at the current state (e.g., the ith state). K-L distance is an effective way to quantify the dissimilarity between two probabilistic distributions [29]. Thus, for discrimination between two Markov model with the

state transition matrices Pti

andPtj, the K-L distance is defined

as [43]

Note that KLD(Pti , Ptj) = KLD(Ptj , Pti) is not guaranteed to

hold; that is to say, the K-L distance is unsymmetrical. For

the convenience of clustering, we define the symmetric K-L

distanceDtijof two Markov model at period t as

Each customer is modeled by T Markov model for T periods

of the day. We further extend the K-L distance to T periods

as follows:

Load profiling,  which  refers  to  electricity  consumption behaviors  of  customers  over  a specific  period, e.g., one day,  can  help  LSEs  understand              how electricity              is actually used for different customers and obtain the customers’ load profiles or load patterns. Load profiling plays a vital role in the Time of Use (ToU) tariff design [4], nodal or customer scale load forecasting [5], demand response andThe dissimilarity matrix is derived by calculating the K-L distance among all customers according to (8). E. FSFDP Algorithm FSFDP is a recently reported clustering algorithm that can effectively recognize clusters regardless of their shape with a reasonable assumption that the cluster centers must have a higher local density and relatively larger distance to the points of higher density.  For a data set, the neighbors can be recognized by a soft threshold like the Gaussian kernel function or a hard threshold as defined in (9). To reduce the computation complexity for big data sets, we employ the hard threshold to calculate the local density

where χ(x) =

1  x<0        0 otherwise, Dij is the dissimilarity/ distance between objects i and j; and dc is the cutoff distance chosen by the principle proposed in [28] or experience. The minimum distance δi between object i and any other object j of higher density is calculated as follows:

For the point with the highest local density, the minimum distance δi = maxj(Dij). Thus, the object with much larger δi has the maximum density in local or global area. Hence, each object or point has two important quantities: local density ρi and distance δi. We can plot all the points Ai(ρi, δi) on a two-dimensional plane, which is called the decision graph. The points of higher local density and a larger distance than the thresholds (ρ0, δ0) can be identified as density peaks or cluster centers. After these density peaks are found, other remaining points are assigned to the same cluster as its nearest neighbor of higher density.

As stated above, the proposed clustering method has the following advantages so that we adopt it to our study. First, FSFDP is so elegant and simple that fewer parameters are needed with low time complexity, and it has shown high performance in classifying several data sets. After finding the density peaks, the assignment is of each object can be performed in a single step without iteration, in contrast with many other clustering methods like k-means. Second, FSFDP as density-based clustering technique can effectively detect non-spherically distributed data and be robust to noise points, which is verified in our case studies. Third, the distribution of objects on the decision graph reveals much information. For example, it is easy to detect the outliers or bad data injections with a small ρi and large δi, and find the objects around the edge of the cluster with both small ρi and δi. The number of clusters can be adjusted elastically according to the distribution of objects by setting different thresholds for ρi and δi.


The electricity consumption data skyrocketing for population-level customers is challenging the storage, communication and analysis of the data. Although SAX and time-based Markov model have largely reduced the dimensionality of the load profiles, the centralized clustering technique is not effective for big data challenges. On one hand, the electricity consumption data are collected and distributed on different sites. The electricity consumption data
of customers are collected and stored on different substations they belong to. It is costly and time consuming to transmit whole data from each distributed site to a central site. On the other hand, the analysis and clustering of large data sets gathered from each distributed site need a very large time and memory overhead. When applying the FSFDP, the dissimilarity matrix of all the customers should first be obtained, which accounts for most of the computation time. Both the time and space complexity of the FSFDP are O(N2). In fact, there exist many works on parallel clustering for big data applications. For these algorithms, the whole data set should reside on the same data center and then be distributed to different clients like map-and-reduce in Hadoop. It is not satisfied with the practical situation of electricity consumption data collecting and storing. Besides, some fully distributed clustering algorithms are also proposed [34] to tackle these challenges by aggregating the information of local data and then sending to a central site for central analysis. However, these algorithms do not contain the advantages as FSFDP. Thus, this section is proposed to design a fully distributed instead of parallel clustering algorithm to ease the communication and computation burden as well as retain the advantages of the FSFDP by a divide-and-conquer framework.

A. Framework


A divide-and-conquer framework for distributed clustering, where Li denotes the original data on the ith distributed local site; Mi denotes the representative objects selected from the ith distributed local site; and R denotes the global clustering results. Each object corresponds to a customer described by transition probability matrixes. The proposed algorithm consists of three steps:

Fig:6.Steps implemented for K-means

Step 1: The SAX and time-based Markov model for individual customers are handled separately. Divide the big data set into k parts, each marked as Li. Note that the data on one distributed site can be further partitioned to make the size of the data sets on each site more even.

Step 2: An adaptive k-means method is performed for each individual part to obtain a certain number of cluster centers. Each cluster center can represent all the objects belonging to this cluster with a small error. All these cluster centers of Li are selected as the representative objects Mi, which are defined as a local model.

Step 3: A modified FSFDP method is applied to all the representative objects (local models) that are centralized and gathered to classify them into several groups R, which are defined as a global model. Then, according to the final clustering result, the cluster label of each local site would be updated.

It is worth noting that the adaptive k-means and modified FSFDP are not interchangeable in Step 2 and Step 3. K-means as a partitioning based clustering algorithm, tries to minimize the within-class distance of all the clusters, which is consistent with the object of Step 2, i.e., selecting the objects that can represent the remaining objects around them for each individual part. While, modified FSFDP applied in Step 3 can inherit its advantages by global clustering. The adaptive k-means method and the modified FSFDP method will be described in detail in the next two parts.

B. Local Modeling-Adaptive k-Means


A set of clustering centers will be obtained by k-means, where the sum of the squared distances between each object is minimized. These centroids can be used as a “code book”: each object can be represented by the corresponding centroid

with the least error. This is called vector quantization (VQ). We try to establish a local model by finding the “code book” that guarantees that the distortion of each object by VQ satisfies the threshold condition according to (11)

Where Ctkijdenotes  the  kth  centroid;  and  θ denotes  the

distortion threshold.

Traditional k-means needs a given number of centers, which makes it difficult to guarantee that (11) holds. In this paper, an adaptive k-means is adopted to dynamically adjust the number of centers following a simple rule: if an object violates the threshold condition, 2-means (i.e., k-means for k=2) will be applied to partition this cluster further and add a new center to the “code book” [35].


Fig.:7. Adaptive k-means for local modeling based on threshold.

It shows the detailed procedures of the adaptive k-means method. The distortion threshold θ varies depending on differ-ent needs. Smaller threshold corresponds to higher clustering accuracy and larger number of local representative objects, and vice versa. As a supplement of distortion threshold and another terminating condition of the iteration, the parameters, KminandKmax, are given to limit the size of the “code book”.The value of Kmin and Kmax can be determined according to the data transmission limits. Especially, if ensuring certain precision is the priority, the adaptive k-means can start from Kmin=2 until the

Fig.:8.Decision graph to find density peaks for full periods.

(11) holds by setting the value of Kmax to positive infinity. The proposed adaptive k-means distinguishes from traditional k-means in at least four aspects:

First, for adaptive k-means, the number of clusters adjusts dynamically depending on whether the distortion threshold condition is satisfied, in contrast to traditional k-means, where it should be pre-determined.

Fig:9.2-D plane mapping for full periods of customers according to their usage type.

Second, the convergence condition of adaptive k-means is given by (11) and Kmax. While, traditional k-means converges until the sum of the squared distances between each object no longer decrease.

Third, the proposed algorithm is capable of retaining the information of outliers on each site because these outliers will become separate clusters.

Fourth, this algorithm apply 2-means to the violating cluster separately. Thus it has small computational burden and parallel computation potential in, which make it applicable to large data sets. While, traditional k-means is conducted on the whole data sets.

C. Global Modeling-Modified FSFDP

The original FSFDP algorithm considers the clustered objects equally. However, in a two-level clustering framework, the selected representative models from different local sites might represent “samples” of different populations. It would be reasonable to consider the representativeness of the local models in the centralized clustering. Thus, a modified FSFDP method is proposed, which introduces a weight factor to differentiate the representativeness of the local models. Without loss of generality, the weight factor, Cj, is added to the local density calculation

whereCj refers to the weight of the representative points of each cluster in Mi, which is equal to the number of objects belonging to the cluster. The calculation of δi is the same as (10). Similarly, on the basis of the calculated ρi and δi, a decision graph can be drawn to find the density peaks that have a higher local density ρi and larger distance δi as cluster centers. After the determination of the cluster centers, each of the other objects is assigned to the same cluster as its nearest neighbor of higher density.

Now that each representative object from the distributed site has its own cluster label, the objects on the distributed sites will be relabeled according to the cluster label of the repre-sentative object. If two centroids ended up in the same cluster, then all their objects will belong to the same cluster.



A. Description of the Data Set

The data used in this research contains 2,075,259 (@ Million) records  measurements gathered between December 2006 and November 2010 (47 months).

The measurements are gathered from households ho are using different meters in deferent palce. Belo are the description of the meters used in this research.

Electrical smart meters :

1).sub_metering_1: It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).

2).sub_metering_2: It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.

3).sub_metering_3: It corresponds to an electric water-heater and an air-conditioner.

Measurement description:

1.(global_active_power*1000/60 – sub_metering_1 – sub_metering_2 – sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.

2.The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.

Attribute Information:

1.date: Date in format dd/mm/yyyy

2.time: time in format hh:mm:ss

3.global_active_power: household global minute-averaged active power (in kilowatt)

4.global_reactive_power: household global minute-averaged reactive power (in kilowatt)

5.voltage: minute-averaged voltage (in volt)

6.global_intensity: household global minute-averaged current intensity (in ampere)

The data set used in this contains the electricity consumption of 50,000 customers over four years  (1410 days) at a gran-ularity of 60 seconds. The whole data set consists of total 3.46 million (6445 × 537) daily load profiles. The bad load profiles are roughly identified by detecting the load profiles with missing values or all zeroes



B. Modeling Dynamics of Electricity Consumption for Each Customer

According to the regular routine of electrical customers, we reasonably divide a day into five periods:

Early Morning (00:00-06:30),

Morning (6:31-11.30),

Noon (11:31-14:30),

Evening (14.31-19.30) and

Night (19:31-24:00).

On this basis, the load data are transformed into PAA representations which also vary from 0 to 1. Fig.6 shows the histogram and CDF of PAA rep-resentations of the whole data sets. It can be seen that the higher the consumption, the lower the density.

For further analysis on seasons, we have divided a year into four seasons. Below are the season defined based on weather conditions.

  1. Fall
  2. Winter
  3. Summer
  4. Spring

We have taken these seasons to show how households consumption behavior varies based on weather condition. Fig.9 clearly shows how the electricity consumption changes from season to season. This behavior helps the power distributers to manage the electricity loads based on the demand.

C. Clustering for Full Periods

To obtain the typical dynamic characteristics of electricity consumption and to segment customers into several groups, FSFDP is first applied to the full periods. After calculating the dissimilarity matrix following (8), we plot the local den-sityρ and distance δ of each customer, calculated according to (9) and (10), respectively, in the decision graph, as shown in Fig. 8. We choose the density peak with ρ >10 and δ >0.5, where a total of 40 clusters can be obtained, which have been marked with different colors in Fig. 8.

To show the distribution of the 6445 customers, we mapped the customers into a 2-D plane according to their dissimilarity matrix by multidimensional scaling (MDS),as shown in Fig. 9. MDS is a very effective dimensional reduction way for visualizing the level of similarity among different objects of a data set. It tries to place each object in N-dimensional space such that the between-object distances are preserved as closely as possible. Each point in the plane stands for a customer. Points in the same cluster are marked with the same color. It can be seen that the customers of different clusters are unevenly distributed. Approximately 90% of the customers belong to the 10 larger clusters, whereas the other 10% are distributed in the other 30 clusters. In this way, these 6445 customers are segmented into different groups according to their electricity consumption dynamic characteristics for full periods. Note that the customers in the same cluster hassim-ilar electricity consumption behavior dynamics over a certain period instead of similar shape in load profiles.

D. Clustering for Each Adjacent Periods

Sometimes, we may not be concerned with the dynamic characteristics of full periods and instead concentrate on a cer-tain period of time. For example, to evaluate the demand response potential in noon peak shaving of each customer, the dynamics from Period 1 to Period 2 are much more important; to measure the potential to follow the change of wind power at midnight, the dynamics from Period 4 to Period 1 should be emphasized. Thus, it is necessary to conduct customer seg-mentation for different adjacent periods. Fig.10 illustrates the decision graph and 2-D plane mapping of customers for the four adjacent periods.

It can be seen that the distributions of the customers of the four adjacent periods are shaped like bells, and the proposed clustering technique can effectively address the non-spherically distributed data. Unsurprisingly, the dynamics from Period 2 to Period 3 and from Period 3 to Period 4 show more diversity because people become more active during the day, whereas the dynamics from Period 1 to Period 2 and from Period 4 to Period 1 show less diversity because most people are off duty and go to sleep with less electricity consumption. Taking the dynamics from Period 2 to Period 3 as an example, the six most typical dynamic patterns are shown in Fig. 11. The percent in each matrix stands for the percentage of customers who belong to the cluster. For example, approximately 37% of the customers have very similar electricity consumption dynamics to that of Type_1.

E. Distributed Clustering

To verify the proposed distributed clustering algorithm, we divide the 6445 customers into three equal parts. Then, the

Different from the traditional load profiling methods which mainly focus on the shape of load profiles, this paper tries to perform clustering on the load consumption change extents and possibilities in adjacent time periods, indicating dynamic features of customer consumption behaviors. The proposed modelling method has many potential applications. For exam-ple, on the decision graph obtained by FSFDP such as Fig.8 and Fig.10, we can easily find the objects with small ρiand

It is believed that customers of less variability and heavier consumption are suitable for incentive-based demand response programs like direct load control (DLC) for their predictability for control, whereas customers of greater variability and heavier consumption are suitable for price-based demand response programs, like ToU pricing, for their flexibility to modify their consumption. Note that aN×N state transition probability matrix is essentially a combination of N probability distributions as mentioned before. Obviously, though the dynamic characteristics have been abstracted into 3 × 3 matrices as shown in Fig. 11, we can make intuitive evaluations on the customers toward demand response targeting by introducing the approach of entropy evaluation to further extract information from the matrices. The variability could be quantified by the Shannon entropy of the state transition matrix distortion threshold θ is carefully selected for the adaptive k-means method, as a larger threshold leads to poor accuracy, whereas a smaller one leads to little compression. We run 100 cases by varying θ from 0.0025 to 0.25 with steps of 0.0025 and calculate the average compression ratio (CR) of the three distributed sites for each case. The CR is defined as the ratio between the volume of the compressed data and the volume of the original data. Especially, the compressed data refers to local models obtained by adaptive k-means, and the original data refers to the whole objects distributed on each sites:

CR =

No. of local modelsNo. of the whole objects (13)

The lower the CR, the better the compression effect.Fig.12 shows the relationship between the average compression ratio and the threshold of different periods. To obtain a lower compression ratio and guarantee clustering quality, we choose “knee point” A as a balance, where θ is approximately 0.025 and the average compression ratio is approximately 0.065. Kmin and Kmax are valued as 10 and 1000 respectively.

To evaluate the performance of the proposed algorithm, we run both the centralized and distributed clustering processes. The high consistency indicates the good performance of the distributed algorithm. As shown in Table I, the matching rate of the algorithm with centralized algorithm can be as high as 96.47%. This indicates that the proposed algorithm has a higher clustering quality with a lower CR. In addition, the time and space complexity of the modified FSFDP in global modelling is O((CRN)2). This means that the efficiency of the global clustering has increased by (1/CR)2 times, where CR<1 holds. In this case, the efficiency has been boosted to approximately (1/0.065)2≈ 235 times.

We implement the proposed distributed clustering algorithm by Matlab R2015a on a standard PC, with an Intel CoreTM i7-4770MQ CPU @ 2.40 GHz, and 8.0 GB RAM. The centralized clustering takes 60.058 sec for 6445 customers. For the distributed clustering algorithm, the times needed for adaptive k-means on distributed sites ranges from 0.415 sec to 0.542 sec, with an average of 0.472 sec; the times needed for global modelling is only 0.226 sec. Distance calculation consumes most of the time at the global modelling stage. The overall computation time reduced greatly. Note that the time consumed by adaptive k-means is greater than that of FSFDP because many iterations are need to satisfy the threshold condition proposed by (11) in contrast to FSFDP.


In Fig.9, we can see that the household consumers are using the electricity mostly in winter that too fo Ac and heaters so we can understand the importance of ac and heater in households. In Fig.It can be seen that Type_3 shows the minimum entropy. The 0.994 in the Type_3 matrix means that the Type_3 customers  have a greater opportunity to remain unchanged in state c,  i.e., the higher consumption level, and are easier to predict. Thus, customers of Type_3 may have a greater potential for an incentive-based demand response during Period 3. However,  Type_1 and Type_2 show much higher entropies and have a relative higher consumption level than Type_3, which makes them much more suitable for a price-based demand response.  For example, the Type_1 and Type_2 customers have almost the same probability of switching from state c to state b and state c, which is hard to predict, and have more flexibility to  adjust their consumption behaviors.



In this paper, a novel approach for the clustering of electricity consumption behavior dynamics toward large data sets has  been  proposed.  Different from a static prospective, SAX and time-based Markov  model  are  utilized  to  model dynamic characteristics of each  clustering technique, FSFDP, is performed to discover the  typical dynamics of electricity consumption and segment customers into different groups. Finally, a time domain analysis and entropy evaluation are conducted on the result of the dynamic clustering to identify the demand response potential of each group’s customers. The challenges of massive high-dimensional electricity consumption data are addressed in three ways. First, SAX can reduce and discretize the numerical consumption data to ease the cost of data communication and storage. Second, Markov model are modelled to transform long-term data to several transition matrixes. Third, a distributed clustering algorithm is proposed for distributed big data sets. In this project, we have analyzed the electricity consumption behavior based on various meters (Kitchen, Hall, Laundry) and Times (for every minute) and day types. Same further can extended to find the electricity consumption behavior based on temperature and dynamic data by using Big Data analytics.


  1. U.S. Department of Energy. (2014). Smart of Energy. [Online]. Available: http://energy.gov/oe/technology-  development/smart-grid
  2. I. P. Panapakidis, M. C. Alexiadis, and G. K. Papagiannis, “Load profil- ing in the deregulated electricity markets: A review of the applications,” dential electricity load profiles,” IEEE Trans. Power Syst., vol. 30, pp. 3217–3224, Nov. 2015.
  3. N. Mahmoudi-Kohan, M. P. Moghaddam, M. K. Sheikh-El-Eslami, and E. Shayesteh, “A three-stage strategy for optimal a retailer based on clustering techniques,” Int. J. Elect. Power Energy Syst., vol. 32, no. 10, pp. 1135–1142, 2010.
  4. P. Zhang, X. Wu, X. Wang, and S. Bi, “Short-term load forecasting based on big data technologies,” CSEE J. Power Energy Syst., vol. 1,
  5. no. 3, pp. 59–67, Sep. 2015.  N. Mahmoudi-Kohan, M. P. Moghaddam, M. K. Sheikh-El-Eslami, and S. M. Bidaki, “Improving WFA k-means technique for demand response programs applications,” in Proc. IEEE Power Energy Soc. Gen. [7]              Meeting (PES), Calgary, AB, Canada, 2009, pp. 1–5. C. Leon et al., “Variability and trend-based generalized rule induction model to NTL detection in power companies,” IEEE Trans. Power Syst., vol. 26, no. 4, pp. 1798–1807, Nov. 2011.
  6. Y. Wang et al., “Load profiling and its application to demand response: A  review,”  Tsinghua  Sci.  Technol.,  vol.  20,  no. Apr. 2015.
  7. R. Li, C. Gu, F. Li, G. Shaddick, and M. Dale, “Development of low volt- age network templates—Part I: Substation clustering and classification,” IEEE Trans. Power Syst., vol. 30, no. 6, pp. 3036–3044, Nov. 2015.
  8. K.-L. Zhou, S.-L. Yang, and C. Shen, “A review of electric load classifi- cation in smart grid environment,” Renew. Sustain. Energy Rev., vol. 24, pp. 103–110, Aug. 2013.
  9. G. J. Tsekouras, P. B. Kotoulas, C. D. Tsirekis, E. N. Dialynas, and N. D. Hatziargyriou, “A pattern recognition methodology for evaluation of load profiles and typical days of large electricity customers,” Elect.  Power Syst. Res., vol. 78, no. 9, pp. 1494–1510, 2008.
  10.           S. V. Verdu, M. O. Garcia, C. Senabre, A. G. Marin, and F. J. G. Franco, “Classification, filtering, and identification of electrical customer load patterns through the use of self-organizing maps,” IEEE Trans. Power  Syst., vol. 21, no. 4, pp. 1672–1682, Nov. 2006.
  11. G. Chicco and I. S. Ilie, “Support vector clustering of electrical load pattern data,” IEEE Trans. Power Syst., vol. 24, no. 3, pp. 1619–1628, Aug. 2009.
  1. M. Piao, H. S. Shon, J. Y. Lee, and K. H. Ryu, “Subspace projection method based clustering analysis in load profiling,” IEEE Trans. PowerSyst., vol. 29, no. 6, pp. 2628–2635, Nov. 2014.
  1. G. Chicco, O. M. Ionel, and R. Porumb, “Electrical load pattern grouping based on centroid model with ant colony clustering,” IEEE Trans. PowerSyst., vol. 28, no. 2, pp. 1706–1715, May 2013.
  1. G. Chicco, “Overview and performance assessment of the clustering methods for electrical load pattern grouping,” Energy, vol. 42, no. 1,
    1. 68–80, 2012.
  1. I. K. Fodor, “A survey of dimension reduction techniques,” Center Appl. Sci. Comput., Lawrence Livermore Nat. Lab., Livermore, CA, USA, Tech. Rep. UCRL-ID-148494, 2003.
  1. M. Abrahams and M. Kattenfeld, “Two-stage fuzzy clustering approach for load profiling,” in Proc. 44th Int. Univ. Power Eng. Conf. (UPEC), Glasgow, U.K., 2009, pp. 1–5.
  1. M. Koivisto, P. Heine, I. Mellin, and M. Lehtonen, “Clustering of con-nection points and load modeling in distribution systems,” IEEE Trans.Power Syst., vol. 28, no. 2, pp. 1255–1265, May 2013.
  1. G. Chicco, R. Napoli, and F. Piglione, “Comparisons among clustering techniques for electricity customer classification,” IEEE Trans. PowerSyst., vol. 21, no. 2, pp. 933–940, May 2006.
  1. E. D. Varga, S. F. Beretka, C. Noce, and G. Sapienza, “Robust real-time load profile encoding and classification framework for efficient power systems operation,” IEEE Trans. Power Syst., vol. 30, no. 4,
    1. 1897–1904, Jul. 2015.
  1. S. Zhong and K. S. Tam, “Hierarchical classification of load profiles based on their characteristic attributes in frequency domain,” IEEETrans. Power Syst., vol. 30, no. 5, pp. 2434–2441, Sep. 2015.
  1. J. Torriti, “A review of time use models of residential electricity demand,” Renew. Sustain. Energy Rev., vol. 37, pp. 265–272, Sep. 2014.
  2. Y. Xiao, J. Yang, H. Que, M. J. Li, and Q. Gao, “Application of wavelet-based clustering approach to load profiling on AMI measurements,” in

Proc. IEEE China Int. Conf. Elect.Distrib. (CICED), Shenzhen, China,2014, pp. 1537–1540.

  1. A. Notaristefano, G. Chicco, and F. Piglione, “Data size reduction with symbolic aggregate approximation for electrical load pattern grouping,” IET Gener. Transm. Distrib., vol. 7, no. 2, pp. 108–117, Feb. 2013.
  1. A. Albert and R. Rajagopal, “Smart meter driven segmentation: What your consumption says about you,” IEEE Trans. Power Syst., vol. 28, no. 4, pp. 4019–4030, Nov. 2013.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

Related Content

All Tags

Content relating to: "Energy"

Energy regards the power derived from a fuel source such as electricity or gas that can do work such as provide light or heat. Energy sources can be non-renewable such as fossil fuels or nuclear, or renewable such as solar, wind, hydro or geothermal. Renewable energies are also known as green energy with reference to the environmental benefits they provide.

Related Articles

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: