Tuesday, June 4, 2019
Partitioning Methods to Improve Obsolescence Forecasting
Partitioning Methods to Improve Obsolescence ForecastingAmol KulkarniAbstract Clustering is an unsupervised classification of observations or data items into groups or clusters. The some(prenominal)er of forgather has been addressed by many re tryers in various disciplines, which serves to reflect its usefulness as one of the steps in exploratory data analysis. This make-up presents an overview of class methods, with a goal of providing useful advice and mentions to identifying the best do of cluster and provide a basic introduction to cluster validation techniques. The aim of assemble methods carried out in this paper is to present useful information which would aid in forecasting obsolescence.INRODUCTIONThere bewilder been more inventions recorded in the past thirty years than all the rest of recorded humanity, and this pace hastens every month. As a result, the reaping life pedal has been decreasing rapidly, and the life cycle of products no longer fit together with the life cycle of their components. This issue is termed as obsolescence, wherein a component tail no longer be obtained from its authoritative manufacturer. Obsolescence arouse be broadly categorized into Planned and Unplanned obsolescence. Planned obsolescence can be considered as a business st prizegy, in which the obsolescence of a product is built into it from its conception. As Philip Kotler termed it Much so-called planned obsolescence is the working of the competitive and technological forces in a free society-forces that lead to ever-improving goods and services. On the other hand, casual obsolescence causes more harm to a burgeoning industry than good. This issue is more prevalent in the electronics industry the procurement life-cycles for electronic components be significantly shorter than the manufacturing and support life-cycle. Therefore, it is passing pith(a) to implement and operate an active management of obsolescence to mitigate and avoid extreme costs 1.On e such(prenominal) product that has been plagued by threat of obsolescence is the digital camera. Ever-since the invention of smartphones there has been a huge dip in the digital camera sales, as can be seemn from Figure 1. The decreasing price, the exponential rate at which the pixels and the resolution of the smart-phones improved can be termed as few of the factors that cannibalized the digital camera market.Figure 1 ecumenical Sales of Digital Cameras (2011-2016) 2 and Worldwide sale of cellphones on the right (2007-2016) 3CLUSTERINGHumans naturally use cluster to understand the world around them. The ability to group sets of objects base on similarities atomic number 18 fundamental to learning. Researchers have sought to take these natural learning methods mathematically and this has birthed the clustering research. To serve well us solve problems at-least approximately as our brain, mathematically precise notation of clustering is important 4. Clustering is a useful tec hnique to explore natural groupings within multivariate data for a social organization of natural groupings, also for feature extraction and summarizing. Clustering is also useful in identifying outliers, forming hypotheses concerning relationships. Clustering can be thought of as partitioning a given space into K groups i.e., 1, , K. One method of carrying out this partitioning is to optimize some national clustering criteria such as the outspace amid apiece observation within a cluster etc. While clustering plays an important role in data analysis and serves as a pre work oning step for a multitude of learning task, our primary interest lies in the ability of clusters to gain more information from the data to improve portent accuracy. As clustering, can be thought of separating classes, it should help in classification task.The aim of clustering is to find useful groups of objects, usefulness being defined by the goals of the data analysis. almost clustering algorithms r equire us to know the occur of clusters onwardhand. However, there is no intuitive way of identifying the optimal sum up of clusters. Identifying optimal clustering is parasitic on the methods employ for measuring similarities, and the parameters used for partitioning, in general identifying the optimal number of clusters. Determining number of clusters is often an ad hoc decision establish on prior knowledge, assumptions, and practical experience is very subjective.This paper performs k-means and k-medoids clustering to gain information from the data structure that could play an important role in divineing obsolescence. It also tries to address the issue of assessing cluster tendency, which is a first and fore most(prenominal) step while carrying out unsupervised machine learning process. Optimization of internal and external clustering criteria will be carried out to identify the optimal number of cluster. Cluster proof will be carried out to identify the most suitable clus tering algorithm.DATA CLEANING deficient take account in a dataset is a common occurrence in real world problems. It is important to know how to handle absentminded data to reduce bias and to produce powerful models. Some clock ignoring the missing data, biases the answers and potentially leads to incorrect conclusion. Rubin in 7 differentiated amongst three types of missing set in the datasetMissing completely at haphazard (MCAR) when qualitys with missing values can be thought of as a random sample of all the cases MCAR occurs rarely in practice.Missing at random (MAR) when conditioned on all the data we have, any remaining missing value is completely random that is, it does not depend on some missing variables. So, missing values can be modelled using the observed data. Then, we can use specialized missing data analysis methods on the available data to correct for the effects of missing values.Missing not at random (MNAR) when data is neither MCAR nor MAR. This is difficult to handle because it will require strong assumptions about the patterns of missing data.While in practice the use of complete case methods which drops the observations containing missing values is quite common, this method has the disadvantage that it is inefficient and potentially leads to bias. Initial approach was to ocularly explore each individual variable with the help of VIM. However, upon learning the limitations of pickaxe in missing values through exploratory data analysis, this approach was abandoned in favor of multiple imputations.Joint moulding (JM) and Fully Conditional Specification (FCS) are the ii emerging general methods in imputing multivariate data. If multivariate distribution of the missing data is a bonny assumption, then Joint Modelling which imputes data based on Markov Chain Monte Carlo techniques would be the best method. FCS specifies the multivariate imputation model on a variable-by-variable reason by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS d defenselesss imputations by iterating over the conditional densities. A low number of iterations is often sufficient. FCS is attractive as an alternative to JM in cases where no suitable multivariate distribution can be found 8.The Multiple imputations approach involves filling in missing values multiple times, creating multiple complete datasets. Because multiple imputations involve creating multiple predictions for each missing value, the analysis of data imputed multiple times take into account the uncertainty in the imputations and yield accurate standard errors. Multiple imputation techniques have been utilized to impute missing values in the dataset, in the first place because it preserves the relation in the data and it also preserves uncertainty about these relations. This method is by no means perfect, it has its own complexities. The only complexness was having variables of different types (binary, unordered and free burning), thereby making the application of models, which assumed multivariate normal distribution- theoretically inappropriate. There are several complexities that surface listed in 8. In order to address this issue It is convenient to specify imputation model separately for each column in the data. This is called as chained equations wherein the specification occurs at a variable take, which is well understood by the user.The first task is to identify the variables to be included in the imputation process. This generally includes all the variables that will be used in the subsequent analysis irrespective of the presence of missing data, as well as variables that may be predictive of the missing data. There are three specific issues that often come up when selecting variables(1) creating an imputation model that is more general than the analysis model,(2) imputing variables at the item level vs. the summary level, and(3) imputing variables that reflect raw scores vs. s tandardized scores.To help make a decision on these aspects, the distribution of the variables may help guide the decision. For example, if the raw scores of a continuous measure are more normally distributed than the corresponding standardized scores then using the raw scores in the imputation model, will likely mend meet the assumptions of the linear regressions being used in the imputation process.The following image supplys the missing values in the data-frame containing the information regarding digital camera.Figure 2 Missing VariablesWe can see that Effective Pixels has missing values for all its observations. After cross verifying it with the source website, the web scrapper was rewriting to correctly capture this variable from the website. The date variable was converted from a numeric to a date and this enabled the identification of errors in the observation for USB in the dataset. Two cameras that were released in 1994 1995 were shown to have USB 2.0, after searching o nline, it was found out that USB 2.0 was released in the year 2005 and USB 1.0 was released in the year 1996. As, most of the cameras before 1997 used PC-serial port a new level was introduced to the USB variable to indicate this.DATA DESCRIPTIONThe dataset containing the specification of the digital cameras was acquired using rvest -package 5 in R from the url provided in 6. The structure of the data set is as shown in Appendix A. The data-frame contains 2199 observation and 55 variables. Appendix B contains the descriptive statistics of the quantitative variables in the data-frame.Figure 4 The Distribution of Body-Type in the dataset card Most of the compact, Large SLR and ultracompact cameras are discontinued.Figure 5 Plot showing the status of Digital Cameras from 1994-2017Observation Most of the cameras released before 2007 have been discontinued however, we can see that few cameras announced amongst the time period of 1996-2006 are dormant in production. Fewer new cameras h ave been announced after the year 2012, this can be evidenced due to the decreasing number of camera sales presented in Figure 5.Figure 6 Distribution of different Cameras (1994-2017)Observation Between the period of 1996 2012 the digital camera market was dominated by the compact cameras. After 2012, fewer new compact cameras have been announced or are still in production. Same can be said about the fate of ultracompact cameras. In the year 2017, only SLR style mirrorless cameras have been announced, signaling the death of burden and shoot cameras.Figure 7 Plot showing the Change in the Total Resolution and Effective Pixels of Digital Camera over the YearsObservation Total resolution has seen an improvement over the years. The presence of outliers can be seen in the top-left corner of the plot. Although the effective pixel is around 10, the derive resolution is far higher than any of the cameras announced between the period 1996-2001. These could be the cameras that are still i n production as evidenced from Figure 7.ASSESSING CLUSTER angle of inclinationA primary issue with unsupervised machine learning is the fact if carried out blindly, clustering methods will divide the data into clusters, because that is what they are supposed to do. Therefore, before choosing a clustering approach, it is important to decide whether the dataset contains meaningful clusters. If the data does contain meaningful clusters, then the number of clusters is also an issue that needs to be looked at. This process is called assessing clustering tendency (feasibility of cluster analysis).To carry out a feasibility study of cluster analysis Hopkins statistic will be used to assess the clustering tendency of the dataset. Hopkins statistic assess the clustering tendency based on the probability that a given data follows a resembling distribution (tests for spatial randomness). If the value of the statistic is close to zero this implies that the data does not follow uniform distrib ution and thus we can reject the null hypothesis. Hopkins statistic is calculated using the following commandmentWhere xi is the distance between devil neighboring points in a given, dataset and yi demos the distance between two neighboring points of a simulated dataset following uniform distribution. If the value of H is 0.5, this implies that and are close to one another and thus the given data follows a uniform distribution. The next step in the unsupervised learning method is to identify the optimal number of clusters.The Hopkins statistic for the digital camera dataset was found to be 0.00715041. Since Hopkins statistic was quite low, we can conclude that the dataset is highly clusterable. A visual assessment of the clustering tendency was also carried out and the result can be seen in Figure 8.Figure 8 Dissimilarity Matrix of the dataset find out OPTIMAL NUMBER OF CLUSTERSOne simple solution to identify the optimal number of cluster is to perform hierarchical clustering a nd determine the number of clusters based on the dendogram generated. However, we will utilize the following methods to identify the optimal number of clustersAn optimization criterion such as within sum of squares or average out silhouette widthComparing evidence against null hypothesis. ( spreadhead Statistic)SUM OF SQUARESThe basic idea behind partitioning methods like k-means clustering algorithms, is to define clusters such that the entirety within cluster sum of squares is minimized.Where Ck is the kth cluster and W(Ck) is the variation within the cluster. Our aim is to minimize the total within cluster sum of squares as it measures the compactness of the clusters.In this approach, we generally perform clustering method, by varying the number of clusters (k). For each k we direct the total within sum of squares. We then plot the total within sum of squares against the k-value, the location of bend or knee in the plot is considered as an appropriate value of the cluster.AVER AGE SILHOUETTE WIDTH sightly silhouette is a measure of the quality of clustering, in that it determines the how well an object lies within its cluster. The metric can range from -1 to 1, where higher values are better. median(a) silhouette method computes the average silhouette of observations for different number of clusters. The optimal number of clusters is the one that maximizes the average silhouette over a range of contingent values for different number of clusters 9. come silhouette functions similar to within sum of squares method. We carry out the clustering algorithm by varying the number of clusters, then we calculate average silhouette of observation for each cluster. We then plot the average silhouette against different number of clusters. The location with the highest value of average silhouette width is considered as the optimum number of cluster.GAP STATISTICThis method compares the total within sum of squares for different number of cluster with their expected val ues while assuming that the data follows a distribution with no obvious clustering.The reference dataset is generated using Monte Carlo simulations of the sampling process. For each variable (xi) in the dataset we compute its range min(xi), max(xj) and generate n values uniformly from the range min to max.The total within cluster variation for both the observed data and the reference data is computed for different number of clusters. The gap statistic for a given number of cluster is defined as follows denotes the expectation under a sample of size n from the reference distribution. is defined via bootstrapping and computing the average . The gap statistic measures the bending of the observed Wk value from its expected value under the null hypothesis.The estimate of the optimal number of clusters will be a value that maximizes Gapn(k). This implies that the clustering structure is far away from the uniform distribution of points.The standard deviation (sdk) of is also computed i n order to define the standard error sk as followsFinally, we choose the smallest value of the number of cluster such that the gap statistic is within one standard deviation of the gap at k+1Gap(k)Gap(k+1) sk+1The above method and its explanation are borrowed from 10.DATA PRE-PROCESSINGThe issue with K-means clustering is that it cannot handle categorical variables. As the K-means algorithm defines a cost function that computes Euclidean distance between two numeric values. However, it is not possible to define such distance between categorical values. Hence, the need to treat categorical data as numeric. While it is not improper to deal with variables in this manner, however categorical variables lose their meaning once they are treated as numeric.To be able to perform clustering efficiently, Gower distance will be used for clustering. The concept of Gower distance is that for each variable a distance metric that works well for that particular type of variable is used. It is scale d between 0 and 1 and then a linear combination of weights is calculated to create the final distance matrix.PARTITIONING METHODSK-MEANSK-means clustering is the simplest and the most commonly used partitioning method for splitting a dataset into a set of k clusters. In this method, we first choose K initial centroids. Each point is then assigned to the closest centroid, and each collecting of points is assigned to a centroid in the cluster. The centroid of each cluster is updated based on the additional points assigned to the cluster. We repeat his until the centroids find a steady state.Figure 9 Plot Showing total sum of square and Average Silhouette width for different number of clustersWe can see from Figure 9, that the optimal number of clusters suggested by the optimization criteria is 3 clusters using WSS method and 2 clusters using Average Silhouette width method.Considering the dependent variable is factor with two levels, having two clusters does make sense. The disadvant age of optimization criterion to identify the optimal clusters is that, it is sometimes ambiguous. A more sophisticated method is the gap statistic method.Figure 10 Gap Statistic for different number of clustersFrom Figure 10, we can see that the Gap statistic is high for 2 clusters. Hence, we carry out k-means clustering with 2 clusters on a volume basis.Figure 11 Visualizing K-means Clustering MethodThe data separates into two relatively manifest clusters, with the red category in the left region, while the region on the right contains the blue category. There is a limited overlap at the interface between the classes. To visualize K-means it is necessary to bring the number of dimensions down to two. The represent produced by fviz_cluster Factoextra Ver 1.0 11 is not a selection of any two dimensions. The plot shows the projection of the entire data onto the first two principle components. These are the dimensions which show the most variation in the data. The 52.8% indicates t hat the first principle component accounts for 52.8% variation in the data, whereas the second principle component accounts for 23.9% variation in the data. Together both the dimensions account for 76.7% of the variation. The polygon in red and blue represent the cluster means.PARTITIONING AROUND MEDOIDSK means clustering is highly sensitive to outliers, this would incite the assignment of observations to their respective clusters. Partitioning around medoids also known as K-medoids clustering are much more robust compared to k-means.K-medoids is based on the search of medoids among the observation of the dataset. These medoids represent the structure of the data. Much like K-means, after finding the medoids for each of the K- clusters, each observation is assigned to the nearest medoid. The aim is to find K-medoids such that it minimizes the sum of dissimilarities of the observations within the cluster.Figure 12 Plot Showing total sum of square and Average Silhouette width for di fferent number of clustersWe can see from Figure 12, that the optimal number of clusters suggested by the optimization criteria is 3 clusters using WSS method and 2 clusters using Average Silhouette width method. Considering the dependent variable is factor with two levels, having two clusters does make sense. The disadvantage of optimization criterion to identify the optimal clusters is that, it is sometimes ambiguous. A more sophisticated method is the gap statistic method.Figure 13 Gap Statistic for different number of clustersFrom Figure 13, we can see that the Gap statistic is high for 2 clusters. Hence, we carry out partitioning around medoids clustering with 2 clusters on a majority basis.Figure 14 Plot visualizing PAM clustering methodThe data separates into two relatively distinct clusters, with the red category in the lower region, while the upper region contains the blue category. There is a limited overlap at the interface between the classes. fviz_cluster Factoextra Ver 1.0 11 transforms the initial set of variables into a new set of variables through principal component analysis. This dimensionality reduction algorithm operates on the 72 variables and outputs the two new variables that represent the projection of the original dataset.CLUSTER VALIDATIONThe next step in cluster analysis is to find the goodness of fit and to avoid finding patterns in noise and to compare clustering algorithms, cluster validation is carried out. The following cluster validation measures to compare K-means and PAM clustering will be usedConnectivity Indicates the extent to which the observations are placed in the same cluster as their nearest neighbors in the data space. It has a value ranging from 0 to and should be minimizedDunn It is the ratio of shortest distance between two clusters to the largest intra-cluster distance. It has a value ranging from 0 to and should be maximized.Average Silhouette widthThe results of internal validation measures are presented in the table below. K-means for two cluster has performed better for each statistic.Figure 15 Plot Comparing Connectivity and Dunn Index for K-means and PAM for different number of clusters Figure 16 Plot Comparing Average Silhouette width of K-means and PAMClusteringAlgorithmValidation MeasuresNumber of Clusters23456kmeansConnectivity139.9575292.5563406.5429514.3913605.5373Dunn0.06610.02460.02230.02440.0291Silhouette0.43690.31740.28140.26790.2447pamConnectivity156.1004333.754474.4298520.3913635.3687Dunn0.02750.03970.0220.0280.0246Silhouette0.42710.30350.27570.26610.2325Table 1 Presenting the values of different validation measures for K-means and PAMValidation MeasuresScoreMethodClustersConnectivity139.9575kmeans2Dunn0.0661kmeans2Silhouette0.4369kmeans2Table 2 Optimal Scores for the Validation MeasuresCONCLUSIONIn this research work, partitioning methods like K-means and Partitioning around medoids were developed. The performances of these two approaches have been observed on the bas is of their Connectivity, Dunn index and Average Silhouette width. The results indicate that K-means clustering algorithm with K = 2 performs better than partitioning around medoids with two clusters. The findings of this paper will be very useful to predict obsolescence with higher accuracy.FUTURE WORKAdvanced clustering algorithms such as Model based clustering and Density based clustering can be carried out to find the multivariate data structure as most of the variables are categorical.1 Bjoern Bartels, Ulrich Ermel, Peter Sandborn and Michael G. Pecht (2012). Strategies to the Prediction, Mitigation and Management of Product Obsolescence.2 Source Figure 1 https//www.statista.com/statistics/269927/sales-of-analog-and-digital-cameras-worldwide-since-2002/3 Source, Figure 1 https//www.statista.com/statistics/263437/global-smartphone-sales-to-end-users-since-2007/4 S. Still, and W. Bialek, How many Clusters? An Information Theoretic Perspective, Neural Computation, 2004.5 Wickham, Hadley, rvest Easily Harvest (Scrape) Web Pages. https//cran.r-project.org/web/packages/rvest/rvest.pdf, Ver. 0.3.26 https//www.dpreview.com7 Rubin, D.B., Inference and missing data. Biometrika, 1976.8 Multivariate Imputation by enchained Equations Stef van Buuren, Karin Groothuis .9 Learning the k in k-means Greg Hamerly, Charles Elkan10 Robert Tibshirani, Guenther Walther and Trevor Hast
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.