Linguistic Summarization of Structured Patent Data
Patent data have an increasingly important role in economic growth, innovation, technical advantages and business strategies and even in countries competitions. Analyzing of patent data is crucial since patents cover large part of all technological information of the world. In this paper, we have used the linguistic summarization technique to prove the validity of the hypotheses related to patent data stated in the literature.
Discovering User Behaviour Patterns from Web Log Analysis to Enhance the Accessibility and Usability of Website
Finding relevant information on the World Wide Web is becoming highly challenging day by day. Web usage mining is used for the extraction of relevant and useful knowledge, such as user behaviour patterns, from web access log records. Web access log records all the requests for individual files that the users have requested from the website. Web usage mining is important for Customer Relationship Management (CRM), as it can ensure customer satisfaction as far as the interaction between the customer and the organization is concerned. Web usage mining is helpful in improving website structure or design as per the user’s requirement by analyzing the access log file of a website through a log analyzer tool. The focus of this paper is to enhance the accessibility and usability of a guitar selling web site by analyzing their access log through Deep Log Analyzer tool. The results show that the maximum number of users is from the United States and that they use Opera 9.8 web browser and the Windows XP operating system.
Summarizing Data Sets for Data Mining by Using Statistical Methods in Coastal Engineering
Coastal regions are the one of the most commonly used places by the natural balance and the growing population. In coastal engineering, the most valuable data is wave behaviors. The amount of this data becomes very big because of observations that take place for periods of hours, days and months. In this study, some statistical methods such as the wave spectrum analysis methods and the standard statistical methods have been used. The goal of this study is the discovery profiles of the different coast areas by using these statistical methods, and thus, obtaining an instance based data set from the big data to analysis by using data mining algorithms. In the experimental studies, the six sample data sets about the wave behaviors obtained by 20 minutes of observations from Mersin Bay in Turkey and converted to an instance based form, while different clustering techniques in data mining algorithms were used to discover similar coastal places. Moreover, this study discusses that this summarization approach can be used in other branches collecting big data such as medicine.
CompPSA: A Component-Based Pairwise RNA Secondary Structure Alignment Algorithm
The biological function of an RNA molecule depends
on its structure. The objective of the alignment is finding the
homology between two or more RNA secondary structures. Knowing
the common functionalities between two RNA structures allows
a better understanding and a discovery of other relationships
between them. Besides, identifying non-coding RNAs -that is not
translated into a protein- is a popular application in which RNA
structural alignment is the first step A few methods for RNA
structure-to-structure alignment have been developed. Most of these
methods are partial structure-to-structure, sequence-to-structure, or
structure-to-sequence alignment. Less attention is given in the
literature to the use of efficient RNA structure representation and the
structure-to-structure alignment methods are lacking. In this paper,
we introduce an O(N2) Component-based Pairwise RNA Structure
Alignment (CompPSA) algorithm, where structures are given as
a component-based representation and where N is the maximum
number of components in the two structures. The proposed algorithm
compares the two RNA secondary structures based on their weighted
component features rather than on their base-pair details. Extensive
experiments are conducted illustrating the efficiency of the CompPSA
algorithm when compared to other approaches and on different real
and simulated datasets. The CompPSA algorithm shows an accurate
similarity measure between components. The algorithm gives the
flexibility for the user to align the two RNA structures based on
their weighted features (position, full length, and/or stem length).
Moreover, the algorithm proves scalability and efficiency in time and
Development of Prediction Models of Day-Ahead Hourly Building Electricity Consumption and Peak Power Demand Using the Machine Learning Method
To encourage building owners to purchase electricity at the wholesale market and reduce building peak demand, this study aims to develop models that predict day-ahead hourly electricity consumption and demand using artificial neural network (ANN) and support vector machine (SVM). All prediction models are built in Python, with tool Scikit-learn and Pybrain. The input data for both consumption and demand prediction are time stamp, outdoor dry bulb temperature, relative humidity, air handling unit (AHU), supply air temperature and solar radiation. Solar radiation, which is unavailable a day-ahead, is predicted at first, and then this estimation is used as an input to predict consumption and demand. Models to predict consumption and demand are trained in both SVM and ANN, and depend on cooling or heating, weekdays or weekends. The results show that ANN is the better option for both consumption and demand prediction. It can achieve 15.50% to 20.03% coefficient of variance of root mean square error (CVRMSE) for consumption prediction and 22.89% to 32.42% CVRMSE for demand prediction, respectively. To conclude, the presented models have potential to help building owners to purchase electricity at the wholesale market, but they are not robust when used in demand response control.
An Improvement of Multi-Label Image Classification Method Based on Histogram of Oriented Gradient
Image Multi-label Classification (IMC) assigns a label or a set of labels to an image. The big demand for image annotation and archiving in the web attracts the researchers to develop many algorithms for this application domain. The existing techniques for IMC have two drawbacks: The description of the elementary characteristics from the image and the correlation between labels are not taken into account. In this paper, we present an algorithm (MIML-HOGLPP), which simultaneously handles these limitations. The algorithm uses the histogram of gradients as feature descriptor. It applies the Label Priority Power-set as multi-label transformation to solve the problem of label correlation. The experiment shows that the results of MIML-HOGLPP are better in terms of some of the evaluation metrics comparing with the two existing techniques.
Application of Data Mining Techniques for Tourism Knowledge Discovery
Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.
Arabic Light Stemmer for Better Search Accuracy
Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.
Using Multi-Arm Bandits to Optimize Game Play Metrics and Effective Game Design
Game designers have the challenging task of building games that engage players to spend their time and money on the game. There are an infinite number of game variations and design choices, and it is hard to systematically determine game design choices that will have positive experiences for players. In this work, we demonstrate how multi-arm bandits can be used to automatically explore game design variations to achieve improved player metrics. The advantage of multi-arm bandits is that they allow for continuous experimentation and variation, intrinsically converge to the best solution, and require no special infrastructure to use beyond allowing minor game variations to be deployed to users for evaluation. A user study confirms that applying multi-arm bandits was successful in determining the preferred game variation with highest play time metrics and can be a useful technique in a game designer's toolkit.
Predicting Groundwater Areas Using Data Mining Techniques: Groundwater in Jordan as Case Study
Data mining is the process of extracting useful or hidden information from a large database. Extracted information can be used to discover relationships among features, where data objects are grouped according to logical relationships; or to predict unseen objects to one of the predefined groups. In this paper, we aim to investigate four well-known data mining algorithms in order to predict groundwater areas in Jordan. These algorithms are Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbor (kNN) and Classification Based on Association Rule (CBA). The experimental results indicate that the SVMs algorithm outperformed other algorithms in terms of classification accuracy, precision and F1 evaluation measures using the datasets of groundwater areas that were collected from Jordanian Ministry of Water and Irrigation.
Evaluation of Ensemble Classifiers for Intrusion Detection
One of the major developments in machine learning in the past decade is the ensemble method, which finds highly accurate classifier by combining many moderately accurate component classifiers. In this research work, new ensemble classification methods are proposed with homogeneous ensemble classifier using bagging and heterogeneous ensemble classifier using arcing and their performances are analyzed in terms of accuracy. A Classifier ensemble is designed using Radial Basis Function (RBF) and Support Vector Machine (SVM) as base classifiers. The feasibility and the benefits of the proposed approaches are demonstrated by the means of standard datasets of intrusion detection. The main originality of the proposed approach is based on three main parts: preprocessing phase, classification phase, and combining phase. A wide range of comparative experiments is conducted for standard datasets of intrusion detection. The performance of the proposed homogeneous and heterogeneous ensemble classifiers are compared to the performance of other standard homogeneous and heterogeneous ensemble methods. The standard homogeneous ensemble methods include Error correcting output codes, Dagging and heterogeneous ensemble methods include majority voting, stacking. The proposed ensemble methods provide significant improvement of accuracy compared to individual classifiers and the proposed bagged RBF and SVM performs significantly better than ECOC and Dagging and the proposed hybrid RBF-SVM performs significantly better than voting and stacking. Also heterogeneous models exhibit better results than homogeneous models for standard datasets of intrusion detection.
Business-Intelligence Mining of Large Decentralized Multimedia Datasets with a Distributed Multi-Agent System
The rapid generation of high volume and a broad variety of data from the application of new technologies pose challenges for the generation of business-intelligence. Most organizations and business owners need to extract data from multiple sources and apply analytical methods for the purposes of developing their business. Therefore, the recently decentralized data management environment is relying on a distributed computing paradigm. While data are stored in highly distributed systems, the implementation of distributed data-mining techniques is a challenge. The aim of this technique is to gather knowledge from every domain and all the datasets stemming from distributed resources. As agent technologies offer significant contributions for managing the complexity of distributed systems, we consider this for next-generation data-mining processes. To demonstrate agent-based business intelligence operations, we use agent-oriented modeling techniques to develop a new artifact for mining massive datasets.
Predication Model for Leukemia Diseases Based on Data Mining Classification Algorithms with Best Accuracy
In recent years, there has been an explosion in the rate of using technology that help discovering the diseases. For example, DNA microarrays allow us for the first time to obtain a "global" view of the cell. It has great potential to provide accurate medical diagnosis, to help in finding the right treatment and cure for many diseases. Various classification algorithms can be applied on such micro-array datasets to devise methods that can predict the occurrence of Leukemia disease. In this study, we compared the classification accuracy and response time among eleven decision tree methods and six rule classifier methods using five performance criteria. The experiment results show that the performance of Random Tree is producing better result. Also it takes lowest time to build model in tree classifier. The classification rules algorithms such as nearest- neighbor-like algorithm (NNge) is the best algorithm due to the high accuracy and it takes lowest time to build model in classification.
Performance Comparison of ADTree and Naive Bayes Algorithms for Spam Filtering
Classification is an important data mining technique
and could be used as data filtering in artificial intelligence. The
broad application of classification for all kind of data leads to be
used in nearly every field of our modern life. Classification helps us
to put together different items according to the feature items decided
as interesting and useful. In this paper, we compare two
classification methods Naïve Bayes and ADTree use to detect spam
e-mail. This choice is motivated by the fact that Naive Bayes
algorithm is based on probability calculus while ADTree algorithm is
based on decision tree. The parameter settings of the above
classifiers use the maximization of true positive rate and
minimization of false positive rate. The experiment results present
classification accuracy and cost analysis in view of optimal classifier
choice for Spam Detection. It is point out the number of attributes to
obtain a tradeoff between number of them and the classification
Cost Sensitive Feature Selection in Decision-Theoretic Rough Set Models for Customer Churn Prediction: The Case of Telecommunication Sector Customers
In recent days, there is a change and the ongoing development of the telecommunications sector in the global market. In this sector, churn analysis techniques are commonly used for analysing why some customers terminate their service subscriptions prematurely. In addition, customer churn is utmost significant in this sector since it causes to important business loss. Many companies make various researches in order to prevent losses while increasing customer loyalty. Although a large quantity of accumulated data is available in this sector, their usefulness is limited by data quality and relevance. In this paper, a cost-sensitive feature selection framework is developed aiming to obtain the feature reducts to predict customer churn. The framework is a cost based optional pre-processing stage to remove redundant features for churn management. In addition, this cost-based feature selection algorithm is applied in a telecommunication company in Turkey and the results obtained with this algorithm.
Data Mining Approach for Commercial Data Classification and Migration in Hybrid Storage Systems
Parallel hybrid storage systems consist of a hierarchy of different storage devices that vary in terms of data reading speed performance. As we ascend in the hierarchy, data reading speed becomes faster. Thus, migrating the application’ important data that will be accessed in the near future to the uppermost level will reduce the application I/O waiting time; hence, reducing its execution elapsed time. In this research, we implement trace-driven two-levels parallel hybrid storage system prototype that consists of HDDs and SSDs. The prototype uses data mining techniques to classify application’ data in order to determine its near future data accesses in parallel with the its on-demand request. The important data (i.e. the data that the application will access in the near future) are continuously migrated to the uppermost level of the hierarchy. Our simulation results show that our data migration approach integrated with data mining techniques reduces the application execution elapsed time when using variety of traces in at least to 22%.
A Relationship Extraction Method from Literary Fiction Considering Korean Linguistic Features
The knowledge of the relationship between characters can help readers to understand the overall story or plot of the literary fiction. In this paper, we present a method for extracting the specific relationship between characters from a Korean literary fiction. Generally, methods for extracting relationships between characters in text are statistical or computational methods based on the sentence distance between characters without considering Korean linguistic features. Furthermore, it is difficult to extract the relationship with direction from text, such as one-sided love, because they consider only the weight of relationship, without considering the direction of the relationship. Therefore, in order to identify specific relationships between characters, we propose a statistical method considering linguistic features, such as syntactic patterns and speech verbs in Korean. The result of our method is represented by a weighted directed graph of the relationship between the characters. Furthermore, we expect that proposed method could be applied to the relationship analysis between characters of other content like movie or TV drama.
Knowledge-Driven Decision Support System Based on Knowledge Warehouse and Data Mining by Improving Apriori Algorithm with Fuzzy Logic
In recent years, we have seen an increasing importance of research and study on knowledge source, decision support systems, data mining and procedure of knowledge discovery in data bases and it is considered that each of these aspects affects the others. In this article, we have merged information source and knowledge source to suggest a knowledge based system within limits of management based on storing and restoring of knowledge to manage information and improve decision making and resources. In this article, we have used method of data mining and Apriori algorithm in procedure of knowledge discovery one of the problems of Apriori algorithm is that, a user should specify the minimum threshold for supporting the regularity. Imagine that a user wants to apply Apriori algorithm for a database with millions of transactions. Definitely, the user does not have necessary knowledge of all existing transactions in that database, and therefore cannot specify a suitable threshold. Our purpose in this article is to improve Apriori algorithm. To achieve our goal, we tried using fuzzy logic to put data in different clusters before applying the Apriori algorithm for existing data in the database and we also try to suggest the most suitable threshold to the user automatically.
A Recognition Method for Spatio-Temporal Background in Korean Historical Novels
The most important elements of a novel are the characters, events and background. The background represents the time, place and situation that character appears, and conveys event and atmosphere more realistically. If readers have the proper knowledge about background of novels, it may be helpful for understanding the atmosphere of a novel and choosing a novel that readers want to read. In this paper, we are targeting Korean historical novels because spatio-temporal background especially performs an important role in historical novels among the genre of Korean novels. To the best of our knowledge, we could not find previous study that was aimed at Korean novels. In this paper, we build a Korean historical national dictionary. Our dictionary has historical places and temple names of kings over many generations as well as currently existing spatial words or temporal words in Korean history. We also present a method for recognizing spatio-temporal background based on patterns of phrasal words in Korean sentences. Our rules utilize postposition for spatial background recognition and temple names for temporal background recognition. The knowledge of the recognized background can help readers to understand the flow of events and atmosphere, and can use to visualize the elements of novels.
The Application of Data Mining Technology in Building Energy Consumption Data Analysis
Energy consumption data, in particular those involving
public buildings, are impacted by many factors: the building structure,
climate/environmental parameters, construction, system operating
condition, and user behavior patterns. Traditional methods for data
analysis are insufficient. This paper delves into the data mining
technology to determine its application in the analysis of building
energy consumption data including energy consumption prediction,
fault diagnosis, and optimal operation. Recent literature are reviewed
and summarized, the problems faced by data mining technology in the
area of energy consumption data analysis are enumerated, and research
points for future studies are given.
Questions Categorization in E-Learning Environment Using Data Mining Technique
Nowadays, education cannot be imagined without digital technologies. It broadens the horizons of teaching learning processes. Several universities are offering online courses. For evaluation purpose, e-examination systems are being widely adopted in academic environments. Multiple-choice tests are extremely popular. Moving away from traditional examinations to e-examination, Moodle as Learning Management Systems (LMS) is being used. Moodle logs every click that students make for attempting and navigational purposes in e-examination. Data mining has been applied in various domains including retail sales, bioinformatics. In recent years, there has been increasing interest in the use of data mining in e-learning environment. It has been applied to discover, extract, and evaluate parameters related to student’s learning performance. The combination of data mining and e-learning is still in its babyhood. Log data generated by the students during online examination can be used to discover knowledge with the help of data mining techniques. In web based applications, number of right and wrong answers of the test result is not sufficient to assess and evaluate the student’s performance. So, assessment techniques must be intelligent enough. If student cannot answer the question asked by the instructor then some easier question can be asked. Otherwise, more difficult question can be post on similar topic. To do so, it is necessary to identify difficulty level of the questions. Proposed work concentrate on the same issue. Data mining techniques in specific clustering is used in this work. This method decide difficulty levels of the question and categories them as tough, easy or moderate and later this will be served to the desire students based on their performance. Proposed experiment categories the question set and also group the students based on their performance in examination. This will help the instructor to guide the students more specifically. In short mined knowledge helps to support, guide, facilitate and enhance learning as a whole.
Design of Personal Job Recommendation Framework on Smartphone Platform
Recently, Job Recommender Systems have gained
much attention in industries since they solve the problem of
information overload on the recruiting website. Therefore, we
proposed Extended Personalized Job System that has the capability of
providing the appropriate jobs for job seeker and recommending
some suitable information for them using Data Mining Techniques
and Dynamic User Profile. On the other hands, company can also
interact to the system for publishing and updating job information.
This system have emerged and supported various platforms such as
web application and android mobile application. In this paper, User
profiles, Implicit User Action, User Feedback, and Clustering
Techniques in WEKA libraries were applied and implemented. In
additions, open source tools like Yii Web Application Framework,
Bootstrap Front End Framework and Android Mobile Technology
were also applied.
A Supervised Learning Data Mining Approach for Object Recognition and Classification in High Resolution Satellite Data
Advances in spatial and spectral resolution of satellite
images have led to tremendous growth in large image databases. The
data we acquire through satellites, radars, and sensors consists of
important geographical information that can be used for remote
sensing applications such as region planning, disaster management.
Spatial data classification and object recognition are important tasks
for many applications. However, classifying objects and identifying
them manually from images is a difficult task. Object recognition is
often considered as a classification problem, this task can be
performed using machine-learning techniques. Despite of many
machine-learning algorithms, the classification is done using
supervised classifiers such as Support Vector Machines (SVM) as the
area of interest is known. We proposed a classification method,
which considers neighboring pixels in a region for feature extraction
and it evaluates classifications precisely according to neighboring
classes for semantic interpretation of region of interest (ROI). A
dataset has been created for training and testing purpose; we
generated the attributes by considering pixel intensity values and
mean values of reflectance. We demonstrated the benefits of using
knowledge discovery and data-mining techniques, which can be on
image data for accurate information extraction and classification from
high spatial resolution remote sensing imagery.
Mining Multicity Urban Data for Sustainable Population Relocation
In this research, we propose to conduct diagnostic and
predictive analysis about the key factors and consequences of urban
population relocation. To achieve this goal, urban simulation models
extract the urban development trends as land use change patterns from
a variety of data sources. The results are treated as part of urban big
data with other information such as population change and economic
conditions. Multiple data mining methods are deployed on this data to
analyze nonlinear relationships between parameters. The result
determines the driving force of population relocation with respect to
urban sprawl and urban sustainability and their related parameters.
This work sets the stage for developing a comprehensive urban
simulation model for catering to specific questions by targeted users. It
contributes towards achieving sustainability as a whole.
Multimedia Data Fusion for Event Detection in Twitter by Using Dempster-Shafer Evidence Theory
Data fusion technology can be the best way to extract
useful information from multiple sources of data. It has been widely
applied in various applications. This paper presents a data fusion
approach in multimedia data for event detection in twitter by using
Dempster-Shafer evidence theory. The methodology applies a mining
algorithm to detect the event. There are two types of data in the
fusion. The first is features extracted from text by using the bag-ofwords
method which is calculated using the term frequency-inverse
document frequency (TF-IDF). The second is the visual features
extracted by applying scale-invariant feature transform (SIFT). The
Dempster - Shafer theory of evidence is applied in order to fuse the
information from these two sources. Our experiments have indicated
that comparing to the approaches using individual data source, the
proposed data fusion approach can increase the prediction accuracy
for event detection. The experimental result showed that the proposed
method achieved a high accuracy of 0.97, comparing with 0.93 with
texts only, and 0.86 with images only.
Using Data Mining Technique for Scholarship Disbursement
This work is on decision tree-based classification for
the disbursement of scholarship. Tree-based data mining
classification technique is used in other to determine the generic rule
to be used to disburse the scholarship. The system based on the
defined rules from the tree is able to determine the class (status) to
which an applicant shall belong whether Granted or Not Granted. The
applicants that fall to the class of granted denote a successful
acquirement of scholarship while those in not granted class are
unsuccessful in the scheme. An algorithm that can be used to classify
the applicants based on the rules from tree-based classification was
also developed. The tree-based classification is adopted because of its
efficiency, effectiveness, and easy to comprehend features. The
system was tested with the data of National Information Technology
Development Agency (NITDA) Abuja, a Parastatal of Federal
Ministry of Communication Technology that is mandated to develop
and regulate information technology in Nigeria. The system was
found working according to the specification. It is therefore
recommended for all scholarship disbursement organizations.
An Automatic Bayesian Classification System for File Format Selection
This paper presents an approach for the classification of
an unstructured format description for identification of file formats.
The main contribution of this work is the employment of data mining
techniques to support file format selection with just the unstructured
text description that comprises the most important format features for
a particular organisation. Subsequently, the file format indentification
method employs file format classifier and associated configurations to
support digital preservation experts with an estimation of required file
format. Our goal is to make use of a format specification knowledge
base aggregated from a different Web sources in order to select file
format for a particular institution. Using the naive Bayes method,
the decision support system recommends to an expert, the file format
for his institution. The proposed methods facilitate the selection of
file format and the quality of a digital preservation process. The
presented approach is meant to facilitate decision making for the
preservation of digital content in libraries and archives using domain
expert knowledge and specifications of file formats. To facilitate
decision-making, the aggregated information about the file formats is
presented as a file format vocabulary that comprises most common
terms that are characteristic for all researched formats. The goal is to
suggest a particular file format based on this vocabulary for analysis
by an expert. The sample file format calculation and the calculation
results including probabilities are presented in the evaluation section.
Spatio-Temporal Data Mining with Association Rules for Lake Van
People, throughout the history, have made estimates
and inferences about the future by using their past experiences.
Developing information technologies and the improvements in the
database management systems make it possible to extract useful
information from knowledge in hand for the strategic decisions.
Therefore, different methods have been developed. Data mining by
association rules learning is one of such methods. Apriori algorithm,
one of the well-known association rules learning algorithms, is not
commonly used in spatio-temporal data sets. However, it is possible
to embed time and space features into the data sets and make Apriori
algorithm a suitable data mining technique for learning spatiotemporal
association rules. Lake Van, the largest lake of Turkey, is a
closed basin. This feature causes the volume of the lake to increase or
decrease as a result of change in water amount it holds. In this study,
evaporation, humidity, lake altitude, amount of rainfall and
temperature parameters recorded in Lake Van region throughout the
years are used by the Apriori algorithm and a spatio-temporal data
mining application is developed to identify overflows and newlyformed
soil regions (underflows) occurring in the coastal parts of
Lake Van. Identifying possible reasons of overflows and underflows
may be used to alert the experts to take precautions and make the
Comprehensive Analysis of Data Mining Tools
Due to the fast and flawless technological innovation
there is a tremendous amount of data dumping all over the world in
every domain such as Pattern Recognition, Machine Learning, Spatial
Data Mining, Image Analysis, Fraudulent Analysis, World Wide
Web etc., This issue turns to be more essential for developing several
tools for data mining functionalities. The major aim of this paper is to
analyze various tools which are used to build a resourceful analytical
or descriptive model for handling large amount of information more
efficiently and user friendly. In this survey the diverse tools are
illustrated with their extensive technical paradigm, outstanding
graphical interface and inbuilt multipath algorithms in which it is
very useful for handling significant amount of data more indeed.
MCOKE: Multi-Cluster Overlapping K-Means Extension Algorithm
Clustering involves the partitioning of n objects into k
clusters. Many clustering algorithms use hard-partitioning techniques
where each object is assigned to one cluster. In this paper we propose
an overlapping algorithm MCOKE which allows objects to belong to
one or more clusters. The algorithm is different from fuzzy clustering
techniques because objects that overlap are assigned a membership
value of 1 (one) as opposed to a fuzzy membership degree. The
algorithm is also different from other overlapping algorithms that
require a similarity threshold be defined a priori which can be
difficult to determine by novice users.