Science topic

Data Mining and Knowledge Discovery - Science topic

It is the research project which is ongoing.
Questions related to Data Mining and Knowledge Discovery
  • asked a question related to Data Mining and Knowledge Discovery
Question
11 answers
I am looking for tools that can be used to extract and mine Facebook data. 
i learned about a tool called Netvizz, but many people are saying it was stopped.
Any idea that can help? 
Relevant answer
Answer
Yes, Netvizz has been discontinued due to changes in Facebook's API policies. However, there are still several tools available that can be used to extract and mine Facebook data. Here are some alternatives:
  1. Facebook Graph API: The Facebook Graph API is a powerful tool for accessing Facebook data. It allows developers to access public Facebook data and also provides limited access to private data through user authentication. However, it requires programming skills to use effectively.
  2. Socialbakers: Socialbakers is a social media analytics tool that provides insights into Facebook and other social media platforms. It offers a range of features, including audience analysis, content analysis, and campaign tracking.
  3. Brandwatch: Brandwatch is a social media monitoring tool that can be used to extract and analyze data from Facebook and other social media platforms. It offers features such as sentiment analysis, influencer identification, and crisis management.
  4. Keyhole: Keyhole is a social media monitoring tool that can be used to track hashtags and keywords on Facebook and other social media platforms. It offers features such as real-time tracking, sentiment analysis, and influencer identification.
  5. Quintly: Quintly is a social media analytics tool that provides insights into Facebook and other social media platforms. It offers features such as audience analysis, content analysis, and campaign tracking.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
I am working in the field of data discovery, therefore, I learn data collection using API of social apps.
Relevant answer
Answer
Lee Mccallum Thank you!!!!!
Sales Aribe Jr. Thank you!!!!!
  • asked a question related to Data Mining and Knowledge Discovery
Question
13 answers
Weka
Relevant answer
Answer
Neural Designer
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
I'm quite new in GMDH and based on my first reading on this technique I feel like I want to know more. Here are some of the benefits of using GMDH approach:
1.The optimal complexity of model structure is found, adequate to level of noise in data sample. For real problems solution with noised or short data, simplified forecasting models are more accurate.
2.The number of layers and neurons in hidden layers, model structure and other optimal NN parameters are determined automatically.
3.It guarantees that the most accurate or unbiased models will be found - method doesn't miss the best solution during sorting of all variants (in given class of functions).
4.As input variables are used any non-linear functions or features, which can influence the output variable.
5.It automatically finds interpretable relationships in data and selects effective input variables.
6. GMDH sorting algorithms are rather simple for programming.
7. TMNN neural nets are used to increase the accuracy of another modelling algorithms.
8. Method uses information directly from data sample and minimizes influence of apriori author assumptions about results of modeling.
9. Approach gives possibility to find unbiased physical model of object (law or clusterization) - one and the same for future samples.
It seems that items 1,2,6 and 7 are really interesting and can be extend to ANN.
Any suggestion or experience from others?
Relevant answer
Answer
Not so much accurate
Instead use ANFIS-PSO
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
I would like to dive into the research domain of explainable AI. What are some of the recent trending methodologies in this domain? What can be a good start to dive into this field?
Relevant answer
Answer
These Papers will help you:
1: Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers: https://arxiv.org/pdf/1801.06889.pdf
2: Visual Analytics for Explainable Deep Learning:
3: CNN EXPLAINER: Learning Convolutional Neural Networks with Interactive Visualization :
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Genomic data privacy is an essential thing while sharing the genomic data to the public. How can the privacy of genomic data be protected? Which anonymization models are useless for preserving the privacy of genomic data? Which model is suggested for preserving the privacy?
Relevant answer
Answer
I suppose a case could be argued both for and against anonimisation, not only in the value to the individual, but also in its actual efficiency to protect individual privacy. At the end of the day, our DNA information itself is a FAR more precise identifyer than any other data that is classed as "identifyer". Hence, the need for genomic privacy is more acute than ever...
  • asked a question related to Data Mining and Knowledge Discovery
Question
1 answer
In my observed data I have different data variables(12 in Number) , Let say X1,X2......X12..I want know ,How one data variable influence the value of another data variable..
is this interdependence measure is directly related to observed number of Tuples.
Relevant answer
Answer
I think your question is ( Are the variables correlated or not ? ) .
search on something called correlation measures and se if that's what you want .
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
What are the ways to transfer a graph from one Relation space to a Euclidean space with less time complexity? although there are some ways solution (such as signal process, spectral method ), they have a high time complexity.
Relevant answer
Answer
Dear Kamal,
maybe node2vec can be useful for your application: https://snap.stanford.edu/node2vec/
Kind regards,
Djordje
  • asked a question related to Data Mining and Knowledge Discovery
Question
45 answers
Recently, several works have been published on predictive analytics:
Besides, there is a paper on how to discover a process model using neural networks:
My questions for this discussion are:
  • It seems, that the field for machine learning approaches in process mining in not limited to predictions/discovery. Can we formulate the areas of possible applications?
  • Can we use process mining techniques in machine learning? Can we, for example, mine how neural networks learn (in order to better understand their predictions)?
  • If you believe that the subjects are completely incompatible, then, please, share your argument. Why do you think so?
  • Finally, please, share known papers in which: process mining (PM) is applied in machine learning (ML) research, ML is applied in PM research, both PM and ML are applied to solve a problem. I believe, this will be useful for any reader of this discussion.
Relevant answer
Answer
There are actually quite a lot of nice application of machine learning techniques in the context of business process variant analysis, which is a fairly large subset of the process mining literature.
For example, Folino, Cuzzocrea et al. have done a series of studies on variant analysis (or deviance mining) using various machine learning methods, including ensemble learning and clustering:
We recently conducted a literature survey of methods in the field of variant analysis, many of them based on machine learning techniques:
Related to the above, there is work on bayesian networks for delay analysis (explanatory rather than predictive):
The above is related to variant analysis and performance mining. But there is also work on anomaly detection in event logs using bayesian networks:
And using deep learning architectures:
As well as using deep learning models to compute alignments in order to correct anomalies:
And a bit related to the above, there was quite a bit of research on using trace clustering in the context of automated process discovery (e.g. Jochen De Weerdt)
So we can say that process mining and machine learning go well together. One should not forget though that BPM and process mining are application-oriented disciplines - their objective is to design approaches to improve business processes. Whereas machine learning is a horizontal discipline, it seeks to develop methods that can be adapted to a broad range of problems/settings. Process mining has tapped a lot into machine learning, but sure it has a lot more to exploit from it.
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
I want to convert an unweighted graph to weighted for solving the link prediction problem. Is the best way to transfer from an unweighted graph to a weighted graph to consider the similarity between nodes?
Relevant answer
Answer
Not necessarily. It totally depends on your application. If your dataset is on online social networks and you want to model the relation strength among individuals, you could also consider the degree of intimacy, trustworthiness, and influence among individuals. check out these papers:
For the concept of weight on multiplex networks:
Also take a look at Granovetter paper since it is possibly the first paper who defined the concept of weak and strong ties in social networks and modeled them as a set of nodes and links: https://sociology.stanford.edu/sites/g/files/sbiybj9501/f/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
Hi Folks,
I need your help regarding the Artificial Intelligence Context of Information Retrieval tools and Big Data & Data Mining in the libraries? Dissertation/Thesis, research paper, conference Paper, Book chapter, Research Project and Article can you share with me. I will also welcome you comments, thought and feed back in the context of University libraries support me to designed my PhD Questionnaire.
-Yousuf
Relevant answer
Answer
Dear Colleagues and Friends from RG,
In my opinion, in the coming years, one of the key applications of artificial intelligence integrated with other Industry 4.0 technologies, including Big Data Analytics, will be improvement of information search on the Internet.
Conducted scientific research confirms the strong correlation between the development of Big Data technology, Data Science analytics, Data Analytics and the effectiveness of the use of knowledge resources. I believe that the development of Big Data technology and Data Science analytics, Data Analytics and other ICT information technologies, multi-criteria technology, advanced processing of large information sets, and Industry 4.0 technology increases the efficiency of using knowledge resources, including in the field of economics, finance and organization management. In recent years, ICT information technologies, Industry 4.0 etc. have been developing dynamically and are used in knowledge-based economies in particular. These technologies are used in scientific research and business applications in commercial enterprises and in financial and public institutions. Due to the growing importance of this issue in knowledge-based economies, an important issue is the analysis of the correlation between the development of Big Data technology and Data Science analytics, Data Analytics, Business Intelligence and the effectiveness of using knowledge resources to solve key problems of civilization development. The use of Big Data, Data Science, Data Analytics, Business Intelligence and other ICT information technologies as well as advanced data processing Industry 4.0 in the processing of knowledge resources should contribute to increasing the efficiency of knowledge resource processing in knowledge-based economies, including in the field of economics and finance.
In recent years, the scope of applications of Big Data technology and Data Science analytics, Data Analytics in economics, finance and management of organizations, including enterprises, financial and public institutions, has been increasing. Therefore, the importance of implementing analytical instruments for advanced processing of large data sets in enterprises, financial and public institutions, i.e. the construction of Big Data Analytics platforms to support organization management processes in various aspects of operations, including the improvement of customer relations, is also growing. In my opinion, scientific research confirms the strong correlation between the development of Big Data technology, Data Science analytics, Data Analytics and the effectiveness of the use of knowledge resources. I believe that the development of Big Data technology and Data Science analytics, Data Analytics and other ICT information technologies, multi-criteria technology, advanced processing of large information sets, and Industry 4.0 technology increases the efficiency of using knowledge resources, including in the field of economics, finance and organization management. In recent years, ICT information technologies, Industry 4.0 etc. have been developing dynamically and are used in knowledge-based economies in particular. These technologies are used in scientific research and business applications in commercial enterprises and in financial and public institutions. Due to the growing importance of this issue in knowledge-based economies, an important issue is the analysis of the correlation between the development of Big Data technology and Data Science, Data Analytics, Business Intelligence and the effectiveness of using knowledge resources to solve key problems of development of business entities. In recent years, the use of 5G technology to collect data from the Internet can significantly contribute to improving the analysis of sentiment of Internet users' opinions and the possibility of extending the use of research techniques carried out on Business Intelligence, Big Data Analytics, Data Science and other research techniques using ICT information technologies , internet and advanced data processing typical of the current fourth technological revolution referred to as Industry 4.0.
In recent years, organization management processes have been improved through the implementation of information technology and advanced data processing technologies Industry 4.0 into the IT analytical platforms Business Intelligence, Big Data Analytics, etc. The technologies of advanced analysis of big data sets Big Data Analytics and research processes carried out on Business Intelligence platforms are used also to improve business management processes. Data collection processes on the Internet can be supported by the use of 5G technology. In recent years, information technology management models in organizations have been enriched with advanced 4.0 industry data processing technologies, including cloud computing, Internet of Things, artificial intelligence, machine learning and more. The use of information systems in built models of information technology management in organizations, etc. is currently taking place in many areas of functioning of various types of business entities. The use of ICT information technologies and advanced data processing technologies i.e. typical for the current technological revolution Industry 4.0 already covers almost the entire functioning of business entities, from computerized sales support systems, logistics, accounting, reporting, risk management to marketing activities on the Internet and designing new products and innovative solutions in information systems. Online banking is starting to dominate, whose development is determined by technological progress in the field of ICT and Industry 4.0 information technologies. Computerization is also increasingly affecting public sector institutions servicing tax systems and settlements of business entities. Business Intelligence analytical platforms have also been developed for several years in the SME sector. Business Intelligence systems supporting analytical processes and organization management are produced by IT companies not only for large corporations. The analyst of large information sets in Big Data databases is also developing. Big Data Analytics and Data Science analytical systems are used by more and more types of business entities to analyze both the markets in which they operate and complex processes that are conducted or diagnosed and researched in these enterprises. Computerization also covers financial and economic risk management processes, etc. In all these areas of ICT technology application, building and improving IT technology management models in organizations is also an important issue. Therefore, specific information technology management models should be tailored to the specifics of the operations of a particular business entity, enterprise, company, corporation, public institution or financial institution.
On the other hand, the collection of large data sets about users of specific websites and portals in Big Data database systems generates new categories of information security risk. The database of social media portal such as Facebbok is already a powerful collection of information. Some research centers specializing in the use of large data sets Big Data downloaded from social media portals through the analysis of sentiment I prepare reports that can be helpful in forecasting phenomena and processes in the future. Medicine is one of the areas where there are great opportunities in this matter. For example, insurance companies and commercial banks that grant loans may be interested in information posted by users on Facebook and possibly also on other social media sites. Apparently, some insurance companies and commercial banks during the analysis of an application for insurance or credit are looking at the information content of accounts, applicant profiles, potential client, contractor posted on social media portals.
Another area of ??application of analytics carried out on large data sets collected in Big Data database systems is sentiment analysis in the field of surveying the opinions of Internet users regarding specific products and / or services and companies producing them. Large amounts of information downloaded from comments, entries, posts from social media portals are processed in Big Data database systems to determine, e.g. consumer awareness regarding the offer of products and services of specific companies. This type of information is of great importance for the purpose of planning advertising campaigns informing about the mission, idea, product offer, usability features of a given company's offer. This type of data may be relevant to forecasting changing consumer preferences for specific companies' offers. Techniques for collecting analytical data on the Internet can be supported by the use of 5G technology.
I am also involved in research on knowledge management using the Big Data computerized database platforms. In my publications available on the Research Gate portal, I described the key determinants of the development of Big Data technology and the security of information obtained from the Internet, collected and processed in Big Data databases. I also described the development of analytics using Business Intelligence platforms that are used in enterprises. Business Intelligence based analytics, as well as Data Science and Big Data Analytics are increasingly being used to improve business management processes. The development of this analytics based on the implementation of ICT and Industry 4.0 information technologies into analytical processes has a great future ahead of it in the coming years. I invite you to cooperation.
One of the areas in which the possibilities of market analytical technology applications, including data downloaded from internet portals, are growing is the marketing of enterprises and institutions. In recent years, the development of marketing is determined by the development of Industry 4.0 technology and the development of open innovations on the Internet. Open innovations developed on the Internet concern, among others, free information and marketing services. The issue of the possibility of publishing specific content, texts, banners, comments etc. on the Internet and obtaining free information are key determinants of the development of information services on the Internet. On the other hand, the largest internet technology corporations earn income mainly from paid marketing services. Therefore, the Internet environment is a kind of mix of free and paid information and marketing services, which are simultaneously, simultaneously and simultaneously interrelatedly developed by various Internet companies. Currently, research is conducted into the analysis of the development of open innovations in the field of free information services, which are the main factor of business success of the largest online technology companies, which include such concerns as Google and social media portals such as Facebook, Instagram, YouTube, Tweeter, LinkedIn and others .
The development of internet information services will be determined by technological progress in the field of new ICT, communication technologies and advanced data processing techniques typical of the current technological revolution referred to as Industry 4.0. The development of information processing technology in the era of the current technological revolution called Industry 4.0 is determined by the use of new information techniques, for example in the field of e-commerce and e-marketing. These solutions are the basis for the business success of the largest online technology concerns that offer information search, data collection and processing services in the cloud (e.g. Google) and provide information services on platforms developed in social media portals (e.g. Facebook, Instagram, YouTube, Tweeter, LinkedIn, Pinterest, and more).
The current technological revolution referred to as Industry 4.0 is motivated by the development of the following factors: Big Data database technologies, cloud computing, machine learning, Internet of Things, artificial intelligence, Business Intelligence and other advanced technologies of Data Mining.
The information technologies mentioned above, combined with the improvement of ICT and communication technologies, along with the progressive process of increasing the computing power of computers will become an important determinant of technological progress in various branches of industry in the coming years. Based on the development of these new technological solutions, the processes of innovatively organized analyzes of large information collections gathered in Big Data database systems and computing cloud computing for the purposes of applications in such fields as machine learning, Internet of Things, artificial intelligence have been dynamically developing in recent years. , Business Intelligence. To this can be added other areas of advanced technologies for analyzing large data sets, such as Medical Intelligence, Life Science, Green Energy, etc. Processing and multi-criteria analysis of large data sets in Big Data database systems is performed according to the V4 concept, i.e. Volume (meaning large number of data), Value (large values ??of specific parameters of the information analyzed), Velocity (high speed of new information appearing) and Variety (high information diversity). The above-mentioned advanced technologies for processing and analyzing information are increasingly used for the needs of marketing activities of various business entities that advertise their offer on the Internet or analyze the needs in this regard reported by other entities, including companies, corporations, financial and public institutions. More and more commercially operating business entities and financial institutions conduct marketing activities on the Internet, including on social media portals. The possibilities of collecting market data on the Internet in subsequent years can be significantly expanded by using 5G technology.
The information and communication technologies listed above, combined with the improvement of ICT technologies and the implementation of Business Intelligence analytics into the processes of economic and financial, economic, macroeconomic and market analyzes may be instrumental instruments helpful in the efficient and effective management of economic, investment processes and enterprises, including analyzes carried out for the purposes of improving marketing activities in enterprises. More and more companies, banks and other entities need to carry out multi-criteria analyzes on large data sets downloaded from the Internet describing the markets in which they operate and contractors and clients with whom they cooperate. On the other hand, there are already specialized technology companies that offer this type of analytical services, prepare commissioned reports, which are the result of such multi-criteria analyzes of large data sets obtained from various websites and from entries and comments contained on social media portals. An important research technique that has been developing in recent years, the effects of which are used for the purposes of marketing activities of companies, is sentiment analysis carried out on large data sets collected from the Internet and stored in Big Data database systems.
In order to group the behavior of social media users into specific classes of behavior, these classes must first be defined. Sentiment analysis using large data sets collected from entries and comments from social media portals and transferred to Big Data database platforms can be helpful. Then, when observing the changes in certain types of behavior of users of social media portals, you can analyze the data collected in Big Data according to these observations. In addition, a useful tool can be an analysis of the behavior of users of social media portals based on current posts, entries and comments on specific social media pages, statistical analysis of comments on specific topics of posts. This type of research is carried out by online technology companies that run social media portals and use the results of these studies to develop their viral marketing services, because this field of marketing is a key determinant of revenue generated by these companies from advertising sales on social media portals. The basis of marketing activities conducted in this way are market research conducted by collecting market data from the Internet regarding the offer of individual companies, their competition, demand for specific products and services from Internet users as well as collecting, processing and analyzing this data in Big Data Analytics database and analytical systems. The process of collecting market data from specific websites can be improved by using 5G technology.
Industry 4.0 technologies are also used in the development of transaction systems and transaction security in the field of e-commerce and online banking. The key determinants of the globally developing e-commerce relate primarily to the implementation of ICT information technologies and advanced data processing technologies, i.e. industry 4.0 typical for the current technological revolution to computerized, automated transaction systems supporting online trading. In addition, the use of blockchain technology for transaction security systems and data transfer on the Internet. The use of ICT information technologies and advanced data processing technologies i.e. typical for the current technological revolution Industry 4.0 to online transaction systems supporting e-commerce already applies to almost all the functioning of online stores, from computerized sales support systems, logistics, accounting, reporting, risk management to Internet marketing activities and improving security systems for online transactions. Another important determinant of e-commerce development is the development of online mobile banking available on mobile devices and new solutions related to the Internet of Things technology. Online banking is starting to dominate, whose development is determined by technological progress in the field of ICT and Industry 4.0 information technologies. Computerization is also increasingly affecting public sector institutions servicing tax systems and settlements of business entities. In addition, Business Intelligence analytical platforms supporting the management processes of companies operating also in the e-commerce sector have been developed for several years. The analyst of large information sets in Big Data databases is also developing. Big Data Analytics and Data Science analytical systems are also used by businesses operating also in the field of e-commerce.
In recent years, new internet marketing instruments have also been developed, mainly used on social media portals, and are also used by companies operating in the e-commerce sector. Internet technology and fintech companies are also emerging that offer information services on the Internet to support marketing management, including the planning of advertising campaigns for products sold via the Internet. To this end, sentiment analyzes are used to survey Internet users' opinions regarding dominant awareness, recognition, brand image, mission and the offer of specific companies. Sentiment analysis is carried out on large data sets downloaded from various websites, including millions of social media sites collected in Big Data systems. The analytical data collected in this way are very helpful in the process of planning advertising campaigns carried out in new media, including social media portals. These campaigns advertise products and services sold via the Internet, available at online stores. In view of the above, the development of e-commerce is determined mainly by technological progress in the field of ICT information technologies and advanced data processing technologies Industry 4.0 and new technologies used in securing financial transactions carried out via the Internet, including e-commerce related transactions, e.g. technology blockchain. I have described the above issues of various aspects of the application of information systems and ICT, including Big Data, Business Intelligence in companies operating on the Internet in my scientific publications available on the Research Gate portal. I invite you to cooperation.
According to the above, in my opinion, the use of 5G technology to collect data from the Internet will significantly contribute to improving the analysis of sentiment of Internet users' opinions and the possibility of extending the use of research techniques carried out on Business Intelligence, Big Data Analytics, Data Science and other research techniques using information technologies ICT, internet and advanced data processing typical of the current fourth technological revolution referred to as Industry 4.0. At present, however, all the potential applications of 5G technology in economic and other applications are unknown. These applications will be wide in both business processes carried out by technological internet companies as well as by security institutions. Globally operating technology internet companies, thanks to the use of 5G technology in research processes, will improve their offer of information, internet and marketing services addressed to Internet users. On the other hand, national security institutions and IT systems risk management departments operating in companies can also obtain a tool enabling a significant improvement of instruments ensuring a high level of security of information transferred via the Internet and other cybersecurity issues. Therefore, research on cyber security and e-commerce will be expanded to include the impact of 5G technology on the development of many aspects of these areas of activity of business entities, institutions and citizens increasingly using the Internet in various areas of business.
In view of the above, in my opinion in the coming years one of the key applications of artificial intelligence integrated with other Industry 4.0 technologies, including Big Data Analytics, will be improvement of information search on the Internet.
Best wishes.
Dariusz Prokopowicz
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
How can i remove over-fitting in weka. I have used re-sample, randomize techniques. But whats the proper way to remove over-fitting in weka.
Relevant answer
Answer
use random forest classifir
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
I'm searching for some good tools that offers easy way to apply evolutionary/genetic algorithm for selecting best feature from a dataset. I was wondering if this task can be performed in KNIME, WEKA or Orange?
Relevant answer
Answer
use weka software
  • asked a question related to Data Mining and Knowledge Discovery
Question
1 answer
As we know, most of the researchers use manual validation by the experts for the unlabeled User Reviews for a specific domain , but is there a new way? Because I worked with big sized dataset and using experts will be difficult?
if anyone use a new performance measure or a new way for validation, plz inform me .
Thanks in advance.
Relevant answer
Thanks Gopi Battineni , I will check the paper.
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
Data Mining and Big data cover the subject of Artificial intelligence or these terms also discuss in the context of Data Literacy or Data Management in the context of Library and information science?
  1. Do librarians data literacy skills remain the same as the Data Scientist skill? If data scientist skill the higher than Librarian data literacy skills inf future librarian job market replace by the librarian?
  2. What should librarian do to enhance data literacy skills ?
Any study (Dissertation, Model, Conference Paper, Poster discussed the data literacy in the context of AI (Big Data and Data mining) application in Library (ies).
_Yousuf
Relevant answer
Answer
Hi Pohammad,
Big data/data mining, business intelligence/analytics, data science, distant reading, knowledge discovery etc. are all different terms used in different disciplines to denote essentially the same thing: statistical analysis and discovery of novel patterns from data and presenting them in the form conducive to human consumption. Meliha
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
If we have multiple classifiers and we need to know which one is under-fitting, and which one is overfitting based on performance factors (classification accuracy, and model complexity)
Are there any method to select the dominate classifier (optimal fitting) that balance between the above-mentioned two factors?
Relevant answer
Thank you Salah Mortada for your explanation. However, I would like to draw a figure that compares different methods that used in the experiment result and clearly show the overfitting, under-fitting methods. If you have an example of such an experimental result will be grateful.
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
For example : if i want to determines the size of the dataset according to their instances number.
Dataset 1= 8000 => High
Dataset 2= 2000 => Medium
Dataset 3= 500 => Small
Another example : if i want to determines the size of the dataset according to their features number.
Dataset 1= 100 => High
Dataset 2= 30 => Medium
Dataset 3= 7 => Small
Relevant answer
Answer
Hi Hayder,
There are no standards for both cases. Required number of instances are related to the complexity of the problem. And also you can extract infinite number of features from available information.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Hello dear researchers
I've decoded some AIS data with https://github.com/schwehr/libais.
I have a question. some records have fields which have called UTC-hour, UTC-min, UTC- spare but some records just have a timestamp.
what should I do with these columns to get to time?
Do you know any another reliable package to decode AIS data?
Regards,
Relevant answer
Answer
Thanks Nagdev Amruthnath for your response.
sure,I'll annex a sample file so you can see it.
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
What are the major differences between using the Information Gain and Entropy when we use to determine the credibility or the importance in the classification.
Relevant answer
Answer
The information gain is the amount of information gained about a random variable or signal from observing another random variable.
Entropy is the average rate at which information is produced by a stochastic source of data, Or, it is a measure of the uncertainty associated with a random variable.
An example at:
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Greeting to every one
              I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
Relevant answer
Answer
Yes , for sure, this application has already made
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
Hello.
I've got some AIS data in text format. When i open these files the contents aren't meaningful .
I want to derive trajectory information from these files but I don't know how to do this. I wonder if anybody can help me.
Regards
Relevant answer
Answer
Dear Danial,
Assuming that you have undecoded AIS data (something in the following form: "!AIVDM,1,1,,A,13u?etPv2;0n:dDPwUM1U1Cb069D,0*24"), you need to:
1) decode the message,
2) check the message type and keep only those that contain data relevant for trajectories (in your case probably types 1,2,3 and 5),
3) identify the fields you need (e.g. MMSI, longitude, latitude, speed) and retrieve the values.
1:
If you are unfamiliar with AIS format you can use the following link:
Also, if you don't want to write decoder yourself, there are free online versions. Not all cover all AIS message types, but majority decodes the ones you need. You may try the following:
2:
You need to check AIS documentation to identify the message type you need. Types 1, 2 and 3 are for position report and you can use those to retrieve MMSI (unique identifier of a ship), its latitude, longitude, speed and some others.
Perhaps type 5 might also be of interest to you as it contains static and voyage related data.
3:
Once you know which messages you are goind to use, follow the format to extract the fields you need.
Best regards,
Andrej Dobrkovic
  • asked a question related to Data Mining and Knowledge Discovery
Question
17 answers
Developing knowledge is one of the most important factors in the development of civilization, including technical progress, technology development, etc. Knowledge is one of the most important production factors in modern knowledge-based economies. In modern knowledge-based economies, information services, Internet and modern information technologies based on advanced information processing are developing dynamically in the most economically developed countries. Currently, the development of knowledge led to the fourth technological revolution known as Industry 4.0.
A particularly important area of ​​knowledge that is rapidly developing in recent years and probability is determined by the development of modern economies are advanced information processing technologies ranked among the main determinants of the technological revolution called Industry 4.0.
The currently ongoing technological revolution Industry 4.0 is determined by the development of the following advanced information processing technologies: Big Data database technologies, cloud computing, machine learning, Internet of Things, artificial intelligence, Business Intelligence and other advanced data mining technologies and other information technologies.
Knowledge development is therefore a key issue for the continuation of technological progress in the 21st century.
In view of the above, I am asking you the following question:
What is the significance of knowledge in the development of the 21st century civilization?
Please reply
Best wishes
Relevant answer
Answer
Every blade has two sides.
The present technology based knowledge and its practice may lead to some new problems like failure of the next generation people to think in versatile directions. Innovations may stop or may be totally dependent on technology for that reason.
As an example, one question may be asked. How many of modern day school student can perform simple calculations without calculator?
  • asked a question related to Data Mining and Knowledge Discovery
Question
14 answers
Today, social networks have a special place in people's lives and people spend a lot of time on these networks. These networks have a number of positive and negative effects on the behavior, culture and lifestyle of individuals and society, how can these impacts be managed and to improve society? Is there a technical and scientific solution?
Relevant answer
Answer
direct discussion is the best
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
What is a possible solution for cross validation of an imbalanced data set problem? The question is in three sections. 1. 1- Oversample the minority class examples using (SMOTE, ADASYN etc), then split it into 10 folds, train the classifier on first nine folds and test on 10th fold and repeat this process 10 times and take the average of metric measure then what about overfitting problem? 2. what about if we divide the data set into 10 folds, oversample the minority class examples in first ninth folds and train the classifier and test the trained classifier on the original (Not oversampled) 10th fold repeat this process 10 times and take the average .. question is what about distribution because basic assumption is training and test set follow the same distribution. 3. If we oversample the minority class examples same as number of majority class examples, then it is necessary to measure F-Measure, G-mean and AUC or accuracy measure is sufficient.
Relevant answer
Answer
Option 1) is a common misconception for researchers farther from the imbalanced data topic. If you oversample the entire data and perform the cross-validation procedure afterwards, similar examples (or exact replicas, depending on the oversampling algorithm used) will appear in both the training and test partitions. Here, the most problematic issue will be the overoptimism introduced in your design (rather than the overfitting).
Option 2) is the correct way to handle imbalanced data, although proper care must be taken when choosing an oversampling algorithm. An inappropriate choice (e.g. choosing algorithms that create exact replicas of original examples) may lead to overfitting.
The following paper may be very useful, as it explains these two issues (overoptimism and overfitting) in detail, and compares several oversampling algorithms over a benchmark of 86 publicly-available datasets:
Your concern regarding the train-test distribution is valid, although I would argue that it is not that significant, providing that your cross-validation is stratified: each fold should contain the same number of minority/majority examples (for a binary-classification problem). Indeed, you will be oversampling the training set and testing on the original (imbalanced) test set, but your model, improved by an appropriate oversampling method, should generalise well for the test set. Note that improved CV procedures exist that handle partition-induced dataset shift (either prior probability or covariate shift), and perhaps are worthy of consideration:
Regarding Option 3), I believe you are still considering Option 1) which should not be performed, for the reasons explained above. Therefore, you should use appropriate performance measures, more "robust" to imbalanced data than Accuracy (that is clearly biased towards the most represented class).
I understand perhaps this comes a little too late to be helpful in your case, but may help future researchers in the field.
Best regards!
  • asked a question related to Data Mining and Knowledge Discovery
Question
13 answers
I have a sort of data in which the change in the weight of materials is recorded during the time. Unfortunately because of special condition I cannot record the weight in the first 75 seconds.
- Is there any way to predict the initial missed data (I mean the change in the weight in the first 75 seconds)?
- How can I find the equation of the curve that fit the data points?
Any solution with MATLAB, SPSS, and Excel softwares is appreciated.
Relevant answer
Answer
hello,
please use any statistical forecasting techniques and check which one is suitable for your data.
good luck@
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
I am studying on frequent subgraph pattern mining from transactional graph. For experimental study, I need some benchmark data sets. Is there any graph generator to generate synthetic data graph?
Relevant answer
Answer
There is one which is called simbrain. You can get graph. But, May be you need little tutorial to understand or may be not.
Thanks,
Manu Mitra
  • asked a question related to Data Mining and Knowledge Discovery
Question
17 answers
For one of my studies, I designed an unsupervised predictive clustering model, and now searching for some modification steps and post processing to use that clustering model for classification in a reliable way.
Relevant answer
Answer
For supervised learning we need to have a labeled data set. If not, it is good to run unsupervised learning algorithms for automatically labeling unlabeled data. Once the data is labelled using clustering algorithms, then it is possible to use supervised learning algorithms. For linking the two tasks a simple script can be written that connect the output of clustering as an input for the classification task.
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
Does anybody know a solver for a large scale sparse QP that works on the GPU?
Or, more in general, can a GPU speed up solvers for sparse QPs?
Relevant answer
  • asked a question related to Data Mining and Knowledge Discovery
Question
6 answers
What are the mostly used, latest and effective techniques for learning from imbalancd dataset?
The techniques I am aware of:
* Resampling Techniques:
  1. Random Undersampling
  2. Random Oversampling
  3. Synthetic Minority Oversampling Technique
* Throw away minority examples and switch to an anomaly detection framework
* At the algorithm level, or after it
  1. Adjust the class weight (misclassification costs).
  2. Adjust the decision threshold.
  3. Modify an existing algorithm to be more sensitive to rare classes.
* Construct an entirely new algorithm to perform well on imbalanced data.
Are there any other new/effective techniques to look at?
Relevant answer
Answer
Dear Imran,
Kindly refer to the following papers:
1. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
2. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
3. Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer, Berlin, Heidelberg.
4. Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007, November). Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 127-136). ACM.
5. Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1), 7-18.
6. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
Please consider the sixth paper with higher priority.
Thanks,
Sobhan.
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
Could you please share some current research trends/topics/techniques in Data Mining and Knowledge Discovery?
Relevant answer
Answer
Dear Imran,
You have to specify the domain so that the trend can be searched. However, you may follow some of the important papers below.
1. Bakhshinategh, B., Zaiane, O. R., ElAtia, S., & Ipperciel, D. (2018). Educational data mining applications and tasks: A survey of the last 10 years. Education and Information Technologies, 23(1), 537-553.
2. Bandaru, S., Ng, A. H., & Deb, K. (2017). Data mining methods for knowledge discovery in multi-objective optimization: Part A-Survey. Expert Systems with Applications, 70, 139-159.
Thanks,
Sobhan
  • asked a question related to Data Mining and Knowledge Discovery
Question
12 answers
Global search vs local search.
Relevant answer
Answer
Dear Ang,
The answers provided against this question are right. If you are looking more, you may find some interesting answers through the link below.
Thanks,
Sobhan
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
What are the procedures that we can implement in Transformation step?
Relevant answer
Answer
Dear Deeman,
Balan has stated correctly. The steps in figure that you showed are very useful indeed.
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
In Apriori Association Rule if the minSupport = 0.25 and minConfidence = 0.58 and for an item set we found a total of 16 association rules:
Rule Confidence Support
{1 2 ==>3} 1 0.4
{3 5 ==>2} 1 0.4
{1 ==> 2 3} 0.666 0.4
{1 3 ==> 2} 0.666 0.4
{2 3 ==> 1} 0.666 0.4
{5 ==> 2 3} 0.666 0.4
{2 3 ==> 5} 0.666 0.4
{2 5 ==> 3} 0.666 0.4
{1 ==> 3} 1 0.6
{5 ==> 2} 1 0.6
{3 ==> 1} 0.75 0.6
{2 ==> 3} 0.75 0.6
{3 ==> 2} 0.75 0.6
{2 ==> 5} 0.75 0.6
{5 ==> 3} 0.666 0.4
{1 ==>2} 0.666 0.4
If we want to reorder these rules from the most to least important rules which factor determine the importance of the rule Support or confidence i.e:
In this rule the Confidence is 1 but the Support is 0.4
{1 2 ==>3} 1 0.4
While in this rule the Confidence is 0.75 but the Support is 0.6
{3 ==> 1} 0.75 0.6
Relevant answer
Answer
Yes. Confidence is the parameter.
  • asked a question related to Data Mining and Knowledge Discovery
Question
10 answers
Generally feature selection method is used to select relevant feature for classification. But in some research work done additionally optimal feature selection.
Relevant answer
Answer
Hi Purusothaman,
Samer is right. First we select the feature set. Then we may optimally choose the features for better performance of algorithms.
Thanks,
Sobhan
  • asked a question related to Data Mining and Knowledge Discovery
Question
10 answers
Hi everyone! From my research, I noted that when it comes to the evaluation of DM and KM, these two components are being evaluated in a separate entity. Are there any integrated DM-KM evaluation method that I might have missed out? Looking forward to your replies. Cheers.
Relevant answer
Answer
Dear Siti,
I think they are different, but can be used as integrated part. DM is usually used to extract information by mining data. From it, knowledge can be enhanced. On the contrary, KM is considered as a broader aspect, where knowledge or information can obtained by using different tools and techniques, not necessarily only DM.
To understand the theoritical difference, please follow the papers:
[1] Alavi, M., & Leidner, D. E. (2001). Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS quarterly, 107-136.
[2] Shaw, M. J., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001). Knowledge management and data mining for marketing. Decision support systems, 31(1), 127-137.
[3] Wang, H., & Wang, S. (2008). A knowledge management approach to data mining process for business intelligence. Industrial Management & Data Systems, 108(5), 622-634.
[4] Luan, J. (2002). Data Mining and Knowledge Management in Higher Education-Potential Applications.
Thanks,
Sobhan
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
I am beginner in the field of text mining.I have implemented an algorithm on text pattern mining.I have collected few sample of Reuters RCV1 dataset. I know about precision,recall and F-score rather I am confused about how to judge relevance.How I will measure how much relevant pattern it can retrieve?
Relevant answer
Answer
  • asked a question related to Data Mining and Knowledge Discovery
Question
6 answers
Hi all,
I would like to ask about what are the different techniques, methods or tools available to identify commonalities and differences among multiple documents?. Please let me know about it.
Relevant answer
Answer
Similarity of Documents
An important class of problems that Jaccard similarity addresses well is that of finding textually similar documents in a large corpus such as the Web or a collection of news articles. We should understand that the aspect of similarity we are looking at here is character-level similarity, not “similar meaning,” which requires us to examine the words in the documents and their uses. That problem is also interesting but is addressed by other techniques, which we hinted at in Section 1.3.1. However, textual similarity also has important uses. Many of these involve finding duplicates or near duplicates. First, let us observe that testing whether two documents are exact duplicates is easy; just compare the two documents character-by-character, and if they ever differ then they are not the same. However, in many applications, the documents are not identical, yet they share large portions of their text.
you can refer and use this book which is very good source
  • asked a question related to Data Mining and Knowledge Discovery
Question
38 answers
"The AI Takeover Is Coming" this is what is the news these days. Is it really a trend setter for future years.
What is the impact over manual work due to this? just needed the audience thoughts over this hence started a conversation.
Your thoughts and expertise are welcome!
Thanks in advance 
Relevant answer
Answer
The answer I would give is yes, AI will be adopted in the future. It's an easy answer, because AI means different things to different people.
Maybe most people can agree that AI has a self-learning component. This aspect is necessary for any computer program to be able to accomplish tasks which have not explicitly been predicted, and appropriate algorithms developed ahead of time, by a programmer. If nothing else, one can imagine a control system that tests operational modes to determine safe operating limits. Such as, allow fuel flow to increase until temperature is no longer controllable, then set the limit below that point. Autonomous driving can certainly benefit from such learning, so the vehicle becomes safer with experience. Just like human drivers do, only better, because such algorithms wouldn't be encumbered with emotions, anxieties, distractions, fatigue, panic, and so on.
We already have systems available to the public, that take on some of these characteristics. For instance, in cars, modern engine controls and stability controls. These systems are always testing the limits, always learning, and reacting to conditions right now multiple times faster than humans can. Perhaps the familiarity we have with some of these modern controls makes us dismiss them. But hey. Imagine what someone would have thought just 50 years ago, about cars that can save themselves from skidding out of control, or can stop faster than that panicked human standing on the brakes, or can parallel park all  by themselves, or can constantly be tweaking the spark advance, to keep the engine always on the verge of pinging? All of these tasks accomplished not in some totally pre-programmed way, but by taking existing conditions into account, in real time.
Although some of what passes off as AI is not much more than rule-based programming. Big, nested, logical if statements, that a user would think behave like AI. Then again, isn't that a lot of what human intelligence is? We build a database of effects and their causes, and we act accordingly?
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
If we train a data model once on a dataset using a machine learning algorithm, save the model, and then train it again using the same algorithm and the same dataset and data ordering, will the first model be the same as the second?
I would propose a classification of ml algorithms based on their "determinism"
in this respect. On the one extreme we would have:
(i) those which always produce an identical model when trained from the same dataset with the records presented in the same order and on the other end we would have:
(ii) those which produce a different model each time with a very high variability.
Two reasons for why a resulting model varies could be (a) in the machine learning algorithm itself there could be a random walk somewhere, or (b) a sampling of a probability distribution to assign a component of an optimization function. More examples would be welcome !
Also, it would be great to do an inventory of the main ML algorithms based on their "stability" with respect to retraining under the same conditions (i.e. same data in same order). E.g. decision tree induction vs support vector vs neural networks. Any suggestions of an initial list and ranking would be great !
for quite a comprehensive list of methods.
Relevant answer
Answer
There is an element of chance in the training process. In some software, you can get reproducible answers by using something like set.seed( ) in the R language. Using the seed number again with the same data will then give the same result. Then you can report the software you used with the seed. However in general the different outcomes will be close together, but as with sampling, you will occasionally get outliers (depending on the seed you choose). 
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
I am working on sensor data to detect deviation of behavior of people and my data is full unlabeled so I read some papers about transfer learning to find a suitable method to detect the deviation and apply in different sensors data but I have not got an idea yet please if you have an idea share with me.thank you
Relevant answer
Answer
You have to have labeled data (with unlabeled data) in order to apply the discussed approach. 
Cheers, 
Samer 
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
Hi ,
I know that most of existing probabilistic and statistical term-weighting schemes (TF-IDF and its variation) are based on linked independence assumption between index terms. On the other hand, semantic information retrieval are seeks the importance of linked dependence between index terms each other.
Please, I am wondering when linked dependence between index terms is vital ? When also can we neglect linked dependence between index terms?
Note: dependence assumption: if two index terms have the same occurrences in the document, this will tend to that index terms are dependent and they should have the same term-weight values. 
Thanks
Osman
Relevant answer
Answer
Hi Vladimir,
Thank you for your answer, but in Information Retrieval, the partially judged document collections  have an issue with relevance judgement values. Thus, I think,  term- weights should have partially semantic relation such as term-weights dependence in unjudged documents. However, the text classification problem has not this issue.
Best wishes,
Osman
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
latest research in Data Mining and Knowledge Discovery
Relevant answer
Answer
There are millions of articles and book chapters out there, packed with information that might help you in your work. But how do you find what you need?
Text and data mining (TDM) could be the answer. TDM relies on new technologies to provide a better way of filtering and analyzing, helping you understand vast data resources. The TDM tools use natural language processing (NLP) – a form of machine learning. However, TDM is more than just a simple search tool like Google or Bing; it can also analyze the output, detecting new connections and patterns at a volume and speed that would be impossible to achieve manually. Not only does this make research more efficient, saving time and money, it can also transform each step of the process, making research more effective...
  • asked a question related to Data Mining and Knowledge Discovery
Question
30 answers
Data scientst, simulation system, data sorce ...
Relevant answer
Answer
Dear Dr. Shuang Liu,
I have updated the method and let it more abbreviated. Please send me an email to confirm that we will work about the issue to turkiabdelwaheb@hotmail.fr
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
WEKA package
Relevant answer
Answer
Hi Mohamed, 
Please have a look at this article:
J. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, 1998.
HTH.
Samer
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
For image retargeting which database is required
and whether it is freely available could able to find anything on following link http://people.csail.mit.edu/mrub/retargetme.  
Relevant answer
Answer
Hello gajanan sir,
        What is NRID ?
  • asked a question related to Data Mining and Knowledge Discovery
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Can You Answer This Question?
Relevant answer
Answer
Mostly by the fact that biological sequences themselves are not typical data for databases with a common DBMS, such as relational databases. However, if the sequences are annotated, the annotations can probably be data-mined in a fashion very similar to other relational databases.
Also, DBMS, by definition is any software used to manage any database, not just the common database formats. So anybody data-mining biological sequences is probably using a DBMS anyway (in the widest sense of the word).
  • asked a question related to Data Mining and Knowledge Discovery
Question
10 answers
Here I have attached an image. Where the blue dots are actual data points and the red line is my prediction(using linear regression). Can you suggest any good model for this dataset ?
It should be noted that the input was 50 dimensional and I am only showing the outputs.
Thank you.
Relevant answer
Answer
You show no red line! 
  • asked a question related to Data Mining and Knowledge Discovery
Question
1 answer
 Kafka needs a input method and it is just a databus. What are the best data gateways?
Relevant answer
Answer
Hi Dear, Arun Reddy
Please check the following linked resources, I hope useful to your question,
Best regards, 
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
the dataset must be on student information
Relevant answer
Answer
Thank you very much Professor Alaa
Hope you all the best
  • asked a question related to Data Mining and Knowledge Discovery
Question
6 answers
Hi,
I know that some of Support Vector Machine approaches and other machine learning approaches use the methodology of reducing the number of sample from the training set to reduce the computational run-time. However, this method can work very well on large training sets if they nearly have instances characteristics that can represent with the small portions (small samples) of these training sets. However, it will not do the same outperformance on training sets that have a lot of variations in the instances.
Please, is there any method in machine learning methods to reduce the computational run-time with considering all sample to be involved in the learning approach?
Thanks
Osman
Relevant answer
Answer
Recently there have been proposed several accelerated methods for finite-sum minimization (SVM optimization problem can be stated as such).
These methods are very fast advanced versions of stochastic gradient descent.
Also please refer to non-acceleratad methods for finite-sum minimization. The references can be found in the papers I listed above.
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
can i use any predictive or machine learning approach to improve quality of health care. or can i use it for disease prediction. 
Relevant answer
Answer
there are several health related datasets available publicly, you can look at them and try applying machine learning/modeling. they will serve as an excellent starting point
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
Dear Professors and research fellows,
Can anyone recommend me some tools or research articles about data extraction from Facebook for data mining and social network analysis please?
Thanks!
Relevant answer
Answer
If you are not particularly interested in FB, but look for texts from social networks, you can be interested in https://en.wikipedia.org/wiki/List_of_academic_databases_and_search_engines that provides a very useful list of academic databases and search engines.
  • asked a question related to Data Mining and Knowledge Discovery
Question
10 answers
hi,
I am working on a research which its purpose to forecast future sales demand. I have annual data of about 27 years which my data set is obviously small. I am trying to train the model which I can forecast 6 years later's sales demand.
At first, I trained my model with annual data of each year. To clarify, Year 2000's inputs data are set with year 2000's actual sale in one row. As i trained with this technique, Everything was good and i got Rsquare of about 99%. The problem is, if I want to forecast next 6 years I have to have the input data for next 6 years. for instance currency rate for next 6 years which it will decrease my model's forecast accuracy.
I came to an idea which I could train each year with input data of the previous years. For example in training, i set currency rate of year 94 with actual sales of year 2000. 
With this technique I can use year 2016's input data in order to forecast year 2022's sales. 
Is this technique logical?
Relevant answer
Answer
dear Thanh, 
Thanks for your time. I have already used the first technique you have proposed. problem is with this method, we are increasing parameters which could be wrong. for instance we are adding another forecast error rate for currency rate. Also all of these forecasts are effected by CPI rate because of process of fixing currency value. 
I was hoping to be able to dismiss the procedure of forecasting other variables.
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
In my dataset, both response and the independent variables are ordinal with multiple categories as severe (coded as 3), moderate (coded as 2), mild (coded as 1) and none (0)
I run the model by putting both outcome and the variable in descending order to see if there was a  relationship between the response and the ordinal variable, and I've had the results shown below
threshold: response (3)
                 response (2)
                 response (1)
variable (1): exp (B)=3, P=0.039, 95% CI=1.061-9.253)
and the reference category is variable (0)
Could you help me to interpret the results?
Relevant answer
Answer
It looks to me as if you are not treating the predictor as ordinal but as continuous.
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
At the moment, I conduct the state of the art concerning the concept of the smart city in the context of intelligent IT systems deployment.
I am particularly interested in research results, describing lessons learned from the field i.e. best practices, project schedules, engaged resources etc.
Anyone interested in collaboration, please let me know.
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
I want to migrate virtual machines in cloudsim simulation 
Relevant answer
Answer
Dear Marwa,
If you go to "PowerVmAllocationPolicyMigrationAbstract" class, you will find "findHostForVm", you can implement your algorithm here then call it in "getNewVmPlacement".
Regards,
Dabiah
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
..
Relevant answer
Answer
Hi Hanan, 
Please have a look at these links: 
I'm not sure if you'll find what you want, but you might find their content interesting. You can also have a look at Facebook (Graph) API.
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
I am fine-tuning the AlexNet  using caffe on my own data and I tried three different versions of caffe (the one release on the year 2014, the one release on 2015 and the one release on 2016) and the outputs are quite different and on some datasets the differences are more than 10 percentages. Generally, the older one outperforms the newer one. Why?
Relevant answer
Answer
Hi,
The main issue in using such framework is that they might update the source code, change some hyperparameters or some operators.
Then, one possible reason would be that they updated the default value of some hyper-parameters.
Did you check that all the hyper-parameters are the same? Does your experimental setting are all equals: batch size, epochs, hyper-parameters, activation functions,...
Hope it can help, otherwise, if you can share some source code in github, it would be a pleasure to help you and do some debugging
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
I'd like to mine web pages that'd result in a dataset of pages taken from a particular website (eg. news sites). It'd target articles not only from one section but also from the other sections on the site (for instance, politics, tech and etc. from CNN.com). All of these articles are combined and retrieved from the 3 years publication and that means I'd have all of the articles published in the 3 years time. What are the tools and techniques that I can opt to do? 
Relevant answer
Answer
I suggest https://scrapy.org/ - python based web crawler.
Also look at quora answer about web crawling services: https://www.quora.com/What-are-the-best-web-crawling-services
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Mobile OS
Relevant answer
Answer
There are many ways to extract apk file from your mobile, but the easiest way is the following:
Step 1: Install ES File Explorer App
Step 2: Open ES File Explorer App and goto Library -> Apps
Step 3: Select the app for which you want the apk and click on Backup (you will see many options after selecting an app) option.
Step 4: Get your apk file from sdcard/backup/apps folder.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Preferably for Mac OS. I found some APIs like Highcharts, but I'm looking for a standalone app. Thanks.
Relevant answer
Answer
Dear Juan José,
I would suggest you download SigmaPlot software via the following link and try it :
I hope this will be help for your research. 
Best wishes.
  • asked a question related to Data Mining and Knowledge Discovery
Question
38 answers
i am interested in doing research on big data .....
Relevant answer
Answer
You can look in for Big data in Medical research, Big data in socialMedia, Big data in clouds, 
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
I've been working on an HOG detection application lately and I am training a linear SVM with a data set. The best penalty parameter C, IS obtained during cross validation and corresponds to the one giving the lowest False Positive and False Negative Rates. Then a hard negative step is done by testing the negative training set against the SVM and then adding those hard negative examples to the training set. This step is giving a lot of hard negative samples so I am sorting them by their probabilities of being misclassified and then only keeping the one greater than a confidence threshold. (for i.e: > 0.9 of being classified as a true sample). I also have another parameter to set a max number of hard negative samples to keep if i found too many of them greater than the threshold.
In my experience (which isn't much) and after processing couple of runs, It appears that having too many hard negative examples doesn't improve the classifier. Also my confidence threshold and the number of samples i want to keep depends on the my C penalty parameter.
I would like to know if there are some rules or tips on choosing the optimal combination of C parameter and hard negative or if the only way to do it, it's to process multiple runs. Any advices is welcome!
Thank you!
Relevant answer
Answer
Thanks for your answer.
The thing with PCA is that a 2D visualisation could show no separation between classes while a 3D visualisation would show a good separation. Then i could be the same between 3D and 4D so i'm not sure if PCA is useful as we can't see the structure greater than 3D.
On another hand, I change my data to be balanced. I'm testing different combination of C and number of hard negatives samples using K-fold cross validation.
The only problem is sometimes I get a better score without using the hard data mining rather than using it. I get also just a few Fals positive and few false negatives during the training. But then if test on an image I'll get a lot of false positives and a poor score. 
Is my model overfitting during training? Does that mean my data set are not well build? Or the SVM cannot make a clear separation between the two classes? Why is the hard data step not improving my results during cross validation?
thanks for your help.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
I am trying to complie my application for TOSSIM. Even though, it compiles for the hardware, it throws run time errors when executing 'make micaz sim'. I was wondering if it because I am using 802.15.4 MAC. Does Tossim support tkn154? (The given examples do not compile).
Relevant answer
Answer
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
Need some basic tools for identifying original content in social media
Relevant answer
Answer
Hi.. Pradeep
If you want to work on tweets then please refer my paper to download streaming tweets using Twitter API.
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
I collected data from TripAdvisor and the users locations are not fixed. some of them used city name and some used their county name.
Is there any way I can merge them and code them as country name?
Relevant answer
Answer
Using OpenRefine, you can query the Google Maps api with little difficulty, which returns a json with a lot of information about the data you gave it (be it city name, street name, country name,..). From this json it is easy to parse out the 'country' element.
You can use the tutorial in the link, and instead of parsing the latitude, you parse the country.
Of course, when a city exists in multiple countries, some mistakes are possible. That's why you should give Google as much location data as possible (for example if you have city name + continent).
Good luck!
  • asked a question related to Data Mining and Knowledge Discovery
Question
9 answers
I'm Workinog on "community detection in networks considering node attributes". In this regard, I have already need some benchmark networks for testing my proposed algorithm through comparison of predicted labels (communities assignments) with the real ones (ground-truth). These networks should be undirected include non-overlapping communites, have small to big sizes, edges show the relations between nodes, nodes have some personal features likely affecting their community memberships and finally the true labels of nodes be known as ground-truth for evaluation of my predicted labels. Although I had an extensive search, but unfortunately I couldn’t find any networks considering these characteristics.
I really appreciate if anybody can address me some references or network benchmarks that satisfy my requirements.
Thank you in advance for your time and cooperation
Best regards,
Esmaeil
Relevant answer
Answer
Try network data sets at these sites:
 (i) University of Michigan, (ii)UCI (iii)Gephi
(iv)Washington State University (v)LAW� (vi) Yahoo graphs (vii) SNAP (viii) KONECT
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
Dear
I want to compare the performence between two SVM algorithms: SMO and libsvm under the WEKA tool:
SMO cross-validation k = 3 then K = 10 with kernel (Linear, Polykernel, RBF)
Libsvm cross-validation k = 3 then K = 10 with kernel (Linear, Polykernel, RBF, Sigmoid)
I get the results in the attached file
is that really the case (SMO K = 10 RBF) is my best solution.
if so, which metrics (2 or 3) I have to choose to make a graph to justify my choice.
thank you for your collaboration
Hamid SLIMANI
Relevant answer
Answer
Hi all, 
Here is a summary, correction and additional information that are related to the  current discussion.
In WEKA, SMO and LibSVM are different algorithms, but both can be used to preform SVM. Precisely, SMO implements John Platt's sequential minimal optimization algorithm for training a support vector classifier, while, LibSVM is a wrapper class for the libsvm library that supports the classifiers implemented in the libsvm library, including one-class SVMs. Therefore, getting different results is an intuitive issue may occur. 
Back to your question, yes, SMO using RBF kernel and based tenfold cross-validation has the best results. 
HTH.
Samer
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
I'm looking for a freely available dataset for Arabic microblog retrieval 
Relevant answer
Answer
check this site out for datasets
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
i am doing a research in big data stored in cloud , I would like to ensure that the modified data in intact. 
Relevant answer
Answer
Dear Shamiel,
Using Cloud Storage, users can remotely store their data and enjoy the on-demand
high quality applications and services from a shared pool of configurable computing resources, without the burden local copy data storage and maintenance.
These links may be useful for you :
  • asked a question related to Data Mining and Knowledge Discovery
Question
6 answers
is there any alternative to 'crowd signals' to collect real data from mobile users across different countries?
Relevant answer
Answer
Of course there are alternatives -- as long as the users are willing to keep an app open.
What sort of data are you looking for?  Anything that can be collected with or by the phone (which is a larger space than you might believe) is easy.  If you include user input, virtually anything is possible.
This question would have been easier to answer if you had provided more details.  Up-votes appreciated if you found this useful.
  • asked a question related to Data Mining and Knowledge Discovery
Question
1 answer
I have obtained a part of TF-mRNA relationships from CRSD database. However, I want to get a more comprehensive database of TF-mRNA relationships to validate results. I will be very appreciated if providing a more comprehensive database.
Relevant answer
Answer
Sorry, I cannot help.
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
I have been doing research on huge data sets using Hadoop. Can anyone provide me links for huge data sets for analysis purpose? 
Thanks in advance.
Relevant answer
Answer
Large unstructured datasets such as tweets can be best analyzed by hadoop.
One such dataset link is given below;
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
Particularly, if yes then all major algorithms like Apriori, FP-Growth and Eclat are np-hard or only Apriori is np-hard?
Relevant answer
Answer
It is good paper about the complexity of frequent itemset mining.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
I am thinking about to work in such carrier counselling area using sentiment analysis of reviews available on web
Relevant answer
Answer
Try facebook pages of colleges and their twitter stream
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
I'm looking for any data on Population Statistics or Modelling for the past periods to compare them with our  results extracted from Wikipedia biographical pages (see attached links).
Relevant answer
Answer
 Very nice figure: it shows immediately the effects of two world wars (and may be spanish flu pandemic) and the discovery of antibiotics.
Concerning Population Statistics you can find online italian mortality and lifespan calculated by ISTAT (italian official institute of statistics) for the period 1974 - present.
Attached there is the publicly accessible URL:.
Hope this helps!
  • asked a question related to Data Mining and Knowledge Discovery
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
I'm hoping to be able to examine the frequency of themes arising for an open ended question administered as part of a survey (n>1000). The answers are typically one to two lines of text. I'm hoping to identify a method and software package that will automate this process to some degree rather than coding by hand. Any help would be greatly appreciated. Thanks.
WBW,
TDC
  • asked a question related to Data Mining and Knowledge Discovery
Question
10 answers
Normalization is done to map the data to a uniform scale. For instance, when the inputs to ANN are on widely different scales, normalization is normally used to get the same range of values for each of the input features. To do this, several standard data normalization techniques such as min-max, softmax, z-score, decimal scaling, box-cox and etc. are available (This list is not exhaustive and many more techniques are in use). As far as I know min-max technique preserve all the relationships in the original dataset exactly. Z-score is often used when responses are on different magnitude scales. But both techniques are sensitive to outliers in the data.
Is there a general guideline to determine the appropriate technique for a particular application? Should the normalization method be solely determined by the range of input features (for removing scaling effect)? Does it depend on the choice of activation functions (logsig [0, 1] or tansig [-1, 1], etc.) as well? Does it depend on the type of the problem we are trying to solve (classification, function approximation, prediction, forecasting of time-series data, etc)?
Relevant answer
Answer
The objectives of data preprocessing include size reduction of the input space, smoother relationships, data normalization, noise reduction, and feature extraction.
I recommend reading chapter 15 of the book: Fuzzy neural intelligent systems: Mathematical foundation and the applications in engineering (see attached PDF).
Data normalization can provide a better modeling and avoid numerical problems. Several algorithms can be used to normalize the data:
1) Min-Max Normalization, min-max normalization is a linear scaling algorithm. It transforms the original input range into a new data range (typically 0-1).
2) Zscore Normalization, in Zscore normalization, the input variable data is converted into zero mean and unit variance.
3) Sigmoidal Normalization, sigmoidal normalization is a nonlinear transformation. It transforms the input data into the range -1 to 1, using a sigmoid function
The outliers of the data points usually have large values. In order to represent those large outlier data, the sigmoidal normalization is an appropriate approach.
Li, H., Chen, C. P., & Huang, H. P. (2000). Fuzzy neural intelligent systems: Mathematical foundation and the applications in engineering. CRC Press.
  • asked a question related to Data Mining and Knowledge Discovery
Question
8 answers
Please have a look on the following documents and please give me your valuable opinion. Scientifically how much it is a correct way or acceptable to get continuous data through accumulation of different statement as follows. I am going to use regression analysis in my study. Most of my independent variables are constructed as mentioned in the following way.
Thanks in advance.
Hope for your best cooperation
Relevant answer
Answer
Following the young lady, separate your deforestation from those of quality of life and do your questions "Likert scale-like.
I think if you want to be 'kool' with your professors or readers, you can use "factor" loadings. Nevertheless, do not report it if it contradictory to Beatrice suggestion. So you choose!!
  • asked a question related to Data Mining and Knowledge Discovery
Question
14 answers
 I am working on association rule mining for retail dataset.  Can you provide the link to download data where demographic and items purchased with quantity information is available.
Relevant answer
Answer
The best way is to generate Syntheic data and define:
The number of items, Number of items per transaction, the number of regions and number of transactions. All what it takes is the use of different random numbers genrators, I think this the best way to go about it.     Good luck
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
I am doing my research in extracting new indicators of business performance using sentiment analysis of headline news. to do that i need a collection of headlines form famous news agencies like Reuters  . is there anybody have this data or know how to get this data pleas help me.  
Relevant answer
Answer
I can suggest you to use dataset from the GDELT projecthttp://gdeltproject.org/
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
Stated the number of Techniques involve?
Whats the problem with the Techniques?Is there any room to enhance it to make it more efficient & reliable.
Type of Unstructured Data : Images (jpg, tiff , gif , etc)
Relevant answer
Answer
I think you can use bag-of-words model on images
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
how to extract the product details fron shopping website ?
Relevant answer
Answer
you can use web scraping/ extracting online tool such as "Easywebextract" tool
or you need to read the XML file (source file) and analyze it (with deleting tags as <html>)
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
i have a dataset which contains 3 types of attributes : categorical, ordinal and numerical. all of these attributes are represented by numbers.
is it meaningful to apply classifiers which accept numeric attributes as input for classification. (without using nominal2numeric filter)
i think applying nomial2numeric does not differentiate between these 3 types of attributes so the incorrect model will be created.
Relevant answer
Answer
I think you have one of three options, 
First is to consider them all nominal data, as the numerical and ordinal are just special cases from being nominal as they are countable and probably finite!
The second option is to create 3 stage classification process, where the first stage recognizes the data nature. The second stage classify each of the three classes (nominal, ordinal and numerical) on its own according to your criteria or feature vector. The third stage you fuse the outcome of the three classifiers in stage two to get the final results.
The third option is to use feature hashing  as indicated in the following URLs:
I hope this helps anyhow.
Ahmad Hossny 
University of Adelaide
  • asked a question related to Data Mining and Knowledge Discovery
Question
7 answers
 I am searching for help in the line of prediction advantage, resistance to irrelevant predictor variables in terms of prediction. Thanking you...
Relevant answer
Answer
AS mentioned by others, one of the primary advantages of the tree models is their human-interpretability. But it has other benefits as well: They can handle non-numeric data, mixed data, categorical data, etc., they can naturally handle missing data; they have lower computational complexity, they can deal with irrelevant inputs, they can perfectly "learn" any training data, etc. In fact, decision trees are perhaps better models than just about any classifier in just about any metric... except accuracy and predictive power. Now you may say, "...well that is one of the most important metrics we care about, what good is a classifier if it does not perform well." And this is where, ensemble systems come in. By using an ensemble of trees, e.g., a random forest or AdaBoost with trees as the base classifier, you get to keep all the benefits and turn the weakness of the tress into a strength.
Kernel based methods have strong predictive power and relatively less likelihood of overfitting, but they do not have the other benefits listed above.
So, at the end, for most supervised classification problems, an ensemble based tree approach is usually a good first choice.
  • asked a question related to Data Mining and Knowledge Discovery
Question
22 answers
I collected tweets in 2013 and would like to analyse the topics in them.
Relevant answer
Answer
Hi,
In R, you can type
data <- read.csv("filename.csv)
and get all tweets in data variable.
then you can follow different paths to analyze your data.
I would apply strsplit function using vapply and see which word used more often.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
How to deal with the matrix input for the practical data mining?
Relevant answer
Answer
Presentation of the data in  Matrix form and using the same  as input and subjecting the same for datamining processes has roots in Linear Algebra. After all a matrix may be viewed as a compact form of representing  data instances / data objects. However , in data mining one needs to have results that are interpretative. Only subtle difference is what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra.
For various algorithms on the topic refer the material at the link:
You can also find useful material in this conference proceedings available at the link:
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
I try to analyze the text corpus of tickets from a ticket-system in order to cluster the requests according to the words that are used in. I use the vector space model and have calculated the values for tf, idf, tfxidf and entropy. 
Due to the amount of dimensions (about 8.500) I would like to extract the key words. Is there any statistical approach depending on a specific threshold for entropy/tfidf? Of course I could say that document frequency has to be at least x and entropy at least y, but how can I verify this point?
Thanks for your help!
Relevant answer
Answer
Dear Yaakov and Amina,
thanks for your answers. Your papers are very interesting and gave me a good overview concerning the several techniques.
First i calculated the tfidf for the whole collection in order to get the most specific words of the corpus.
Now i calculate the tfidfs for every document and choose the top n elements depending on the tfidf value. Now i have to evaluate a "good" value for n.
I hope this approach leads to a good result.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Can someone suggest me about a corpus containing visualization or chart related terms.
I have a big dataset(97000) of visualization images and its related captions. I need to label each image with its respective name. As I have no experience therefore, I would not like to go into image analysis part. 
I would like to use some existing corpus  or vocabulary to automatically label these caption. 
So does, anyone know about such corpus. Or someone can suggest me a better way to do that.
NOTE : By "visualization images" I mean the images with the content as a data visualization /charts like line chart, bar plot, dendogram etc.
Thanks
Relevant answer
Answer
Hi,
What do you mean by "visualization images"? Are they of a specific type?
For general purpose images with anotations there are some big corpora:
Imagenet
MSCoco
Im2Text
Hope this is helpful for you.
Best
  • asked a question related to Data Mining and Knowledge Discovery
Question
2 answers
I'm working with the application of the divergences of Kullback-Leibler and Jensen-Shanon over some texts (e.g. A and B). In some cases A and B do not have the same vocabulary, therefore I need to set a value to the unseen n-grams. For the moment I was seting, by myself a small probability for those unseen cases. However, in some cases this value is not small enough and I can get negative divergences.
Therefore, I am looking for a smoothing method where it is not necessary to have a learning corpus. I must use the probability values from the texts and not from a learning corpus. As well, if it is possible I wouldn't like to use a smoothing method where different n-grams sizes are used. The reason is that I make use of n-grams, but several of them are skip n-grams.
P.S. For the moment I have been testing the Good-Turing smoothing.
Relevant answer
Answer
I am agree with Caitlin answer: Use Additive Smoothing. You can start with plus-one version but there is a more general expression plus-delta to test with your data . Here the details:
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
I am doing a research project on Network Intrusion Detection System. How can I select the attributes from the packets captured from the network to make it similar to attributes in KDD dataset?
Relevant answer
Answer
Deepak you may use the tcprewrite to modify the packet as suggested previously. Secondly, authors have extensively use this data set for classification and identification of intrusive event.  I would suggest you should think of building the data set through a small scale testbed.
  • asked a question related to Data Mining and Knowledge Discovery
Question
4 answers
This is for a Special issue in Systems Medicine.
Relevant answer
Answer
Hi,
The research on Bioinformatics is increasing tremendously. All reputed Journals either will charge heavily or may not accept without experimental proof (wet lab).
With the intention to publish the dry lab work with no processing fee, we started a Journal called International Journal of Computational Biology and Bioinformatics (IJCOB), Strings Publications.
Being the editor-in-chief of IJCOB I request you to share your manuscript and publish in IJCOB.
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Please, describe the algorithm and it is implementation
Relevant answer
Answer
If you are looking for open source: StanfordNLP does a decent job. If proprietary works, Semantria api's does really well.
Typically, there are two ways to crack sentiment analysis problem supervised and unsupervised. Supervised gives relatively better accuracy however training data for the same is difficult to find in most of the cases. Whereas for  unsupervised learning, it is complicated. Please read Mining and summarizing customer reviews  its a seminal paper for sentiment analysis.
Please read the papers, blogs of Prof. Bing Liu. Amazing work in this domain. You can also get amazon reviews data dump from Dr.Bind Liu's blog.
  • asked a question related to Data Mining and Knowledge Discovery
Question
13 answers
Hello..Please can anyone suggest which data pre_processing technique will be good for applying before Naive Bayes algorithm.Thankyou
Relevant answer
Answer
The biggest assumption naive bayes has it's features or attributes are independent.  If we can select features in a way that they are relatively independent , they can give good results. We applied the same for text classification , you may want to take a look
  • asked a question related to Data Mining and Knowledge Discovery
Question
3 answers
Especially opinion sparm detection.
Relevant answer
Answer
1. Red Opal - it's a to that enables users to determine opinion orientations of a particular products based on their feature,
2. Weka - Machine learning algorithm for data mining, visualisation, Data pre processing, regression etc
3. Pattern: tools used for POS tagging , Network Analysis, WordNet , N-gram search, Machine Learning etc.
4. LingPipe
5. Gate
6. Apache OpenNLP
7. NLTK
8. Opinion Observer
9. Review Seer Tool
10. Robust Accurate Statistical
  • asked a question related to Data Mining and Knowledge Discovery
Question
18 answers
This may help us to publish our work on high impact factor journals.
Relevant answer
Answer
list some journals in anna university annexure list which can be published in  short period of time and in low publishing cost
  • asked a question related to Data Mining and Knowledge Discovery
Question
5 answers
Hi. Are there any merits of one over the other for evaluating classifiers (e.g. SVM for image processing problem domain)? 
Relevant answer
Answer
Neither! And both! :-) 
Confusion matrices by themselves do not do anything. By taking combinations and ratios of confusion matrices rows/columns and their marginals, you extract lots of useful performance metrics, including accuracy, precision, recall, f-measure, etc. Each of these has their own merits, and there are occasions where one should be preferred over the other. For example, if there is a significant class imbalance, you should prefer to assess performance with f-measure over accuracy, as accuracy can give a misleading belief about performance. (Read the first link for more "The case against accuracy estimation for comparing induction algorithms".) So, confusion matrices are not very useful, but they can certainly help you understand the nature of performance metrics.
But, to get the confusion matrix in the first place you will have needed to select a threshold somewhere along the line. There are a number of approaches for selecting thresholds, and ROC analysis is quite a powerful option. ROC analysis is particularly useful for threshold selection if your classes each have different misclassification costs, e.g. in medical fields false positive classifications are more tolerable than false negatives. ROC analysis can easily let you do cost-sensitive threshold selection that will let you select your threshold optimally for accuracy. (In a related note, there has been some recent work that allows similar threshold selection for f-measure, see "Precision-Recall-Gain Curves: PR Analysis Done Right", link #3 below.)
The area under ROC curves (AUROC) can also tell you something about the classifier. You often read that the AUROC is the probability that any randomly selected positive example will have a higher score than a randomly selected negative sample. So with this interpretation, AUROC is related to ranking, and this doesn't have an obvious link to accuracy yet. It can be shown, though, that the AUROC is related to accuracy in the following manner: 
[math]\mathbb{E}(acc) = \frac{AUROC}{2} + \frac{1}{4}[/math]
for even misclassification costs/class distributions, and in this manner we can see the relationship between the two. For more information about ROC/AUROC analyses, I recommend reading "An introduction to ROC analysis" by Tom Fawcett, see the second link below. 
So, what I'm trying to say is that performance evaluation is complicated. Depending on the problem you are working on you should prefer one evaluation method over others. 
(Note, the comments below are not limited to just image processing.)