Science topics: Data MiningData Mining and Knowledge Discovery
Science topic
Data Mining and Knowledge Discovery - Science topic
It is the research project which is ongoing.
Questions related to Data Mining and Knowledge Discovery
I am looking for tools that can be used to extract and mine Facebook data.
i learned about a tool called Netvizz, but many people are saying it was stopped.
Any idea that can help?
I am working in the field of data discovery, therefore, I learn data collection using API of social apps.
I'm quite new in GMDH and based on my first reading on this technique I feel like I want to know more. Here are some of the benefits of using GMDH approach:
1.The optimal complexity of model structure is found, adequate to level of noise in data sample. For real problems solution with noised or short data, simplified forecasting models are more accurate.
2.The number of layers and neurons in hidden layers, model structure and other optimal NN parameters are determined automatically.
3.It guarantees that the most accurate or unbiased models will be found - method doesn't miss the best solution during sorting of all variants (in given class of functions).
4.As input variables are used any non-linear functions or features, which can influence the output variable.
5.It automatically finds interpretable relationships in data and selects effective input variables.
6. GMDH sorting algorithms are rather simple for programming.
7. TMNN neural nets are used to increase the accuracy of another modelling algorithms.
8. Method uses information directly from data sample and minimizes influence of apriori author assumptions about results of modeling.
9. Approach gives possibility to find unbiased physical model of object (law or clusterization) - one and the same for future samples.
It seems that items 1,2,6 and 7 are really interesting and can be extend to ANN.
Any suggestion or experience from others?
I would like to dive into the research domain of explainable AI. What are some of the recent trending methodologies in this domain? What can be a good start to dive into this field?
Genomic data privacy is an essential thing while sharing the genomic data to the public. How can the privacy of genomic data be protected? Which anonymization models are useless for preserving the privacy of genomic data? Which model is suggested for preserving the privacy?
In my observed data I have different data variables(12 in Number) , Let say X1,X2......X12..I want know ,How one data variable influence the value of another data variable..
is this interdependence measure is directly related to observed number of Tuples.
What are the ways to transfer a graph from one Relation space to a Euclidean space with less time complexity? although there are some ways solution (such as signal process, spectral method ), they have a high time complexity.
Recently, several works have been published on predictive analytics:
- Prediction-based Resource Allocation using LSTM and Minimum Cost and Maximum Flow Algorithm by Gyunam Park and Minseok Song (https://ieeexplore.ieee.org/abstract/document/8786063)
- Using Convolution Neural Networks for Predictive Process Analytics by Vincenzo Pasquadibisceglie et al. (https://ieeexplore.ieee.org/document/8786066)
Besides, there is a paper on how to discover a process model using neural networks:
My questions for this discussion are:
- It seems, that the field for machine learning approaches in process mining in not limited to predictions/discovery. Can we formulate the areas of possible applications?
- Can we use process mining techniques in machine learning? Can we, for example, mine how neural networks learn (in order to better understand their predictions)?
- If you believe that the subjects are completely incompatible, then, please, share your argument. Why do you think so?
- Finally, please, share known papers in which: process mining (PM) is applied in machine learning (ML) research, ML is applied in PM research, both PM and ML are applied to solve a problem. I believe, this will be useful for any reader of this discussion.
I want to convert an unweighted graph to weighted for solving the link prediction problem. Is the best way to transfer from an unweighted graph to a weighted graph to consider the similarity between nodes?
Hi Folks,
I need your help regarding the Artificial Intelligence Context of Information Retrieval tools and Big Data & Data Mining in the libraries? Dissertation/Thesis, research paper, conference Paper, Book chapter, Research Project and Article can you share with me. I will also welcome you comments, thought and feed back in the context of University libraries support me to designed my PhD Questionnaire.
-Yousuf
How can i remove over-fitting in weka. I have used re-sample, randomize techniques. But whats the proper way to remove over-fitting in weka.
I'm searching for some good tools that offers easy way to apply evolutionary/genetic algorithm for selecting best feature from a dataset. I was wondering if this task can be performed in KNIME, WEKA or Orange?
As we know, most of the researchers use manual validation by the experts for the unlabeled User Reviews for a specific domain , but is there a new way? Because I worked with big sized dataset and using experts will be difficult?
if anyone use a new performance measure or a new way for validation, plz inform me .
Thanks in advance.
Data Mining and Big data cover the subject of Artificial intelligence or these terms also discuss in the context of Data Literacy or Data Management in the context of Library and information science?
- Do librarians data literacy skills remain the same as the Data Scientist skill? If data scientist skill the higher than Librarian data literacy skills inf future librarian job market replace by the librarian?
- What should librarian do to enhance data literacy skills ?
Any study (Dissertation, Model, Conference Paper, Poster discussed the data literacy in the context of AI (Big Data and Data mining) application in Library (ies).
_Yousuf
If we have multiple classifiers and we need to know which one is under-fitting, and which one is overfitting based on performance factors (classification accuracy, and model complexity)
Are there any method to select the dominate classifier (optimal fitting) that balance between the above-mentioned two factors?
For example : if i want to determines the size of the dataset according to their instances number.
Dataset 1= 8000 => High
Dataset 2= 2000 => Medium
Dataset 3= 500 => Small
Another example : if i want to determines the size of the dataset according to their features number.
Dataset 1= 100 => High
Dataset 2= 30 => Medium
Dataset 3= 7 => Small
Hello dear researchers
I've decoded some AIS data with https://github.com/schwehr/libais.
I have a question. some records have fields which have called UTC-hour, UTC-min, UTC- spare but some records just have a timestamp.
what should I do with these columns to get to time?
Do you know any another reliable package to decode AIS data?
Regards,
What are the major differences between using the Information Gain and Entropy when we use to determine the credibility or the importance in the classification.
Greeting to every one
I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
Hello.
I've got some AIS data in text format. When i open these files the contents aren't meaningful .
I want to derive trajectory information from these files but I don't know how to do this. I wonder if anybody can help me.
Regards
Developing knowledge is one of the most important factors in the development of civilization, including technical progress, technology development, etc. Knowledge is one of the most important production factors in modern knowledge-based economies. In modern knowledge-based economies, information services, Internet and modern information technologies based on advanced information processing are developing dynamically in the most economically developed countries. Currently, the development of knowledge led to the fourth technological revolution known as Industry 4.0.
A particularly important area of knowledge that is rapidly developing in recent years and probability is determined by the development of modern economies are advanced information processing technologies ranked among the main determinants of the technological revolution called Industry 4.0.
The currently ongoing technological revolution Industry 4.0 is determined by the development of the following advanced information processing technologies: Big Data database technologies, cloud computing, machine learning, Internet of Things, artificial intelligence, Business Intelligence and other advanced data mining technologies and other information technologies.
Knowledge development is therefore a key issue for the continuation of technological progress in the 21st century.
In view of the above, I am asking you the following question:
What is the significance of knowledge in the development of the 21st century civilization?
Please reply
Best wishes

Today, social networks have a special place in people's lives and people spend a lot of time on these networks. These networks have a number of positive and negative effects on the behavior, culture and lifestyle of individuals and society, how can these impacts be managed and to improve society? Is there a technical and scientific solution?
What is a possible solution for cross validation of an imbalanced data set problem? The question is in three sections. 1. 1- Oversample the minority class examples using (SMOTE, ADASYN etc), then split it into 10 folds, train the classifier on first nine folds and test on 10th fold and repeat this process 10 times and take the average of metric measure then what about overfitting problem? 2. what about if we divide the data set into 10 folds, oversample the minority class examples in first ninth folds and train the classifier and test the trained classifier on the original (Not oversampled) 10th fold repeat this process 10 times and take the average .. question is what about distribution because basic assumption is training and test set follow the same distribution. 3. If we oversample the minority class examples same as number of majority class examples, then it is necessary to measure F-Measure, G-mean and AUC or accuracy measure is sufficient.
I have a sort of data in which the change in the weight of materials is recorded during the time. Unfortunately because of special condition I cannot record the weight in the first 75 seconds.
- Is there any way to predict the initial missed data (I mean the change in the weight in the first 75 seconds)?
- How can I find the equation of the curve that fit the data points?
Any solution with MATLAB, SPSS, and Excel softwares is appreciated.
I am studying on frequent subgraph pattern mining from transactional graph. For experimental study, I need some benchmark data sets. Is there any graph generator to generate synthetic data graph?
For one of my studies, I designed an unsupervised predictive clustering model, and now searching for some modification steps and post processing to use that clustering model for classification in a reliable way.
Does anybody know a solver for a large scale sparse QP that works on the GPU?
Or, more in general, can a GPU speed up solvers for sparse QPs?
What are the mostly used, latest and effective techniques for learning from imbalancd dataset?
The techniques I am aware of:
* Resampling Techniques:
- Random Undersampling
- Random Oversampling
- Synthetic Minority Oversampling Technique
* Throw away minority examples and switch to an anomaly detection framework
* At the algorithm level, or after it
- Adjust the class weight (misclassification costs).
- Adjust the decision threshold.
- Modify an existing algorithm to be more sensitive to rare classes.
* Construct an entirely new algorithm to perform well on imbalanced data.
Are there any other new/effective techniques to look at?
Could you please share some current research trends/topics/techniques in Data Mining and Knowledge Discovery?
Global search vs local search.
What are the procedures that we can implement in Transformation step?
In Apriori Association Rule if the minSupport = 0.25 and minConfidence = 0.58 and for an item set we found a total of 16 association rules:
Rule Confidence Support
{1 2 ==>3} 1 0.4
{3 5 ==>2} 1 0.4
{1 ==> 2 3} 0.666 0.4
{1 3 ==> 2} 0.666 0.4
{2 3 ==> 1} 0.666 0.4
{5 ==> 2 3} 0.666 0.4
{2 3 ==> 5} 0.666 0.4
{2 5 ==> 3} 0.666 0.4
{1 ==> 3} 1 0.6
{5 ==> 2} 1 0.6
{3 ==> 1} 0.75 0.6
{2 ==> 3} 0.75 0.6
{3 ==> 2} 0.75 0.6
{2 ==> 5} 0.75 0.6
{5 ==> 3} 0.666 0.4
{1 ==>2} 0.666 0.4
If we want to reorder these rules from the most to least important rules which factor determine the importance of the rule Support or confidence i.e:
In this rule the Confidence is 1 but the Support is 0.4
{1 2 ==>3} 1 0.4
While in this rule the Confidence is 0.75 but the Support is 0.6
{3 ==> 1} 0.75 0.6
Generally feature selection method is used to select relevant feature for classification. But in some research work done additionally optimal feature selection.
Hi everyone! From my research, I noted that when it comes to the evaluation of DM and KM, these two components are being evaluated in a separate entity. Are there any integrated DM-KM evaluation method that I might have missed out? Looking forward to your replies. Cheers.
I am beginner in the field of text mining.I have implemented an algorithm on text pattern mining.I have collected few sample of Reuters RCV1 dataset. I know about precision,recall and F-score rather I am confused about how to judge relevance.How I will measure how much relevant pattern it can retrieve?
Hi all,
I would like to ask about what are the different techniques, methods or tools available to identify commonalities and differences among multiple documents?. Please let me know about it.
"The AI Takeover Is Coming" this is what is the news these days. Is it really a trend setter for future years.
What is the impact over manual work due to this? just needed the audience thoughts over this hence started a conversation.
Your thoughts and expertise are welcome!
Thanks in advance
If we train a data model once on a dataset using a machine learning algorithm, save the model, and then train it again using the same algorithm and the same dataset and data ordering, will the first model be the same as the second?
I would propose a classification of ml algorithms based on their "determinism"
in this respect. On the one extreme we would have:
(i) those which always produce an identical model when trained from the same dataset with the records presented in the same order and on the other end we would have:
(ii) those which produce a different model each time with a very high variability.
Two reasons for why a resulting model varies could be (a) in the machine learning algorithm itself there could be a random walk somewhere, or (b) a sampling of a probability distribution to assign a component of an optimization function. More examples would be welcome !
Also, it would be great to do an inventory of the main ML algorithms based on their "stability" with respect to retraining under the same conditions (i.e. same data in same order). E.g. decision tree induction vs support vector vs neural networks. Any suggestions of an initial list and ranking would be great !
for quite a comprehensive list of methods.
I am working on sensor data to detect deviation of behavior of people and my data is full unlabeled so I read some papers about transfer learning to find a suitable method to detect the deviation and apply in different sensors data but I have not got an idea yet please if you have an idea share with me.thank you
Hi ,
I know that most of existing probabilistic and statistical term-weighting schemes (TF-IDF and its variation) are based on linked independence assumption between index terms. On the other hand, semantic information retrieval are seeks the importance of linked dependence between index terms each other.
Please, I am wondering when linked dependence between index terms is vital ? When also can we neglect linked dependence between index terms?
Note: dependence assumption: if two index terms have the same occurrences in the document, this will tend to that index terms are dependent and they should have the same term-weight values.
Thanks
Osman
Data scientst, simulation system, data sorce ...
For image retargeting which database is required
and whether it is freely available could able to find anything on following link http://people.csail.mit.edu/mrub/retargetme.
Here I have attached an image. Where the blue dots are actual data points and the red line is my prediction(using linear regression). Can you suggest any good model for this dataset ?
It should be noted that the input was 50 dimensional and I am only showing the outputs.
Thank you.

Kafka needs a input method and it is just a databus. What are the best data gateways?
the dataset must be on student information
Hi,
I know that some of Support Vector Machine approaches and other machine learning approaches use the methodology of reducing the number of sample from the training set to reduce the computational run-time. However, this method can work very well on large training sets if they nearly have instances characteristics that can represent with the small portions (small samples) of these training sets. However, it will not do the same outperformance on training sets that have a lot of variations in the instances.
Please, is there any method in machine learning methods to reduce the computational run-time with considering all sample to be involved in the learning approach?
Thanks
Osman
can i use any predictive or machine learning approach to improve quality of health care. or can i use it for disease prediction.
Dear Professors and research fellows,
Can anyone recommend me some tools or research articles about data extraction from Facebook for data mining and social network analysis please?
Thanks!
hi,
I am working on a research which its purpose to forecast future sales demand. I have annual data of about 27 years which my data set is obviously small. I am trying to train the model which I can forecast 6 years later's sales demand.
At first, I trained my model with annual data of each year. To clarify, Year 2000's inputs data are set with year 2000's actual sale in one row. As i trained with this technique, Everything was good and i got Rsquare of about 99%. The problem is, if I want to forecast next 6 years I have to have the input data for next 6 years. for instance currency rate for next 6 years which it will decrease my model's forecast accuracy.
I came to an idea which I could train each year with input data of the previous years. For example in training, i set currency rate of year 94 with actual sales of year 2000.
With this technique I can use year 2016's input data in order to forecast year 2022's sales.
Is this technique logical?
In my dataset, both response and the independent variables are ordinal with multiple categories as severe (coded as 3), moderate (coded as 2), mild (coded as 1) and none (0)
I run the model by putting both outcome and the variable in descending order to see if there was a relationship between the response and the ordinal variable, and I've had the results shown below
threshold: response (3)
response (2)
response (1)
variable (1): exp (B)=3, P=0.039, 95% CI=1.061-9.253)
and the reference category is variable (0)
Could you help me to interpret the results?
At the moment, I conduct the state of the art concerning the concept of the smart city in the context of intelligent IT systems deployment.
I am particularly interested in research results, describing lessons learned from the field i.e. best practices, project schedules, engaged resources etc.
Anyone interested in collaboration, please let me know.
I want to migrate virtual machines in cloudsim simulation
I am fine-tuning the AlexNet using caffe on my own data and I tried three different versions of caffe (the one release on the year 2014, the one release on 2015 and the one release on 2016) and the outputs are quite different and on some datasets the differences are more than 10 percentages. Generally, the older one outperforms the newer one. Why?
I'd like to mine web pages that'd result in a dataset of pages taken from a particular website (eg. news sites). It'd target articles not only from one section but also from the other sections on the site (for instance, politics, tech and etc. from CNN.com). All of these articles are combined and retrieved from the 3 years publication and that means I'd have all of the articles published in the 3 years time. What are the tools and techniques that I can opt to do?
Preferably for Mac OS. I found some APIs like Highcharts, but I'm looking for a standalone app. Thanks.
I've been working on an HOG detection application lately and I am training a linear SVM with a data set. The best penalty parameter C, IS obtained during cross validation and corresponds to the one giving the lowest False Positive and False Negative Rates. Then a hard negative step is done by testing the negative training set against the SVM and then adding those hard negative examples to the training set. This step is giving a lot of hard negative samples so I am sorting them by their probabilities of being misclassified and then only keeping the one greater than a confidence threshold. (for i.e: > 0.9 of being classified as a true sample). I also have another parameter to set a max number of hard negative samples to keep if i found too many of them greater than the threshold.
In my experience (which isn't much) and after processing couple of runs, It appears that having too many hard negative examples doesn't improve the classifier. Also my confidence threshold and the number of samples i want to keep depends on the my C penalty parameter.
I would like to know if there are some rules or tips on choosing the optimal combination of C parameter and hard negative or if the only way to do it, it's to process multiple runs. Any advices is welcome!
Thank you!
I am trying to complie my application for TOSSIM. Even though, it compiles for the hardware, it throws run time errors when executing 'make micaz sim'. I was wondering if it because I am using 802.15.4 MAC. Does Tossim support tkn154? (The given examples do not compile).
Need some basic tools for identifying original content in social media
I collected data from TripAdvisor and the users locations are not fixed. some of them used city name and some used their county name.
Is there any way I can merge them and code them as country name?
I'm Workinog on "community detection in networks considering node attributes". In this regard, I have already need some benchmark networks for testing my proposed algorithm through comparison of predicted labels (communities assignments) with the real ones (ground-truth). These networks should be undirected include non-overlapping communites, have small to big sizes, edges show the relations between nodes, nodes have some personal features likely affecting their community memberships and finally the true labels of nodes be known as ground-truth for evaluation of my predicted labels. Although I had an extensive search, but unfortunately I couldn’t find any networks considering these characteristics.
I really appreciate if anybody can address me some references or network benchmarks that satisfy my requirements.
Thank you in advance for your time and cooperation
Best regards,
Esmaeil
Dear
I want to compare the performence between two SVM algorithms: SMO and libsvm under the WEKA tool:
SMO cross-validation k = 3 then K = 10 with kernel (Linear, Polykernel, RBF)
Libsvm cross-validation k = 3 then K = 10 with kernel (Linear, Polykernel, RBF, Sigmoid)
I get the results in the attached file
is that really the case (SMO K = 10 RBF) is my best solution.
if so, which metrics (2 or 3) I have to choose to make a graph to justify my choice.
thank you for your collaboration
Hamid SLIMANI
I'm looking for a freely available dataset for Arabic microblog retrieval
i am doing a research in big data stored in cloud , I would like to ensure that the modified data in intact.
is there any alternative to 'crowd signals' to collect real data from mobile users across different countries?
I have obtained a part of TF-mRNA relationships from CRSD database. However, I want to get a more comprehensive database of TF-mRNA relationships to validate results. I will be very appreciated if providing a more comprehensive database.
I have been doing research on huge data sets using Hadoop. Can anyone provide me links for huge data sets for analysis purpose?
Thanks in advance.
Particularly, if yes then all major algorithms like Apriori, FP-Growth and Eclat are np-hard or only Apriori is np-hard?
I am thinking about to work in such carrier counselling area using sentiment analysis of reviews available on web
I'm looking for any data on Population Statistics or Modelling for the past periods to compare them with our results extracted from Wikipedia biographical pages (see attached links).
I am interested in datasets for knowledge discovery from data streams.
It would be valuable to obtain a real datasets, but artificial are fine for me as well. I am aware of: http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift.
I'm hoping to be able to examine the frequency of themes arising for an open ended question administered as part of a survey (n>1000). The answers are typically one to two lines of text. I'm hoping to identify a method and software package that will automate this process to some degree rather than coding by hand. Any help would be greatly appreciated. Thanks.
WBW,
TDC
Normalization is done to map the data to a uniform scale. For instance, when the inputs to ANN are on widely different scales, normalization is normally used to get the same range of values for each of the input features. To do this, several standard data normalization techniques such as min-max, softmax, z-score, decimal scaling, box-cox and etc. are available (This list is not exhaustive and many more techniques are in use). As far as I know min-max technique preserve all the relationships in the original dataset exactly. Z-score is often used when responses are on different magnitude scales. But both techniques are sensitive to outliers in the data.
Is there a general guideline to determine the appropriate technique for a particular application? Should the normalization method be solely determined by the range of input features (for removing scaling effect)? Does it depend on the choice of activation functions (logsig [0, 1] or tansig [-1, 1], etc.) as well? Does it depend on the type of the problem we are trying to solve (classification, function approximation, prediction, forecasting of time-series data, etc)?
Please have a look on the following documents and please give me your valuable opinion. Scientifically how much it is a correct way or acceptable to get continuous data through accumulation of different statement as follows. I am going to use regression analysis in my study. Most of my independent variables are constructed as mentioned in the following way.
Thanks in advance.
Hope for your best cooperation
I am working on association rule mining for retail dataset. Can you provide the link to download data where demographic and items purchased with quantity information is available.
I am doing my research in extracting new indicators of business performance using sentiment analysis of headline news. to do that i need a collection of headlines form famous news agencies like Reuters . is there anybody have this data or know how to get this data pleas help me.
Stated the number of Techniques involve?
Whats the problem with the Techniques?Is there any room to enhance it to make it more efficient & reliable.
Type of Unstructured Data : Images (jpg, tiff , gif , etc)
how to extract the product details fron shopping website ?
i have a dataset which contains 3 types of attributes : categorical, ordinal and numerical. all of these attributes are represented by numbers.
is it meaningful to apply classifiers which accept numeric attributes as input for classification. (without using nominal2numeric filter)
i think applying nomial2numeric does not differentiate between these 3 types of attributes so the incorrect model will be created.
I am searching for help in the line of prediction advantage, resistance to irrelevant predictor variables in terms of prediction. Thanking you...
I collected tweets in 2013 and would like to analyse the topics in them.
How to deal with the matrix input for the practical data mining?
I try to analyze the text corpus of tickets from a ticket-system in order to cluster the requests according to the words that are used in. I use the vector space model and have calculated the values for tf, idf, tfxidf and entropy.
Due to the amount of dimensions (about 8.500) I would like to extract the key words. Is there any statistical approach depending on a specific threshold for entropy/tfidf? Of course I could say that document frequency has to be at least x and entropy at least y, but how can I verify this point?
Thanks for your help!
Can someone suggest me about a corpus containing visualization or chart related terms.
I have a big dataset(97000) of visualization images and its related captions. I need to label each image with its respective name. As I have no experience therefore, I would not like to go into image analysis part.
I would like to use some existing corpus or vocabulary to automatically label these caption.
So does, anyone know about such corpus. Or someone can suggest me a better way to do that.
NOTE : By "visualization images" I mean the images with the content as a data visualization /charts like line chart, bar plot, dendogram etc.
Thanks
I'm working with the application of the divergences of Kullback-Leibler and Jensen-Shanon over some texts (e.g. A and B). In some cases A and B do not have the same vocabulary, therefore I need to set a value to the unseen n-grams. For the moment I was seting, by myself a small probability for those unseen cases. However, in some cases this value is not small enough and I can get negative divergences.
Therefore, I am looking for a smoothing method where it is not necessary to have a learning corpus. I must use the probability values from the texts and not from a learning corpus. As well, if it is possible I wouldn't like to use a smoothing method where different n-grams sizes are used. The reason is that I make use of n-grams, but several of them are skip n-grams.
P.S. For the moment I have been testing the Good-Turing smoothing.
I am doing a research project on Network Intrusion Detection System. How can I select the attributes from the packets captured from the network to make it similar to attributes in KDD dataset?
This is for a Special issue in Systems Medicine.
Please, describe the algorithm and it is implementation
Hello..Please can anyone suggest which data pre_processing technique will be good for applying before Naive Bayes algorithm.Thankyou
This may help us to publish our work on high impact factor journals.
Hi. Are there any merits of one over the other for evaluating classifiers (e.g. SVM for image processing problem domain)?































































































































































