Science topic

Web Content Extraction - Science topic

Web Content Extraction are content extraction is the process of identifying the Main Content and/or removing the additional items, such as advertisements, navigation bars, design elements or legal disclaimers. The rapid growth of text based information on the Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the Main Content (MC) from the additional content items.
Questions related to Web Content Extraction
  • asked a question related to Web Content Extraction
Question
4 answers
I want to compute adjusted cosine similarity value in an item-based collaborative filtering system for two items represented by a and b respectively. these items are represented by two vectors a={2,3,1,0} and b={1,0,4,2}. I know how cosine similarity works but i am stuck with adjusted cosine similarity approach. we are working on a collaborative filtering recommender system where we need to find similar items using adjusted cosine similarity. Those who are working in CF recommender systems please guide me.
Relevant answer
Answer
You can find a self-explanatory example in the attachment :)
  • asked a question related to Web Content Extraction
Question
48 answers
I am doing a research in twitter sentiment analysis related to financial predictions and i need to have a historical dataset from twitter backed to three years. last year twitter announced that they will release historical data for scientific proposes.
I am  asking if anybody have an idea about how to get this data? 
Relevant answer
Answer
Hi all I finished my project, and wanted to pass on my lessons learned.
First off I was looking doing stance classification on twitter threads. This means that I had to get an entire thread, and do this in real time and a t scale since I was looking at Trump's Tweets. So I had somewhat complex and unique requirements, and no budget.
I'm also pretty much fluent in python.
That being said...
My first tendency was to grab the HTML with BeautifulSoup
But as I mentioned above the problem is that as mentioned in the landing page shared above by Svitlana Galeshchuk (https://github.com/Jefferson-Henrique/GetOldTweets-python) The HTML is dynamically created by as you scroll down the page so just scrapping isn't enough to get all responses (or old tweets).
Second I used the Twitter API which I discovered wouldn't always give you the full text (which is documented but easy to miss) not to mention all the other limitations of the search or streaming API.
It is worth mentioning that the besides that the streaming API is definitely worth looking at as it's very powerful. However your research needs to be about unfolding social networks (obviously), and less obviously your analysis needs to accept a semi random sample since they don't guarantee you'll see every new tweet.
This was what I used for my research because by the time I realized the issue with getting full tweet texts I had already built everything, so I grabed the full text using BeautifulSoup and the Python requests module.
Finally What I would suggest is using some sort of web simulator like Selenium
in order to "scroll down"
Then parse with BeautifulSoup .
it sounds like
GetOldTweets-python mentioned above might be a nice package that includes this, and some other solutions mention include this simulator aspect but if your trying to get all replies this is pretty much essential.
It's possible to work around using a simulator but probably isn't worth it even tough setting up a simulator often involves getting drivers to your browser and setting path variables etc. all of which are a pain to do but in my opinion probably worth it.
  • asked a question related to Web Content Extraction
Question
1 answer
We have witnessed the power of a regular search engine like Google. There is a semantic search engine like Swoogle as well. However, we are trying to build a semantic search engine with more user friendly display capability and relevant ranking algorithm. Can anybody suggest ideas?
Relevant answer
Answer
Where can I have more formal info about the semantic search of RG search engine?
  • asked a question related to Web Content Extraction
Question
1 answer
Given a headline and a body of text from an article, find the stance from the following option -
Agrees: The body text agrees with the headline.
Disagrees: The body text disagrees with the headline.
Discusses: The body text discuss the same topic as the headline, but does not take a position
Unrelated: The body text discusses a different topic than the headline
What features would you use when trying to build a classifier ?
Relevant answer
Answer
Have a look on these articles:
1. SemEval-2016 Task 6: Detecting Stance in Tweets
2. Recognizing Stances in Ideological On-Line Debates
  • asked a question related to Web Content Extraction
Question
8 answers
Good afternoon,
I have to conduct a search with respect to the information/support sources on visual impairment at layperson disposal on the Internet (i.e. websites, blogs, facebook...).
The matter is to know on the one hand "what is there" on the Internet and, on the other hand, to analyze the resources found to determine their goodnesses and shortcomings. The latter is not the problem for me (the literature on this topic is quite extensive), what I do not know is if there is a rigorous procedure to follow when surfing the Internet and selecting the results (e.g. as when a systematic review on the written literature is conducted).
I mean: it would be proper to select e.g. the first 20 Google search results according to some inclusion-exclusion criteria? 20 is enough? it is too little? where must be the limits?
If you have done something similar, have you followed any methodological guidelines?
Thanks,
Marta
Relevant answer
Answer
Agreed with Henk... Dont confuse yourself with millions of information there o internet. Have focus on your topic, and continue search. You will soon find what are the corresponding websites that are good in particulars. And it is always good to review corresponding sites and if possible verify it from literature. 
  • asked a question related to Web Content Extraction
Question
3 answers
I have read that Facebook's APIs called graph and Feed can be used to retrieved a given users public profile. But as I noticed Feed API is currently not available for Developers. Is anyone aware of the possibility of getting the public profile of a given user through Graph API.
Relevant answer
Answer
The API have been changed a while ago. Now it is possible to mine users' profiles only using facebook apps which use an explicit permission by the user.
  • asked a question related to Web Content Extraction
Question
4 answers
I am doing semi-supervised classification with WEKA. The test set of my data (twitter) is unlabeled (no class assigned) but replaced with (?). I used WEKA to convert the test data from csv to arff file, which sets automatic datatype of 'string' to the class attribute. When I try to run it after creating the model with my train data it seems to give errors regarding the string datatype given to the unlabelled class attribute. My question is what datatype is suitable for the class attribute in '?' to avoid glitches?  
Relevant answer
Answer
Hi ,
Any data type can have a label with "?". Attribute type with "?" value is most likely to be from a nominal type, hence, if you have such attribute but with string type (which is weird), simply apply "StringToNominal" filter. This will convert the string attribute into a nominal one after specifying the index of this particular attribute.
HTH.
Samer 
  • asked a question related to Web Content Extraction
Question
2 answers
I'm looking for a freely available dataset for Arabic microblog retrieval 
Relevant answer
Answer
check this site out for datasets
  • asked a question related to Web Content Extraction
Question
3 answers
I am planning to employ R software to develop “word cloud” to find out the central theme or core intention of 2100 respondents in local level planning process. The questionnaires were filled out in five districts of Nepal - mountain, hill and tarai madhesh.
Relevant answer
Answer
Yes it is valid, I have developed an application using this technique to find out text clones, you can read my research paper to have a good vision for that
  • asked a question related to Web Content Extraction
Question
5 answers
I am doing my research in extracting new indicators of business performance using sentiment analysis of headline news. to do that i need a collection of headlines form famous news agencies like Reuters  . is there anybody have this data or know how to get this data pleas help me.  
Relevant answer
Answer
I can suggest you to use dataset from the GDELT projecthttp://gdeltproject.org/
  • asked a question related to Web Content Extraction
Question
3 answers
I am doing research in Review mining. I need reviews about mobile phones and hotels. Any one has idea how to extract these reviews?
Relevant answer
Answer
You can retrieve the reviews through the APIs usually made available to developers. Here are some entrypoints you may use to find some info:
  • asked a question related to Web Content Extraction
Question
2 answers
I need to extract text from an image,
I think to partition the images into several layers using gaussian mixture model based on color. wheather this approach is correct?
Relevant answer
Answer
THANK YOU,
Text is from any form, but not handwritten and yes, there is different foreground and background color
  • asked a question related to Web Content Extraction
Question
3 answers
Suppose I need to extract code for the voting portion of a webpage alone. Can it be any tool for doing this.
Relevant answer
Answer
we use python-based scrapper: http://www.crummy.com/software/BeautifulSoup/. after scrapping, perform some manual editing and get things done as per specific requirement
  • asked a question related to Web Content Extraction
Question
14 answers
I'm interested in finding ontologies in the domain of sustainable territories
Relevant answer
Have you try https://duckduckgo.com. Some of the search engine above don't work
  • asked a question related to Web Content Extraction
Question
8 answers
Hi,
maybe someone knows where I can find a webpage dataset to Information extraction evaluation. I need a set like a:
- domain_1 = { {web_page_1, {relevant entities}}, ..., { {web_page_2, {relevant entities} }
I created a wrapper induction algorithm with based on domain's web pages. This algorithm can extract an important entity from these pages (for example from domain about movies they from each page extract information like film title, actors names etc.) . I created a reference dataset (I labeled 3 domain and 200 documents). But maybe there is an another better reference dataset?
Maybe someone know where I can find a software to comparation with my solution (semi-supervised information extraction from web pages based on html structure) ?
Relevant answer
Answer
Please have a look on http://tinyurl.com/o8ykn4y. This is the dataset used to evaluate a recent work on IE at web scale. The full description of this work and the way the corpus was extracted is described in http://www.aclweb.org/anthology/D15-1086. More information within the project's page of this group of authors http://oak.dcs.shef.ac.uk/lodie/
  • asked a question related to Web Content Extraction
Question
4 answers
I am working with text classification using ant colony algoriithm, but basically I am confused with computation of feature vector for test set.
For training feature vector, I took TF-IDF vector for each training data, and constructed a feature matrix [docs x terms] using the TF-IDF values.
But how about computing the test set's feature vector? Should I just use the TF-IDF values in training set to compute it?
eg: In training set for a particular word "apple", the doc frequency is 5. For test set, should I use the value 5 for "apple"? Or recompute the TF-IDF based on test set?? Or rather, am I going the wrong way in computing the feature vector??
Thanks in advance!
Relevant answer
Answer
Hi,
 As I know, one of the main concerns in performance evaluation in machine learning techniques is that you should not use the training set for the test set too. In your case, if you use the TF/IDF or any feature vector values from your training set in your test set, your result will not be accurate enough. 
I hope it helps. 
Mostafa
  • asked a question related to Web Content Extraction
Question
4 answers
Hello, everyone
I do the implementation of web page classification. Now I am testing on small dataset such as downloading about 50 web pages (sport, business,...etc.)from bbc web sites . But I need more web pages for further implementation and calculate the classification accuracy. Therefore, if you know and have some web page dataset, please can you share me or give links.
Thanks,
pan ei san
Relevant answer
Answer
 Are you doing an analysis of the text on these webpages? A lot of other pages are aggregating news from different web sources and classifying them, also news agencies like Associated Press and Reuters might be places to use? Also Google News seems to be good at indexing different news-stories. 
  • asked a question related to Web Content Extraction
Question
7 answers
Hello, everyone
I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio and title density to extract main content. But all of HTML tags don't appropriate where content extraction. SO I want to know what tags are more accurate and more suitable for web page' cleaning? Thank You all...
Relevant answer
Answer
try visual ping
you can visually select what you want to extract
  • asked a question related to Web Content Extraction
Question
10 answers
I'm looking for recent developments in automated analysis of Twitter, Facebook, or any other text-based social media streams. What are researchers able to extract? How are facts gathered, summarized, visualized?
If you can point me to recent research, technologies, and specifically conferences dealing with automation of social media content, I'd much appreciate it. VR
Relevant answer
Answer
Martin Hawksey's TAGS is a great tool for harvesting tweets. It's simple to use and data is sent to a Google spreadsheet which can be downloaded as .xlsx 
It also has in built visualizations so you can see, for example, numbers of interactions and who has interacted with whom. 
I've written a little about my experience using it for the ascilite2014 conference.
I hope this helps.
  • asked a question related to Web Content Extraction
Question
12 answers
Datasource which is as well free of cost and permitted to download...
Relevant answer
Answer
  • asked a question related to Web Content Extraction
Question
9 answers
I want to get moodle learning dataset in CVS format. Is there any open source moodle dataset for research purpose and can anyone suggest me any tools to extract moodle web data in CVS format?
Relevant answer
Answer
Some demo data, including student activity logs/analytics, and grades that can be exported in CSV format are available from the "Mount Orange" Moodle Demo Site: http://moodle.com/any-way-you-want-it-moodles-demo-sites/ 
Did you mean something different, or did you want more or "real" data perhaps?
  • asked a question related to Web Content Extraction
Question
4 answers
Many years ago I read a paper on a hardware implementation of an information retrieval system. It was implemented as a circuit board, where the query would be set by putting jumpers on one side of the board and the result would be indicated by LEDs or the equivalent on another side of the board. The math behind it was very insightful, and I'd love to find it again, but I've been unable to. The paper was written (probably well) before 1975, perhaps even in the 1950's. I vaguely remember that the primary author's name began with an S but that's as far as I've gotten. (I'm not thinking of Vannevar Bush's Memex.)
Can anyone help?
Relevant answer
Answer
Dear Sir,
Check the file attached. May be it can help you.
Thanks
  • asked a question related to Web Content Extraction
Question
1 answer
Hello Everyone,
I want to know how to get the DBLP and SIGMOD query set. If you know the links, please can you share me? But if it is not gained query set from the links,these tested query is created by yourself when the query is processed. Please share me.. Thank you all.
Relevant answer
Answer
I am not sure about DBLP data set. But, if you can explore, the following link is useful to get good data set for typical analytical problems. I hope this may be of use.
  • asked a question related to Web Content Extraction
Question
12 answers
I have:
- Polarity words.
Example:
- Good: Pol 5.
- Bad: Pol -5.
My assignment:
Determine a document is negative or positive. So how I have to do, please tell me about that, I'm a newbie in NLP (sentiment analysis).
I want to use polarity to do that, don't use Naive Bayes. So anyone tell me about algorithm based on polarity words.
Thanks for your time.
Relevant answer
Answer
Hi Phuong, you can try many algorithm:
Many tools also you can try such as Weka, Matlab, RapidMiner, etc.
Being noted that instead of using only 2 class (positive and negative) in sentiment analysis. You can consider to employ the neutral label as the 3rd class.
This article and paper could be your further reference:
Good Luck
  • asked a question related to Web Content Extraction
Question
6 answers
Hello, please can you share info with me about how to count the stop words and tokens for text. I would like clarification with examples. Thanks
Relevant answer
Answer
Stopwords means the most commonly used words in particular language on which you are working. Most of the times we find 20% to 30% of text is stopwords in our normal text document. Stop words can cause problems when searching for phrases that include them. Stopwords are filtered out from the text when we process the text. and density of word is calculated as below:
Density = (frequency(word)/count(word))*100
frequency of word :- number of occurrences of word in document.
count(word) :- total number of words
Token count is the count of tokens from the text i.e. number of basic unit which you have described for your language mostly it is word.
Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.
To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: -
Checking the Content-Language HTTP header - Checking lang="" or xml:lang=""
attribute - Checking the Language and Content-Language metadata tags If none of those are set,
You will need a list of stopwords per language, which can be easily found on the web.
Try doing it in python or R, it will be more easy for you.
- Mayur
  • asked a question related to Web Content Extraction
Question
4 answers
Dear all, I want to get some stopwords for web page classification when I want the train for learning classifiers. So if you know some link and how to get these stopwords, can you share them with me please? Thanks all.
Relevant answer
Answer
Make a file containing the words from a sample of your pages then:
sort words.txt | uniq -c | sort -nr | head 100
The output will be the 100 most common words - this is pretty close to what you want.
  • asked a question related to Web Content Extraction
Question
8 answers
I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.
Relevant answer
Answer
Hi,
Another type of ontology is knowledge graph such as Freebase (https://www.freebase.com/), which allows users to download the weekly data dumps or use API to access the information.
best regards,
  • asked a question related to Web Content Extraction
Question
9 answers
Hello everyone!
Can you advice me what java is more learn for my opinion?
Relevant answer
Answer
Generally, in the area of academic work, the core of programming languages are used. not software technologies.
but the decision about web-based(j2ee) or desktop-application based(j2se) should be made by through consideration of customer requirements. if you want to implement a package for classification and information extraction.
  • asked a question related to Web Content Extraction
Question
18 answers
I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to unstructured and complex datasets) or implementing Data Warehouse (which is a powerful tool when it comes to structured datasets). But my question is, can´t they actually go along, since Big Data is about both structured and unstructured data?
Relevant answer
Answer
It's very hard to answer this question in general without taking into considerations what your specific needs are. Also, "Data Warehouse" is a pretty general term which basically can mean any kind of technology where you put in your data for later analysis. It can be a classical SQL database, Hadoop (yes, Hadoop can be a Data Warehouse, too), or anything else. Hadoop is a general Map Reduce framework you can also use for a lot of different tasks, including Data Warehousing, but also many other things. You also have to bear in mind that Hadoop itself is a piece of infrastructure which will require a significant amount of coding on your part to do anything useful. You might want to look into projects like Pig or Hive which build on Hadoop and provide a higher level query language to actually do something with your data.
Ultimately you have to ask yourself what existing infrastructure is already in place, how much data you have, what the kind of questions are you want to extract from your data and so on, and then use something which fits your needs.
  • asked a question related to Web Content Extraction
Question
39 answers
I'm developing a strategy as a MSc project. I will be monitoring, collecting, and analyzing the data of a Facebook page (posts, comments, likes, shares) and a Twitter profile (tweets, retweets, mentions, and public tweets containing one/two keywords only). Any suggestions would be great. Also, what mining techniques do you recommend? I'm thinking sentiment analysis and would like to use one or two more techniques. What techniques do you recommend?
Thanks
  • asked a question related to Web Content Extraction
Question
1 answer
The network for routing the query is based on Markov process. If we want to model the time taken to answer a query , is Probabilistic timed automaton a better model?
Relevant answer
Answer
A stochastic queue would be indicated. Look up queueing theory. Start here: "http://en.wikipedia.org/wiki/Queueing_theory". If you wanted to use a number of probabilistic timed automatons, you would then have the complexity of having to build in the appropriate statistical properties.
  • asked a question related to Web Content Extraction
Question
3 answers
I need to extract specific data from related websites . For example I need to extract data from specific website providing the positive feedback about a type of vehicle. Kindly suggest some good code or algorithm for this.
Relevant answer
Answer
you need to create a service which fetch the information after some interval. you must know the Document structure to get the specific field or table or value. if you will have any problem with specific tool then you are welcome to ask.
Regards
Muddsair
  • asked a question related to Web Content Extraction
Question
8 answers
How to data mining algorithms be implemented for web content mining?
Relevant answer
Answer
The web content extraction is a task applied to web pages, not to databases. You are scraping unstructured data from the web, you put them in structured storage (databases) and then apply data mining algorithms to them. That's the order.
  • asked a question related to Web Content Extraction
Question
1 answer
Does anybody know of any useable textmining software programs that do topic modeling and also cover Chinese as a language? This seems harder to find that I had thought. I found things like FudanNLP - (http://code.google.com/p/fudannlp/) and Ictclas (http://www.ictclas.org/ictclas_download.aspx), neither of which I have been able to make work so far. Pingar (http://apidemo.pingar.com/AnalyzeDocument.aspx) doesn't seem to have topic extraction. Mallet does seem to have a Chinese module and does have topic modeling, but I have yet to figure that one out too. Does anybody have any other suggestions?
Relevant answer
Answer
Have you considered commercial software vendors like Basis Tech?
  • asked a question related to Web Content Extraction
Question
4 answers
One of my principal research and devolpment interenst is in Web Content Extraction. I founded a start-up in this field www.altiliagroup.com. If there is someone interested in collaborating with us on this topic or in working as principal software architect for Altilia please let me know.
Relevant answer
Answer
Hi all,
thanks for your interest in my post.
We are seaching for companies interested in becoming resellers of our content extraction and management technologies and for technical people with deep expertices in web content extraction tecnologies interested in working with as software architect.