
Xiaoxin YinGoogle Inc. | Google
Xiaoxin Yin
About
45
Publications
13,523
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,905
Citations
Publications
Publications (45)
Techniques and systems are disclosed providing improved ranking of results to an online search-based query. One or more user types are identified for a search-based query, and may correspond to a number of user relevant results, and which user results are selected. A user profile can be determined for the respective user types for the search-based...
Named entities are observed in a large portion of web search queries (named entity queries), where each entity can be associated with many different query terms that refer to various aspects of this entity. Organizing these query terms into topics helps understand major search intents about entities and the discovered topics are useful for applicat...
Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on H...
Recently answers for fact lookup queries have appeared on major search engines. For example, for the query Barack Obama date of birth Google directly shows "4 August 1961" above its regular results. In this paper, we describe FACTO, an end-to-end system for answering fact lookup queries for web search. FACTO extracts structured data from tables on...
Accessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and conflicting information. Existing approaches attack this problem using unsupervised learning methods,...
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for
the correctness of information on the web, and different web sites often provide conflicting information on a subject. In
this section we study two problems about correctness of information on the web. The first one is Veraci...
Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic
information that may indicate important relationships among objects, such as the similarities between objects. In this chapter
we explore linkage-based clustering, in which the similarity between two objects is measured based on...
A significant portion of web search queries are name entity queries. The major search engines have been exploring various ways to provide better user experiences for name entity queries, such as showing "search tasks" (Bing search) and showing direct answers (Yahoo!, Kosmix). In order to provide the search tasks or direct answers that can satisfy m...
Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users' search trails (i.e.,...
In recent years, there has been growing interest in multi-relational classification research and application, which addresses the difficulties in dealing with large relation search space, complex relationships between relations, and a daunting number of attributes involved. Bayesian Classifier is a simple but effective probabilistic classifier whic...
One of the most fundamental problems in web search is how to re-rank result web pages based on user logs. Most tradi- tional models for re-ranking assume each query has a single intent. That is, they assume all users formulating the same query have similar preferences over the result web pages. It is clear that this is not true for a large portion...
Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Unfortunately, most existing data mining approaches can only handle data stored in single tables, and cannot be a...
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, calle...
Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge
in the world. Because of the complexity of relational data, it is a challenging task to design efficient and scalable data
mining approaches in relational databases. In this paper we discuss two methodologies to address thi...
Online bibliographic databases, such as DBLP in computer science and PubMed in medical sciences, contain abundant information about research publications in dierent fields. Each such database forms a gigantic information network (hence called BibNet), connecting in complex ways research papers, authors, conferences/journals, and possibly citation i...
Most structured data in real-life applications are stored in relational databa ses containing multiple semantically linked relations. Unlike clustering in a single table, when cluster ing objects in relational databases there are usually a large number of feature s conveying very different semantic information, and using all features indiscriminate...
Different people or objects may share identical names in the real world, which causes confusion in many applications. It is a nontrivial task to distinguish those objects, especially when there is only very limited information associated with each of them. In this paper, we develop a general object distinction methodology called DISTINCT, which com...
The world-wide web has become the most important infor- mation source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting in- formation on a subject, such as different specifications for the same product. In this paper we propose a new problem ca...
Different people or objects may share identical names in the real world, which causes confusion in many applications. It is a nontrivial task to distinguish those objects, especially when there is only very limited information associated with each of them. In this paper, we develop a general object distinction methodology called DISTINCT, which com...
Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple r...
Most of today’s structured data is stored in relational data- bases. Such a database consists of multiple relations that are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines including financial decision making and medical r...
Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich seman- tic information that may indicate important relationships among objects. Most current clustering methods rely only on the properties that belong to the objects per se. How- ever, the similarities between objects are often indicate...
Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple r...
Firms hesitate to outsource their network security to outside security providers (called Managed Security Service Providers
or MSSPs) because an MSSP may shirk secretly to increase profits. In economics this secret shirking behavior is commonly referred
to as the Moral Hazard problem. There is a counter argument that this moral hazard problem is no...
With the fast expansion of computer networks, it is inevitable to study data mining on heterogeneous databases. In this paper we propose MDBM, an accurate and efficient approach for classification on multiple heterogeneous databases. We propose a regression-based method for predicting the usefulness of inter-database links that serve as bridges for...
Clustering is an essential data mining task with numerous applications. However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which...
The effectiveness of automated system management is dependent on the domain-specific information that is encoded within the management framework. Existing approaches for defining the domain knowledge are categorized into white-box and black-box approaches, each of which has limitations. White-box approaches define detailed formulas for system behav...
Previous work on mining transactional database has focused primarily on mining frequent Itemsets, association rules, and sequential patterns. However, interesting relationships between customers and items, especially their evolution with time, have not been studied thoroughly. In this paper, we propose a Gaussian transformation-based regression mod...
To discover knowledge or retrieve information from a relational database, a user often needs to nd objects re- lated to certain source objects. There are two main chal- lenges in building an eective object search system: the huge amount of objects in the database and the large num- ber of dierent relationships between objects. In this paper we intr...
Visualization of IP-based traffic dynamics on networks is a challenging task due to large data volume and the complex, temporal relationships between hosts. We present the architecture of VisFlowConnect-IP, a powerful new tool to visualize IP network traffic flow dynamics for security situational awareness. VisFlowConnect-IP allows an operator to v...
Classification is one of the most popular data mining tasks with a wide range of applications, and lots of algorithms have been proposed to build accurate and scalable classifiers. Most of these algorithms only take a single table as input, whereas in the real world most data are stored in multiple tables and managed by relational database systems....
We present VisFlowConnect-IP, a network flow visual-ization tool that allows operators to detect and investigate anomalous internal and external network traffic. We model the network on a parallel axes graph with hosts as nodes and traffic flows as lines connecting these nodes. We present an overview of this tool's purpose, as well as a detailed de...
We present several ways to correlate security events from two applications that visualize the same underlying data with two distinct views: system and network. Correlation of security events provide Security Engineers a better understanding of what is happening for enhanced security situational awareness. Visualization leverages human cognitive abi...
We present the design and implementation of VisFlowConnect, a powerful new tool for visualizing network traffic flow dynamics for situational awareness. The visualization capability provided by VisFlowConnect allows an operator to assess the state of a large and complex network given an overall view of the entire network and filter/drill-down featu...
Most of today's structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multirelational classification can be widely used in many disciplines, such as financial decision-making, medical researc...
We present a visualization design to enhance the ability of an administrator to detect and investigate anomalous traffic between a local network and external domains. Central to the design is a parallel axes view which displays NetFlow records as links between two machines or domains while employing a variety of visual cues to assist the user. We d...
Recent studies in data mining have proposed a new classification approach, called associative classification, which, according to several reports, such as [7, 6], achieves higher classification accuracy than traditional classification approaches such as C4.5. However, the approach also su#ers from two major deficiencies: (1) it generates a very lar...
This paper describes an image clustering approach to grouping semantically similar images. In this approach, the similarity between images is estimated using users' relevance feedback information recorded in the user log of an image retrieval system. An algorithm similar to CAST (Cluster Affinity Search Technique) is used tu identify clusters of se...
A robust approach is proposed for document skew detection. We use Fourier analysis and SVM to classify textual areas from non-textual areas of documents. We also propose a robust method to determine the skew angle from textual areas. Our approach achieves good performance on documents with large area of non-textual contents.
With the increasing number of hostile network attacks, anomaly detection for network security has become an urgent task. As
there have not been highly effective solutions for automatic intrusion detection, especially for detecting newly emerging
attacks, network traffic visualization has become a promising technique for assisting network administra...
Most of today's structured data is stored in relational databases. In contrast, most classification approaches only apply on single "flat" data rela-tions. And it is usually difficult to convert multiple relations into a single flat relation without losing essential information. Inductive Logic Programming ap-proaches have proven effective with hig...
We present several ways to correlate security events from two applications that visualize the same underlying data with two distinct views: system and network. Correlation of security events provide Security Engineers a better understanding of what is happening for enhanced security situational awareness. Visualization leverages human cognitive abi...
Printout. Thesis (M.S.)--University of Illinois at Urbana-Champaign, 2003. Vita. Includes bibliographical references (leaves 41-43).
Projects
Project (1)

























































































































































