-- 作者:whfcarter
-- 发布时间:1/13/2009 8:51:00 PM
-- Our vision of semantic Web search
We have proposed a layer cake for semantic web search. It includes four layers from bottom to up as follows: Knowledge Engineering Layer focuses on how to create semantic data. It includes knowledge annotation, knowledge extraction and knowledge fusion. In particular, we investigate collaborative annotation based on Wiki-technologies. Moreover, we pay much attention to automatically extract semantic data from Web 2.0 social corpus (e.g. Wikipedia, Del.icio.us). Indexing and Search Layer focuses on semantic data management. It includes scalable triple store design for the data Web. It further considers building suitable indices on top of those triple stores for fast lookup or query processing. Additionally, it integrates database and information retrieval perspective for efficient and effective search engines. Query Interface and User Interaction Layer focuses on usability issues of semantic search. It includes adapting different query interfaces (i.e. keyword interface, natural language interface) for semantic search. It aims at interpreting user queries into potential system queries with respect to the underlying semantic data. Furthermore, it involves faceted browsing to ease the process of expressing complex information needs from end users. These basic infrastructures enable us to build more intelligent applications. For example, we can provide semantic services for Wikipedia. We can exploit semantic technologies for e-tourism, semantic portal, life science and personal information management as well. In the Knowledge Engineering Layer, we have published the following work (2007 - 2008) Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring Published in the 6th International Semantic Web Conference (ISWC 2007) Abstract Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source. PORE: Positive-Only Relation Extraction from Wikipedia Text Published in the 6th International Semantic Web Conference (ISWC 2007) Abstract Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identification, and transductive inference to work with fewer positive training examples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly out per forms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wikipedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains. An Unsupervised Model for Exploring Hierarchical Semantics from Social Annotations Published in the 6th International Semantic Web Conference (ISWC 2007) Abstract This paper deals with the problem of exploring hierarchical semantics from social annotations. Recently, social annotation services have become more and more popular in Semantic Web. It allows users to arbitrarily annotate web resources, thus, largely lowers the barrier to cooperation. Furthermore, through providing abundant meta-data resources, social annotation might become a key to the development of Semantic Web. However, on the other hand, social annotation has its own apparent limitations, for instance, 1) ambiguity and synonym phenomena and 2) lack of hierarchical information. In this paper, we propose an unsupervised model to automatically derive hierarchical semantics from social annotations. Using a social bookmark service Del.icio.us as example, we demonstrate that the derived hierarchical semantics has the ability to compensate those shortcomings. We further apply our model on another data set from Flickr to testify our model’s applicability on different environments. The experimental results demonstrate our model’s efficiency. Catriple: Extracting Triples from Wikipedia Categories Published in the 3rd Asian Semantic Web Conference (ASWC 2008) Abstract As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to extract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the infoboxes and suffer from low article coverage. In contrast, the category-based extraction methods exploit the widespread categories. However, they rely on predefined properties. It is too effort-consuming and explores only very limited knowledge in the categories. This paper automatically extracts properties and triples from the less explored Wikipedia categories so as to achieve wider article coverage with less manual effort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype implementation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on demand use the triples with suitable confidence. In the indexing and Search Layer, we have published the following work (2007 - 2008) SOR: a practical system for ontology storage, reasoning and search Published in the 33rd International Conference on Very Large Data Bases (VLDB 2007) Abstract Ontology, an explicit specification of shared conceptualization, has been increasingly used to define formal data semantics and improve data reusability and interoperability in enterprise information systems. In this paper, we present and demonstrate SOR (Scalable Ontology Repository), a practical system for ontology storage, reasoning, and search. SOR uses Relational DBMS to store ontologies, performs inference over them, and supports SPARQL language for query. Furthermore, a faceted search with relationship navigation is designed and implemented for ontology search. This demonstration shows how to efficiently solve three key problems in practical ontology management in RDBMS, namely storage, reasoning, and search. Moreover, we show how the SOR system is used for semantic master data management. Effective and Efficient Semantic Web Data Management on DB2 Published in the 27th International Conference on Management of Data (SIGMOD 2008) Abstract With the fast growth of Semantic Web, more and more RDF data and ontologies are created and widely used in Web applications and enterprise information systems. It is reported that the W3C Linking Open Data community project consists of over two billion RDF triples, which are interlinked by about three million RDF links. Recently, efficient RDF data management on top of relational databases gains particular attentions from both Semantic Web community and database community. In this paper, we present effective and efficient Semantic Web data management over DB2, including efficient schema and indexes design for storage, practical ontology reasoning support, and an effective SPARQL-to-SQL translation method for RDF query. Moreover, we show the performance and scalability of our system by an evaluation among well-known RDF stores and discuss future work. CE2 – Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support Published in the 17th Conference on Information and Knowledge Management (CIKM 2008) Abstract The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency. Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data Published in the 6th International Semantic Web Conference (ISWC 2007) Abstract As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we briefly describe how Semplore is used for searching Wikipedia and an IBM customer’s product information. Efficient Index Maintenance for Frequently Updated Semantic Data Published in the 3rd Asian Semantic Web Conference (ASWC 2008) Abstract Nowadays, the demand on querying and searching the Semantic Web is increasing. Some systems have adopted IR (Information Retrieval) approaches to index and search the Semantic Web data due to its capability to handle the Web-scale data and efficiency on query answering. Additionally, the huge volumes of data on the Semantic Web are frequently updated. Thus, it further requires effective update mechanisms for these systems to handle the data change. However, the existing update approaches only focus on document. It still remains a big challenge to update IR index specially designed for semantic data in the form of finer grained structured objects rather than unstructured documents. In this paper, we present a well-designed update mechanism on the IR index for triples. Our approach provides a flexible and effective update mechanism by dividing the index into blocks. It reduces the number of update operations during the insertion of triples. At the same time, it preserves the efficiency on query processing and the capability to handle large scale semantic data. Experimental results show that the index update time is a fraction of that by complete reconstruction w.r.t. the portion of the inserted triples. Moreover, the query response time is not notably affected. Thus, it is capable to make newly arrived semantic data immediately searchable for users. In the Query Interface and User Interaction Layer, we have the following work (2007 - 2008) PANTO: A Portable Natural Language Interface to Ontologies Published in the 4th European Semantic Web Conference (ESWC 2007) Abstract Providing a natural language interface to ontologies will not only offer ordinary users the convenience of acquiring needed information from ontologies, but also expand the influence of ontologies and the semantic web consequently. This paper presents PANTO, a Portable nAtural laNguage inTerface to Ontologies, which accepts generic natural language queries and outputs SPARQL queries. Based on a special consideration on nominal phrases, it adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser. Complex modifications in natural language queries such as negations, superlative and comparative are investigated. The experiments have shown that PANTO provides state-of-the-art results. SPARK: Adapting Keyword Query to Semantic Search Published in the 6th International Semantic Web Conference (ISWC 2007) Abstract Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named ‘SPARK’ has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result. Q2Semantic: A Lightweight Keyword Interface to Semantic Search Published in the 5th European Semantic Web Conference (ESWC 2008) Abstract The increasing amount of data on the Semantic Web offers opportunities for semantic search. However, formal query hinders the casual users in expressing their information need as they might be not familiar with the query’s syntax or the underlying ontology. Because keyword interfaces are easier to handle for casual users, many approaches aim to translate keywords to formal queries. However, these approaches yet feature only very basic query ranking and do not scale to large repositories. We tackle the scalability problem by proposing a novel clustered-graph structure that corresponds to only a summary of the original ontology. The so reduced data space is then used in the exploration for the computation of top-k queries. Additionally, we adopt several mechanisms for query ranking, which can consider many factors such as the query length, the relevance of ontology elements w.r.t. the query and the importance of ontology elements. The experimental results performed against our implemented system Q2Semantic show that we achieve good performance on many datasets of different sizes. Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data Published in the 25th International Conference on Data Engineering (ICDE 2009) Abstract Keyword queries enjoy widespread usage as they represent an intuitive way of specifying information needs. Recently, answering keyword queries on graph-structured data has emerged as an important research topic. The prevalent approaches build on dedicated indexing techniques as well as search algorithms aiming at finding substructures that connect the data elements matching the keywords. In this paper, we introduce a novel keyword search paradigm for graph-structured data, focusing in particular on the RDF data model. Instead of computing answers directly as in previous approaches, we first compute queries from the keywords, allowing the user to choose the appropriate query, and finally, process the query using the underlying database engine. Thereby, the full range of database optimization techniques can be leveraged for query processing. For the computation of queries, we propose a novel algorithm for the exploration of top-k matching sub graphs. While related techniques search the best answer trees, our algorithm is guaranteed to compute all k sub graphs with lowest costs, including cyclic graphs. By performing exploration only on a summary data structure derived from the data graph, we achieve promising performance improvements compared to other approaches. Snippet Generation for Semantic Web Search Engines Published in the 3rd Asian Semantic Web Conference (ASWC 2008) Abstract With the development of the Semantic Web, more and more ontologies are available for exploitation by semantic search engines. However, while semantic search engines support the retrieval of candidate ontologies, the final selection of the most appropriate ontology is still difficult for the end users. In this paper, we extend existing work on ontology summarization to support the presentation of ontology snippets. The proposed solution leverages a new semantic similarity measure to generate snippets that are based on the given query. Experimental results have shown the potential of our solution in this problem domain that is largely unexplored so far. Making them as a whole SearchWebDB: Searching the Billion Triples! Published in the 7th International Semantic Web Conference (ISWC 2008) Abstract In recent years, the amount of structured data in form of triples available on the Web is increasing rapidly and has reached more than one billion. In this paper, we propose an infrastructure for searching the billion triples -- called SearchWebDB -- that integrates data sources publicly available on the web in a way such that users can ask queries against the billion triples through a single interface. Approximate mappings between schemata as well as data elements are computed and stored in several indices. These indices are exploited by a query engine to perform query routing and result combination in an efficient way. As opposed to a standard distributed query engine requiring the use of formal languages, users can ask queries in terms of keywords through SearchWebDB. These keywords are translated to possible interpretations presented as structured queries. Thus, complex information need can be addressed without imposing too much of a burden to the casual users. Attached please find the document with a poster for each work. Wish you enjoy it.