中文XML论坛 - 专业的XML技术讨论区--显示贴子

以文本方式查看主题

-  中文XML论坛 - 专业的XML技术讨论区  (http://bbs.xml.org.cn/index.asp)
--  『 Semantic Web(语义Web)/描述逻辑/本体』  (http://bbs.xml.org.cn/list.asp?boardid=2)
----  Our vision of semantic Web search  (http://bbs.xml.org.cn/dispbbs.asp?boardid=2&rootid=&id=71338)

--  作者：whfcarter
--  发布时间：1/13/2009 8:51:00 PM

--  Our vision of semantic Web search
We have proposed a layer cake for semantic web search. It includes four layers from bottom to up as follows:

Knowledge Engineering Layer focuses on how to create semantic data. It includes knowledge annotation, knowledge extraction and knowledge fusion. In particular, we investigate collaborative annotation based on Wiki-technologies. Moreover, we pay much attention to automatically extract semantic data from Web 2.0 social corpus (e.g. Wikipedia, Del.icio.us).

Indexing and Search Layer focuses on semantic data management. It includes scalable triple store design for the data Web. It further considers building suitable indices on top of those triple stores for fast lookup or query processing. Additionally, it integrates database and information retrieval perspective for efficient and effective search engines.

Query Interface and User Interaction Layer focuses on usability issues of semantic search. It includes adapting different query interfaces (i.e. keyword interface, natural language interface) for semantic search. It aims at interpreting user queries into potential system queries with respect to the underlying semantic data. Furthermore, it involves faceted browsing to ease the process of expressing complex information needs from end users.

These basic infrastructures enable us to build more intelligent applications. For example, we can provide semantic services for Wikipedia. We can exploit semantic technologies for e-tourism, semantic portal, life science and personal information management as well.

In the Knowledge Engineering Layer, we have published the following work (2007 - 2008)

Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source.

PORE: Positive-Only Relation Extraction from Wikipedia Text
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identification, and transductive inference to work with fewer positive training examples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly out per forms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wikipedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains.

An Unsupervised Model for Exploring Hierarchical Semantics from Social Annotations
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
This paper deals with the problem of exploring hierarchical semantics from social annotations. Recently, social annotation services have become more and more popular in Semantic Web. It allows users to arbitrarily annotate web resources, thus, largely lowers the barrier to cooperation. Furthermore, through providing abundant meta-data resources, social annotation might become a key to the development of Semantic Web. However, on the other hand, social annotation has its own apparent limitations, for instance, 1) ambiguity and synonym phenomena and 2) lack of hierarchical information. In this paper, we propose an unsupervised model to automatically derive hierarchical semantics from social annotations. Using a social bookmark service Del.icio.us as example, we demonstrate that the derived hierarchical semantics has the ability to compensate those shortcomings. We further apply our model on another data set from Flickr to testify our model’s applicability on different environments. The experimental results demonstrate our model’s efficiency.

Catriple: Extracting Triples from Wikipedia Categories
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)

Abstract
As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to extract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the infoboxes and suffer from low article coverage. In contrast, the category-based extraction methods exploit the widespread categories. However, they rely on predefined properties. It is too effort-consuming and explores only very limited knowledge in the categories. This paper automatically extracts properties and triples from the less explored Wikipedia categories so as to achieve wider article coverage with less manual effort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype implementation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on demand use the triples with suitable confidence.

In the indexing and Search Layer, we have published the following work (2007 - 2008)

SOR: a practical system for ontology storage, reasoning and search
Published in the 33rd International Conference on Very Large Data Bases (VLDB 2007)
Abstract
Ontology, an explicit specification of shared conceptualization, has been increasingly used to define formal data semantics and improve data reusability and interoperability in enterprise information systems. In this paper, we present and demonstrate SOR (Scalable Ontology Repository), a practical system for ontology storage, reasoning, and search. SOR uses Relational DBMS to store ontologies, performs inference over them, and supports SPARQL language for query. Furthermore, a faceted search with relationship navigation is designed and implemented for ontology search. This demonstration shows how to efficiently solve three key problems in practical ontology management in RDBMS, namely storage, reasoning, and search. Moreover, we show how the SOR system is used for semantic master data management.

Effective and Efficient Semantic Web Data Management on DB2
Published in the 27th International Conference on Management of Data (SIGMOD 2008)
Abstract
With the fast growth of Semantic Web, more and more RDF data and ontologies are created and widely used in Web applications and enterprise information systems. It is reported that the W3C Linking Open Data community project consists of over two billion RDF triples, which are interlinked by about three million RDF links. Recently, efficient RDF data management on top of relational databases gains particular attentions from both Semantic Web community and database community. In this paper, we present effective and efficient Semantic Web data management over DB2, including efficient schema and indexes design for storage, practical ontology reasoning support, and an effective SPARQL-to-SQL translation method for RDF query. Moreover, we show the performance and scalability of our system by an evaluation among well-known RDF stores and discuss future work.

CE2 – Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
Published in the 17th Conference on Information and Knowledge Management (CIKM 2008)

Abstract
The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency.

Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data
Published in the 6th International Semantic Web Conference (ISWC 2007)
Abstract
As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we briefly describe how Semplore is used for searching Wikipedia and an IBM customer’s product information.

Efficient Index Maintenance for Frequently Updated Semantic Data
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)
Abstract
Nowadays, the demand on querying and searching the Semantic Web is increasing. Some systems have adopted IR (Information Retrieval) approaches to index and search the Semantic Web data due to its capability to handle the Web-scale data and efficiency on query answering. Additionally, the huge volumes of data on the Semantic Web are frequently updated. Thus, it further requires effective update mechanisms for these systems to handle the data change. However, the existing update approaches only focus on document. It still remains a big challenge to update IR index specially designed for semantic data in the form of finer grained structured objects rather than unstructured documents. In this paper, we present a well-designed update mechanism on the IR index for triples. Our approach provides a flexible and effective update mechanism by dividing the index into blocks. It reduces the number of update operations during the insertion of triples. At the same time, it preserves the efficiency on query processing and the capability to handle large scale semantic data. Experimental results show that the index update time is a fraction of that by complete reconstruction w.r.t. the portion of the inserted triples. Moreover, the query response time is not notably affected. Thus, it is capable to make newly arrived semantic data immediately searchable for users.

In the Query Interface and User Interaction Layer, we have the following work (2007 - 2008)

PANTO: A Portable Natural Language Interface to Ontologies
Published in the 4th European Semantic Web Conference (ESWC 2007)

Abstract
Providing a natural language interface to ontologies will not only offer ordinary users the convenience of acquiring needed information from ontologies, but also expand the influence of ontologies and the semantic web consequently. This paper presents PANTO, a Portable nAtural laNguage inTerface to Ontologies, which accepts generic natural language queries and outputs SPARQL queries. Based on a special consideration on nominal phrases, it adopts a triple-based data model to interpret the parse trees output by an off-the-shelf parser. Complex modifications in natural language queries such as negations, superlative and comparative are investigated. The experiments have shown that PANTO provides state-of-the-art results.

SPARK: Adapting Keyword Query to Semantic Search
Published in the 6th International Semantic Web Conference (ISWC 2007)

Abstract
Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named ‘SPARK’ has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search
Published in the 5th European Semantic Web Conference (ESWC 2008)

Abstract
The increasing amount of data on the Semantic Web offers opportunities for semantic search. However, formal query hinders the casual users in expressing their information need as they might be not familiar with the query’s syntax or the underlying ontology. Because keyword interfaces are easier to handle for casual users, many approaches aim to translate keywords to formal queries. However, these approaches yet feature only very basic query ranking and do not scale to large repositories. We tackle the scalability problem by proposing a novel clustered-graph structure that corresponds to only a summary of the original ontology. The so reduced data space is then used in the exploration for the computation of top-k queries. Additionally, we adopt several mechanisms for query ranking, which can consider many factors such as the query length, the relevance of ontology elements w.r.t. the query and the importance of ontology elements. The experimental results performed against our implemented system Q2Semantic show that we achieve good performance on many datasets of different sizes.

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data
Published in the 25th International Conference on Data Engineering (ICDE 2009)

Abstract
Keyword queries enjoy widespread usage as they represent an intuitive way of specifying information needs. Recently, answering keyword queries on graph-structured data has emerged as an important research topic. The prevalent approaches build on dedicated indexing techniques as well as search algorithms aiming at finding substructures that connect the data elements matching the keywords. In this paper, we introduce a novel keyword search paradigm for graph-structured data, focusing in particular on the RDF data model. Instead of computing answers directly as in previous approaches, we first compute queries from the keywords, allowing the user to choose the appropriate query, and finally, process the query using the underlying database engine. Thereby, the full range of database optimization techniques can be leveraged for query processing. For the computation of queries, we propose a novel algorithm for the exploration of top-k matching sub graphs. While related techniques search the best answer trees, our algorithm is guaranteed to compute all k sub graphs with lowest costs, including cyclic graphs. By performing exploration only on a summary data structure derived from the data graph, we achieve promising performance improvements compared to other approaches.

Snippet Generation for Semantic Web Search Engines
Published in the 3rd Asian Semantic Web Conference (ASWC 2008)

Abstract
With the development of the Semantic Web, more and more ontologies are available for exploitation by semantic search engines. However, while semantic search engines support the retrieval of candidate ontologies, the final selection of the most appropriate ontology is still difficult for the end users. In this paper, we extend existing work on ontology summarization to support the presentation of ontology snippets. The proposed solution leverages a new semantic similarity measure to generate snippets that are based on the given query. Experimental results have shown the potential of our solution in this problem domain that is largely unexplored so far.

Making them as a whole

SearchWebDB: Searching the Billion Triples!
Published in the 7th International Semantic Web Conference (ISWC 2008)

Abstract
In recent years, the amount of structured data in form of triples available on the Web is increasing rapidly and has reached more than one billion. In this paper, we propose an infrastructure for searching the billion triples -- called SearchWebDB -- that integrates data sources publicly available on the web in a way such that users can ask queries against the billion triples through a single interface. Approximate mappings between schemata as well as data elements are computed and stored in several indices. These indices are exploited by a query engine to perform query routing and result combination in an efficient way. As opposed to a standard distributed query engine requiring the use of formal languages, users can ask queries in terms of keywords through SearchWebDB. These keywords are translated to possible interpretations presented as structured queries. Thus, complex information need can be addressed without imposing too much of a burden to the casual users.

Attached please find the document with a poster for each work. Wish you enjoy it.

--  作者：whfcarter
--  发布时间：1/14/2009 9:10:00 AM

--
这篇文章过长，当时也比较偷懒，写的全部是英语，可能对于某些初学者不是很容易阅读，我会将其分不同帖子分别讨论，希望对大家有所帮助。这是对2007年到2008年之间我们组的工作总结，对于我们提出的语义搜索框架作了初步的实现，虽然已经做了很多，但是大家可以发现还有很多没有涉及，同时大家可以从CSWS2008主页上下载俞勇老师的slides，其中对于语义搜索框架的补充以及第二轮实现的初步计划有更多的了解。

--  作者：Humphrey
--  发布时间：1/14/2009 10:06:00 AM

--
一共有15篇之多？而且是两年之内？或许是接触时间较短的原因，我还看不出作为一个系列研究，这些文章之间有什么关系。不过能够感受到这个团队在语义搜索领域的分量。希望能看到您的进一步解说，也希望有机会能多和您交流。谢谢！

--  作者：whfcarter
--  发布时间：1/14/2009 9:45:00 PM

--
前面和iamwym聊起，这里对我们的semantic search做进一步的说明。从狭义的semantic search来说，它就是研究如何对现有的search进行semantic化。而我们这里考虑search的整个生命周期，从crawling->pre-processing->indexing->search。那么在semantic search中，我们不仅有crawling还有extracting，因为相比document来说, semantic数据存在明显的不足，因此为了普及semantic search (无论是semantic data search还是semantic-based document search，甚至hybrid search)都需要考虑这个bootstrap的问题。而对于pre-processing来说原先需要清理html代码，解析HTML code获得dom tree, 抽取hyperlink，以及各种feature (e.g. title, headings ...)。现在我们需要清理semantic data，解析RDF甚至OWL, 可以利用offline reasoning来获得更多facts, 还需要进行data integration (schema level 和data level的)。对于indexing原来需要考虑index terms和documents, 以及根据document生成相关的snippet进行store；现在需要考虑store schema, store data, build indices来支持各种graph pattern的访问(最基本的是triple)。当然search就不用多说了，从document search到object或data search。最后的query interface and user interaction应该相对比较清晰，即支持非sparql之外的如natural language和keyword search，对于user interaction支持faceted browsing，同时考虑结果的展现如snippet的生成等问题。

--  作者：Humphrey
--  发布时间：1/15/2009 8:46:00 AM

--
从全局考虑的语义搜索确实如此，不过我还是有些疑惑：如您所说，语义搜索又有广义狭义之分，但是搜索的语义化和从整个语义搜索流程（生命周期？我可以把它理解为流程吧？）考虑似乎并不存在很大的差异。只不过“搜索语义化”可以采用统计学方法来近似的表现出计算机对语义的“理解”；而您所说的搜索周期所体现的就是对文档数据的解析和查询的推理吧？

--  作者：whfcarter
--  发布时间：1/15/2009 10:16:00 AM

--
"而您所说的搜索周期所体现的就是对文档数据的解析和查询的推理吧？"这只是一部分，我在上面的描述中放在预处理部分。而对于你提到的"只不过“搜索语义化”可以采用统计学方法来近似的表现出计算机对语义的“理解”"，我觉得有些狭隘。semantifying的方法很多，既可以用semantic data增加很多metadata甚至推理，从数据的角度出发，利用smart data，而不改变现有的search algorithm，最典型的例子searchmonkey就是这个做法。另外一个是利用无论统计或者NLP等技术来增加对文档本身的理解和对查询本身的理解，那么这里的典型应用就是Hakia。他属于smart application的范畴。当然我们也可以把两者进行结合，典型的例子是Powerset, 利用NLP+freebase的数据。我前面贴子的意思是说我们关心的是从搜索架构出发需要涉及的各个过程（包括offline和online的），而并非只是search这样一个online的过程而已。

--  作者：Humphrey
--  发布时间：1/15/2009 11:46:00 AM

--
首先感谢您的答复。您的意思是自然语言处理（NLP）还不是纯统计的方法？而且NLP也不能算作推理类的语义实现方法？
“用语义数据增加元数据”我头一次听说，RDF/RDF Schema本身不就是表示语义的数据吗？还要把它做成元数据？那么一条元数据得有多大啊！我确实理解不上去了。如果您有时间可以再做些解释吗？谢谢！

--  作者：whfcarter
--  发布时间：1/15/2009 1:29:00 PM

--
NLP当然不是纯统计的方法，这里有很多基于rule和基于知识表示和推理的。最典型的应用是sowa原来提出的conceptural graph (CG)。当然，近年来，统计的方法很流行，不过这也只是NLP的一个分支，称为计算语言学(computational linguistic)。其次，说一下metadata，故名思意就是对于data的描述的数据。semantic search的一个直观解释是对现有Web search的语义化或performance提高，而现在web search的主要任务是document search，那么我说的利用semantic data作为metadata就是将semantic以annotation的形式对document进行额外的描述，这些semantic data就是document的metadata。这也是yahoo的microsearch和searchmonkey的做法，很自然的处理。当然metadata不一定是semantic data，metadata代表的是对数据的描述，是一种用于建模的知识表示。所以metadata和semantic data是两个层面或者在不同context下说的词，我觉得没有必要去陷入其中问个所以然。

--  作者：Humphrey
--  发布时间：1/15/2009 1:48:00 PM

--
就是说自然语言处理实际上包含两个分支：一个是通过统计的方法，计算词频和归纳相似的词以期获得尽量全面的搜索/检索结果；另一个是使用推理机和知识库对查询提问或文档中的关键词进行逻辑推理。而且在语义网方面的主要语义实现技术就是NLP，是这样吧？
用语义数据做元数据看起来很像我们常用的“中国知网”里对论文的著录，每一篇都有关键词、摘要、作者和相关文献、引用文献及其链接。事实上这也就是一个元数据吧，不过用“语义数据”做元数据，给我的感觉就像是把上述数据用专门的RDF/XML文档格式描述了。其实我不是有意要在这个问题上较真，只是觉得似乎把元数据做得过于庞大和复杂（语义元数据要表示语义嘛，想必得遵循相应的描述格式，也要有比较高的精细程度）可能还是会拖累检索效率。真是难以两全啊！

[此贴子已经被作者于2009-1-15 15:10:52编辑过]

--  作者：whfcarter
--  发布时间：1/15/2009 3:24:00 PM

--
如果希望对NLP有更加深入的了解，可以关注一下NLP的顶级会议ACL中的一些session，以及看一些NLP的介绍或许对你有帮助。你描述的关于论文中的作者等就是最典型的metadata。不过，我不是很理解，你为什么认为semantic data作为metadata就会把它做的很复杂很大。首先semantic data使用RDF/OWL等来进行表示本来就考虑web的特性，你可以在任何时间任何地点描述任何事物，那么你用一定的semantic data来标注一个document里面的attributes等是很自然的事情，如果说一些标准的话，你可以去看一下microformats, eRDF或者RDFa等，它们是一种dialet，基本模型还是RDF或者部分支持它。还有就是从你的描述中，感觉你认为semantic data是很复杂的东西，其实不然，他可以是包含很多axioms等支持复杂推理表达能力很强的数据，如life science等，也可以仅仅使用RDF data graph的结构来表达如FOAF等social network很简单的东西。semantic可以是复杂的也可以是轻量级的，关键是满足当前的应用就是最好的。

--  作者：viaphone
--  发布时间：1/15/2009 10:43:00 PM

--
我有一点点愚见：我认为可以在不同阶段做不同方面semantic化.
1）在Query interface 和user interface 阶段，NLP是必不可少的，不管是统计学方法也好还是其它方法也好，其实本身或多或少的已经就有语义化的味道了。

2) 经过NLP处理后的用户输入到本体知识库的映射，这个过程可以有很多技术可以使用。其中有一项我认为在语义化的过程中比较重要，就是本体语言DL规则可以帮助实现语义扩展。

3)映射过程不能仅仅是词组到本体概念、个体、属性的，关系上的映射更为重要。把用户输入从知识关系上补缺补漏是很关键的一步。

4)还有就是领域规则推理（这一步现在用起来会使我的系统跑得太慢，不敢碰）

5)把知识检索的结果与用这些知识标注过的文档一起作为检索结果。（当然先得有文档标注过程）

=============
以上是一点点个人心德，愚人之见是它们都是搜索语义化的过程的其中一步。
我同意whf的观点，把语义化理解成NLP或搜索过程中的任何一步，都比较狭隘。语义化应该是体现在整个搜索生命周期中的。否则有点像盲人摸象，摸到哪就会认为象长得像哪。
语义化语义化，随语潜入，化义无声。一步一步化吧。见笑见笑

--  作者：whfcarter
--  发布时间：1/16/2009 12:03:00 AM

--
谢谢viaphone的补充，这里列举的各项工作正是从不能的层面和侧面来帮助search的语义化的。其中还有很多工作没有涉及或者做深，有待大家一起努力完成。

--  作者：Humphrey
--  发布时间：1/16/2009 8:26:00 AM

--
拜读了viaphone同志的回复之后，我觉得您的意思我能理解，但是对一些概念更模糊了。由此引发了我提出下列问题，向各位前辈讨教：
其一、自然语言处理似乎不能在语义搜索全局使用，只能用于界面部分。很多文档都是通篇自然语言型（不含计算机程序设计语言代码），那么对这些文档的索引可以应用NLP技术吗？
其二、既然自然语言处理既包括统计方法，又包括推理方法。在语义搜索中间环节却需要“本体语言DL规则”来实现，感觉只是逆向证明自然语言处理不包含推理规则（至少是DL推理规则）。如果是这样那么NLP又只剩下统计方法了，因此我想请示一下NLP的概念和包含的内容。
其三、用户界面（user interface）应该包括查询界面（query interface）和结果反馈界面（个人理解，不清楚是否有此说法，因此不能列举英文原文）吧。现在对语义查询系统界面的分类是什么样的呢？
感谢诸位对我的提问的关注和解答，又要麻烦各位，小可在此先谢过大家了。

--  作者：viaphone
--  发布时间：1/16/2009 8:51:00 AM

--

以下是引用Humphrey在2009-1-16 8:26:00的发言：
拜读了viaphone同志的回复之后，我觉得您的意思我能理解，但是对一些概念更模糊了。由此引发了我提出下列问题，向各位前辈讨教：
其一、自然语言处理似乎不能在语义搜索全局使用，只能用于界面部分。很多文档都是通篇自然语言型（不含计算机程序设计语言代码），那么对这些文档的索引可以应用NLP技术吗？
其二、既然自然语言处理既包括统计方法，又包括推理方法。在语义搜索中间环节却需要“本体语言DL规则”来实现，感觉只是逆向证明自然语言处理不包含推理规则（至少是DL推理规则）。如果是这样那么NLP又只剩下统计方法了，因此我想请示一下NLP的概念和包含的内容。
其三、用户界面（user interface）应该包括查询界面（query interface）和结果反馈界面（个人理解，不清楚是否有此说法，因此不能列举英文原文）吧。现在对语义查询系统界面的分类是什么样的呢？
感谢诸位对我的提问的关注和解答，又要麻烦各位，小可在此先谢过大家了。

一、是的，现在的绝大多数文档都是无结构的，所以对需要语义标注的工作，即把文档与语义知识关联起来。标注工作的文章也是一找一大堆了，大部份还是根据文本统计的方法来实现，你去找找。

二、你的推论“NLP过程不包含推理”有问题。我只是说语义搜索的中间有一个可以依靠DL规则进行语义推理和扩展的环节。与NLP也可有推理并不互斥。至于NLP用什么方法来实现推理，那就是八仙过海各显神通了。

三、界面这方面我没多大研究。。。。

--  作者：whfcarter
--  发布时间：1/16/2009 9:11:00 AM

--
为什么要这么在乎分类呢，类别都是在这一个东西已经有一定研究或者成果之后产生的（类似bottom up的思路，现有data，后有schema）。而研究本身就追求创新，对于做semantic web的人来说，应该是problem-driven, user-driven和data-driven的去看待问题，在已有的分类体系里面，本身就说明你的贡献不会很大，无非是小修小补的。

problem-driven就是你现在遇到什么实际问题，对于enterprise来说，他们经常会采取这样的方法，因为研究和产品不能很脱机，不是做basic research。
user-driven的话，就是考虑用户的需求，这点才能保证研究有价值，在此申明这里不是说basic research
data-driven就是说根据数据的特性来考虑问题，这也就是web为什么永远有做不完的研究，因为数据不断地在变化，从传统的web page到后来的social tagging以及semantic data，不同的数据带来了不同的机遇和挑战，就算对固定的数据如semantic data，在不同的阶段，由于数据的规模，数据的分布，以及关注点不同提出了不同的要求。

还有一点就是要注意交叉学科，我们在讨论中不断引入NLP， HCI (Human Computer Interaction)等其他领域，所以这也是sw研究的特点（对semantic search的研究同样适用）

--  作者：Humphrey
--  发布时间：1/16/2009 9:12:00 AM

--
感谢viaphone同志的热情解答，从您和whfcarter同志的回复中，我学到了好多东西，从技术到理念。能够和你们交流，我感到十分荣幸。谢谢二位同志！

--  作者：viaphone
--  发布时间：1/16/2009 9:59:00 AM

--
过奖....我还是菜鸟....whf才是大牛

--  作者：Humphrey
--  发布时间：1/16/2009 2:29:00 PM

--
通过这两天的交流和思考，我发现自然语言处理是个很特别的学科（或者叫做方法更合适？），它的覆盖范围很大，连计算机语言学也包含在内；其方法也很全面，包括统计学方法和推理方法。
有趣也正是有趣在这里，既然自然语言处理如此兼容并包，从应用角度来看似乎已完全可以取代语义网这个概念（或者与之等同）。我原来一直认为它的范围很狭窄的，看来有必要再了解一下自然语言处理，再做语义搜索？

--  作者：Humphrey
--  发布时间：2/8/2009 9:32:00 AM

--
whfcarter同志提供的文献我已悉数下载，不过在实际使用时遭遇了一些技术性问题。
一个是在解压缩过程结束之后，有压缩包损坏的提示，但是文档还是出来了。
第二个就是，四个压缩包里的文件是同名的，难道它们是一个？还是由一个文档拆分而来的呢？
最后，我惊奇的发现，这个文档的编辑软件是Word2007，真是太强了，我必须安装一个专用的格式兼容软件才能用呢！
虽然可能费一些劲，但是能够从whfcarter同志提供的文件中学到新知是最重要的。在此向whfcarter同志表示感谢和敬意。谢谢！

--  作者：whfcarter
--  发布时间：2/8/2009 12:35:00 PM

--
四个压缩包里面的是一个文件，由于文件太大，我选择了分卷压缩。同时我考虑到向下兼容，所以虽然使用07编辑，但是还是保存为03支持的格式。如果实在不能打开，可以考虑使用Open office最新版本，它可以打开07格式，毕竟07其实就是一些XML文件。

--  作者：vannus
--  发布时间：3/11/2009 12:32:00 AM

--
whfcarter同志的压缩文件已全部下载，但是无法完成解压，好像压缩文件命名不对。

--  作者：Humphrey
--  发布时间：3/11/2009 9:36:00 AM

--
可以解压缩，文件名没有问题。需要注意的是按顺序进行选择，否则不能完成解压过程。

W 3 C h i n a ( since 2003 ) 旗下站点
苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》

171.875ms