3.6.1 Suggestions for future research
3.6.1.1 The problem with ambiguity
The preceding study may give the impression that semantic relations are clear-cut categories or that a linguistic pattern used to express a particular relation is used to express only that relation. However, neither of the above is true. Semantic relations and linguistic patterns are prone to ambiguity.
First, not all researchers agree on what constitutes meronymy, for example. Cruse (1986, 175) states that "Entities such as groups, classes and collections stand in relations which resemble meronymy with their constituent elements." Winston et al (1987, 423), however, have this to say: "Collections must be distinguished from classes. The class-member relation is not a meronymic relation because it is not expressed by "part" but by "is"..."
An example from my own research will help illustrate the confusion that sometimes arises when trying to determine what relation is being expressed in a given sentence. The following two sentences appear on a single TERMIUM record and are describing the same phenomenon. However, one seems to be expressing hyponymy, while the other, meronymy:
a) The earth's surface (the material that is to be classified) is divided into soil and nonsoil.
b) On distingue deux sortes de matériaux à la surface de la terre (objet de la classification): le sol et le non-sol.
The first seems meronymic: both soil and non-soil are parts of the earth's surface. The second is hyponymic: soil and non-soil are kinds of surface material.
Does this mean, then, that a relation that appears to hold between two concepts is not inherent and absolute, but rather, is dependent on point of view or the way the sentence is worded? To answer this question, much detailed research is needed to work out the subtleties of language (and maybe even those of human perception).
Then comes the problem of categorizing linguistic patterns, which can also be ambiguous. If we do decide that sentence a) and b) above are both expressing hyponymy, then the pattern divided into (and its inflectional variants divide into, division into, etc.) is a hyponymic pattern. What happens if we then encounter a sentence such as the following one from TERMIUM:
[Insects are] small arthropod animals characterized, in the adult state, by division of the body into head, thorax, and abdomen, three pairs of legs on the thorax, and, usually, two pairs of membranous wings.
This is very obviously expressing meronymy, i.e. the parts of an insect. Conclusion: the pattern divid* into is used in real language to express both hyponymy and meronymy. A knowledge extraction program would have to include the pattern in both places. As a result, a search for hyponymy will necessarily retrieve, as well, all the meronymic expressions using divid* into, which would constitute noise. This is an unavoidable situation until researchers are able (if possible) to tease out the subtle differences between meronymic and hyponymic sentences using this pattern. The same can be said for other areas of relation overlap. I believe that this would imply a knowledge extraction program capable of performing some amount of semantic analysis--a step far beyond simple character-string recognition.
Related to the above problem involves assuming that a given well-known pattern is used to express only those relations we normally associate it with. For example, what relation does the pattern define express? The automatic answer is probably hyponymy, as in "Geotextiles are defined as permeable textiles used in conjunction with soils or rocks" (genus plus differentiae specificae).
This answer is not wrong. But neither is it totally right, because the word define is not as "clean" as we would like; it is potentially ambiguous. For example, in the context of Java programming language, "to define" means "to create", "to call into existence". Classes and interfaces are defined within packages. This meaning of define is similar to that seen in, say, WordPerfect. When users wants to create columns in a document, the columns must first be "defined", i.e. created by the user specifying how many, their width, type, etc. The relation expressed in this case is that of "creation" or perhaps "generation": two concepts are related by the fact that one creates the other.
The conclusion is that, across subject fields, some patterns can "switch" categories and end up expressing a different semantic relation. Again, much research is needed on large amounts of text in various subject fields to determine which patterns are relatively fixed and which are ambiguous.
The above discussion about relation overlap tends to paint a negative picture of the fuzziness among relations and patterns expressing them. This fuzziness is not really a major concern in a terminology context, however, because regardless of how sentences expressing semantic relations are extracted, they will be useful anyway. For example, a terminologist performs a meronymic search for bacteria. If the pattern part of is programmed into the knowledge extraction program, a sentence such as "Bacteria are part of the animal kingdom" may be retrieved. This sentence is expressing class membership; hence hyponymy. This is valuable information for the terminologist, who is not about to discard it simply because it was harvested during a search for meronymy.
3.6.1.2 Lexical vs Grammatical patterns
Up to this point, the work for this thesis concentrated on linguistic patterns that are lexical, i.e. words (or character strings) that express particular relations. Another area for further study is grammatical patterns--those parts of speech in certain positions in a sentence that can express those relations. Consider the following sentence:
The oil filter removes abrasive particles that enter the lubrication system before they can cause excessive wear.
It is clear that the oil filter's function is being explained. However, there are no lexical patterns in this sentence that should be entered into a knowledge extraction program; it is the underlying part of speech that performs the role of expressing function: term + action verb.
Chapter 4, Section 4.2 of this thesis presents a brief exploration into using grammatical patterns for knowledge extraction, but is by no means exhaustive.
3.6.2 General conclusions
The first conclusion to be drawn from my research, involving an experimental program called the Text Analyzer, is that knowledge extraction technology can play an important role in the field of terminology. I have been able to demonstrate, by partially simulating a terminology project, that at least some of the conceptual analysis can be semi-automated. The focus for future research on this technology and other systems should be on obtaining optimal recall values by discovering as many lexical and grammatical patterns as possible for each relation and determining the best search window size for each pattern.
The second conclusion is the following: Before humans can program a computer with the "knowledge" required for effective, reliable text analysis, we will need to do a great deal of research into the intricacies and subtleties of language so we ourselves can fully understand semantic relations in general and how they are expressed in real language in particular.