File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1105_metho.xml
Size: 15,687 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1105"> <Title>A LEXICAL DATABASE TOOL FOR QUANTITATIVE PHONOLOGICAL RESEARCH</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> EXAMPLE </SectionTitle> <Paragraph position="0"> This section shows how the system can be used to support phonological analysis. The language data comes from Dschang, a Grassfields Bantu language of Cameroon, and is structured into a lexicon consisting of 2200 records. Suppose we wished to learn about phonotactic constraints in the syllable rhyme. The following sequence of queries were not artificially constructed, but were issued in an actual session with the system in the field, running the Web server in a stand-alone mode. The first query is displayed below. 3</Paragraph> <Paragraph position="2"> The main attribute of interest is the root attribute. 4 The . * expression stands for a sequence of zero or more segments. The expressions $V and $C are variables defined in the vats section of the query form.</Paragraph> <Paragraph position="3"> These are strings, but when surrounded with brackets, as in \[$V\] and \[$C\], they function as wild cards which match a single element from the string. The # character is a boundary symbol marking the end of the root. Observe that the root attribute contains two parenthesised subexpressions. These will be called parameters and have a special role in structuring the search output. This is best demonstrated by way of an example. Consider the table below, which is the result aThe display is only a crude approximation to the HTML form. Note that the query form comes with the variables already filled in so that it is not necessary for the user to supply them, although they can be edited. The transcription symbols used in the system have the following interpretation: U=u, @=a, E=e, O=3, N=ij, '=?.</Paragraph> <Paragraph position="4"> 4|n the following discussion, 'attribute' refers to a line in the query form while 'field' refers to part of a database record.</Paragraph> <Paragraph position="5"> of the above query. In: this table, the row labels are all the segments which matched the variable $V, while the column labels are just the segments that matched $C.</Paragraph> <Paragraph position="6"> Search Results:</Paragraph> <Paragraph position="8"/> <Paragraph position="10"> There are sufficient gaps in the table to make us wonder if all the segments are actually phonemes. For example, consider o and u, given that they are phonetically very similar (\[co\] and \[u\] respectively). We can easily set up o as an allophone Of u before k. Only the case of glottal stop needs to be considered. So we revise the form, replacing $V with just the vowels in question, and replacing the $C df the coda with apostrophe (for glottal stop). We add a term for the syllable onset and resubmit the query. See Figure 2. This time, several attributes are omitted from the display for brevity.</Paragraph> <Paragraph position="11"> We can now conclude that o and u are in complementary distribution, except for the five words corresponding to pfand v onsets. But what are these words? We revise the form again, further restricting the search string as follows: Search Attributes: display: speech word gloss root:, *(Pflv) \[ou\]'# The display parametelr is set to speech word gloss allowing us to see (arid hear) the individual lexical items. The results are shown below.</Paragraph> <Paragraph position="12"> The cells of the output ~table now contain fragments of the lexical entries. The first part is an icon which, when clicked, plays the speech file. The second part is a gif of the orthographic form of the word. The third part is the English gloss. Note that the above nouns have different prefixes (e.g. le-, m-, a-). These are noun class prefixes and are not part of the root field. If we had wanted to take prefixes into consideration then the as attribute, containing a transcription of the whole word, could have been used instead.</Paragraph> <Paragraph position="13"> Listening to the speech files it was found that the syllables pro' and pfu' sounded exactly the same, as did vo' and vu'. The whole process up to this point had taken less than five minutes. After some quick informant work to recheck the data and hear the native-speaker intuitions, it was clear that the distinction between o and u in closed syllables was subphonemic.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> MORE POWERFUL QUERIES </SectionTitle> <Paragraph position="0"> Constraining one field and displaying another In some situations we are not interested in seeing the field which was constrained, but another one instead.</Paragraph> <Paragraph position="1"> The next query displays the tone field for monosyllabic roots, classed into open and closed syllables. Although the root attribute is used in the query, the root field is not actually displayed. (This query makes use of a projection function which maps all consonants onto C and all vowels onto V, as will be explained later.) Search Attributes: display : tone root: #C+V(C?)# ($CV-proj) The C+ expression denotes a sequence of one or more consonants, while C ? denotes an optienal coda consonant. By making C? into a parameter (using parentheses) the search results will be presented in a two column table, one column for open syllables (with a null label) and one for closed syllables (labelled c). A minor change to the root attribute, enlarging the scope of the parameter (\#c+ (vc?)\#), will produce the more satisfactory column labels V and VC.</Paragraph> <Paragraph position="2"> Searching for near-minimal sets Finding good minimal sets is a heuristic process. No attempt has been made to encode heuristics into the system. Rather, the aim has been to permit flexible interaction between user and system as a collection of minimal sets is refined. To facilitate this process, the regular expression notation is extended slightly. Recall the way that parameters (parenthesised subexpressions) allowed output to be structured. One of the parameters will be said to be in focus, Syntactically, this is expressed using braces instead of parentheses. Semantically, such a parameter becomes the focus of a search for minimal sets.</Paragraph> <Paragraph position="3"> Typically, this parameter will contain a list of segments, such as { \[ ou \] }, or an optional segment whose presence is to be contrasted with its absence, such as (h?}.</Paragraph> <Paragraph position="4"> In order for a minimal set to be found, the parameter in focus must have more than one possible instantiation, while the other parameters remain unchanged. To see how this works, consider the following example.</Paragraph> <Paragraph position="5"> Suppose we wish to identify the minimal pairs for o/u discussed above, but without having to specify glottal stop in the query, as shown in Figure 3. Note this example of a 3D table.</Paragraph> <Paragraph position="6"> If this was not enough minimal pairs, we could relax the restrictions on the context. For example, if we do not wish to insist on the following consonant being identical across minimal pairs, we can remove the second set of parentheses thus: . * ( \[$c\] +) { \[ou\] ~ \[$c\] #. This now gives minimal pairs like legOk work and 13gu' year. Observe that the consonant preceding the o/u vowel is fixed across the minimal pair, since this was still parenthesised in the query string.</Paragraph> <Paragraph position="7"> Usually, it is best for minimal pairs to have similar syntactic distribution. We can add a restriction that all minimal pairs must be drawn from the same syntactic category by making the whole part attribute into a parameter as follows.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Search Attributes: </SectionTitle> <Paragraph position="0"> display: tone root: *(\[$Cl+){\[~ul}\[$c\]# part: (.*) Making the part attribute into a parameter adds an extra dimension to the table of results. We now only see an o/u minimal pair if the other parameters agree. In other words, all minimal pairs that are reported will contain the same consonant cluster before the o/u vowel and will be from the same syntactic category. Variables across attributes There are occasions where we need to have the same variable appearing in different attributes. For example, suppose we wanted to Check where the southern dialect</Paragraph> <Paragraph position="2"> ro~t: .*(3\[$V\]+).* s_dialect: .*$3.* This query makes use of another syntactic extension to regular expressions i An arbitrary one-digit number which appears immediately inside a parameter allows the parameter to be referred to elsewhere. This means that whichever sequence of vowels matches \[ $V\] + in the root field must also appear somewhere in the s_dialect field.</Paragraph> <Paragraph position="3"> Negative restrictions The simplest kind of qegative restriction is built using the set complement operator (the caret). However this only works for single character complements. A much more powerful negation is available with the ? ! zero-width negative lookahead assertion, available in Perl 5, which I will now discu~ss.</Paragraph> <Paragraph position="4"> The next example uses the tone attribute. Dschang is a tone language, and the records in the lexicon include a field containing a toni melody. Tone melodies consist of the characters H (high), L (low), D (downstep) and F (fall)* A single tone has the form D? \[HL\] F?, i.e. an optional downstep, follbwed by H or L, followed by an optional fall. The next 6xample finds all entries starting with a sequence of unlike tones.</Paragraph> <Paragraph position="5"> Search Attributes: root: .*(1\[ST\] ) (?!$i) \[ST\] .* vars: $T = D?\[HL\]F? The (1 \[ST\] ) expression matches any tone and sets the $1 variable to the tone which was matched. The ( ? ! $1 ) expression requires that whatever follows the first tone is different, and the final \[ST\] insists that this same following material is a tone (rather than being empty, for example). 6 distinct tones are sufficiently different from each other, so that one is not an initial substfing of finother.</Paragraph> <Paragraph position="6"> I earlier introduced the notion of projections. In fact, the system allows the user to apply an arbitrary manipulation to any attribute before the matching is carried out. Here is the query again, this time with the $C/vproj variable filled out.</Paragraph> <Paragraph position="7"> Search Attributes: display: tone root: #C+V(C?) # ($CV-proj) vars: $CV-proj = {tr/$C/C/; tr/$V/V/;} This causes the Perl tr (transliterate) function to be applied to the root attribute before the #c+v (C?) # regular expression is matched on this field.</Paragraph> <Paragraph position="8"> Projections can also be used to simulate second order variables, such as required for place of articulation. Suppose that the language has three places of articulation: L (labial), A (alveolar) and V (velar). We are interested in finding any unassimilated sequences in the data (i.e, adjacent consonants with different places of articulation). The following query does just this. Prior to matching, the segments which have a place of articulation value are projected to that value, again using tr. The query expression looks for a sequence of any pair $PSP, where $p is a second order variable ranging over places of articulation.</Paragraph> <Paragraph position="9"> Search Attributes: display: word root: .*(55P) (?!$5) ($P).* ($P-proj) vars: $P-proj=tr/pbmtdnkgN/LLLAAAVVV/;</Paragraph> <Paragraph position="11"> Observe that the second $P must b~ different from the first, because of the zero-width negative lookahead assertion (?!$5). This states that immediately to the right of this position one does not find an instance of $ 5, where this variable is the place of articulation found in the first position. The output of the query is a 3 x 3 table showing all words that contain unassimilated consonant sequences*</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> SYSTEM OVERVIEW </SectionTitle> <Paragraph position="0"> Lexicon compiler The base lexicon is in Shoebox format, in which the fields are not required to be in a fixed order. To save on runtime processing, a preprocessing step is applied to each field* For example, the contents of the \w field, comprising characters from the Cameroon character set, are replaced by a pointer a graphics file for the word (i.e. a URL referencing a gif). 7 Each record is processed into a single line, where fields occur in a canonical order and a field separator is inserted, and the compiled lexicon is stored as a DBM file for rapid loading.</Paragraph> <Paragraph position="1"> The query string The search attributes in the query form can contain arbitrary Peri V5 regular expressions, along with some extensions introduced n above. A CGI program constructs a query string based on the submitted form data. The query string is padded with wild cards for those fields which were not restricted in the query form.</Paragraph> <Paragraph position="2"> The dimensionality of the output and the axis labels are determined by the appearance of 'parameters' in the search attributes. These parenthesised subexpressions are copied directly into the query string. So, for example, the first query above contained the search expression. * ( \[ Sv\] ) ( \[ $c \] ) # applied to the root field. This field occupies fifth position in the compiled version of a record, and so the search string is as follows. The variable $e matches any sequence of characters not containing the field separator.</Paragraph> <Paragraph position="4"> The search loop Search involves a linear pass over the whole lexicon %LEX. 8 The parameters contained in $search are tied to the variables $1 - $4. These are stored in four associative arrays $diral - $dim4 to be used later as axis labels.</Paragraph> <Paragraph position="5"> foreach Sentry (keys %LEX) {</Paragraph> <Paragraph position="7"> 7These gifs were generated using L~I'EX along with the utilities pstogif and giftool.</Paragraph> <Paragraph position="8"> 8 Inverting on individual fields was avoided because of the run-time overheads and the fact that this prevents variable instantiation across fields.</Paragraph> <Paragraph position="9"> Finally, a pointer to the entry is stored in the 4D array Shits (appended to any existing hits in that cell.) Here we see that the structuring of the output table using parameters is virtually transparent, with Perl itself doing the necessary housekeeping.</Paragraph> <Paragraph position="10"> As an example, suppose that the following lexical entry is being considered at the top of the above loop:</Paragraph> <Paragraph position="12"> By matching this against the query string given in our first example we endup matching. * ( \[$V\] ) ( \[$C\] ) # with #kup#. This results in $1=u and $2=p. The entries $diml{u} and $dira2 {p} are incremented, recording these values for later use in the $V and $C axes respectively. Finally Shits (&quot;u;p; ; &quot;) is updated with the index 0107.</Paragraph> <Paragraph position="13"> The display loop This module cycles through the axis labels that were stored in 9diml - $dira4 and combines them to access the Shits array. At each level of nesting, code is generated for the HTML or IbTEX table output. At the innermost level, the fields selected by the user in the display attribute are used to build the current cell.</Paragraph> </Section> class="xml-element"></Paper>