A Computerized Database for Theoretical and Historical Bantu Phonology and Morphology

  1. A. PROJECT SUMMARY
  2. C. PROJECT DESCRIPTION
  3. 1. INTRODUCTION
  4. 2. RESULTS FROM PRESENT N.S.F. SUPPORT
    1. 2.1.Development of human resources and the training of graduate students
    2. 2.2.Inputting and collection of lexical material
    3. 2.3.Database design and development
      1. 2.3.1. Progress in database design and implemention
      2. 2.3.2. Database development tools
      3. 2.3.3. Query tools
    4. 2.4.Other tools associated with CBOLD
      1. 2.4.1. Bantu MapMaker
      2. 2.4.2. CBOLD Bibliography
    5. 2.5.Results of past CBOLD research efforts
      1. 2.5.1.Velar palatalization
      2. 2.5.2.High vowel frication
      3. 2.5.3.Nasal consonant harmony
    6. 2.6.List of publications/submissions of CBOLD senior staff and graduate students
  5. 3. THE CURRENT PROPOSAL
    1. 3.1.Complete building the database
    2. 3.2.Building the "etymological backbone"
      1. 3.2.1.Tagging
      2. 3.2.2.Alternatives to tagging: computer-aided reconstruction
      3. 3.2.3.Morpheme alignment and correspondence extraction
    3. 3.3.Continuation of lexicographic work
    4. 3.4.Continuation of linguistic research based on the CBOLD database
      1. 3.4.1.Vowel harmony
      2. 3.4.2.Prefix allomorphy
      3. 3.4.3.Prosodic constraints on stem formation

A. PROJECT SUMMARY

This project aims to produce in Berkeley a Comparative Bantu On-Line Dictionary (CBOLD) designed to support and enhance the theoretical, descriptive and historical linguistic study of the languages in the important Bantu family. The activities of the project include: (i) the expansion and further development of the unified lexical database which now contains data from some 150 of the approximately 500 Bantu languages); (ii) the production of original and scanned/retyped dictionaries and shorter lexicons to input into the database; (iii) the development of accompanying software and linguistic tools for linguistic research; (iv) the dissemination through publication of theoretical, descriptive and historical work of CBOLD staff based on Bantu phonology and morphology. Over the next three years, CBOLD will: (i) complete the development of the database; (ii) increase the number of Bantu languages to upwards of 200; (iii) complete the integration of the available data; (iv) incorporate a system of "tagging" of lexical entries with Proto-Bantu (*PB) and other protoforms (e.g. regional Bantu, pre-*PB); (v) produce linguistic research using the database and its associated tools. Although centered in Berkeley, CBOLD represents an international effort with free exchange of materials both to and from the project. As outlined below, extensive collaborations have been established with scholars in North America, Europe and Africa, who have contributed lexicons, participated in software development, and attended CBOLD-featured conferences.


C. PROJECT DESCRIPTION
1. INTRODUCTION

The present proposal requests a three-year renewal of the Comparative Bantu On-Line Dictionary (CBOLD) project, NSF Grant SBR93-19415 (February 1, 1993-January 31, 1997). In the words of the original funded proposal, this project "aims to produce in Berkeley a Comparative Bantu On-Line Dictionary (CBOLD) designed to support and enhance the theoretical, descriptive and historical linguistic study of the languages in the important Bantu family." As described below, we have realized (and in some cases surpassed) most of the goals outlined in the last proposal, including: (i) the creation of a uniform lexical database now containing from some 150 of the approximately 500 Bantu languages; (ii) the production of original and scanned/retyped dictionaries and shorter lexicons to input into the database; (iii) the development of accompanying software and linguistic tools for linguistic research; and (iv) the dissemination through publication of theoretical, descriptive and historical work of CBOLD staff based on Bantu phonology and morphology. The present proposal seeks support to continue this effort. Over the next three years, CBOLD will: (i) complete the development of the database; (ii) increase the number of Bantu languages to upwards of 200; (iii) finish the integration of the available data; (iv) incorporate a system of "tagging" of lexical entries with Proto-Bantu (*PB) and other protoforms (e.g. regional Bantu, pre-*PB); and (v) produce linguistic research using the database and its associated tools. Although centered in Berkeley, CBOLD represents an international effort with free exchange of materials both to and from the project. As outlined below, extensive collaborations have been established with scholars in North America, Europe and Africa, who have contributed lexicons, participated in software development, and provided feedback at conference presentations related to the CBOLD project.

2. RESULTS FROM PRESENT N.S.F. SUPPORT

During the past two and one half years, the PI has directed A computerized database for theoretical and historical Bantu phonology and morphology, NSF grant SBR93-19415 (February 1, 1994 to January 31, 1997) . During this period CBOLD has:

(i) converted lexicographic data from 25 sources, including several extensive and complete dictionaries, totaling over 148,000 words (see 1a)

(ii) received from other researchers contributions of lexicographic data from about 140 languages totaling 198,000 words (see 1b)

(iii) provided to researchers an integrated comparative database with cross-platform query tools (see [[section]]2.2)

(iv) assisted or advised in the conversion of numerous other data sources

(v) developed (in collaboration with other researchers) cartographic tools geared toward Bantu linguistic research (see [[section]]2.3.1)

(vi) aggregated bibliographies from disparate sources into a coherent annotated bibliography containing over 6,800 citations for use by the Bantu research community (see [[section]]2.3.2)

(vii) provided data and resources used in the production of two completed dissertations and numerous research articles (see [[section]]2.4)

Thus, we have accomplished the first major phase of setting up this common resource. In addition, CBOLD has functioned as:

(i) a clearing house to receive and distribute lexical materials

(ii) a provider and developer of other tools, e.g. MapMaker, CBOLD bibliography, index of Bantu language names, database templates (see [[section]]2.2.2-[[section]]2.2.3)

(iii) an advice center to a growing number of dictionary-makers at universities and in the field who have contacted us

(iv) a focus and catalyst for future development of Bantu linguistics

2.1. Development of human resources and the training of graduate students

The initial funding of the project has contributed to the research and publications of seven graduate students, several of whom have used or are using CBOLD data in their dissertations or whose dissertation research has contributed to the project. To the extent that their efforts have been published they are are listed below in [[section]]2.4 and [[section]]2.5. CBOLD data is being used in pedagogical and research efforts by a number of researchers in the US and elsewhere (details may be found on the CBOLD web site, http://bantu.berkeley.edu/CBOLD.html). John Lowe, programmer and database administratorr, used CBOLD data extensively in his dissertation research (completed in May 1995).

Four of Berkeley's Africanist students have also completed their Ph.D.'s and now hold university faculty positions: Josephat Rugemalira (U Dar es Salaam), Kathleen Hubbard (U.C. San Diego), Cheryl Zoll (M.I.T.), Joyce Mathangwane (U. Botswana). Others who are continuing to work on CBOLD as graduate students are the following: (i) Jeri Moxley, who is publishing on velar palatalization with the PI, will produce a noun class database that will be used for morphological and semantic work for her dissertation; (ii) Armindo Ngunga (from Mozambique) has produced an extensive Yao P.21 lexicon of 7000+ entries which he has used to prepare several presentations and submitted papers (one on preconsonantal nasality with the PI). (iii) Galen Sibanda (from Zimbabwe) arrived at Berkeley this past academic year to do lexical work on his language, Ndebele S.44. The Berkeley graduate students had exclusive responsibility for organizing a day-long Special Session entitled "Historical issues in African linguistics" during the Berkeley Linguistics Society meeting on February 18, 1994. It is vitally important to the PI that the project contribute to the training of graduate students, and that they be especially encouraged to develop as independent researchers in the fields of general and Bantu linguistics. In addition to the Berkeley students, the PI is frequently contacted by students working on Bantu at other universities and provides verbal and written comments on their work to them. As part of the general openness of the project, CBOLD has encouraged both Berkeley and other graduate students to participate and, wherever possible, to use the lexicons and tools that we produce.

Although we do not repeat all of the rationale in the original funded proposal for creating CBOLD, it can be inferred from the following description of our goals and the range of accomplishments and activities sponsored by the project.

2.2. Inputting and collection of lexical material

CBOLD has surpassed its original goal of producing 15 dictionaries at Berkeley. Successful application of optical character recognition (OCR) technology coupled with heuristic error-detection and correction and determined proofreading and editing by competent and generous researchers all over the world enabled CBOLD to convert 26 dictionaries and lexicons, as summarized in (1) below, including language ID number from Guthrie (1967/71).

(1) Dictionaries converted by CBOLD (by Guthrie Language number; counts are approximate)

Language  ID    Source of Data     Form  
---------------------------------------
Londo     A.11  Kuperus 1985       1800  
Tunen     A.44  Dugast 1967        4190  
Tiene     B.81  Ellington 1977      580  
Koyo      C.24  Hyman/Ndzambo      1600  
                1996                     
Bamwe     C.30  Samarin n.d.        126  
Bobangi   C.32  Whitehead 1899     9500  
Lingala   C.36d Dzokange 1979      7200  
Holoholo  D.28  Coupez 1955         728  
Nande     D.42  Kavutirwaki 1978   2224  
Shi       D.53  Polak-Bynon 1978   2391  
Kiga      E.13  Taylor 1959       12700 
Ganda     E.15  Snoxall 1967      11243 
Nyambo    E.21  Rugemalira 1993    1230  
Sukuma    F.21  Mann 1966          3253  
Swahili   G.42  Rugemalira 1993    1399  
Kongo     H.16  Swartenbroeckx    25659 
Laadi     H.16f Jacquot 1982       9300  
Yaka      H.31  Ruttenberg 1971    3791  
Lozi      K.21  Jalla 1937        11200 
Pende     L.11  Gusimana 1972      8700  
Tonga     M.64  Turner 1952        1900  
Cewa      N.31b Sc&Heth. 1957      5295  
Yao       P.21  Sanderson 1954     7433  
Kwanyama  R.21  Turvey 1977        7200  
Shona     S.10  Taylor 1967       38000 
Kalanga   S.16  Mathangwane 1996   3765  
In addition, CBOLD has received about 140 lexicons and dictionaries from outside contributors. A list of these materials is given in (5) in [[section]]2.4 below. Many of these lexicons have already been exploited for linguistic research. CBOLD has been successful in obtaining copyright releases for most of the copyrighted material. Oxford University Press has been quite generous in this regard. The machine-readable contributions by other scholars have generally been unrestricted, e.g. the Tanzanian Language Survey (TLS) data provided by Prof. Derek Nurse (Memorial University, Newfoundland). To facilitate the distribution of this material, CBOLD has circulated a data-sharing agreement for signature by Bantu researchers. The agreement, designated "The Bantuists' Manifesto", affirms the consent of members of the Bantuist academic community to the dissemination of their data. To date the document has been signed by 32 current and near-term CBOLD contributors, demonstrating their support for this project.

2.3. Database design and development

2.3.1. Progress in database design and implemention

Substantial energy has been invested in developing a database design which will provide the appropriate resources for Bantu research of the sort described in the original proposal. However, space restrictions permit only the most terse description of the implementation details of the database system. At the time of writing (June 1996) a number of the original milestones have been achieved: individual data sources, such as Cewa, Yaka, Yao, and Kalanga, have been converted into database form and used in support of specific research projects (see list of presentations/publications in [[section]]2.4). Most of the problems of representation we discovered have been discussed among Bantuists and computational linguists at Berkeley and elsewhere; most methodological and programming problems have been solved and a prototype containing over 100,000 words has been created. Of these, 41,000 have been parsed morphologically and phonologically and the rest will be parsed by the end of the fall of 1996. Substantial design and programming effort remains to implement a prototype containing usefully integrated versions of the converted data. We expect to complete this work and release the first "public" version of an integrated database during the fall of 1996. At this point the database would be distributed to and evaluations solicited from the Bantu research community.

2.3.2. Database development tools

The creation of the phonological and morphological representations of lexical items used in the CBOLD data required the development of a suite of software tools for recognizing and segmenting linguistic structures. Though the literature contains many descriptions of implemented algorithms which adduce or otherwise handle lexical data, few of these are immediately usable for the practical lexicographic data-processing required for the CBOLD database. As a result, the CBOLD tools are mainly custom-built and take advantage of language-specific and Bantu-specific heuristics, while attaining a degree of reusability. The major tools developed are sketched out below.

(i) Several format recognition programs have been written to recognize the structure of dictionary entries in text format, making it possible to extract particular data elements for further processing. These programs take advantage of specific conventions (e.g. layout and typography) about the dictionary text in order to divide the entries into headword, part-of-speech, definidia, etc. In the scanned Luganda dictionary (Snoxall 1967), for example, the basic lexicographic data is distinguished (by SGML tags) from the extended dictionary entry (which contains proverbs, example sentences, usage notes, etc.) which, while useful, have not yet found a specific use in the CBOLD research program. It is likely that these example sentence "mini-corpora", a rich source of data for comparative Bantu syntax and semantics, will be useful to other researchers.

(ii) Two database templates in FileMaker Pro have been used by researchers to create dictionaries of Bantu languages. Template I was developed at the Laboratoire Dynamique du Langage (DDL) in Lyon by Joel Brogniart has been used to create Bantu language dictionaries, several of which are now part of the CBOLD database. It is a highly refined scheme for entering lexicographic data by way of a syllabic template, and for connecting modern forms to preexisting reconstructions. Template II, a simple format for entry of Bantu-specific lexical data, was developed at CBOLD in Berkeley by John Lowe. It has been used to enter data for nine of the dictionaries produced by CBOLD. Both templates are available on the CBOLD FTP site.

(iii) The CBOLD data representation standard is a draft document specifying several varieties of text documents which can be easily loaded into the CBOLD database. The standard, which provides a simple markup language and format specification, is designed to permit contributors to convert existing files into a suitable format, quickly and easily, usually by making a few simple changes with a text editor. Most database export formats are supported (with minor caveats) by the standard. Other tools developed at DDL and CBOLD are designed to facilitate the retranscription of data in different fonts and formats and to insert markup tags into unformatted text. Several of these have been made available to researchers on an individual basis.

(iv) A morphological stemmer. One of the first analyses performed on an machine-readable dictionary (MRD) is the segmentation of headwords into prefix and stem. This is carried out with a simple stemmer which compares a predefined list of prefixes and their common allomorphs with dictionary headword and inserts a delimiter between the prefix and the stem. Since this program can take a "left-edge in" approach and since only the first and longest possible segmentation is made, there is no ambiguity in the parsing. So, for example, the program can separate prefixes ku- (before consonants) and kw- (before vowels) from the stems of verbs, and is not confused by nouns which have concord prefixes which begin with either o- or omu-/omw-. Of course, this procedure is a heuristic; a certain number of the segmentations are in error, and the entire result must be verified by a linguist familiar with the data. The program can also assign noun class and part-of-speech designations based on the occurrence of prefixes, facilitating the inclusion of this basic categorial information in wordlists.

(ii) A finite state transducer calibrated for Bantu syllable structure is used to analyze the stem forms into phonological constituents, assigning constituents (e.g. segments) to canonical CV templates. This analysis makes it possible to search the dictionaries using boolean queries based on phonotactic properties. For example, it is possible to search the database to find all cases where vowels of a certain user-defined class are followed by vowels of another class (i.e. to determine the range of co-occurrence of segments in the V1 and V2 slots in Bantu syllables). This technique is being used by the PI to study vowel harmony properties of specific Bantu languages (cf. [[section]]3.4.1).

2.3.3. Query tools

The query interface for the database allows the conventional types of searches to be performed on the data in the database. Since most of the work (in design and programming) has been in preparing appropriate representations and data structures, the query interface is relatively simple given some familiarity with the organizational principles of the database. It is possible to perform cross-index boolean searches (with limited wild card capability) on individual syllabic constituents, classes of constituents, words in glosses, morphological information including part-of-speech and noun class designations, morphological constitutients including distinct grammatical affixes. It is also possible to use extralinguistic information in queries such as source of information, language name, or Guthrie language ID number.

2.4. Other tools associated with CBOLD

2.4.1. Bantu MapMaker

A goal of CBOLD is to be able to display graphically the geographic distribution of linguistic features derived from database queries. A step is this direction is the creation of a simple GIS (Geographical Information System) called (Bantu) MapMaker. This Hypercard stack was developed as the result of a collaboration between Prof. Thilo Schadeberg of the University of Leiden and Dr. John Lowe. The application has been through three versions and is quite functional, though still evolving. The latest version has been widely distributed and provides Bantuists with a means to make publication quality linguistic maps of Africa, to perform certain types of GIS-like queries (such as plotting isoglosses and plotting data from external files), and to share map data easily with colleagues via email (MapMaker's internal represention of the information on a map is plain text, similar to the representations of "clickable maps" on the WWW). Sample maps are shown in (2a) and (2b) below. The first plots data points reflecting the geographic and lexical extent of CBOLD's coverage of the Bantu-speaking area. The second, an areal plot based on data extracted from the PI's vowel harmony database (see [[section]]3.4.1), shows the distribution of 5- vs. 7-vowel systems in Bantu.

(2) (a) Languages for which CBOLD has data (b) 5 vs. 7 Vowels in Bantu Languages

2.4.2. CBOLD Bibliography

In the course of the project, CBOLD has acquired electronic bibliographies contributed by a number of different researchers. To date, 9,074 citations received from eight sources have been merged into a unified bibliographic database. Citations which matched on a key composed of the author's last name, year of publication, and initial words of the title were aggregated into a single bibliographic entry and their annotations (which include, for example, Guthrie numbers and names of languages treated) were combined. The resulting database contains 6,877 citations (2680 books, 4001 journal articles, and 196 other documents) covering virtually all areas of Bantu research. This bibliography is searchable via the WWW (http://bantu.berkeley.edu /CBOLDBibs/BibMain.html) and is also available via FTP in four formats: (i) a "formatted" text file; (ii) a Refer/BibIX format; (iii) an EndNote bibliography; (iv) a FoxPro database, similar in organization to the Refer format. A simple application for using the database in FoxPro is also included.

2.5. Results of past CBOLD research efforts

Although a second funding period is necessary to reach the goals we have set for CBOLD, the individual dictionaries which have been prepared and entered into the database can be and have been used for linguistic research by the PI (e.g. on synchronic and diachronic aspects of the morphology/phonology interface in Bantu) and his graduate students. In this section we briefly present three examples of such research that have led to publication.

2.5.1. Velar palatalization

Hyman and Moxley (1996) address an apparent problem for the Neogrammarian hypothesis of strict phonetic conditioning of primary sound change and the specific claim by Kiparsky (1973:75) that "no sound change can depend on morpheme boundaries". In many Bantu languages *k and *g are palatalized and affricated to c and j before front vowels only if the velar consonant is morpheme-initial. In order to explain this unusual morphological restriction, an extensive study was undertaken of velar palatalization throughout the Bantu zone. Bantu languages that palatalize velars could be classified into five types, which are systematically related to each other by the nature of environments in which palatalization takes place, as in (3):

(3) Environments of Velar Palatalization in Bantu

Type A Type B Type C Type D Type E

Across Morphemes +

Morpheme-Internal + +

Root-initial + + +

Prefixes + + + +

ky/gy > c/j + + + + +

The authors provide evidence for a diachronic progression of Type E > D > C > B > A, arguing that palatalization was originally restricted but underwent gradual analogical extensions by morphological context. Particularly relevant are languages of Type C for which electronic dictionaries are now available, e.g. Shi D.53, Bemba M.42, Cewa N.31b, and Kalanga S.16 (the last two developed by graduate students J. Moxley and J. Mathangwane).

2.5.2. High vowel frication

Several studies deal with the frication of consonants before the *PB high vowels *O(i,[[cedilla]]) and *O(u,[[cedilla]]). Mathangwane (1996) documents the following changes in Kalanga and provides phonetic explanations for them, based on Ohala.

(4) Development of high vowels in Kalanga

*pO(i,[[cedilla]]) > swi *tO(i,[[cedilla]]) > tshi *kO(i,[[cedilla]]) > si

*bO(i,[[cedilla]]) > zwi *dO(i,[[cedilla]]) > dzi *gO(i,[[cedilla]]) > zi

*pO(u,[[cedilla]]) > fu *tO(u,[[cedilla]]) > thi *kO(u,[[cedilla]]) > fu

*bO(u,[[cedilla]]) > vu *dO(u,[[cedilla]]) > du *gO(u,[[cedilla]]) > vu

Zoll (1995) surveys the phenomenon and provides both cross-linguistic generalizations about these "mutations", as well as a formal feature-geometric account. Hyman (1996a) is more concerned with the problem that these changes are frequently conditioned differentially by morphological context. As verified via the CBOLD database, languages such as Ganda and Shi mutate *t, *d, *k and *g to [s, z] in all environments before *O(i,[[cedilla]]), while *p and *b mutate to [s, z] only intramorphemically, not across morpheme boundaries. Closely related "Rutara" languages such as Haya, Nyambo, Nkore and Kiga show a similar exceptional labial pattern, but in these languages the frication/non-frication of *k, *g may also be sensitive to morphological environment. Hyman hypothesizes that these changes first occur morpheme-internally and then extend out, hitting consonants differentially according to place of articulation.

2.5.3. Nasal consonant harmony

A third case where phonological change is sensitive to morphology is nasal consonant harmony, the most transparent reflex of which is *[l] > [n], e.g. Bemba -tum-il- > -tum-in- `send to'. A widespread area of the Bantu zone has this phenomenon (Greenberg 1951), but with three variants: (i) nasalization only within a morpheme (Bastin 1983); (ii) nasalization only when the nasal consonant is in the preceding syllable; (iii) "long-distant" nasalization, where the nasal may be several syllables before the [l], e.g. Yaka H.31 -miituk-il- > -miituk-in- `sulk +applicative'. In order to study this phenomenon, the Yaka-French part of Ruttenberg (1970) was scanned, proof-read, and entered into the CBOLD database. As a demonstration of the value of an electronic dictionary, Hyman (1995) conducted an exhaustive study of the distribution of nasality in Yaka and of this non-local assimilatory phenomenon which is repeated in nearby languages such as Kongo H.16 (Ao 1991, Odden 1994 and Piggott 1995), Pende L.11 (Gusimana 1972) and (optionally) Suku H.32 (Piper 1977). It was found that the only true exceptions to the leftward spread of nasality onto voiced consonants were in borrowings (e.g. kómélésáa `commerçant') and that the l/n alternations were a reflection of a general phonological property of Yaka, not simply a morphologized alternation.

In section [[section]]3.4 we outline additional on-going projects that will be followed up during the second funding period of CBOLD.

2.6. List of publications/submissions of CBOLD senior staff and graduate students

Hyman, Larry M. 1995. Nasal consonant harmony at a distance: the case of Yaka. Studies in African Linguistics 24,1: 5-30 (1995)

Hyman, Larry M. and Joyce Mathangwane. 1997. Tonal domains and depressor consonants in Ikalanga". L.M. Hyman and C. Kisseberth (eds), Theoretical aspects of Bantu tone. Stanford: C.S.L.I. (about to go to press).

Hyman, Larry M. and Jeri Moxley. 1996. The morpheme in phonological change: velar palatalization in Bantu. Diachronica 13.2 (in press).

Hyman, Larry M. Morphologie et frication diachronique en bantou. Mémoires de la Société de Linguistique de Paris (invited paper to be submitted, summer 1996).

Hyman, Larry M. and Armindo Ngunga. Preconsonantal nasality in Yao (to be submitted to Phonology, summer 1996).

Lowe, John B. 1995. Cross-linguistic lexicographic databases for etymological research, with examples from Sino-Tibetan and Bantu languages. Ph.D. Dissertation, U.C. Berkeley.

Mathangwane, Joyce. 1996. Phonetics and Phonology of Ikalanga: a diachronic and synchronic study. Ph.D. dissertation, University of California, Berkeley.

Ngunga, Armindo. 1996. The role of nasals in Ciyao Segmental Phonology. To appear in Proceedings of Berkeley Linguistic Society 22.

Zoll, Cheryl. 1995. Consonant mutation in Bantu. Linguistic Inquiry 26.536-545.

3. THE CURRENT PROPOSAL

The present proposal requests the funds required to complete the CBOLD database. As indicated in the above sections, we have made considerable progress toward this goal. In some cases we have surpassed some of our projections, particularly as concerns the conversion and entering of data, to which we devoted considerable energy during the first phase of the project. In other cases progress was slower, either as the result of having to wait for this data to become available, or because of complexities that arose during the first phase of the construction of CBOLD. During the seven months which remain in the first grant period, several major components of the project will be completed. The brief descriptions below cover both the remaining period as well as the work we plan to accomplish in the subsequent period.

In the next phase of the project, CBOLD intends to:

(i) complete the database and improve its availability to researchers, including those in Africa with limited access to computing and telecommunications facilities

(ii) continue the etymological work required to revise and complete the reconstruction of *PB

(iii) continue to convert printed materials into machine-readable form for inclusion in the database and to gather and disseminate lexicographic data on Bantu languages

(iv) continue work being conducted currently by the PI and his students on the phonology and morphology of Bantu languages based on the CBOLD database

3.1. Complete building the database

The ultimate development target for the CBOLD database as called for by the 1993 proposal is a stand-alone FoxPro-based system running on both Macintosh and Windows platforms. While the current version of the database software implements the basic query functionality called for in the original proposal it lacks at the time of this writing several important components in the user interface; these features will be incorporated during the remaining grant period and refined during the next grant period. Also, the current version of the database contains only a bit more than one third of the total lexical data available. Most of these outstanding development and data conversion tasks, enumerated below, will be completed by early 1997.

(i) We will complete the editing and conversion of the remaining data sources into FoxPro format.

(ii) Also, the entire TLS 1975 (Tanzania Language Survey) corpus, itself nearly a quarter of CBOLD holdings, already in a FoxPro format, will be loaded into the database.

(iii) Certain data sources appear to be rather intractable computationally or linguistically; processing them has been deferred until the major components and data sources are in good shape. It is likely that the conversion of these data will not be completed until sometime during the first year of the requested renewal period.

(iv) Certain features of the query system which have been implemented in a rudimentary way and tested on small scale datasets may not scale up when applied to the entire 300,000 word corpus. Notably, the query subsystem for searching on phonological features exists only in this "alpha" state; this subsystem has both a data component (a feature specification for the transcription used for each data source and language must be provided) and a processing component (converting feature specifications into segments and searching those segments in the database). A certain amount of conceptual development followed by design and implementation remains before this subsystem (which will certainly be among the most useful for phonological research) can be fully intergrated into the database.

(v) We will "wring-out" the query functions already developed and have our beta-testers evaluate the user interface before burning CD-ROMs and releasing the database to the research community.

(vi) At the time of the first proposal, the World Wide Web was virtually unknown. Now, however, it is evident that the WWW will be the dissemination and research environment of choice for many types of projects. CBOLD, along with several other Bantu researchers, have prepared niches on the web (http://bantu.berkeley.edu/CBOLD.html); we have already put a number of resources on the site which have been used by other researchers, not only Bantuists. The CBOLD database will be mounted on the WWW; we are currently negotiating with the managers of the SunSite server cluster at Berkeley and some of the principals of the Digital Libraries Project for access to software and server space to use for the CBOLD database. Of course, for researchers without internet access, the FoxPro version will continue to be available or a CD-ROM version of the WWW database will be made available.

(5) Inventory of CBOLD data sources (by Guthrie Language number; count s are approximate)

A.11 Kwanyama Turvey 1977/7200
A.11 Londo Kuperus 1985/1800
A.25 Ngoli Burssens 1994/787
A.40 Basa Dautrey 1994/1986
A.44 Tunen Dugast 1967/4190
A.70 Lenje Évariste1995/1023
A.75 *Fang Lyon.Fang/451
A.75 Fang Medjo 1994/447
A.85b Bekwil DDL/1000
B.10 Myene Mouguiama 1994/2625
B.11c Galwa DDL/1000
B.20 Metombolo DDL/1000
B.20 Pouvi DDL/1000
B.22a Kele DDL/1000
B.25 Kota Piron 1990/1000
B.25 Mahongwe DDL/1000
B.25 Shake DDL/1000
B.25 Tumbidi DDL/1000
B.30 Gevove Van d. Veen 1994/1467
B.40 Isangu Idiata 199x/2000
B.41 Shira Mouguiama 1994/848
B.42 Sangu DDL/1000
B.43 Punu Blanchon 1994/4228
B.50 Wanzi Mouele 1994/3020
B.74 buma Burssens 1994/616
B.75 Teke DDL/1000
B.81 Tiene Ellington 1977/580
B.85 Yansi Burssens 1994/822
B.86 Dinga Burssens 1994/830
C.30 Bamwe Samarin 19xx/126
C.32 Bobangi Whitehead 1899/9500
C.34 Sakata Burssens 1994/825
C.36 Lingala Dzokange 1979/7200
D.25 Lega Botne 1994/1507
D.28 Holoholo Coupez 1955/728
D.42 Nande Kavutirwaki 1978/2224
D.53 Shi Polak-Bynon 1978/2391
D.61 Rwanda TLS Wstighlnds/1052
D.62 Rundi TLS Wstighlnds/1052
D.65 Hangaza TLS Wstighlnds/1052
D.66 Ha TLS Wstighlnds/1052
D.67 Vinza TLS Wstighlnds/1052
E.11 Nyoro TLS Rutara/1052
E.12 Tooro TLS Rutara/1052
E.13 Kiga Taylor 1959/12700
E.13 Nkore TLS Rutara/1052
E.14 Rukiga TLS Rutara/1052
E.15 Ganda Snoxall 1967/11243
E.21 Nyambo Rugemalira 1993/1230
E.21 Nyambo TLS Rutara/1052
E.22 Haya TLS Rutara/1052
E.23 Zinza TLS Rutara/1052
E.24 Kerebe TLS Rutara/1052
E.24 Kerewe Odden 1995/1554
E.25 Jita TLS Suguti/1052
E.25 Mkwaya TLS Suguti/1052
E.31b Kizu TLS ENyaza/1052
E.42 Gusii TLS ENyaza/1052
E.43 Kuria_Mago TLS ENyaza/1052
E.43 Kuria_Tari TLS ENyaza/1052
E.44 Zanaki TLS ENyaza/1052
E.45 Nata TLS ENyaza/1052
E.47 Ngoreme TLS ENyaza/1052
E.53 Meru TLS 1975/1079
E.62a Machame TLS 1975/1079
E.62a Mochi.unn TLS 1975/530
E.62b Vunjo TLS 1975/1079
E.64 Keni TLS 1975/530
E.65 Gweno TLS 1975/1079
E.74a Dawida TLS 1975/1079
F.12 Bende TLS 1975/1079
F.21 Sukuma Mann 1966/3253
F.21 Sukuma TLS 1975/1053
F.21a Shashi_siz TLS ENyaza/1052
F.22 Nyamwezi M&S 1992/1905
F.22 Nyamwezi TLS 1975/1053
F.24 Kimbu TLS 1975/1053
F.33 Langi TLS Langi/1052
G.23 Sambaa TLS 1975/1079
G.42 Swahili Rugelemira 1993/1399
G.42 Swahili TLS 1975/1079
G.51 Pogoro TLS 1975/1079
G.52 Ndamba TLS 1975/1079
G.61 Lori Burssens 1994/809
H.10 Yombe Mabiala 1994/2122
H.16 Kongo Swart 1973/25659
H.16f Laadi Jacquot 1982/9300
H.31 Yaka Ruttenberg 1971/3791
K.15 Mbunda Burssens 1994/814
K.21 Lozi Jalla 1937/11200
L.11 Pende Gusimana 1972/8700
L.13 Pindi Burssens 1994/616
M.11 Pimbwe TLS 1975/1079
M.13 Fipa TLS 1975/1079
M.14 Rungu TLS 1975/1079
M.15 Mambwe Halemba 1996/5000
M.15 Mambwe TLS 1975/1079
M.21 Ndali TLS 1975/1079
M.21 Wanda TLS 1975/1079
M.22 Namwanga TLS 1975/1079
M.23 Nyiha TLS 1975/1079
M.24 Malila TLS 1975/1079
M.25 Safwa TLS 1975/1079
M.28 Lambya TLS 1975/1079
M.31 Nyakyusa TLS 1975/1079
M.42 Bemba Mann 1995/7200
M.64 Tonga Turner 1952/1900
N.11 Manda TLS 1975/1079
N.12 Ngoni TLS 1975/1079
N.13 Matengo TLS 1975/1079
N.14 Mpoto TLS 1975/1079
N.31B Cewa S&H1957/5295
P.11 Ndengeleko Ndegeleko/1052
P.12 Rufiji TLS BantuLg/1052
P.13 Matumbi TLS BantuLg/1052
P.14 Ngindo TLS BantuLg/1052
P.15 Mbunga TLS 1975/1079
P.21 Yao BantuLg/1052
P.21 Yao Sanderson 1954/7433
P.22 Mwera TLS 1975/1079
P.23 Makonde TLS BantuLg/1052
P.25 Mabia TLS BantuLg/1052
S.10 Shona Taylor 1967/38000
S.16 Kalanga Mathangw.1994/3765
S.31 Tswana Creissels 1995/6500

(NB: sets of historical reconstructions and a few small data sets are not listed here)

3.2. Building the "etymological backbone"

One of the principal goals of the next period is to carry out, in collaboration with Tervuren (Belgium) and others, one of the most far-reaching and extensive systematic revision of a lexical reconstruction ever attempted. The Tervuren team, under the direction of Dr. Claire Grégoire, plans to provide an estimated 10,000 *PB or regional reconstructed roots. To date 5,474 regional and *PB reconstructions, covering four of the seven proto-vowels, have been provided to CBOLD and other researchers as part of the "Bantu Linguistic Roots II" (BLR2).
The revision of the reconstructions is itself a demanding subproject designed to provide a unified, consistent, and sufficiently aggregated list of reconstructions. Although it was first hypothesized that the current stock of PB reconstructions would be adequate it is now clear, as exemplified below, that this revision is a necessary precursor to creating the etymological backbone called for in the first proposal.
Besides the revised reconstructions, the etymological backbone will consist of a lexically exhaustive reconstruction based on a substantial and representative sets of data from daughter languages. A prerequisite of this step is of course a streamlined, complete, and comparable set of modern lexicons. Such a set of lexicons is nearly complete and will be extended during the second funding period for etymologization.

The linking of the reconstructions will be accomplished in two ways: (i) manual "tagging" of modern forms, and (ii) computer-aided detection and verification of correspondences and reconstructions. These methods are outlined below.

3.2.1. Tagging

The simplest method for building the backbone is to assign to each cognate form in the daughter language a unique identifier linking it to an established list of reconstructions. CBOLD is using the consolidated list of reconstructions described above as the basis for tagging, illustrated below using the prototype tagging program. Cognate forms in modern languages (top window) are linked by an "etymological tag" to a general set of reconstructions in the list of consolidate reconstruction (bottom window).

(6) Prototype etymological tagger (using the FoxPro database management program)

As shown by the numerous variant reconstructions for bínà dance' (numbered 537 in (6) above), the reconciliation of the existing Bantu reconstructions with each other is a required first step to effective tagging. Therefore a new numbering scheme for Bantu reconstructions will be devised and a consensus of Bantu researchers sought. This scheme will bring together variant reconstructions (or "allofams", to use Matisoff's term) under a single numeric rubric, thus avoiding the complications of using a numbering scheme like that of Guthrie in which reconstructed forms were inserted into the sequence after it was completed (resulting in numbers such as 1988 1/2 and 236c) and in which many obviously related forms were not so marked, e.g.
(7) Variant reconstructions (after Guthrie 1967/71, vol. 2, p.130)
ProtoLg Reconstruction Gloss Source ID #
CB kímb wander about G1967.CB 1060
CB kímbid hurry G1967.CB 1062

3.2.2. Alternatives to tagging: computer-aided reconstruction

The tagging method, while easy to implement and use, is extremely laborious. In addition, tagging a form with a number does nothing to ensure that the form is indeed regular with respect to any hypothesis of diachronic development. We will therefore introduce a mechanical assistant to aid in the process of linking reconstructions, correspondences, and modern forms into a close-knit dataset. A prototype of relational software of this type has been demonstrated in the Reconstruction Engine (Lowe and Mazaudon 1989, 1994). This program, however, is merely an evaluator of pre-defined correspondences with regard to modern lexicons; it requires as its input correspondence hypotheses already noted or conjectured by the linguist. As is widely recognized, Guthrie's correspondences will not serve for this function, at least not without extensive corrections and additions. Thus, computer aid techniques are necessary for "correspondence extraction." These techniques, which have been tested in small datasets, will be refined and made available as part of the research apparatus of the CBOLD database. Some pilot results of the application of these computational techniques to Bantu languages are described in the sections immediately below.

3.2.3. Morpheme alignment and correspondence extraction

Based on the literature, we will be able to provide semi-automatic estimates for at least some of the correspondences. Such an etymological skeleton will be fleshed out by hand as part of ongoing research efforts. This method is explicated by Lowe (1995) and similar methods been described briefly by other linguists as well (Kay 1964, Veatch 1989). Indeed, the algorithm is simply a computational expression of the most common method of cognate-hunting employed by linguists: the collection of words with similar meanings in different languages and subsequent search for recurring phonological patterns. Once a morphological structural analysis of lexical items in individual languages is accomplished, there are techniques which can be used to assist in creating an "etymological alignment". Consider the following data (words for `corpse' or `body' (French corps) in 8 languages):

(8) Near synonyms for `body'/`corpse' in 8 Bantu languages

omu     tuúmbi    N sg   3-4    corpse                     Kerewe     64900        
óomu    biri             3      corps                      Shi        81588        
omu     biri                    body                       Nyambo     88435        
omu     tûmbi                   corpse                     Nyambo     88572        
omu     biri      n             body; substance;           Kiga       98830        
                                fortune...                                         
omu     tûmbi     n             corpse                     Kiga       99836        
en      túmbi     n             corpse(s) of cow(s)        Kiga       100957       
        -biri     n.     3/4    corps humain               Nande      102620       
m       tembo     n.     3      a corpse...                Chewa      112115       
n-      tembo     n.     3      corpse.                    Yao        114062       
n-      tuvi      n.     3      corpse.                    Yao        116829       
ci-     vidividi  n.     7      trunk of the body; torso.  Yao        120620       
n       tumbú     n      3      a corpse                   Kalanga    124486       

Once the phonological parsing (carried out by the finite state transducer described in [[section]]2.2.2 above) has provided a syllabic analysis, it is possible to compare constituents across syllable slots and propose these as correspondences:

(9) Above data (8) sorted after phonological parsing

 C1   V11   V12    C2   V2    T11   T12    T2   Language     
 b     i           r    i                       Kiga         
 b     i           r    i                       Nande        
 b     i           r    i                       Nyambo       
 b     i           r    i                       Shi          
 v     i           d    i                       Yao          
                                                             
 t     e           mb   o                       Chewa        
 t     e           mb   o                       Yao          
 t     u           mb   i      H                Kiga         
 t     u           mb   i      HL               Kiga         
 t     u           mb   i      HL               Nyambo       
 t     u           mb   u                  H    Kalanga      
 t     u     u     mb   i            H          Kerewe       
 t     u           v    i                       Yao          

(10) A few correspondences extracted from analyzed syllables with *PB reconstruction
Slot   *PB      Kerewe    Shi     Nyambo    Kiga   Nande   Chewa    Yao   Kalanga   
 C1    *t         t                 t        t       t     t         t       t      
 C1    *b                  b                 b       b               v              
 C2    *mb        mb                mb       mb            mb       mb       mb     
 C2    *d                  r                 r       r               d              
                                                                                    
 V1    *u         uu                u        u             e         e       u      
 V1               i                 i        i       i     o         o       u      
 V2      *e              i                 i       i               i                

Using the modern forms and these correspondences (and others not shown), it is now a simple matter to generate actual protoforms for the reconstructed sets. Two reconstructions may be generated (or "confirmed") by the Reconstruction Engine on the basis of these correspondences and the cognate forms. One (-bèdè 3/4 body #112) may be found in Guthrie 1967. The other (-túmbi) is not in the list of reconciled reconstructions.

There are a number of details of implementation and heuristics not described here. The method as outlined here has a number of limitations which should be clear to anyone who has attempted to reconstruct a proto-lexicon. The technique has been applied succesfully to Tibeto-Burman, and early trials with Bantu languages indicate that it is suitable for application in this family as well.

Related to this important aspect of the project, CBOLD, at the urging of collaborating researchers, will convert the entire list of Guthrie's supporting forms (vols. 3 and 4) used to set up his comparative series (and hence the sound correspondences between *PB and the individual languages). This would make these data, the results of a twenty years of research, easily accessible to researchers.

3.3. Continuation of lexicographic work

CBOLD will continue to convert dictionaries and other legacy data into machine readable form; however, the data-gathering aspect of the project will no longer be emphasized. We have begun discussion with Myles Leitch of the SIL for zone C and have cooperated with the SIL in producing a consolidated gloss list for use in gather data for languages in the NW zones. We have also made contacts with colleagues in Cameroon about acquiring and producing quality lexicons from Zone A, e.g. Tuki, Ndemli, etc. As before, the choice of dictionaries for conversion will be based on: 1) the importance of the language/area; 2) the availability of a good source; 3) commitment of linguist(s) to work with the materials.

3.4. Continuation of linguistic research based on the CBOLD database

Rather than waiting for the comparative database to be completed, the PI and his graduate students have been engaged in exploiting electronically produced materials for individual languages as they have become available. As indicated in [[section]]2.4, we have thus been able to produce a number of carefully documented studies based on CBOLD lexicons. The PI's research program continues to center around the role of morphology in synchronic and diachronic Bantu phonology. The next funding period will thus see further research in this area, examples of which described in the following subsections.

3.4.1. Vowel harmony

Bantu languages are well known for their vowel harmony systems, which have served as input in recent theoretical work in phonology (e.g. Clements 1991; Archangeli and Pulleyblank 1995). It is striking is that a huge area of the Bantu zone shows the following front-back asymmetry in vowel height harmony (VHH):

(11) a. Front height harmony (FHH) : *i > e / { e, o } C __

b. Back height harmony (BHH) : *u > o / o C __

As a first step towards explaining the widespread asymmetry, almost unknown outside Bantu, the PI has created a preliminary vowel harmony database which now includes 130 languages and which he plans to extend to at least twice that number. Already a number of properties clearly cluster in languages which have what can be referred to as "canonical Bantu vowel harmony":

(12) a. VHH has the above asymmetry: i.e. independence of FHH and BHH

b. VHH is not conditioned by /a/ (i.e. it patterns with high vowels)

c. VHH does not apply to /a/

d. VHH does not apply to final vowel (FV) morpheme

e. VHH does not apply to prefix vowels

The database reveals exceptions to each of the above properties, however, particularly in the NW:

(13) a. Some languages have no VHH, e.g. Punu B.43, Lengola D.12, Enya D.14, N. Binja D.26, Chaga E.62, Suku H.32, Mbala H.41, Ruund L.53.

b. Prefixes harmonize in Londo A.11, Bakweri A.22, Nen A.44, Gunu A.62, Bobangi C.32, Mongo C.61, Tetela C.71, Kela C.72, Ombo C.76, Budu D.35, Logooli E.41, Gusii E.42.

c. Final vowel harmonizes in Bobe, Bia, Pinzi (etc.) B.30, Boma B.82, B.74b, Leke C.14, and in the perfective only, Kongo H.10, Yaka

d. Asymmetry is not found in zones A-B-C and Mituku D.13, Gusii E.42, Kuria E.43, Beembe H.11, Vili H.16d, Laadi H.16f, Mbundu H.21a + perfective in other Kongo H.10 and Yaka H.31,

e. /a/ conditions VHH in Boma B.74b (B.82), Mbundu H.21a, Mbunda K.15, Kwangali K.33, Kwezo K.35, Dciriku K.62, Pende L.11 (K.52), Mbundu R.11, Kwanyama R.21, Ndongo R.22, and Herero R.31.

f. /a/ undergoes VHH in Londo A.11, Bakweri A.22, Nen A.44, Gunu A.62, Kota B.25, Nzebi B.52, Tiene B.81, Boma B.74b (B.82), Leke C.14, Koyo C.24, Mboshi C.25, Doko C.31, Lingala C.36d, Ngombe C.41, Leku C.60, Bembe H.11, Lwalwa L.00 (cf. also Leitch 1996).

While the properties can vary as indicated above, interestingly, no language was found which has prefixal (i.e. right-to-left) VHH cooccurring with the front-back asymmetry. There are numerous questions for which (convincing) answers need to be sought: (i) Was there vowel harmony in *PB? (ii) Where was asymmetric VHH innovated and why do so many languages retain this typological anomaly? (iii) What is the relation between internal VHH, e.g. in verbs, and final VHH, e.g. in disyllabic nouns? Nande D/J.42, for instance, has the asymmetry cited above, such that we obtain -CeCeC-, -CoCeC-, -CoCoC- vs. -CeCuk-, but shows symmetric VHH of the second vowel of CVCV noun stems:

(14) V + FV in Nande CVCV Noun Stems

 V/FV   O(i,[[  i         e     O(u,[[  u       o       a       
        cedill                  cedill                          
        a]])                     a]])                           
O(i,[[  31      ---       8       4     ---     25      35      
cedill                                                          
 a]])                                                           
  i     ---     25       ---     ---    5       29      28      
  e     14      ---       70      4     ---     37      28      
O(u,[[  29      ---       4       7     ---     18      32      
cedill                                                          
 a]])                                                           
  u     ---     15        10     ---    43      16      42      
  o     18      ---       16      5     ---     46      28      
  a     21      21        12      17    10      38      113     

Thus, there are 37 disyllabic noun stems of the shape CeCo and none with the shape *CeCu. (Another asymmetry not found in -CVCVC- forms is the non-occurrence of noun stems of the shape *CiCe vs. the presence of CuCe.) One hypothesis is that internal and final harmonies are independent either in origin or, at least, in their being subjected differentially to the three pressures of vowel assimilation, vowel reduction, and vowel peripheralization (to /i, u, a/ in various positions). Hyman (1996b) represents a progress report on only part of a much more extensive study that is projected for the second funding period during which the database will be significantly expanded and improved.

3.4.2. Prefix allomorphy

A number of Bantu languages show prefix (and suffix) allomorphy dependent on the number or nature of syllables in the stem to which they attach. One widespread phenomenon is the development of syllabic nasals from the several Bantu *mu- prefixes (Kadima 1969; Bell 1972). Hyman and Ngunga (1996) have investigated "preconsonantal nasality" in Yao and discovered that the N of "NC" may be contrastively moraic and syllabic (N'), moraic but not syllabic, or non-moraic/non-syllabic. Similarly complex cases of allomorphy exist in the often complementary distribution of the class 5 prefixes *O(i,[[cedilla]])- vs. *di-, the first of which is known for its variability in realization (e.g. causing gemination in Ganda, voicing in Shona, aspiration in Cewa etc.). Other morphemes showing important allomorphy include the augment morpheme (de Blois 1970) and the copula, whose properties were extensively documented in Kalanga by Mathangwane (1995), using the 3700 entry lexicon she produced for CBOLD. Another graduate student, J. Moxley, will produce a comparative database of noun class prefixes. This database will be used in support of her Berkeley dissertation, in which, expanding on Kadima (1969), she will document the phonological shape of the noun class prefixes across Bantu, and map out which noun classes have been lost, where, and (if possible) why. Also during this second funding period the PI will continue research on these and other Bantu allomorphies using the CBOLD database.

3.4.3. Prosodic constraints on stem formation

It is well known that certain languages of the Northwest (i.e. Guthrie zones A, B, C) have severe constraints on the nature and number of syllables that can appear in a stem (root + suffixes), e.g. Basaa A.43a (Hyman 1990), Kukuya B.77a (Paulian 1975; Hyman 1988) etc. Another example of the kind of research the PI will conduct during the second funding period is the role of prosodic (e.g. foot, syllable) stem constraints in determining diachronic developments in these languages. A very unusual and dramatic variant of what occurs is seen in the following Tiene B.81 data from Ellington (1977):

(15) a. lók-a 'vomit' lósek-E 'cause to vomit' PB *-es- 'causative'

b. yók-a 'hear' yólek-E 'listen to' PB *-ed- 'applicative'

c. kab-a 'divide' kalab-a 'be divided' PB *-ad-? 'extensive'

d. sook-E 'put in' solek-E 'take out' PB *-ud- 'reversive'

Students of Bantu will immediately recognize the semantic and phonetic relatedness of the apparently infixed material to the reconstructed *PB derivational suffixes in the right column. Had the present-day Tiene affixal material been simply suffixed to the verb roots, as in almost every other Bantu language, we would have expected the derived forms to be lók-es-, yók-el-, kab- el- , and sok-el-, respectively. In fact, as seen in the following derivations, Ellington proposes exactly these underlying forms and derives the surface realizations by a synchronic consonant metathesis rule:

(16) /lók-es-/ /yók-el-/ /kab-el-/ /sook-el-/ Underlying Reps.

sok-el- Vowel shortening

lós-ek- yól-ek- kal-eb- sol-ek- Metathesis

While other more modern analyses are possible, what is important for us is to understand what is motivating this phenomenon. The PI, using a lexical database of Tiene based on Ellington, discovered (i) that verb stems are maximally trisyllabic, i.e. C1VC2VC3V, and (ii) that in such trisyllabic stems, C2 must be coronal and C3 must be non-coronal. Since the external suffixation of -es- or -el- to a C1VC2- root whose C2 is a non-coronal would violate this pattern, the C2 and the [l] or [s] are metathesized. While similar, but less severe constraints, are found in the nearby Teke languages, this is the only case known where maintaining the distribution of stem consonants by place of articulation forces a metathesis. (Other processes are at work when the normal derivation would produce coronal C2+C3 or non-coronal C2+C3.) This analysis explains the diachronic metathesis of reconstructed stem forms as well, e.g. PB *-kúkut- > kótoka `gnaw', *-túbud- > tólebE `pierce', *dO(í,[[cedilla]])mid- > dínema `get lost' (the last case involving also the nasalization of *d to [n] as if it came after the following [m]). As is well known, many of the NW Bantu languages also have severe constraints on suffixation, e.g. losing certain derivational suffixes, restricting their cooccurrence etc. The PI hypothesizes that prosodic constraints may also be at work in languages such as Koyo C.24, where one cannot combine a causative and a reciprocal suffix to a CVC- root, e.g. kór-a `tie', kór-is-a `cause to tie', kór-in-a `tie each other', but *kór-is-in-a, *kór-in-is-a `cause to tie e.o./cause e.o. to tie'. However, double suffixation is possible with a CV- root, e.g. -tá-a `see', -tá-s-a `show' (i.e. `make see'), -tá-n-a `see each other', tá-s-an-a `show each other'. If we extend the Tiene facts to say that Koyo derived stems are maximally trisyllabic, then causative + reciprocal cannot occur on CVC- roots because they would produce too many syllables (four) with the final vowel -a. (Of course some causativized CV roots may be "lexicalized", e.g. dz-és-a `cause to eat' > `feed', but this too is at least in part due to the fact that the forms are short.) With the expansion and refinement of the CBOLD database over this second period, the PI and other researchers will be able to study prosodic constraints on stems in NW Bantu that are hypothesized to play an important role in the gradual dissolution of the extensive Bantu suffix system as one goes from East to West.