A Computerized Database for Theoretical and Historical Bantu Phonology and Morphology

Funded by the National Science Foundation Grant no. 9x-xxxxxxx

Larry M. Hyman, P.I.
John B. Lowe, Programmer
Department of Linguistics
University of California, Berkeley

Table of Contents

1. Project Summary

This three-year project aims to produce in Berkeley a Comparative Bantu On-Line Dictionary (CBOLD) designed to support and enhance the theoretical, descriptive and historical linguistic study of the languages in the important Bantu family. The computerized database will include upwards of 4,000 reconstructed Proto-Bantu roots, several thousand additional reconstructed regional roots, and reflexes of these roots for an initial 50 of the 500+ daughter languages. Published and unpublished dictionaries of selected Bantu languages will be scanned and also entered into the database. Working with colleagues and students from the United States, France, Belgium, the Netherlands, Cameroon, Tanzania and other countries, organized into the Bantu Working Group (BWG), the project will (i) unify computational efforts that have thus far been separately explored; (ii) share all linguistic materials; and (iii) establish separate and collaborative efforts to push the fields of theoretical and historical Bantu linguistics forward. Coordinating with other members of the BWG, the PI and co-workers will contribute in two major ways. First, computationally, Berkeley will: (i) set up the database for all of the researchers to input; (ii) establish a unified format for computational lexicographic work in the Bantu languages; (iii) input extensive (annotated) dictionaries of at least nine languages (Duala, Basaá, Fang, Luganda, Runyankore, Kinande, Chichewa, Ciyao, and Shona-Ikalanga); and (iv) input lexicons of 4,000+ words for another 15 languages. Second, linguistically, the PI will: (i) participate in evaluating "Narrow" and "Wide" Proto-Bantu reconstructions; (ii) conduct specific studies in the historical phonology of the Bantu consonant, vowel and tonal systems; and (iii) produce theoretical work based on the phonology and morphology of Bantu. As outlined in the proposal, the database will be designed specifically to facilitate detailed lexical studies of the different Bantu languages. Examples of studies made possible by CBOLD are enumerated in the proposal. For example, the Berkeley project will continue the PI's research on the Bantu verb stem, providing rich documentation of the generalizations underlying (frozen and live) suffix ordering, the possible cyclic interleaving of morphology and phonology, the restructurings that take place over time in different Bantu languages, and so forth. Other members of the international team will provide both overlapping and complementary expertise as well as reinforcement in the research effort: The Lyon team under the direction of Dr. Jean-Marie Hombert will consult on computerization and provide regional reconstructions for Guthrie zone B (Gabon and environs). The Tervuren team under the direction of Dr. Claire Grégoire will provide up to 10,000 Proto-Bantu or regional reconstructed roots, as well as 7,000 lexical entries from zone A (Cameroon). Dr. Thilo Schadeberg of Leiden University will coordinate the efforts of the BWG to evaluate these roots and, where necessary, gather additional information from the international community. All participating linguists, their students, and other colleagues and contacts will participate in the preparation of extensive Bantu lexicons to be inputted into the database). This includes a number of colleagues and former students in Africa and elsewhere. Initial attention of the project will be on languages from the three corners of the Bantu zone: NW Bantu, Lacustrine Bantu, and South-Central Bantu. The project will subsequently fill in representative languages from other Narrow Bantu zones and then turn to Wide Bantu and Bantoid (e.g. inputting languages documented by the Grassfields Bantu Working Group with N.S.F. support in the late 1970's.) While the database will doubtless have other uses (e.g. for archeologists and historians to trace the history and movements of peoples), emphasis in Berkeley will be on producing a database for the purpose of general linguistic study. The Bantu and other African languages have figured prominently in much of the recent work in phonological and morphological theory. In fact, CBOLD has become a necessity to allow the PI and co-workers to expand and deepen their theoretical and historical research on general issues that can be resolved most efficiently--if not only--by means of an extensive computerized database containing detailed information from a wide range of Bantu languages. The database and associated tools will be made available on as many platforms as can be reasonable supported. When completed, the project will make CBOLD available to other researchers both in hard and soft copy. Given the simultaneous methodological and general linguistic concerns, the project will be able to produce a steady flow of published research from its initiation throughout the three-year period.

2. The Proposal

2.1. Objectives and significance of the project

The present proposal requests the funds required to establish a Comparative Bantu On-Line Dictionary (CBOLD) consisting of 4,000+ Proto-Bantu roots as well as reflexes of these and additional regional roots for an initial 50 of the 500+ daughter languages. As outlined below, the project is international in scope, with researchers from the United States, France, Belgium, the Netherlands and several African countries participating in the effort. During the research period, the Berkeley team will establish the database as the centerpiece of the newly formed Bantu Working Group (BWG) (see Appendix I), pooling information and ideas from a wide range of sources and scholars. The PI and graduate students will be responsible for inputting data from approximately 20 of the 50 languages that are initially targeted, starting with those for which the largest corpora are available, e.g. those for which dictionaries can be scanned or which are spoken natively and/or being studied by project researchers. The project will aim to do the following:

(i) Establish an extensive database for all researchers interested in historical, comparative, descriptive and theoretical Bantu linguistics. This will include annotated reconstructed Proto-Bantu roots which now number over 4,000 (and, according to Tervuren, up to 10,000, when regional reconstructions are added) . It will also include reflexes of these reconstructions from at least 50 of the daughter languages, as well as more extensive lexicographic information from two sources: (a) Appropriate dictionaries will be scanned and then converted to the database--e.g. Snoxall's 1967 Luganda-English Dictionary, which the PI is presently having scanned on U.C. Berkeley's Kurzweil with permission from O.U.P. (see Appendix II). (b) Graduate student researchers, several of whom are native speakers of specific Bantu languages. In many cases both a scanned dictionary and a native speaker/linguist will be available. For example, Sanderson (1957) will be scanned, with tone marks and vowel lengths being added by Mr. Armindo Ngunga, a native speaker of Ciyao from Mozambique who worked with the PI in Berkeley in Summer 1992 and is entering the Berkeley Linguistics graduate program in Fall 1993.

(ii) Provide a unified format for computational lexicographic work in the Bantu languages. The BWG will establish the list of glosses and PB reconstructions that will constitute the core entries of the database, after which individual researchers will be able to receive print-outs or diskettes with these glosses. Based on our previous experience (e.g. the Grassfields Bantu Working Group of 1977-1981, co-organized with Jan Voorhoeve by the PI with support from N.S.F.), researchers, language pedagogues, and others will quite generally want to have access--and contribute--to the build-up of information. The desired result is that all students of Bantu languages will seek to collect the BWG's common core of 4000+ lexemes in each Bantu language. Detailed recommendations will also be provided for how such wordlists (and ultimately, Bantu dictionaries) should be formatted--so as to allow our program to optimally convert them to the centralized database.

(iii) Sponsor (and expand) significant theoretical and historical research on Comparative Bantu. Bastin (1978:168) suggests that in the U.S. Bantu linguistics has been subordinated to (generative) theoretical linguistics, while in Europe theoretical problems are "subordonnés à la description des faits et à la connaissance des langues". While there are exceptions in both directions, this project aims to bring together all parties and all concerns in order to solve outstanding problems of potential significance both to Bantu and to general linguistics. The focus of the N.S.F. project is both to prepare a database, and address the new and outstanding theoretical and historical issues in Bantu linguistics. As further elaborated below, the questions that will be investigated through the database are of both a synchronic and diachronic nature. The PI's work has been both theoretical and historical at the same time, with extensive lexical documentation from one language informing the analysis of others. One hope for this project is that it will provide a major stimulus for the field of Bantu linguistics within this country, with the database enticing theoretical Bantuists (and/or their students) to address phonological and morphological questions from a comparative or historical point of view.

(iv) Serve as the point of departure for work on the relation between Bantu and its closest relatives. The PB reconstructions pertain to what is known as "Narrow Bantu", with most of the well-known daughter languages belonging to the Kongo branch--only one of the eight Bantu subgroups proposed by Heine et al (1977). The 50+ languages that will be initially inputted will include languages from all parts of the Bantu zone, with special care taken that representative languages from within Guthrie's (1967/71) zones A, B and C (so-called "Northwest Bantu") are included. Besides the PI's expertise in Cameroonian Bantu, we are fortunate that the French and Belgian teams (from Lyon and Tervuren, respectively) have been working extensively on Narrow Bantu languages from Cameroon, Gabon and Congo. In the third year, the project will expand to "Wide Bantu" to include the PI's and others' data from Grassfields Bantu and Bantoid languages from the Nigeria-Cameroon border area. As seen in Figure 1, Blench and Williamson have proposed a binary branching internal structure to Bantoid, with the enormous Bantu family deeply embedded at the lower right. Once the PB reconstructions and their reflexes are entered, it should be possible to go up this tree, node by node, e.g. updating and comparing Hyman's (1979) Proto-Grassfields reconstructions and their reflexes with Narrow Bantu (see Appendix III). In this sense, the project represents a natural continuation of the Grassfields Bantu Working Group and provides a springboard for future comparative work on the relation of Bantu to Bantoid and, ultimately, of Bantoid to the other branches of Benue-Congo.

Because of the quality of people involved, their good working relations, and their individual and collective expertise and experience, the project can be expected to be of major consequence in the field of African linguistics. Given also the PI's general and Bantu linguistic concerns, as well as the pilot studies that have been done or are in progress, the Berkeley project will be able to produce a steady flow of published research from its initiation throughout the three-year period.

2.2. Sampling of issues for which a database is needed

The comparative Bantu database has been necessitated in part as an outgrowth of the PI's theoretical, descriptive and historical research on Bantu languages. In order to test various hypotheses, it became increasingly necessary to spend inordinate amounts of time repeatedly going through the same dictionaries of Luganda, Kinande, Cibemba, Ciyao, Chichewa, Basaá etc. Some of the results that evolved from these efforts are Hyman (1991, 1992) and Katamba and Hyman (1991). It is clear that a computerized database that includes such scanned dictionaries, as well as extensive lexicons of a sizeable number of Bantu languages, would greatly facilitate this type of issue-oriented research. The following subsections indicate the nature and significance of some of the unresolved issues or comparative projects that will be undertaken by the PI and co-workers, if funding is available:

2.2.1. Proto-Bantu consonants

Up through the works of Meinhof (1910/1932), Meeussen (1967, 1969), Guthrie (1967/71) and others, a rather simple Proto-Bantu consonant system has been assumed with only slight variations, e.g. *p, *t, *c, *k, *b, *d, *j, *g, *m, *n, *[[sterling]], and possibly *^ (Meeussen 1967). In various writings, Stewart (1973, 1989) proposed that both the voiceless and voiced stop series of PB were further differentiated into a fortis vs. lenis series, i.e. that there were four series of proto stops. The basic argument for this position was that in numerous NW Bantu languages, the proto consonants have "double reflexes", e.g. in Tunen, PB *t corresponds to [t] in some roots, but to [l] in others, e.g. *-túg- `draw water' > -tók- vs. *túd `forge' > -lún-. Stewart's proposal would be to reconstruct the two PB roots as *-túg- and *-t'úd'- (where C' = lenis). A number of researchers have tested Stewart's fortis/lenis distinction against the double reflexes found in various Bantu and non-Bantu languages (Hyman 1979; Gerhardt 1980, 1986; Bancel 1988; Bachmann 1989; Botne 1992a). Most recently Bachmann (1989), Blanchon (1991) and Janssens (1993) have contested the validity of fortis/lenis hypothesis in Tunen, Mpongwe, Ewondo and other NW Bantu languages (but cf. Botne 1992b, who argues for a glottalic/non-glottalic opposition in Proto-Bantu). Showing that PB *-tá^- `count' has the two reflexes in Ewondo, -lá^ `count' and -

tá^ `pay a debt', Janssens (1993:261) argues that both are direct reflexes. PB *t generally weakens to [l] in Ewondo, e.g. -lá^. There is a [t] in -tá^, however, because this verb is denominal (from class 9 /n-tá^/ `account', with the nasal noun prefix later deleting before the voiceless stop that it preserves from weaking to [l]). He presents possible, but less conclusive, alternatives to fortis/lenis to explain double reflexes in Bubi, Tunen and Bafia. It is quite fair to say that the interpretation of double reflexes in Bantu languages from all zones is unclear. It is possible that there were three, not four series of consonants, or that there were more consonantal oppositions only in particular places of articulation (e.g. *j vs. *y?). The database will allow a systematic study of correspondences between PB and consonants in the daughter languages.

2.2.2. Palatalization

Proto-Bantu is generally assumed to have had the seven vowel system *O(i,[[cedilla]]), *i, *e, *O(u,[[cedilla]]), *u, *o and *a (where *O(i,[[cedilla]]) and *O(u,[[cedilla]]) represent so-called "superclosed" or [+ATR] high vowels), as realized in present-day Kinande. (An alternative hypothesis is to recognize the seven vowels as *i, *e, *[[perthousand]], *u, *o, *ø, *a.) Many Bantu languages palatalize *k and *g before *i and *e. (Velars usually spirantize (e.g. to s/z) before *O(i,[[cedilla]]) in the same languages--see [[section]]2.2.3.) The extent of velar palatalization can be seen in Map 1, which shows the realization of the noun class 7 prefix *ki- throughout the Bantu area. While such sequences frequently yield the alveopalatal affricates transcribed c and j, the exact environment of palatalization depends on the language. Hyman (1992) [Appendix V] documented in great detail that palatalization of velars is found only in morpheme-initial position in Cibemba (also Chichewa, (ci-)Shona, (ci-)Tonga): *-kéka > (mu)-ceka `mat' vs. *tooke `banana' > (ci)-tooki `wild banana tree'. Palatalization is also not found across a morpheme boundary, e.g. -fik- `arrive', -

fik-il- `arrive for/at'. This distribution accounts for the striking bifurcation of *k > k, c in the following examples from Tonga:

*-júki `bee, honey' > (n)-zuki `bee'

> (bu)-ci `honey' (< bu-uki < bu-yuki < *bu-juki)

With the class 9 nasal prefix n-, the *j is preserved as z, and the second syllable of the morpheme remains [ki]. With the class 14 prefix bu-, *j weakens to y and drops out. Since Tonga lost the historical vowel length contrast, bu-uki shortens to bu-ki. The syllable *ki, which was the second syllable in the reconstructed root is now at the beginning of the morpheme--and undergoes palatalization. Finding the restriction of palatalization to morpheme-initial position puzzling--and not wanting to say that a morpheme boundary could condition a sound change in this way--the PI and a graduate student undertook and presented a pilot study of velar palatalization in Bantu (Hyman and Moxley 1993). It was found that in Chimwiini (also Kanyok, Luba-Kasayi), palatalization is more restricted, essentially limited to occurring in prefixes: class 7 *ki- > ci- vs. *-

kída > (m)-kila `tail'. On the other hand, there are languages like Ciyao and Citumbuka that palatalize in all environments before *i and *e. To explain these variations, Hyman and Moxley arrived at the following hypothesis: Following Nurse and Hinnebusch (1993), palatalization is initiated by the change of kyV and gyV to cV and jV, as in Standard Swahili. Since almost all cases of kyV (and gyV) result from the gliding of a ki- (or gi-) prefix before a vowel prefix or vowel-initial root, the change was analogized to all instances of the prefixes ki- (and gi-), as in Chimwiini. (There are no prefixes with the vowel *e.) Being prefix-initial is necessarily morpheme-initial. The second analogy was thus to palatalize all morpheme-initial sequences of velar + front vowel, as in Cibemba. The last step was to generalize the palatalization process to non-morpheme-initial position, as in Ciyao. Evidence for this view comes from the roots *-ke- `to dawn' and *-

joki- `to burn, roast'. Since these exceptionally end in a vowel (most roots end in a C), and since all verbs must end in an inflectional final vowel (FV) morpheme, e.g. -a, these will produce -

ky-a and -joky-a. It turns out that these roots always palatalize if anything does (cf. Chimwiini -


a `to dawn'). Hyman and Moxley intend to use the database to conduct a more exhaustive and lexically detailed study of the rise of palatalization in Bantu and test their hypothesis.

2.2.3. "Superclosed" spirantization

Another study that the database will be prepared to conduct concerns the spirantization of consonants in the context of the proto superclosed vowels *O(i,[[cedilla]]) and *O(u,[[cedilla]]). As documented by Bourquin (1955) and others, Bantu languages that merge *O(i,[[cedilla]])/*i and *O(u,[[cedilla]])/u frequently show mutations of a preceding consonant before the historically superclosed member. As seen in Table 1 (above), there are a number of possible changes, including non-mutation. In Hyman (1975, 1977) it was proposed that this phonologized "obstruentization" derives from the noise created by the release of an obstruent into the relatively constricted tense high vowel (cf. Ohala 1978; Jaeger 1978). A chain of events may take place as follows: *pO(i,[[cedilla]]) > p[h]O(i,[[cedilla]]) > {pfO(i,[[cedilla]]) / psO(i,[[cedilla]])} > {fO(i,[[cedilla]]) / sO(i,[[cedilla]])}. With a voiceless consonant, one obtains aspiration before the superclosed vowels, followed by affrication either to the place of the consonant (pf) or to the "place" of the vowel (ps). Deaffrication then produces [f] or [s]. The same happens with sequences such as *kO(u,[[cedilla]]), which becomes [fu] through the stage [kfu] (as attested, for example, in Fang). Table 1 represents the diachronic reflexes within morphemes. As with velar palatalization, some languages have a different realization before suffixes that begin with *O(i,[[cedilla]]). In Luganda (also Haya, Shi and certain other nearby languages), *pO(i,[[cedilla]]) and *bO(i,[[cedilla]]) spirantize only within a morpheme: *-bO(í,[[cedilla]])n- > -zín- `dance' and *-

bO(í,[[cedilla]])mb- > -

zímb- `swell', but *-kúb-O(i,[[cedilla]])- `beat+caus'> -kúb-O(i,[[cedilla]])- `cause to beat' and *-dO(u,[[cedilla]])b-O(i,[[cedilla]]) `to fish+agentive'> (mu)-vub-O(i,[[cedilla]]) `fisherman'. Coronals and velars, on the other hand, show the same mutated reflexes both within and across morpheme boundaries. In order to investigate the synchronic and diachronic implications, a graduate student undertook and presented a pilot study of these mutations (Zoll 1993). The study has turned up some surprising results (e.g. the realization of all oral consonants as alveopalatal affricates in Haya (and Runyambo) seen in Table 1). Zoll is developing a feature-geometric account of productive alternations found before the causative suffix *-O(i,[[cedilla]])-, the deverbal agentive suffix *-O(i,[[cedilla]]), and the perfective ending *-O(i,[[cedilla]])d-e. To account for the wide range of facts found in the 80+ languages for which there is data, she has extended Clements' (1991) vowel height feature [open] to consonantal stricture. Zoll notes that Bantu languages frequently block the fusion of labiality and coronality (cf. Kingston 1991). Thus, as seen in the table, Cibemba *pO(i,[[cedilla]]), *bO(i,[[cedilla]]) > fi (not si) and Ikalanga *tO(u,[[cedilla]]), *dO(u,[[cedilla]]) > t[h]u, du (not fu, vu). The database will allow a full and systematic search of the reflexes of PB *CO(i,[[cedilla]]) and *CO(u,[[cedilla]]), revealing the possible range of changes as well as explaining discrepancies between morpheme-internal vs. morpheme-external realizations.

2.2.4. Verb stem morphology

As already indicated, the database will facilitate the study of the morphology of the Bantu verb stem (= verb root + suffixes). Besides non-productive suffixes of limited distribution (e.g. *-

am-, *-at-, *-uk- etc.), it will allow investigation of the ordering and mutual exclusivity of lexicalized and compositional uses of productive suffixes: applicative *-id-, short causative *-O(i,[[cedilla]])-, long causative *-ic-O(i,[[cedilla]])-, reciprocal *-an-, passive *-u-. Languages for which dictionaries can be scanned typically provide extensive information about suffixed verb forms. As an Appendix to his Berkeley dissertation, J. Rugemalira has computerized all of the suffix possibilities for 500+ verbs in Runyambo. Other graduate students have expressed interest in doing similar detailed work on their languages (Ciyao, Ikalanga). The PI has published overviews and detailed descriptions of the verb stem morphology of different Bantu languages which identify the richness of the data as well as the theoretical and historical significance of this line of research (Hyman 1993; Hyman 1991; Hyman and Katamba 1991; Hyman and Mchombo 1992). Although noting the limitations of "mirror principle" effects in Bantu (Baker 1988, Alsina 1990), the PI's work provides phonological evidence for morphological cyclicity in the Bantu verb stem. To facilitate this research program, morphological information, even if non-productive or frozen, will be tagged on entries in the database. This also will allow careful examination of morpheme-internal coarticulation constraints.

2.2.5. Other issues

Because of space limitations, other projects that will take place can be mentioned only briefly. Hubbard (1993ab) has been working extensively on the realization of NC clusters in Bantu: their timing, effect on preceding vowels and following consonants. The detailed study of nasality in Luganda roots and of Meinhof's Law in Katamba and Hyman (1991) needs to be done for other languages. Other areas that require close scrutiny within and across Bantu languages are Dahl's Law (a voicing dissimilation rule that has different property in different languages--cf. Nurse and Davy 1979), vowel-initial roots (whose zero consonant frequently alternates with [y] and other consonants), and so-called "ghost consonants" that arise when proto consonants fall out--e.g. the loss of [l] in Kamba (Hinnebusch 1974) or in the Swahili verb stem recently studied by Moxley (1993); the loss of *g in Cibemba when not preceded by a nasal consonant, etc. Finally, there's the issue of imbrication (Bastin 1983), the "infixation" of the *-O(i,[[cedilla]])d- morph of the *-O(i,[[cedilla]])d-e perfective ending, as when Cibemba -

kúngub- `gather' forms the perfective -kúngwiib-e instead of *-

kúngub-il-e (Hyman, in press).

2.3. Comparative Bantu On-Line Dictionary

The Comparative Bantu Online Dictionary (CBOLD) will provide a consistent and unified source of data on Bantu languages along with tools for analysing and manipulating that data. Outlined here are 1) the database design, 2) the procedures for gathering and entering the data into the database, 3) the tools for accessing the database, 4) methods for sharing the database, and 5) a suggested computer environment for development.

2.3.1. Database design and data acquisition strategies: overview

The CBOLD consists of a number of parallel bilingual dictionaries with an overlaid semantic and phonological analysis. Functioning as the "backbone" of the database are the linked dictionaries of nine "core" languages. The lemmata in these dictionaries will be "aligned" etymologically as illustrated in Figure 2. First, the "core" dictionaries will be scanned and proofread, as also indicated in (1) in Figure 3. As seen in Map 2, Guthrie (1967/71) divided up the Bantu languages into geographic zones. Each zones is defined by a letter of the alphabet and is further divided up into "decades" (e.g. A.10, S.20 etc.). Each Bantu language has a specific Guthrie letter+number code. The dictionaries that will be scanned come from languages that are representative of three regions of "Narrow" Bantu which will receive special attention in the initial phases of the project: (i) Northwest Bantu: Duala (A.24), Basaá (A.43a), Fang (A.75); (ii) Lacustrine Bantu: Luganda (J.15), Runyambo-Haya (J.21-22), Kinande (J.42);[1] Southcentral Bantu: Chichewa (N.31b), Ciyao (P.21), Shona-Ikalanga (S.16). Software to parse these texts into "fields" which can be loaded into the database will be developed and is indicated in (2) in Figure 3. During or after the loading process, the "backbone" files will be searched and links created between the lemmata in the different dictionaries (3). Data for another 50+ Bantu languages will be added by researchers in Berkeley, Lyon, Tervuren and Leiden. In many cases native speakers will refer to the lemmata in the dictionary of the language most closely related to their own while entering their data (e.g J. Rugemalira will adapt Runyankore to Runyambo and J. Mathangwane will adapt standard Shona to Ikalanga). These texts will be loaded into the database in the same way as the scanned texts. The CBOLD database will thus be a repository of both published and unpublished material in a consistent and easily accesible format.

In terms of the data structures, the lexicon of each language will be represented as a distinct file; each lemma in the dictionary will be linked to the analytic "backbone" of the database, which is depicted for convenience in Figure 2 as being the reconstructions proposed by Guthrie (1967/71). Tools for import/export, maintenance, and querying will be developed. Periodically, a "snapshot" of the database will be prepared for distribution to interested parties along with the latest version of the tools developed.

2.3.2. Data entry and preprocessing

The results of scanning one of the "core" dictionaries (the Luganda of Snoxall 1967) are exemplified in Figure 4. This version of the dictionary is merely a machine-readable version of the original printed work. A more complete sample of the results of scanning this dictionary and the methodology employed may be found in Appendix II.

Software to parse these texts into fields which can be loaded into the database will be developed. The results of this process are depicted in Figure 5. This example is displayed in Lexware format. In fact, the exact format used will be decided by the international group collecting and using the data and the availability of standards for the representation of lexical data.[2]

As each dictionary is loaded, links between the individual lemmata and a sets of phonological, semantic, and etymological standards will be created (only the "etymological backbone" is outlined here--cf. Figure 2). These links will make it possible to bring together related words from the various lexical sources. New records in the backbone files may be created as the semantic and phonological scope of the database expands.

2.3.3. Tools for database Access

Software tools for bringing linked entries together will be developed; so, for example, it will be a simple matter to bring together all the supporting form for a reconstruction into a single cognate set. Other tools for more general searching on the basis of morphological and phonological structure will be created. As an aside, the database and the tools for using it will be developed in an environment which will allow researchers access on most of the popular computing platforms, at least Apple Macintosh and IBM compatibles. Several types of tools will be developed:

* Publishing and document preparation tools--programs will be developed which will create high-quality, publishable versions of the data in the database. For example, one "product" which will be realized from the database will be an up-to-date, revised, and expanded etymological dictionary of Proto-Bantu. Tervuren estimates the number of Bantu lexical reconstructions at approx. 10,000: "mais y compris des reconstructions régionales non protobantoues dont nous ne pouvons pas à l'heure actuelle déterminer en quelles proportions elles participent à l'ensemble" (Dr. Claire Grégoire, pers.comm.). Other types of documents which might be useful are synonym lists, phonological inventories with supporting forms, thesauruses, and multi-lingual dictionaries. In general, these tools will operate on the database as a whole and produce sizable documents. Thus, the users of CBOLD will be able to support a certain amount of "demand publishing", providing interim and final versions of their analyses in a timely fashion.

* Database query and retrieval tools--programs will be developed to support ad hoc queries against the database. Researchers will be able to perform multi-faceted searches. Prototypical software for searching the database based on phonological criteria, for example, has already been developed: colleagues in Lyon have created a Filemaker Pro(TM) version of Guthrie's list of common Bantu forms (Hombert et al 1991). This application has been converted to Hypercard by John Lowe, and is illustrated in Figure 6. It implements a "query-by-example" strategy: the user fills in those fields for which matching forms are to be sought. In this example, the program retrieves all reconstructions having an initial /po/ and a high tone in the first syllable. This rudimentary search engine can answer a few of the types of questions proposed as research topics earlier, but to address the majority of the problems, a much more sophisticated system will have to be developed which can address the database in terms of morphological and phonological features.[3] The sketch of data structures and query functions below outlines how these capabilities may be implemented.

* Database maintenance tools, including programs to import and export data into other useful data formats, backup and data validation tools, and other housekeeping software as needed. To the extent possible, existing software will be used to reduce the amount of development required and improve functionality.

2.3.4. Sketch of data structures and query functions

Queries of high complexity will need to be answerable by the database. The queries may refer to specific segments or broad classes of segments. They may refer to adjacency or boundary conditions defined by morphological or phonological criteria. Queries may combine elements from different "levels" of representation. Consider some specific queries which are of interest:

* List forms in which two succeeding consonants share phonological feature x (where x might be `voicing' or `labiality' or `+stop').

* List forms which have a Cf = /*d/, i.e. in which the final consonant was (historically) /d/, but which may or may not be realized in modern forms. For example Swahili -ingi- `enter', whose latent [l] < *d surfaces when the applicative suffix is added (-ingil-i-) (but see Moxley 1993 for a different interpretation).

* List forms which have a Cf = /s/, i.e. in which the final consonant is /s/. This query would match Luganda -lás- `shoot (arrow)' where /s/ is in position C2 and -pákas- `work for hire', where it is in C3.

To support queries of these types requires (l) that an analysis of lexemes into morphological and phonological constituents be available, which in turn implies (2) that (at least within Bantu) a universal morpheme and syllable structure be defined which determines the analysis of lexeme into constituents and their membership in phonotactic and morphotactic classes (i.e. syllable slots and morphological categories); and (3) that each phonological constituent be defined not only in morpho-/phonotactic classes, but into phonological classes as well, i.e. into features as well as into hierarchical subconstituents or nodes, as in feature geometry. (For a discussion of practical considerations in performing semi-automatic feature analysis, see Mazaudon and Lowe 1991 and Lowe and Mazaudon 1993).

As an initial proposal for a computable representation and tentative algorithms to realize these goals,[4] each lexeme (i.e. headword of a lemma) in the database could be parsed and the tokens classified at each of the levels of representation required for retrieval. Initially, it will be sufficient to propose a morphological parsing, a syllabic parsing, and a feature analysis for each form in the database. Each of these levels or tiers would be searchable as a separate indexed entity within the database. "Superclasses" of these entities would be created, so that C could stand for C1, C2, ... Cn; S could stand for any syllable, and so on. Consider the representation in Figure 7. The Legend by row is as follows: "L600a" refers to the original lexical entry in the dictionary, and identifies a full lexeme. "M226" means morpheme #226 (i.e. each morpheme would be individually identified and distiguished). These would be subcategorized by part of speech, semantic field, noun or verb class, and so in other parts of the database record. "S120" and "S213" are numbered syllables. Like morphemes, each syllable in each language would be uniquely identified. H and L represent High and Low tone (on a separate tier, as it were). C1, V11, etc. are syllable slot identifiers, additionally, elements like "Ci" (= intial consonant), "Cf" (=final consonant), and "Cx" (= any consonant) would be defined as "metaslots." The tier descriptions are as follows:

1 `Morphology'

2 `Prosodic' 2a Syllable, 2b Tone

3 `Segmental' 3a Syllable Slot, 3b grapheme

4 `Feature' 4a Manner, 4b place, 4c Laryngeal; 4d Height

Such a representation would allow the answering of queries such as:

"Cases of reduplication" Mi = Mj for all i, for all j = i + 1

"Initial /d/ followed immediately by /i/" Ci = /d/ ^ V11 = /i/

"lexemes containing dental stops" Tier 4a = [+stop] ^

Tier 4b = [+dental]

"Ci is a labial and Ci+1 is a velar" Tier 4b of Ci = [labial] ^

Tier 4b of Ci+i = [velar], for all i

2.3.5. Data sharing and standardization

As previously mentioned, the database and associated tools would be available on as many platforms as can reasonably be supported. Choosing a development environment and database system which is already supported on different computer systems (such as FoxBase or Oracle) would make this task easier. At any rate, the choice of design and initial implementation should be flexible enough to allow for migration to a new environment if conditions warrant. Software capabilities are evolving rapidly and while it would be unfortunate to spend too much time in the chase, it might be of great benefit to be able to take advantage of some advances. Standardization would be achieved in two ways:

* Standard representations--these would allow researchers to share data regardless of the computer system they are using. Standards would encompass character sets, transcription, tagging and delimiting of fields, relationships between records and files of different types and so on. Such standards are becoming available now (such as SGML/TEI and Unicode).

* Multi-platform tools--providing programs which would operate on different computer systems is more difficult. As noted above, the "normal" method to use the database would be to use the multi-platform software suite developed for this project. However, it may eventually prove useful to provide some tools in source form in a well know computer language such as C which would allow computer literate researchers to build special purpose programs for manipulating the database.

2.4. Timetable

The requested starting date for NSF support is January 1, 1994. The following represents the preliminary timetable for the project.

2.4.1. Prior to funding

In a sense, several important steps have already been taken to get the project into gear. Extensive e-mail and FAX exchanges began in Fall 1992 between the PI, Jean-Marie Hombert (Lyon), Claire Grégoire (Tervuren) and Thilo Schadeberg (Leiden), which continue to the present. In February 1993, Schadeberg volunteered to take the responsibility for sending out an announcement of the proposed formation of a Bantu Working Group (Appendix I) to potentially interested Bantuists worldwide. Having received permission from Tervuren (Appendix IV) the PI scanned, proof-read, and realphabetized Meeussen's (1969/80) Bantu Lexical Reconstructions, prepared on MS Word for current use and future loading into the database. On March 16, a meeting was held in Brussels and attended by Hyman, Grégoire, Schadeberg, Yvonne Bastin and Baudouin Janssens (the last two also of Tervuren) to discuss the idea of a BWG further (Hombert could not attend, but was contacted the same evening by phone). Since all the principals (and roughly 150 other people) already had plans to attend the 23rd Colloquium on African Languages and Linguistics in Leiden on Aug. 30-Sept. 1, it was agreed that a two-day workshop involving a slightly larger group of historically oriented Bantuists would be held in Brussels on Sept. 2-3, 1993. Following this meeting, the PI will make a brief trip to the Laboratoire de Phonétique et Linguistique Africaine (LAPHOLIA) in Lyon to visit their facilities, consult with their programmer, Joel Brogniart and exchange ideas on the proposed CBOLD with Hombert, Jean Blanchon, Gilbert Puech, François Nsuka and other Bantuist researchers in Lyon.

In Fall 1993, the PI's annual Bantu seminar will focus on phonological and morphological issues deriving from the Bantu lexicon. In preparation for the project, J. Moxley will complete the proofreading and formatting of the scanned Snoxall (1967) Luganda-English dictionary, and three additional dictionaries will be scanned: (i) Taylor (1959) Runyankore-Rukiga-English Dictionary for J. Rugemalira to "convert" to Runyambo; (ii) Sanderson's (1957) Dictionary of the Yao Language for research by A. Ngunga; (iii) Scott & Hetherwicks' (1957) Dictionary of the Nyanja [Chichewa] Language for future research by J. Moxley (in consultation with Prof. Sam Mchombo and Dr. Al Mtenje, who will be a visiting Fulbright scholar at Berkeley in Fall 1993).

2.4.2. During funded period

During the first year, initial emphasis will be on both preparations and actual development of the database, as well as on those languages for which extensive materials are readily available. From January-March 1994 work will continue on the above scanned dictionaries, as well as on dictionaries of Duala, Basaa, Kinande, Cibemba and Shona. The team will carefully evaluate these and other sources to be used in the database. It is hoped that Drs. Hombert, Grégoire, and Schadeberg (and others from their institutions) will be able to attend the Special Session of the Berkeley Linguistic Society meeting, Feb. 18, 1994, entitled "Historical issues in African linguistics". Announcement and introduction of the project will be made at this meeting. By this time Tervuren will have provided a corrected outline map of the Bantu languages which can be scanned and converted to a draw program for use in mapping out isoglosses in color and in black and white (vs. Map 1, which is inaccurate in a few details). The project anticipates a number of regional maps as well since, as Map 1 shows, some of the languages are too crowded together to allow careful identification.

From April-June 1994 the database design and specification will be worked out and a prototype developed. This will include decisions on how to represent various "complications" that arise in the daughter languages, e.g. "latent" consonants (similar to French liaison consonants), "empty C slots" (similar to French h-aspiré), floating tones, etc. In July-Oct. 1994, the scanning and proofreading will be completed, there will either be a revised prototype and/or development of a "Mark I" version of CBOLD. By this period a tentative lexicon will have been established through consultations with the other members of the team as well as the broader Bantuist com-munity, whose input on such matters will be sought. By Dec. 1994 the loading of dictionaries into the database will have been completed and the first distribution of the database and tools prepared.

In Winter 1995, we will plan an evaluation by a "user community" of the first version of the database. This will be followed in Spring 1995 by a revision of the database and tools, incorporating new data from other languages (which will have been going on since the beginning of the year, and which will continue throughout the project, ending with at least 20 languages having been entered by the American team, and another 30 by other American, European and African scholars.

If not before, the third year of the project will include additional Bantu languages of Cameroon (Bakweri (A.22), Tunen (A.44), Bafia (A.53), Tuki (A.64)) and representative Wide Bantu/Bantoid languages from the Nigeria-Cameroon borderland: Grassfields Bantu, Ekoid Bantu, Tivoid, Beboid etc. Some of this work will be coordinated with linguists at the University of Yaounde (Cameroon), several of whom were participants in the Grassfields Bantu Working Group in the late 1970's (see [[section]]6 Biographical Sketches).

During the entire period the PI and co-workers will be producing progress reports for distribution among interested parties. As mentioned earlier, theoretical and historical research will go hand in hand with the development of the database. Projects will be organized in part in terms of what can best be accomplished at what stage of the work, with the PI and co-workers expected to produce a steady published output on comparative Bantu. In addition, all concerned hope that CBOLD will encourage a wide range of linguists (including Bantuists who may not have had a comparative or historical orientation in their work) to use the data base to enter into the significant issues that the database will allow to be studied in a systematic way. The goal is for there to be an overall increase in quality activity in comparative Bantu linguistics--especially in the United States.

2.5. Relation to Berkeley and the training of graduate students

It is particularly fitting that such a major lexical database project take place at the University of California at Berkeley. From the pioneering work of Prof. William S.Y. Wang's Chinese dialect Dictionary on Computer (D.O.C.), through Prof. James Matisoff's Macintosh-based Sino-Tibetan Etymological Dictionary and Thesaurus (S.T.E.D.T.), Berkeley has been a center for computational work on large lexical databases. Several other colleagues have invested heavily in lexical and other large database projects (e.g. Prof. Charles Fillmore on English, Prof. Gary Holland on Sanskrit, Prof. Richard Rhodes on Ojibwe). In addition, the restructured and refurbished Phonology Laboratory under the direction of Dr. Steven Greenberg and housing a team of researchers including Prof. John Ohala, programmer/analyst Michael Ward, postdocs and graduate students, provides an intellectual setting--as well as additional technical and material support--for CBOLD, which will be located within the PL area. Concerning Bantu, since the PI and Prof. Sam Mchombo joined the Department of Linguistics at U.C. Berkeley, the latter has become a serious center for the study of African (especially Bantu) linguistics. As summarized in the short biographical sketches ([[section]]6), seven graduate students will participate in the project: Kathleen Hubbard, Joyce Mathangwane (from Botswana), Jeri Moxley, Armindo Ngunga (from Mozambique), David Peterson, Josephat Rugemalira (from Tanzania) and Cheryl Zoll. Each of the seven has already--or is about--to publish in the area of Bantu linguistics. In Fall 1993, the PI will offer a Bantu seminar which during the pre-grant period will serve as a focus for the work that the team intends to approach from a running start on January 1. The above graduate students also have exclusive responsibility for organizing a day-long Special Session entitled "Historical issues in African linguistics" during the Berkeley Linguistics Society meeting on February 18, 1994. It is vitally important to the PI that the project contribute to the training of graduate students, and that they be especially encouraged to develop as independent researchers in the fields of general and Bantu linguistics. In addition to the Berkeley students, the PI is frequently contacted by students working on Bantu at other universities and provides verbal and written comments on their work to them. It is expected that some of these students will also want to participate in one way or another--and this will be encouraged as part of the general openness of the project.

2.6. Current and pending support

There is no other current or pending support available for this project.

[1]The Tervuren School has replaced Guthrie's D or E designation of Lacustrine languages by a J (e.g. Luganda J.15).

[2]Lexware was developed by Bob Hsu of the University of Hawaii. The Text Encoding Initiative, a national effort to develop standards for the encoding of textual data, includes a proposal for the encoding of dictionary data. It may ultimately prove most efficacious to use this representation since high-powered tools for manipulating documents in this format will soon be available.

[3]Tervuren also independentlys began a modest computerization project using Quatrième Dimension whose com-pletion date is uncertain and whose retrieval capabilities are even more restricted (cf. second letter in Appendix V).

[4]The representation suggested here is based loosely on theoretical concepts familiar from autosegmental theory and particularly feature geometry (Clements 1985, Sagey 1986, McCarthy 1990 etc.), and which have been implemented (again, in a loose sense) in the Delta phonological programming language (by Sue Hertz, Cornell University).