|How much is a language
A Quantification of the Digital Industry
for the Spanish Language José Antonio Millán
(with support from Jesús González Barahona, Miquel Vidal, Javier Candeira and José Luis de Vicente)
In this article I will focus my interest in a hidden sector of the industry: the linguistic added value inherent in digital products and services the share of the digital economy contributed to by the language industries.
My thesis is that these hidden linguistic services (that is, those that do not appear as independent products) have a great economic importance. The present article is an attempt to quantify this value in the case of the Spanish language, followed by a series of economic and political consequences extracted from the analysis of the quantitative data.
Linguistic technologies are basically enablers, in some cases commmunication enablers for general transactions: they act, or rather, they will act, as interface with the users PC, phone or other device, and through them as interface with databases, purchasing systems, and so forth. In other cases they are linguistic enablers for operations such as comprehension, translation, abstraction and composition of pieces of written or spoken language.
Linguistic technologies are important because they will allow for the development of innumerable services: spoken interaction with the computer (and other devices) will open the use of the net to all sorts of people and for any purpose (from learning to leisure, from shopping to scholarly pursuits), and they will notably lower the barriers between languages. Moreover, they will play an efficient part in many language-based professional tasks, such as journalism, publicity, tourism information, editing, publishing and other forms of info-mediation.
In order to develop these linguistic technologies we must start by procuring the basic elements, the building blocks that must exist before a software program can serve any linguistic function. These basic elements are neither useful by themselves, nor appropriate to be sold as a consumer product:
On a second level, there are task-oriented modules. These components are usually part of a commercial product, though typically they do not work by themselves, but inside a bigger software program (here noted within parentheses) which serves another, more general, function.
Spell checking (text processors) Grammar checking (text processors) Style checking (text processors) Desambiguating (search engines) Indexing (search engines) Hypertext crawling (search engines) Document abstracting (search engines) Text imput and display for roman and non-roman languages (operating systems)
Text-speech/speech-text conversion (operating systems, dictation programs)Handwriting recognition (operating systems, mainly in hand-held devices)
Lastly, we have products, which are the software programs that reach the final user (private or corporate).
I have tried to distinguish three types of final products that share basic technologies. What I call searcher is the descendent of todays indexing search engines such as Altavista, empowered with semantic and morphological intelligence; an agent is a program that is also capable of searching by parameters, of comparing and abstracting the results, and of learning from the users patterns of behaviour; by management tool I mean more powerful programs, geared towards work with less formal language registers and towards the extraction of subtler meanings.
My economic asessment has two axes: on the one hand I will calculate the weight in linguistic technology (WLT) of each of the products or services under scrutiny. I will explain this through an example: a text processor is not only a program that sets one letter after another. It also has some knowledge of typography (word breaking rules) and linguistics (spell checker, style checker, thesauri, etc.) that nowadays are not independent products (though they were so at some moment in time, and they could become so again).
How can one assess this weight in linguistic technology? By an index, which I am calling WLT. Broadly, the weight depends upon the type of product: a translation program will make a greater use of linguistic technologies than an advanced net search engine. In the translation program, however, the weight will never reach 100%, for a translator also uses some technologies that are purely computational in nature; a search engines strenghts will rest more heavily on database sorting algorithms, and so forth.
My proposals on the weight of linguistic technologies are abstracted on Table 1:
Table 1. Weight of Linguistic Technologies by product or service, in growing order
In order to undertake a complete quantification we are interested in estimating the total amount of use that these final products will have in a given span of time. We are making these estimations for a moment in the near future, around the year 2004. The data for each type of program are summarized in Table 3. The sources for these data can be found in an article (in Spanish) that forms the basis and preliminary version for the present document. This article can be found at http://jamillan.com/tesoro.htm
We will try to assess a total number of users for each program (column C), by estimating a percentage of the total numbers of PC users. For this piece of information we assume the number of 61 million Spanish speakers with Internet access in 2003/2004. The percentage of users of each program with regard to the total number of computer users (B) will be high on general-interest programs (text-processor) and will get lower as we reach more specific-interest programs. The greater risk we are taking is assuming a high penetration of info-managing programs (30% of all PC users).
For the prices of sale (D) we have done an estimation based on present prices (converted from Spanish pesetas), plus our own estimation for products not yet for sale (info-management programs). The user totals (C) are multiplied by these prices in order to arrive at gross sales (E), which, multiplied by the WLT index (A) gives us the economical weight of the language industry for each product (F). We will consider a three-year cycle, with only one purchase for each of these products, so these are the totals for the cycle (G).
In the case of enterprises and institutions, we will consider them within the 61 million general users as regards to purchase of programs, but we will estimate their professional equipments separately. For that estimate we will count 7 million units in the Spanish speaking world. We have also assigned a percentage of user institutions to each type of program.
Products and services
As we are contemplating a 3-year period, the totals (F) will be multiplied by 3 (we will not consider the annual increase) to fill the Total column for the cycle (G).
Net searchers. Searchers do not charge for their services (nor is it within reasonable probabilities that they will in the near future), but they obtain indirect income from advertising, or subcontract their technologie to other branded net searchers. We have estimated here an annual business amount.
Intelligent agents. Again, only a forecast of implantation and price can be made.
Distance learning and training. I have not found any quantification on this sector. I have only been able to make an estimation.
Teaching of Spanish as a foreign language: We estimate 43,000,000 students of Spanish worldwide, with a minimum disboursement of 50 dollars in course materials per student/year, which gives us a total of 2,194 million euros. If we are to add courses taught directly online, that amount I estimate in another 13 million per year.
Tourist information. We start from the number of visitors to Spanish speaking countries with the larger tourism trade: Spain, Mexico and Argentina. We consider that 10% of the trips are contracted through the net, at 3 euros each operation.
Professional information. Starting from the data in MSStudy II Spain: 1997/1998, which stated a business amount of 478 million euros for that period, we are assuming a 100% growth from that date until 2004 (horizon of our speculation), and that the rest of the Spanish speaking world will have the same activity than Spain.
Copyright Industry. According to researches developed by the SGAE (Spanish Authors General Society), "Copyright Industries" (culture and leisure) amount to about 3.5% of GDP (Our estimations of the GDP for 1999 are 543,597,418,052 euros. For the whole of South-America, our estimations are the same as the Spanish GDP.)
E-Commerce. We need to consider for their estimation the total amount of intrahispanic electronic transactions. This amount will be a part of the world electronic commerce. According to Dataquest, commerce on the Net in 2003 will amount to almost 138,232,784,008 euros in the United States, and close to 108,182,178,789 euros in Europe. We can only project an estimation of which we believe to be 7% of the whole.
This quantification effort allows us to provide the following amount (Table 3) for a three-year period: more than 9,200 million euros, that is, more than 3,066 million euros per year. (At least three important fields for language technologies remain out of this estimation: hand-held devices as electronic agents and organizers, voice-powered information retrieval and transactions by phone, and voice-powered home and car appliances. Any attempt to precise their economic weight would be premature, but they are to be huge).
This amount can be considered from different perspectives. One could compare different scenarios, depending whether the companies providing these goods and services are located in Spanish-speaking countries or not.
Table 2: Differencial balance for Spanish-speaking countries: ownership or not of Spanish linguistic technologies.
Ill provide an example: If Spanish-speaking countries dont invest in the development of linguistic technologies they can undoubtedly keep on developing reference works, tourist databases, etc. They can also keep on creating goods and services that can be marketed through the networks. But if a Spanish or a Mexican company wants to make those products available to their natural public, it will have to pay royalties for the use of linguistic technologies it doesnt own.
If we consider that the real value of owning these technologies lies in not having to pay royalties for them, and in charging for their use, we could say that for the Spanish-speaking business community this really amounts to 6,000,000,000 euros per year (if M=N).
This is equivalent for instance to the total annual turnover of the Spanish language Publishing sector (Spain and Latin America). Therefore, we could well say that what is at stake in the Spanish language industries is a business at least as big as their own publishing sector, and of a great strategic importance. My hypothesis is that the same will happen with other languages, in proportion to their population and to the weight of their cultural production.
But maybe, in a globalized world with transnational companies, the perspective of the nationality (let alone language) of companies is not really meaningful, although it is useful to give us a concept of scale.
We could then try a different perspective: service to Spanish speakers that are going to use the Net to communicate, learn, buy or research. In the Non-Digital World the skill of knowing how to properly use a language (a useful, necessary skill both for the social environment and the jobs market) is transmitted through education, public speech and normative and social sanction, and reinforced with dictionaries and other reference works. In the Digital World these tools are even more important: either because they are going to provide concealed services to users in the networks, or because they are going to be the basis to allow different social agents (companies, teaching institutions) to provide service to third parties.
Under which regime must the production of these linguistic digital tools be administered? They should be shared by several businesses and social agents, public and private, since if they stay under the monopoly of big corporations: 1) their price (and therefore the frequently hidden tax for many linguistic services) will be higher; 2) the eventual failure of a monopoly that would control them would bury all the efforts put in their development, and they would have to be created again from scratch.
Which is the best model of development for linguistic tools? Probably free software, promoted by the Free Software Foundation (http://www.fsf.org). To introduce the operational framework of free software is not the aim of this intervention (see in Spanish Miquel Vidal, "Cooperación sin mando: una introducción al software libre", http://www.sindominio.net/biblioweb/telematica/softlibre; for a different document in English, http://eu.conecta.it; documents in French in http://www.april.org). Free software is essentially a co-operative system for the creation of software where all produced programs can be freely re-used by anyone (since their source code is open) with the only condition of keeping the code produced by that work also free and reusable. If a company wants to use this code in their products they can do so with no problem whatsoever, and their improvements will also be used by third parties in the future. We are speaking of a positive, snowball effect.
Is a free software system for the creation of linguistic tools feasible? It has already proved to be useful for the creation of complex software (like the GNU/Linux Operative System, with 20 million users, or the Apache program for servers, used in 62% of Internet servers). But linguistic software does not only need programming, but also morphological, lexicographic, and semantic linguistic data (which we have called base elements at the beginning of this text). For Spanish, many of this basic elements or platforms for their development, such as corpus) already exist in official institutions (universities) or in historical institutions which have received public fundings (as the Academy).
Opening completely and without restrictions this basic elements for further research to anyone wanting to pursue linguistic technology developments could lead many companies in and outside Spanish-speaking countries to develop programs, barring that only big monopolistic corporations may (also) own our language.
It is rather plausible that, in the future, computing may not be, as now, dominated by really huge applications (called by the english fatware and by the french obésiciel), but by small recombinable and modular applications that may allow users to use only what they need. Companies and groups of voluntary developers would be able to to improve and complete any of their elements; a flexibility specially valuable in a field like linguistics. If base elements are open we can foresee a future where big companies may not own the digital use of our language, and where users will be better serviced.
Language is, in the end, a model or a metaphor for the operational principle of free software. Language codes work only because they are shared, created by all (as poet Pedro Salinas liked to remind us), open to use and improvement (from the academic jargon or creative writing to popular expressions). Language succesful innovations can reach everyone, up to the last of speakers. Maybe languaqe itself will give us an example of what to do with it in the digital century
And we may free ourselves of being charged for using our language in the networks.
URL of this document: http://jamillan.com/worth.htm
English, modified version, of "La lengua que era un tesoro":
Thanks to: all the people who contributed to the preliminar draft (http://jamillan.com/tesoro.htm): Tomás Baiget, Jesús González Barahona, Rafael Millán, Javier Candeira, José Luis de Vicente, Álvaro del Castillo, Daniel Prado, Lidia Cámara, Rodolfo González, Héctor Piccoli, Chimo Soler and Xosé Castro. Thanks to Esteban González Pons and Política Exterior for the invitation to write that paper. Special thanks for final remarks to Susana Narotzky. Warning: A research as this combining many types of data with projections and estimations which deals with a methodologically complex subject, can only be improved. The author will be grateful for all observations and comments. Write to portada at jamillan.com.
|Last versión: March, 7, 2001|