Wikipedia for Language Research

Wikipedia is a valuable resource whose usage goes beyond the encyclopedia itself. The proposal is to use Wikipedia as a large source of text, suitable for language research, explaining the followed procedure to turn Spanish Wikipedia raw data into a suitable text source, considering the format of source data (wiki syntax), the conversion from written text to individual sentences or the conversion from acronyms or numbers to the way they are said. The case explained here is speciļ¬c in some parts to the Spanish wikipedia, but the ideas and some steps of the followed procedure can be generalised to any language or text source.

A more detailed explanation of the whole goal and the implementation of the project will be published soon.

