Modeling Language Without Language: A ChatGPT Lesson for Language Research*

 

 

In recent years, the term token has gained widespread use among linguists and non-linguists alike, largely due to the rise of large language models (LLMs) such as ChatGPT. However, this term does not mean what most linguists assume it does. In computational contexts, token refers to a subword unit—often a sequence of characters—that is neither a word nor a morpheme in the linguistic sense. ChatGPT processes only token ID numbers and operates without representations of linguistic units, which is why some computer scientists describe it as modeling language without language. This paper examines the terminological confusion surrounding token and its consequences for linguistic research on LLMs, drawing parallels to the earlier misinterpretation of the term tree in formal syntax. I argue that these misunderstandings stem from long-standing disciplinary divides and a reluctance among linguists to engage directly with computer science (CS) literature. Using mathematical reasoning, I show why language data—being neither ordered nor regular—require vast amounts of input for effective modeling, unlike systems based on predictable (ordered and regular) data. Finally, I reflect on the unresolved theoretical tensions between generative and usage-based linguistics in light of a functioning CS solution to language generation that aligns with neither. The paper calls for terminological clarity and increased cross-disciplinary literacy as necessary steps for future research on language. *Proofread and edited using ChatGPT.

 

Full text at: https://lingbuzz.net/lingbuzz/008998

 

Stela Manova for Gauss:AI