A Response to Noam Chomsky on Machine Learning and Knowledge

On machine learning, the nature of knowledge and intelligence, associations, predictions, and explanations.

Dec 01, 2023

1. Noam Chomsky on Machine Learning

Recent machine learning advances have reignited a criticism to the effect that these systems do not possess genuine knowledge. Noam Chomsky has voiced doubts of this kind before. More recently, he co-authored an essay where he predicted that machine learning “will degrade our science and debase our ethics by incorporating into our technology a fundamentally flawed conception of language and knowledge”.

Chomsky’s main concern is with the statistical nature of machine learning methods. Machine learning systems operate without explicit programming, relying instead on identifying regularities in data and making predictions based on them. This, Chomsky contends, is not how the human mind works. The human mind is not “a lumbering statistical engine for pattern matching”. It does not seek “to infer brute correlations among data points” but “to create explanations”.

This skepticism follows naturally from Chomsky’s linguistic and philosophical commitments. Referring to Plato’s Meno, where Socrates shows how an uneducated slave can be led to exhibit knowledge of geometric principles with only a few relevant prompts, Chomsky asks how a child is able to acquire so much knowledge so rapidly and effortlessly despite limited evidence. Like Immanuel Kant, he recognizes the selectivity that must originate from us against the shower of data to which we are exposed. As with Plato and Kant, his answer to the possibility of knowledge is in terms of something intrinsic to human nature.

Chomsky views the human mind as consisting of a set of inborn, interacting faculties. Each faculty operates according to a distinct, domain-specific set of rules, producing various mental phenomena.

Language is one such faculty. Chomsky explains the language faculty through a framework of principles and parameters. Principles are invariant rules common to all natural languages, while parameters are options set upon exposure to linguistic data that account for differences across languages. To use his analogy: Although growth would not occur without eating, it is not the food but the child’s inner nature that determines how the growth will occur. Similarly, it is not the linguistic data but the child’s biological endowment that determines how a language is acquired.

Opposed to this nativist conception of knowledge and language is the kind of empiricism which denies innate rules for knowledge. Contrary to Chomsky’s view, where the mind is seen as directed by inner principles that selectively use data as part of a fixed program of development, in the empiricist account it is seen as forming associations based on data. In the absence of rigid inner procedures that could impose themselves, the empiricist view places upon experience a far more determining role. Chomsky criticizes this approach for assuming about mind what is not assumed for the other systems of the body.

Chomsky argues that an approach to inquiry that focuses solely on associations from observed data would fail to discover the underlying principles from which certain developments, and only those developments, follow. He sees scientific inquiry as the pursuit of such principles and emphasizes the use of abstraction and idealization, as with thought experiments:

More generally in the sciences, for millennia, conclusions have been reached by experiments–often thought experiments–each a radical abstraction from phenomena. Experiments are theory-driven, seeking to discard the innumerable irrelevant factors that enter into observed phenomena ... the basic distinction goes back to Aristotle’s distinction between possession of knowledge and use of knowledge. The former is the central object of study.

It is evident that in machine learning, where associations dominate rather than rules, Chomsky sees echoes of the empiricist approach. He believes it is unlikely that the use of statistical techniques to find regularities in data and make predictions would yield explanatory insights. A notion of success involving predictions but not explanatory insights, he remarks, has little precedence in the history of science. He refers to such predictions as pseudoscience.

2. What is Claimed

In his comments, Chomsky makes two distinct claims that it would be helpful to differentiate.

The first concerns the workings of the human mind compared to machine learning systems. “The human mind,” Chomsky says, “is a surprisingly efficient and even elegant system that operates with small amounts of information.” Referring to the innate language faculty “that limits the languages we can learn to those with a certain kind of almost mathematical elegance”, he points to a lack of similar constraints in machine learning systems. The same applies to knowledge, where “humans are limited in the kinds of explanations we can rationally conjecture”, while “machine learning systems can learn both that the earth is flat and that the earth is round”.

It is unsurprising that the human mind—a part of the human organism shaped by an evolutionary process over countless generations—differs from machine learning systems, which are non-biological creations developed by the methods and data supplied by humans. If Chomsky’s only claim were that the workings of the human mind differ from those of machine learning systems, as would follow from the differences in their underlying architectures, it would be a rather trivial observation.

However, Chomsky also appears to be making a more general claim about the nature of knowledge. He identifies the deepest flaw of machine learning programs as:

the absence of the most critical capacity of any intelligence: to say not only what is the case, what was the case and what will be the case—that’s description and prediction—but also what is not the case and what could and could not be the case. Those are the ingredients of explanation, the mark of true intelligence ... The crux of machine learning is description and prediction; it does not posit any causal mechanisms or physical laws.

To define epistemic success thus, one must rely on a universal conception of what constitutes knowledge—a conception that could be applied across animals, humans, and other systems, free from human metaphysical biases. Once we have such a conception, we are able to separate what is essential to knowledge from what is merely instrumental. Chomsky, however, seems to be using a particular way of acquiring knowledge, with its unique set of constraints, to conclude that a path that differs from this would, for that reason, fail to yield genuine knowledge.

To make this more concrete, consider an analogy with locomotion. We observe that there are different ways through which it can occur. For humans, a natural method, walking, is enabled by a particular architecture, with a particular set of constraints. A wheeled vehicle, in contrast, has a different architecture, with a different set of constraints. If we compared the two, we would conclude that they differ in such and such ways, that the operation of one is properly described as walking while that of the other as rolling, and that they are therefore not identical. Yet if we were aware of the idea of locomotion, if we could conceive it independently of the particular means through which it occurs, then we would be right to use that term to describe the actions of both. We could raise legitimate questions about which is better on a specific metric, such as energy efficiency, but there would be no question as to which instantiates genuine locomotion and which merely simulates locomotion.

What should concern us is the nature of knowledge itself. What are its core features that allow us to identify it independently of the means through which it is instantiated? I submit that, when properly understood, associations and predictions are sufficient to capture the essential features of knowledge. Everything that aids in achieving them—whether biological endowments, algorithms, epistemic techniques, or other means—holds only instrumental value.

3. Associations and Predictions

Associations, Platonic forms, causal relations, and physical laws are instances of what we can call “propositions”. A proposition emerges from accounting for, or linking, data.

Suppose we traverse an area and develop an internal arrangement of its physical features. We can say that we linked some material available to us, material which increased after we had traversed the area. If we externalized this arrangement using pencil and paper, it would take the form of a map, where points are linked in a particular way. What we did internally can be viewed as analogous to linking points on the paper.

As Chomsky believes about cognitive operations, the process of developing propositions is independent of individual awareness.

The process is also independent of the ability to externalize a proposition. An internal arrangement exists within us before we translate it into lines on paper and would exist even if we did not externalize it. The fact that an animal lacks the capacity to create diagrams or utter sentences like humans does not prevent it from discovering propositions.

All that aids in accounting for data contributes to one’s intelligence—that is, one’s capacity to know or understand. Contributions such as memory can be described as internal, while inventions like the mechanical calculator and techniques such as the scientific method can be described as external. Since it is possible to experience memory deterioration with age or be exposed to new inventions and techniques through cultural diffusion, it follows that one’s capacity for knowledge is subject to variation.

What is the material that is accounted for when one discovers a proposition? If we define data as that—whatever it turns out—which is accounted for when discovering a proposition, what properties do we discover? By focusing on these properties, we can gain a better understanding of how different types of propositions are possible, which would help answer some of Chomsky’s concerns.

In examining the properties of the material that forms propositions, we find parallels with the material that constitutes physical objects. Just as a physical object is made of a certain quantity of material that can increase or decrease, so too can the amount of data available to us fluctuate over time. Data, like the materials of an object, comes from diverse sources. The divisibility of a proposition allows us to appreciate the granularity of its constituent data, much like dividing a physical object into smaller parts reveals the granularity of the material of which it is made. This granularity enables propositions to be about anything discoverable in data, just as different combinations of particles yield distinct objects. Let us consider each of these properties in turn.

As illustrated in the example of a proposition discovered after traversing an area, data can become available to us after previously being unavailable. We also find that new data is constantly getting added to us, even while we sleep. Thus, data that was once unavailable becomes accessible at each moment, contributing to the totality of data available to us.

In traversing an area, we encounter one source of data: direct observations. Other sources could include pictures of the area captured by a mechanical device that we see or descriptions of the area that we read. The sources of data, therefore, can be diverse. Some sources may provide more value in terms of allowing us to account for more data through less. For example, instead of physical traversing, we might rely on a picture of the area to discover the proposition that we externalized as a map. To the extent that the picture possesses representative value, the features in our proposition would align with any actual visit we make. The distinction in the representative value of data is something we discover through data itself. A subsequent ability to discriminate among different sources of data may emerge as a consequence of this discovery.

Philosophers have long drawn an analogy between the way particles combine to form physical objects and how simple ideas obtained from experience combine to form complex ideas. We can think of this in terms of divisibility: any part of a physical object can be distinguished as a constituent, which can, in turn, have its own divisible parts. Similarly, a proposition contains other propositions presupposed within it. By focusing on them, we can discover the granularity of the material of which the proposition is made. For example, consider the proposition “whenever one ball strikes another, the other ball moves”. Within it, we would find presupposed such propositions as “a ball is a round body”, “to move is to go from one place to another”, and even “a thing cannot both be and not be at the same time and in the same respect”. When we account for certain data and form a proposition, we account for all the data accounted for by those propositions that we presuppose within it.

This granularity of data allows the different kinds of propositions that we discover. Some of these pertain to what we can observe, including both what we have observed and what we have yet to observe. If we take the above proposition and ask through what data we arrived at it, we might say that we had a series of experiences in which we directly observed one ball striking another and the subsequent movement of the other ball. Based on these experiences, we formed an association between one ball striking another and the second ball moving. While this is one possible path to that conclusion, it is certainly not the only one. Suppose we have never directly observed one ball striking another but have observed objects colliding, seen pictures of balls, and read what occurs when one ball strikes another. In that case, we could form the proposition “whenever one ball strikes another, the other ball moves” without having directly observed such an event even once.

Beyond propositions about what we can observe, we can also identify propositions about what we cannot observe. Suppose we hear a sound coming from a tape recorder. We cannot observe the original source of the sound, as it is from the past and not present before us. However, by synthesizing the data becoming available through the tape recorder and existing data, we can infer, for example, that the original source of the sound is a woman, that she has a quiet disposition, that she felt lonely the night before she spoke, and so forth. It is possible that we could have reached these same judgments if analogous data had become available through her speaking while standing right in front of us.

We never find, in the ingredients or products of our imagination, anything that is not present, in some form, within the data available to us. This fact helps us understand how propositions can also be about what does not exist in the world. In nature we may observe lions and men, but a lion-man—a figure with the head of a lion and body of a man—exists only in our imagination. But it is clear that this figure could have never occurred to us if we had not observed certain things that provided the elements we used to create it.

While propositions are discovered based on the data available to us, the data a proposition accounts for can also include data yet to become available. When we account for certain data currently available, they become different instances of a single state of affairs. When we treat some data yet to become available as instances of a state of affairs, we form predictions. If these predictions agree with the data when they eventually become available, it means the proposition that allowed those predictions accounted for that data.

We can compare one proposition to another if one accounts for at least all the data that the other accounts for. In such cases, we can say the proposition that accounts for more data is truer. Continuing with our analogy between the material of propositions and physical objects, we see that truth is a gradable property, much like physical largeness. We can describe an object as large when viewed in relation to a certain range, or larger when compared to another object, or, more precisely, as being of a specific length. Similarly, with the truth value of a proposition, we can describe it as true relative to a certain range, truer in comparison to another proposition, or, more precisely, as accounting for specific data.

A problem posed in Plato’s Meno concerns the possibility of inquiry: how is any inquiry possible if it is impossible to inquire into what we don’t know, since we couldn’t search for it and wouldn’t recognize it even if we found it, or into what we do know, as we already know it? This problem can be solved by recognizing that knowledge consists in accounting for data, with truth value referring to the amount of data accounted for by propositions, and that data becoming available to a system is merely one of the transformations that occur in the world.

From this account of knowledge, we observe that animals, humans, and machine learning systems differ in their capacities for knowledge in part by the differences in their respective architectures. These architectures—more malleable and improvable in machine learning systems than in animals or humans—create tendencies for systems to account for data in specific ways. Moreover, they constrain them in the kind of data they can access. For example, we cannot claim that a human and a bat are accounting for the same data even if exposed to the same environment for the same period of time. These architectures also determine the extent to which different external contributions can enhance a system’s the capacity for knowledge, as when education increases a human’s capacity for knowledge but not that of an animal. Granting these differences, a universal constraint across all systems is the kind of data available to build propositions. Consequently, corrupt data misleading a machine learning system is not dissimilar from a Cartesian deceiver misleading a human. The journey from error to knowledge can be understood as the discovery of propositions that are truer. In this process, unsuccessful predictions made by a machine learning system are no different from those made by a human.

The ultimate reference for truth and falsehood is what occurs in the world. Propositions about what does not occur, whether possible or impossible, can only be true insofar as they point to what does occur. The use of propositions about possibilities to arrive at propositions about what actually occurs can be understood as an epistemic technique. We can conjecture that other, yet undiscovered epistemic techniques—additional ways of accounting for data—could exist. If it is possible to reach a true proposition without relying on the properties of the human system or its epistemic techniques, then the constraints of the human system become merely instrumental in the process.

The endowments of biological entities can also be considered independently of knowledge, as pure manifestations of the body’s creative processes, like the pumping of the heart. In that case, they, as Kant said of the senses, would not form any judgment, correct or incorrect. In other words, they would not be involved in accounting for data, which is the basis for arriving at truth and falsehood. The faculties of the human mind, when they contribute to knowledge, do so by contributing to a person’s capacity for knowledge, as when the language faculty helps us read a textbook. We can compare the contribution of the human mind to human knowledge with the contribution of human limbs to human locomotion.

This account of knowledge also clarifies the role of predictions in making the internal process of accounting for data concrete. When someone claims to understand something, we assess this by asking questions or presenting problems, and then comparing their predictions against a benchmark. We consider past actions as a record of previous predictions. Since the capacity for externalization is independent of the capacity for knowledge, it is possible to demonstrate knowledge without being able to articulate it. Thus, predictions help us distinguish knowledge from its absence in any system—whether animal, human, or machine learning system. While we can raise legitimate questions about which system performs better on a specific metric, such as the amount of data required to arrive at a particular prediction, there can be no question as to which possesses genuine knowledge and which merely simulates having knowledge. Whatever demarcations we invoke will be arbitrary, like saying a system only instantiates genuine locomotion once it crosses a certain speed or energy efficiency threshold.

4. Explanations

Chomsky’s contention that knowledge requires something more than mere description and prediction is reminiscent of the argument of Socrates that even if we formed beliefs that are true, we would still need something more in order to have genuine knowledge. In Plato’s Theaetetus, Socrates illustrates this using the example of lawyers who persuade jurymen to form beliefs about criminal acts. If a group of jurymen concludes that a defendant is innocent based on the compelling arguments of a lawyer, it cannot be said that they possess knowledge of the defendant’s innocence, even if he is indeed innocent, since we can imagine them determining the defendant guilty if the persuasive lawyer had argued otherwise. We can form all kinds of beliefs, some of which could turn out to be true by mere happenstance. Socrates suggests that a true belief would have to be fastened by explanation and made reliable to become knowledge.

What is an explanation? In Aristotle’s Posterior Analytics, we find the following:

We suppose ourselves to possess unqualified scientific knowledge of a thing, as opposed to knowing it in the accidental way in which the sophist knows, whenever we think we are aware both that the explanation because of which the object is is its explanation, and that it is not possible for this to be otherwise.

Accordingly, we can describe an explanation as an answer to the question of why a particular fact is the way it is and not otherwise. In an explanation, the fact to be explained is demonstrated to be an instance of a general proposition—a causal relation or a universal law. A particular ball moving after being struck by another can thus be explained by the general proposition “whenever one ball strikes another, it causes the other ball to move”.

Let us take this example to understand the nature of causal relations. Our intuitive conception of causation suggests that the first ball acts on the second ball upon contact, causing the second ball to move out of necessity in each instance. However, as David Hume argued, what we discover in causal relations is merely a constant union between pairs of events. The necessity we attribute to this connection which makes us claim that whenever one event occurs another event must follow is not something we actually discover. Without the obscure notion of necessity, which adds no value to the relationships we identify despite any significance we attach to it, all that remains in our understanding of causation are associations as discovered by a system with certain capacities. More accurately, causal relations refer to propositions in which relations of precedence and succession between states of affairs are presupposed, with some propositions being truer than others.

In explaining why, when one ball strikes another, the other ball moves, we can also invoke a general proposition recognized as a law of nature, such as “the total momentum of a system remains constant”. If we compare this law to the proposition “whenever one ball strikes another, the other ball moves”, the law is revealed to be a truer proposition. That is, it accounts for all the data the other proposition does while also accounting for more. Thus, a law of nature is a type of proposition, distinguished only by the large amount of data it accounts for.

If finding causal relations and laws of nature consists of identifying propositions of certain kinds, then the notion of explanation, where it is taken to demonstrate the possession of knowledge, involves having a sufficiently truer proposition than the proposition to be explained. Such a truer proposition would contain within it the limits that rendered the other proposition falser, the same way a map amended with a pencil contains within it the constraints of the older map. This would be evident in the predictions we form based on the truer proposition.

There is, however, a risk with explanations when they are viewed as the goal of all inquiry. Because of the internal nature of the process of accounting for data, our ignorance can become obscured in the pursuit of explanations, and we may mistake a satisfactory subjective state—such as the feeling of clarity or simplicity generated by a particular explanation—for actual knowledge. Reliance on predictions helps mitigate this risk, as demonstrated by the success of modern natural science which offers predictions that pre-modern natural philosophy, despite its elaborate explanations, failed to offer. It is crucial to recognize that anything involved in what we mean by developing explanations is relevant to knowledge only to the extent that it aids in accounting for more data.

Once we understand how predictions are made, we can acknowledge that a machine learning system making predictions must have based its conclusions on something it discovered, even if it cannot externalize that. While this presents its own risks—such as hindering the cumulative process of knowledge-building in humans that externalization facilitates—as a fact about the possession of knowledge it is similar to the case of a person who cannot articulate a proposition he has discovered but demonstrates it through the successful predictions he makes. Each successful prediction reflects the data accounted for by a proposition, and it is on the basis of the data accounted for that a distinction can be made between beliefs that turn out true accidentally and those that are more reliable.

In the case of the jurymen of Socrates, it is the identification of a truer proposition than, say, “the accused is innocent of the crime because the lawyer would not be persuasive otherwise”, such as “the accused is innocent of the crime because of such and such evidence, and the lawyer’s persuasiveness is independent of this”, that yields something more than mere true belief. Or, to use an example we have been considering, we may observe in a new experience that one ball strikes another but the other ball does not move, in which case we would realize that any explanation we had that satisfied us or made us certain, whatever its other uses, had limits that could be attributed to nothing else beside the limited data accounted for by a proposition. This realization could prompt us to identify a truer proposition, such as “whenever one ball strikes another, the other ball moves, except when the other ball is significantly heavier”, or, eventually, “the total momentum of a system remains constant”. It is an observed consequence of a proposition being sufficiently true that it leads us toward the discovery that a particular outcome, rather than another, would occur or would not occur under particular circumstances.

5. Conclusion

When we understand the nature of knowledge, we recognize that it is aided by the endowments of a system—such as the endowments innate to humans—but is not solely defined by them. Consequently, we can imagine these endowments independently of knowledge, while also conceiving of entities that lack the same endowments yet remain capable of acquiring knowledge.

Thank you for reading Vatsal’s Newsletter. Have feedback? Write me at vatsal@readvatsal.com. For more posts, see the Archive. To receive new posts in your inbox, subscribe with your email address.

Vatsal’s Newsletter