In recent years, the “deep learning” technology, which makes use of artificial neural networks, has gained popularity and success. We will describe it through an imaginary method for identifying students in distress, those who are at risk of dropping out of their studies.
Advertisement
One way we can define machine learning is a machine’s ability to learn and characterize a system’s properties from examples. A key element in this process is an "evaluation function", which gives the computer a way to assess the presence and importance of different features during learning. For instance, an evaluation function in image recognition answers the question, "What is the probability that a picture on the web contains your photo?"
Artificial neural networks [1] is a popular implementation method, inspired by brain research. In this approach, the function is implemented in hardware as a network containing many layers of computational elements. Each element in an inner layer receives data from elements in the previous layer, performs calculations using parameters called "weights", and passes the results to elements in the next layer. In this way, the information is distilled until the final layer produces a single outcome—the evaluation. Sound complicated? It is. Feeling overwhelmed? To help, we will use an analogy.
At the University of Arizona, the built a data-driven system to identify students who might drop out: In addition to grades, the system tracks where students spend their time and what they buy. The goal is to create an alert about students in distress so that university advisors can help.
Suppose we want to build a system that estimates whether a first-year student will succeed in completing a degree. The system receives plenty of data about the student (grades, social behavior, how much beer they drink, and so on) and performs calculations using weights to estimate the student’s chances of surviving. By using data from many known students—those who graduated and those who dropped out—we can "train" the system. Training is done by finding weight values that minimize the evaluation error.
Imagine that the university staff set up a huge office filled with many clerks ("computational elements") sitting in rows. Each clerk receives data from several clerks in the previous row and performs calculations. When done, the clerk stands up and passes the result to several clerks in the next row, and so on. Suppose a clerk in a certain row receives information on how much beer a student drinks per week. The clerk multiplies it by a parameter ("weight") and adds the amount of vodka the student drinks multiplied by another weight. On the result, the clerk performs an additional computation called an “activation function,” which we’ll explain shortly. Finally, the clerk passes the outcome to several clerks in the next row. They will use it in a similar way—for example, to estimate how much the student is a partygoer, how wasteful they are, how depressed they are, etc. The results are passed to clerks in the following layers, who combine the student’s personal life with their grades and produce the evaluation.
What is the "activation function"? The idea, inspired by neural-network research, is to let each clerk take more responsibility: instead of forwarding various numbers to clerks in the following rows, which would only burden them, each clerk gets an independent judgment capability—say, to decide "the student has bad drinking habits". The activation function translates the value computed by the clerk into values like "1" (certain this is true) or "0" (certain this is not true). For intermediate values, where there is uncertainty, the function produces numbers between 0 and 1.
At first, every clerk chooses the weights randomly. When the gong sounds, the data on the first student are fed into the system. The clerk receives the amount of beer the student consumed and multiplies it by the initial weight (let’s assume it was a large number). Next, the clerk performs a similar calculation for vodka and sums the results. Finally, after computing the activation function, the clerk declares, "90% chance that the student has bad drinking habits!” The information passed to the next row of clerks trickles through the network, and eventually the chief clerk outputs a final evaluation: "40% chance the student will graduate". Suppose it is known that the student actually did graduate. In this case, the manager can compute the squared error of the evaluation: (1-0.4)². The process continues with the rest of the students, and the manager sums all the squared errors. Eventually, the manager announces, "You’re a bunch of incompetent clerks. The error is large; we need to improve!"
How can our clerk reduce the error function by changing the weight used? For that, math is required. For those who did not study calculus, or forgot: the standard way to minimize is to find the weight value where the derivative of the error function with respect to that weight is zero. However, because the derivative is very complex, it cannot be solved algebraically. Instead, we use an algorithm called "gradient descent" [3], which, by computing the derivative at the current point, indicates how to gradually adjust the weight to approach the minimum. Through several rounds of gradual changes, one can usually get close to the minimum.
After our clerk and the other clerks update their weights according to the algorithm, the data on all the students are fed into the network again, the new error is computed, and the weights are updated once more according to the algorithm. This process repeats until the error becomes small. At that point, the manager declares, "Training is over! Our system is ready to accept data from new students and identify those in distress".
Quite surprisingly, it has turned out in recent years that these methods work very well. Combining large amounts of hardware, a vast collection of training examples, careful tuning of the network, and the use of suitable activation functions yields impressive results. Moreover, it turns out there is no need to assign specific roles to the clerks. One simply places them, and through gradual learning, the system figures out how to use them efficiently.
English editing: Elee Shimshoni
Sources and further reading: