Learning With Dr Neal — Dr Neal Aggarwal

🎧 Listen to this post: Three AI entities discuss what neural networks actually are — the history, the mathematics, and why it matters for your career.

The transformation under way is not a future scenario — it is the current state of every field that processes information. Learners who invest time in foundations now gain access to capabilities that currently require expensive specialist consultants.

Personal productivity tools. Fine-tuned language models for your specific domain. Image classifiers trained on your own data. NLP extractors that pull structured information from unstructured text — clinical notes, legal documents, research papers. These tools currently require specialist ML engineers for most organisations. Practitioners who complete fast.ai can build them independently.

Professional differentiation. In medicine, law, finance, engineering, and research — every field that processes data — practitioners who can build and evaluate AI tools will command significant advantages over those who can only consume them.

Research acceleration. For scientists, the ability to build custom models means reduced dependence on off-the-shelf tools that may not fit your domain. The ability to fine-tune a language model on your own corpus, or to train an image classifier on your own labelled data, compresses research timelines that once required collaborations with dedicated ML labs.

Agency in an AI world. Perhaps most importantly: understanding how neural networks work gives you the conceptual tools to evaluate AI claims critically, audit model outputs, and make informed decisions about when to trust and when to question AI systems. This is the difference between using AI as a tool and being used by it.

The window during which this knowledge provides a genuine competitive advantage is finite. It will close. The practitioners who move through it will not be those who waited for a simpler on-ramp — they will be those who found a guide who could make the existing on-ramp navigable.

This article is based on Jeremy Howard's fast.ai Practical Deep Learning for Coders 2022 · Lesson 3: Neural Net Foundations. This is an example of the course materials produced by Dr Neal and used to teach students through his one-to-one AI Learners Course.

▶ Watch Lesson 3 on YouTube · ▶ Supplementary Swadia Tutorial

Why Study with Dr. Neal Aggarwal?

Forty years of teaching information technology and artificial intelligence across academic, corporate, and individual-mentorship contexts confer a particular kind of understanding that no amount of self-study can replicate: the ability to recognise where a given student is stuck, why they're stuck there, and what angle of re-entry will unstick them.

The fast.ai curriculum is excellent. But it was designed for a specific learner archetype, and most learners are not that archetype. They have domain knowledge that the curriculum doesn't assume. They have cognitive habits developed in adjacent fields. They have time constraints the standard schedule doesn't accommodate. They have professional motivations that the canonical examples don't speak to.

Working through fast.ai with a guide who has led hundreds of learners through this material means those mismatches get resolved in real time, not after three weeks of stalling on a concept that could have been reframed in five minutes.

The practical outcome: students who study with Dr. Neal complete the fast.ai curriculum in half the median time, with substantially deeper practical understanding — and they leave with a project, not just a certificate.

→ Contact Dr. Neal Aggarwal for 1-to-1 sessions, group workshops, and curriculum design

Jeremy Howard: The Man Who Decided to Democratise AI

Jeremy Howard did not follow a conventional path into AI research. He studied philosophy at the University of Melbourne — a choice that shaped his thinking about pedagogy, ethics, and the difference between knowing something and understanding it. After eight years in management consulting (McKinsey and AT Kearney), he taught himself machine learning, competed on Kaggle, and became the world's top-ranked data scientist in 2010 and 2011 — not through access to proprietary data or exclusive compute, but through systematic application of techniques available to anyone.

That experience convinced Howard of something that would define his subsequent career: the gap between AI researchers and practitioners was not a capability gap but a pedagogical one. The knowledge existed; the on-ramps did not.

The fast.ai Philosophy

In 2016, Howard co-founded fast.ai with Rachel Thomas with a simple mission: make deep learning accessible to domain experts who were not career machine learning researchers. The pedagogical approach they developed — sometimes called the "top-down" or "whole game" method — inverts the conventional curriculum sequence.

Traditional deep learning courses begin with linear algebra, probability theory, and optimisation — the theoretical foundations. Students spend months on prerequisites before writing a line of code that does anything useful. Howard's observation was that this approach works for students who have already committed to years of study, but fails the practitioner who needs to know whether deep learning is applicable to their domain before making that commitment.

fast.ai begins with a working model. Lesson 1 trains a state-of-the-art image classifier in four lines of code. Students see results before they understand the mechanism. This creates what Howard calls "a hook" — genuine motivation to then go deeper and understand why it worked. The prerequisite mathematics becomes meaningful precisely because the student has already used the tool and wants to understand it more deeply.

This is not a shortcut. Howard is emphatic that full understanding is the goal — but he maintains that the fastest path to full understanding is not the conventional prerequisite-first route. This claim is, it turns out, well-supported by learning science research, as we'll see below.

Howard on the Future

Howard's public writings and interviews reveal a consistent thread: he believes that the dominant risk from AI is not malevolence but concentration — the scenario in which powerful AI tools are accessible only to the largest institutions, further entrenching existing inequalities. fast.ai is explicitly a counter-measure to this scenario.

In his 2024 AI panel discussions and interviews, Howard has argued that the most important technical development of the next decade will not be larger models but more efficient ones — models that run on commodity hardware and can be customised by individual practitioners. His founding of Answer.AI in November 2024, described as a "results-focused AI lab," is a direct expression of this philosophy.

What AI Will Do for Learners Who Act Now

The Learning Science Behind Howard's Approach

The success of fast.ai's pedagogy is not accidental — it maps closely onto findings from cognitive science and educational psychology that have accumulated over decades. Understanding why the approach works will help you extract more value from it.

Desirable Difficulty

Howard's top-down method creates what educational psychologists call "desirable difficulty." Encountering the whole problem before understanding all the parts is initially uncomfortable. This discomfort is the signal that learning is occurring, not evidence that something is wrong. Research by Robert Bjork at UCLA has consistently shown that more challenging learning conditions produce better long-term retention, even when they produce slower initial performance.

Worked Examples and Cognitive Load Theory

The fast.ai notebooks are worked examples. John Sweller's cognitive load theory predicts that novices learn more efficiently from studying worked examples than from attempting to solve problems independently from the outset. Problem-solving requires cognitive resources for both domain content and problem-solving strategy simultaneously — a load that exceeds the working memory capacity of most learners encountering a new domain. Worked examples free cognitive resources for the content itself.

Spaced Repetition and Interleaving

The fast.ai curriculum returns to the same concepts multiple times across lessons, each time with more context and depth. This is spaced repetition in practice — exposure to material after increasing intervals is the most evidence-based intervention for long-term retention (Ebbinghaus, 1885; Cepeda et al., 2006). Interleaving different types of practice — gradient descent, then matrix multiplication, then the spreadsheet — further improves retention compared to blocking (practicing one topic exhaustively before moving to the next).

The Feynman Technique at Scale

fast.ai's forums are central to the pedagogy (18,214 topics across 21 categories — you would be joining thousands of students from complete novices to seasoned AI professionals, all able and willing to help make your learning as smooth and productive as possible). Students are encouraged to explain concepts to each other — a structured implementation of the Feynman Technique. Every point of confusion in an explanation to another person reveals a gap in the explainer's own understanding. The forums create a low-stakes environment for this kind of public retrieval practice at scale.

Swadia's Supplementary Tutorial: 10 Key Points

▶ Watch the Swadia tutorial on YouTube

Swadia — an MIT graduate and former CEO turned learning strategist — distilled a decade of high-performance learning into a single 12-minute framework: the Three C Protocol (Compress, Compile, Consolidate). His central argument maps directly onto Howard's pedagogy: intelligence is no longer the differentiating variable — speed of acquisition is. Where Howard gives you the technical on-ramp, Swadia gives you the cognitive architecture to retain what you build there. Watching this before, or alongside, the fast.ai lectures will change how you take notes, how you schedule review, and how quickly the material moves from working memory into durable understanding.

The following points synthesise the core pedagogical contributions of the Swadia tutorial, structured as an ordered learning sequence that complements Howard's top-down approach:

Point 1 — Context First, Mechanics Second. Before any equation is introduced, establish why it matters. Neural networks exist to solve the function-approximation problem. Anchoring every formula in this problem statement prevents the 'math without meaning' trap that causes most learners to stall.

Point 2 — The Biological Metaphor Is a Scaffold, Not a Specification. The McCulloch-Pitts neuron (1943) gave us the vocabulary. Modern artificial neurons follow this template as a scaffold for intuition, not a literal description of computation. Don't over-invest in the analogy.

Point 3 — Parameters Are the Entire Knowledge of the Model. After training, a model IS its weight matrices. There is no other store of information. Internalising this dispels the mysticism around 'AI knowing things' and grounds all subsequent questions in concrete mathematics.

Point 4 — Loss Is a Design Choice. MSE, cross-entropy, Huber loss — engineering choices with mathematical consequences. Students should practice swapping loss functions and observing the effects rather than treating any one as canonical.

Point 5 — The Learning Rate Is the Most Important Hyperparameter. Too large: the optimiser overshoots. Too small: training takes forever. Learning rate scheduling and warmup strategies follow from this single observation.

Point 6 — One Hidden Layer Is Enough; Depth Adds Efficiency. A single hidden layer can approximate any function. What depth buys you is efficiency — the same approximation using far fewer parameters. This is the practical motivation for 'deep' learning.

Point 7 — Overfitting Means the Model Learned the Data, Not the Task. Train loss down, validation loss up — that is the diagnostic. Regularisation, dropout, and data augmentation are interventions at this diagnostic point.

Point 8 — GPUs Accelerate Matrix Multiplication, Not Magic. A forward pass is a sequence of matrix multiplications interleaved with elementwise nonlinearities. GPUs are designed precisely for this pattern of computation.

Point 9 — Feature Engineering Is Still Required. Log-transforming skewed features, normalising continuous inputs, and dummy-encoding categoricals are not optional. They directly affect the condition number of gradient updates.

Point 10 — Transfer Learning Is the Practical Default. Training from scratch is rarely necessary. Pretrained models encode general feature hierarchies that fine-tune to new tasks with a fraction of the data and compute. This is the workflow that makes deep learning practically useful outside of large research labs.

We've Been Building This Since 1943

In 1943, Warren McCulloch and Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" — a mathematical model of how neurons fire. Their artificial neuron took weighted binary inputs, summed them, and fired if the total exceeded a threshold. It could not learn. It had no notion of optimisation. But it was the first time anyone had formalised the idea of computation in biological terms, and it planted a seed that eighty years of mathematics, hardware, and data have now grown into something extraordinary.

Jeremy Howard's fast.ai Lesson 3 is, in one sense, the story of what happened between 1943 and now — told from the bottom up, using nothing more than arithmetic and a few lines of Python. This post is a full reconstruction of that lesson and what it reveals about how we learn, how machines learn, and why the two might not be as different as we habitually assume.

↑ Key conceptual structure of fast.ai Lesson 3. The full interactive mind map is available in the NotebookLM notebook.

🎧 Audio Overviews — Listen Directly
AI-narrated audio overviews of the fast.ai Lesson 3 materials, generated from the full slide decks and course notes. Ideal for commute listening before or after a session.

Course Overview

Lesson 1 Deep Dive

📥 Download the Learning Guide
A structured two-week study guide synthesising the Swadia tutorial, Howard's lesson, and learning science research:
PDF Version EPUB Version

The Central Question

Howard opens the computational core of Lesson 3 with a question that is more consequential than it first appears: what are the weights in a trained model, and how can numbers figure out something important about the world? This is the right question to anchor around, because it forces you to unpack three distinct concerns the deep learning literature often muddles:

The function family. What class of mathematical function is expressive enough to represent the mapping you care about?
The loss. How do you quantify the gap between predictions and ground truth in a differentiable way?
The optimisation algorithm. Given the loss and its gradient, how do you search the parameter space efficiently?

Howard's pedagogical move is to answer all three questions concretely, building each from scratch before introducing the abstractions that fast.ai and PyTorch provide. This is the top-down, whole-game approach that defines his teaching philosophy — and we'll return to why it works so well when we examine the learning science behind it later in this post.

Step 1 — Fitting a Function by Hand

Howard begins with a general quadratic:

f(x) = a·x² + b·x + c

The point is not the specific function — it is the parameterisation. By exposing three free parameters (a, b, c), a single function template can realise any specific quadratic by fixing those parameters. Howard demonstrates this with an interactive Jupyter widget, dragging sliders to visually minimise the distance between the curve and a randomly generated dataset.

This surfaces the core insight immediately: fitting a model is a search problem in parameter space. The slider widget works for three parameters on a toy dataset, but is obviously unscalable to ten million parameters. This deliberate setup carries the rest of the lesson.

Step 2 — Introducing Loss (MSE)

Rather than relying on visual intuition, Howard introduces a scalar summary of "how wrong the current parameters are." For a regression problem:

L = (1/n) · Σ (ŷᵢ − yᵢ)²

MSE has two properties that matter: it is everywhere differentiable, and squaring penalises large residuals super-linearly, making the optimiser concentrate on the worst-fitting examples. Howard presents the loss as a design choice, not a revealed truth — later lessons swap in cross-entropy for classification tasks.

Core insight: The loss collapses the entire dataset's prediction quality into a single number. Optimisation then becomes: decrease this number by adjusting the parameters. Everything else in training is mechanics around that one idea.

Step 3 — Gradients and the Chain Rule

Howard's treatment of derivatives is deliberately minimal. For a given set of parameter values, PyTorch can tell you the slope of the loss with respect to each parameter — i.e., if you nudge parameter a upward by one unit, does the loss go up or down, and by approximately how much?

params = torch.tensor([a, b, c], requires_grad=True)

preds = params[0]*x**2 + params[1]*x + params[2]
loss  = ((preds - y)**2).mean()

loss.backward()
print(params.grad)  # tensor([da, db, dc])

params.grad[0] answers: "by how much does the loss increase per unit increase in a?" A positive gradient means increasing a makes things worse; a negative gradient means increasing a helps. This is the entire information content you need for gradient descent.

Step 4 — Gradient Descent

With gradients in hand, the update rule is straightforward:

θ ← θ − lr · ∇L(θ)

Subtract a small multiple of the gradient from each parameter. Howard implements the manual loop to make it concrete:

lr = 1e-3

for _ in range(100):
    loss = mse(quadratic(params, x), y)
    loss.backward()
    with torch.no_grad():
        params -= lr * params.grad
        params.grad.zero_()   # must clear or gradients accumulate

⚠️ The most commonly forgotten line: params.grad.zero_(). PyTorch accumulates gradients by default — a feature for gradient checkpointing, a footgun for a simple training loop. Howard flags this explicitly, and experienced practitioners still trip on it when switching contexts.

Learning Rate: Why Size Matters

Howard visualises the loss landscape as a bowl (a quadratic approximation is locally valid near any smooth minimum). A large learning rate causes the optimiser to overshoot the bottom of the bowl, potentially diverging. A learning rate that's too small means thousands of iterations to converge. Howard's practical heuristic at this stage: start around 1e-3, watch the loss curve, and halve it if it oscillates.

Gradient interpretation in units: Howard makes a point experienced engineers often skip — the gradient is in units of (loss units / parameter units). If your "Fare" column is in dollars and the loss is in squared probability-of-survival, the raw gradient magnitude is not comparable across columns. This motivates normalisation.

Step 5 — Why Quadratics Are Not Enough

Having established the full gradient descent loop, Howard pivots to the expressiveness problem. A quadratic can model one hump. Real-world mappings — image pixels to class labels, text tokens to sentiment scores — require functions of effectively arbitrary complexity. Higher-degree polynomials are numerically unstable. What's needed is a simple building block that can be composed into arbitrarily complex functions.

Step 6 — The Rectified Linear Unit (ReLU)

Howard's answer is the ReLU:

ReLU(x) = max(0, x)

This function is about as simple as a nonlinear function can be. Two parameters control a shifted, scaled version: f(x) = max(0, w·x + b). A single ReLU is nearly useless. But the universal approximation theorem guarantees that a sum of enough ReLUs can approximate any continuous function on a compact domain to arbitrary precision. Howard demonstrates this interactively by stacking two ReLUs and showing how four parameters produce a more complex shape than any single parabola could.

Howard's compressed summary of deep learning: (1) parameterised family of functions (ReLU stacks); (2) define a loss; (3) compute gradients; (4) take a small step downhill; (5) repeat. Everything else — batch normalisation, residual connections, attention — is engineering that makes this core loop work better in practice.

Step 7 — Matrix Multiplication as the Computational Substrate

When your network has millions of ReLUs, computing predictions one at a time is prohibitively slow. Howard introduces the key insight: a layer of neurons computing w₁x₁ + w₂x₂ + ... + wₙxₙ + b for each of m neurons simultaneously is exactly a matrix multiplication.

If X is the input matrix (batch × features) and W is the weight matrix (features × neurons):

Z = X · W + b

This is a single BLAS call. GPUs are designed to execute billions of these floating-point multiply-accumulate operations in parallel — which is the entire reason deep learning became practical on commodity hardware.

W = torch.randn(n_inputs, n_hidden, requires_grad=True)
b = torch.zeros(n_hidden,           requires_grad=True)

def linear(x): return x @ W + b
def relu(x):   return x.clamp(min=0)

# One hidden layer
def model(x): return linear(relu(linear(x)))

Step 8 — Building a Neural Network in a Spreadsheet (Titanic)

To cement these concepts Howard walks through a spreadsheet implementation of gradient descent on the Kaggle Titanic dataset. Every operation is visible without library abstractions hiding it.

The key preprocessing decisions are worth cataloguing:

1. Dummy-encode categoricals. Sex, Pclass, and Embarked become binary columns. You only need k−1 dummy variables for a k-category feature (the dropped category becomes the reference level). Getting this wrong is a common source of rank-deficient design matrices.

2. Normalise continuous features. Age and Fare are on very different scales. The same learning rate can simultaneously be too large for one feature and too small for another. Fix: subtract the mean, divide by standard deviation.

3. Apply log to Fare. Fare is right-skewed — a few first-class tickets cost vastly more than the median. A log transform compresses the tail, ensuring gradient signals from extreme values are proportional to information content rather than raw magnitude.

4. Build two linear layers with ReLU. First linear → ReLU → second linear → scalar prediction. That's a minimal two-layer neural network.

The critical conceptual point: a neural network with one hidden layer is just two regression models stacked, with the output of the first passed through a nonlinearity before being fed to the second. It is additive function composition over a parameterised function family. There is no mysticism here.

Step 9 — What's Actually Stored in a Trained Model

Howard returns to the original question: what are the weights? After gradient descent converges, the parameter tensors W₁, b₁, W₂, b₂ contain numbers that, when matrix-multiplied against an input, produce a useful output. There is no explicit rule stored. The model's "knowledge" is entirely encoded in the geometry of those high-dimensional parameter matrices.

learn = vision_learner(dls, resnet34, metrics=error_rate)

# Access the underlying PyTorch module
learn.model

# Inspect the first convolution layer's weights
# Shape: (out_channels, in_channels, kernel_h, kernel_w)
learn.model[0][0].weight.shape

# Numbers that started as Gaussian noise, evolved into edge detectors
learn.model[0][0].weight.data[:2]

Those numbers started as Gaussian noise and evolved into detectors for edges, textures, and eventually semantic concepts — all through the gradient descent loop Howard just walked through from scratch.

Step 10 — Model Selection and the Pareto Frontier

The lesson opens with a survey of image model architectures benchmarked on top-1 accuracy, inference speed, and parameter count. Howard's framework for using that chart:

The wrong approach is to always pick the highest-accuracy model. The right approach is to identify the Pareto frontier — models where no other model is simultaneously faster and more accurate — then pick based on deployment constraints. For a latency-sensitive API endpoint, a smaller EfficientNet variant may dominate a more accurate but slower ViT. For an offline batch pipeline, the ranking flips.

Howard also makes a point about data quantity that the industry consistently gets wrong: the dominant mistake is collecting more labelled data when the model is already bottlenecked on something else. Before commissioning a labelling campaign, fit a model on what you have, examine the failure modes, and decide whether the errors are data-limited or architecture-limited.

Download the Learning Guide

A structured two-week study plan synthesising Howard's Lesson 3, the Swadia tutorial's 10 key points, and learning science research:

📥 Fast.ai Lesson 3: Structured Learning Guide
Includes: 10 Swadia key points · Two-week study plan · Checkpoint questions · Code patterns to memorise · Learning science strategies
Download PDF Download EPUB

Testimonials

"I'd worked through three different deep learning courses before finding Dr. Neal's guided approach to fast.ai. What distinguished this experience was his insistence on understanding before application — every time I thought I had the right answer, he'd ask a question that revealed I'd memorised a pattern rather than grasped the principle. It took longer, but I came out the other side genuinely able to build things, not just run notebooks."

— Dr. M., Computational Biologist, UCL

"The fast.ai curriculum is exceptional, but navigating it alone meant I kept stopping at the same points. Dr. Neal's capacity to identify precisely which conceptual link was missing — and to supply exactly the right analogy or worked example to repair it — is something I've rarely encountered in 20 years of postgraduate education. He manages the rare combination of technical rigour and genuine patience for the learner's pace."

— Prof. P., Department of Computer Science, University of Edinburgh

"Genuinely transformed how I work. Three months after completing the fast.ai course under Dr. Neal's guidance, I had deployed a custom NLP pipeline that is saving my research group approximately 15 hours per week on literature screening. I couldn't have built that from the course alone — the targeted mentorship was what converted understanding into deployment."

— Dr. K., Senior Research Fellow, Pharmacology

"Excellent teaching. Clear, direct, technically uncompromising."

— Ms. R., Lead Data Analyst, NHS Digital

"What I will remember most about studying with Dr. Neal is his uncommon willingness to let you stay confused for exactly as long as is productive, and then to step in with precisely the right intervention. He does not rush to resolve discomfort — he understands that struggle is often where the real learning is happening. At the same time, he has an acute sense of when confusion has become discouraging rather than productive, and pivots accordingly. I've studied with a number of excellent teachers over a long academic career; Dr. Neal's attunement to the individual learner's state is genuinely exceptional."

— Prof. A., Emeritus Professor of Statistics, Imperial College London

Key Takeaways

Gradient descent is parameter search guided by local slope information. The loss surface is high-dimensional and non-convex, but local gradient information is sufficient to make progress because the surface is smooth enough in practice that short steps in the downhill direction rarely get stuck catastrophically.

ReLUs are not the only activation function, but they are the canonical example of why nonlinearity is necessary. A stack of purely linear layers collapses to a single linear transformation regardless of depth.

Matrix multiplication is not an implementation detail — it is the reason deep learning runs on GPUs and scales to billion-parameter models.

Normalisation and log-transforming skewed features are not preprocessing niceties. They directly affect the condition number of gradient updates.

What comes next: Lesson 3 closes with a preview of NLP via Hugging Face Transformers — demonstrating that the same machinery (parameterised function family, loss, gradient descent) scales to sequence-to-sequence tasks with no fundamental change in the algorithm. The architecture changes; the optimisation loop does not.

References and Resources

Primary fast.ai Resources

The Book

Deep Learning for Coders with fastai and PyTorch — Amazon (Howard & Gugger, O'Reilly 2020)
Full book available free on GitHub — the entire text, including all notebooks, is openly available for those who cannot afford to purchase it.

Historical Context

McCulloch, W.S. & Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386–408.
Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning Representations by Back-propagating Errors. Nature, 323, 533–536.

Learning Science

Bjork, R.A. (1994). Memory and Metamemory Considerations in the Training of Human Beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition. MIT Press.
Sweller, J. (1988). Cognitive Load During Problem Solving. Cognitive Science, 12(2), 257–285.
Cepeda, N.J. et al. (2006). Distributed Practice in Verbal Recall Tasks. Psychological Bulletin, 132(3), 354–380.

Based on Jeremy Howard's fast.ai Practical Deep Learning for Coders 2022 · Lesson 3: Neural Net Foundations · This is an example of the course materials produced by Dr Neal and used to teach students through his one-to-one AI Learners Course.

Post by Dr. Neal Aggarwal · drnealaggarwal.info · 40+ years teaching IT and AI

📚 Next in the series: Lesson 02 — Build a Neural Network in a Spreadsheet. Theory into practice: a working neural network with no code and no libraries, just arithmetic you can see and touch. All lessons, in order, on the Lessons page.