My team is hiring a Research Scientist (recent PhD grad or with a couple of years' experience)
Mission: develop fundamental, state of the art tech at the intersection of imaging, vision, and machine learning. Come work w/
@2ptmvd
,
@hossTale
& me
Apply:
Years ago when my wife and I we were planning to buy our home, my dad stunned me with a quick mental calculation of loan payments. I asked him how - he said he'd learned the strange formula for compound interest from his father, who was a merchant in 19th century Iran.
🧵 1/4
I published this in a 1-pager:
P. Milanfar, “A Persian Folk Method of Figuring Interest”, Mathematics Magazine, vol. 69, no. 5, Dec. 1996
My late dad refused to be a co-author. But when it appeared, he printed it out, framed it, and hung it on the wall of the house. 🙂
4/4
(1/5) One of the most surprising and little-known results in classical statistics is the relationship between the mean, median, and standard deviation. If the distribution has finite variance, then the distance between the median and the mean is bounded by one standard deviation.
Years ago when my wife and I we were planning to buy our home, my dad stunned me with how quickly he could compute loan payments in his head. I asked him how he did it. He'd learned a strange formula for compound interest from his father, who was a merchant in 19th century Iran.
Almost every technical person knows about least-squares(LS) but most don’t know about *total* least-squares(TLS).
These measure fitting error differently: LS minimizes sum of sq. vertical distances whereas TLS minimizes the sum of orthogonal distances from data to fit line
1/2
I’m just an average researcher. But something I’ve learned in 30 years of being a researcher is that if you’ve convinced yourself at age 25 that you can teach others how to be great researchers, you’ve still got a lot to learn.
Enjoyed visiting UC Berkeley’s Machine Learning Club yesterday, where I gave a talk on doing AI research. Slides:
In the past few years I’ve worked with and observed some extremely talented researchers, and these are the trends I’ve noticed:
1. When
The retina is arguably the most impressive part of the brain.
It’s the only part of the brain that faces the world directly - it’s a sensor and processor in one
Its consumes 50% more energy per gram than the rest of the brain.
1000:1 compression from retina to optic nerve
There’s a single formula that makes all of your diffusion models possible: Tweedie's
Say 𝐱 is a noisy version of 𝐮 with 𝐞 ∼ 𝒩(𝟎, σ² 𝐈)
𝐱 = 𝐮 + 𝐞
MMSE estimate of 𝐮 is 𝔼[𝐮 | 𝐱] and would seem to require P(𝐮|𝐱). Yet Tweedie says P(𝐱) is all you need
1/3
The Kalman Filter was once a core topic in EECS curricula. Given it's relevance to ML, RL, Ctrl/Robotics, I'm surprised that most researchers don't know much about it, and many papers just rediscover it. KF seems messy & complicated, but the intuition behind it is invaluable
1/4
Think you understand Maximum Likelihood? Think again.
It seems like such a natural idea, but there’s an epic and turbulent history with numerous assaults on the core idea, culminating in a beautiful and complicated theory.
A highly entertaining account:
I'm releasing all the lectures and notes for an introductory course on Statistical Detection and Estimation I used to teach. The core material hasn't changed - it was an EE course, but it's as relevant today to AI researchers as ever before. Hope you find it useful.
Covers:
*
For a generation of Iranians who came of age during & just after the Islamic Revolution of 1979, current events in Afghanistan are not just heartbreaking, but also deeply personal & resonant with our own experiences.
I was smuggled alone out of Iran. I was 15. Here's my story.
A most surprising & little-known results in statistics is that the mean (μ) and median (m) are within a std deviation (σ)
|μ−m| ≤ σ
for unimodal densities bound is even tighter
|μ−m| ≤ 0.7756 σ
This beautiful results first appeared in a 1932 paper by Hotelling & Solomons
How to read a paper online:
1: open PDF in browser
2: skim abstract & figures
3: leave tab open for weeks
4: accidentally close tab
5: search for paper again
6: go to step 1
The original PageRank paper, the algorithm powering Google search, published in 1998, has been cited 17,139 times to date.
The original ResNet paper, published in 2016, has been cited 146,746 times to date.
To me, this seems extremely weird.
It’s been 20 years since I submitted my first paper with Nhat Nguyen and the late great Gene Golub on multi-frame super-res (SR). Here’s a thread, a personal story of SR as I’ve experienced it. It won’t be exhaustive or fully historical. Apologies to colleagues for any omissions
1/6 Iterating (i.e. repeatedly composing) a function is tricky business. You can get wild (even chaotic) behavior with simple functions like r(x) = cx(1-x)
That's why it's important to choose the nonlinear activation functions very carefully in neural networks.
Even technical people get this wrong:
Sample Standard Deviation (SD) vs Standard Err (SE)
You want an estimate m̂ of m=𝔼(x) from N independent samples xᵢ. Typical choice is the average or "sample" mean
How stable is this? The Standard Error (SE) tells how stable it is
1/6
The perpetually undervalued least-squares:
minₓ‖y−Ax‖²
can teach a lot about some complex ideas in modern machine learning including overfitting & double-descent.
Let's assume A is n-by-p. So we have n data points and p parameters
1/n
Why don’t more people know about the gem that is Tweedie's formula?
Say 𝐱 is a noisy measurement of 𝐮
𝐱 = 𝐮 + 𝐞
w/ 𝐞 ∼ 𝒩(𝟎, σ² 𝐈)
min mean² estimate of 𝐮 is 𝔼 [𝐮 | 𝐱]. Obviously we need the density P(𝐮|𝐱) right?
No! Tweedie says P(𝐱) is all you need!
1/2
Image-to-image models have been called 'filters' since the early days of comp vision/imaging. But what does it mean to filter an image?
If we choose some set of weights and apply them to the input image, what loss/objective function does this process optimize (if any)?
1/8
“Mathematical rigor is like clothing: in its style it ought to suit the occasion, and it diminishes comfort and restricts freedom of movement if it is either too loose or too tight"
-G.F. Simmons
Physicists/Engineers know this well - too much rigor induces a fear of making
the only people I’ve seen dismissing this insightful article are ones who don’t seem to understand compression very well
the only surprise(?) is that it took a skilled writer and thinker, instead of one of us researchers, to make the case crystal clear
"Non-parametric" regression is often misunderstood. Are there no parameters? Hardly
It's just that non-parametric methods don't fit to explicit global models.
The overall fit is many local fits that may use a total # of parameters that may even exceed the # of data points.
1/n
What do polar coordinates, polar matrix factorization, & Helmholz decomposition of a vector field have in common? They’re all implied by Brenier’s Theorem: a cornerstone of Optimal Transport theory. It’s a fundamental decomposition result & deserves to be better known.
1/5
Here's a neat trick to impress your friends
Let's say you have some curve with a random shape, possibly even self-intersecting. Can you measure its length?
This isn't just a parlor trick -- it has many practical applications. For example, the curve could be a strand of DNA
1/n
Take pixels gᵢ=g(xᵢ,yᵢ) of an image as nodes in a weighted, undirected graph. The weights on each edge are the similarity between pixels, measured w/ a sym pos def kernel
k(i,j) =exp[−d(gᵢ,gⱼ)]
g is encoded in K. What can we learn about g from K? Can we get g back from K?
This isn’t the abstract of the paper. It’s the whole paper.
“Equilibrium points in n-person games”
by John F. Nash Jr., PNAS January 1, 1950 36 (1) 48-49
It's a privilege to work with so many talented folks in
@GoogleAI
. The body of work
@JeffDean
describes in this blog is so broad and deep - I'm learning something new everyday and fortunate to be able to contribute with my team on the vision/imaging side.
The perpetually undervalued least-squares
minₓ‖y−Ax‖²
can teach a lot about some complex ideas in modern machine learning. Overfitting & the double-descent ideas are some interesting examples.
Let's assume A is n-by-p. So we have n data points & p parameters
1/n
It’s been >20 years since I published my first work on multi-frame super-res (SR) w/ Nhat Nguyen and the late great Gene Golub. Here’s my personal story of SR as I’ve experienced it from theory, to practical algorithms, to deployment in product. In a way it’s been my life’s work
Congratulations to the authors of this lovely work that just won a best paper award at
@siggraph
#siggraph2023
.
They achieve state-of-the-art visual quality with real-time (≥ 100 fps) novel-view synthesis at 1080p resolution. Far exceeding NeRF approaches on both quality and
This is not a scene from Inception. The sorcery is a real photo was taken with a very long focal length lens. When the focal length is long, the field of view becomes very small and the resulting image appears more flat.
1/4
Motion blur is often misunderstood, because people think of it in terms of a single imperfect image captured at some instance in time.
But motion blur is in fact an inherently temporal phenomenon. It is a temporal convolution of pixels (at the same location) across time.
1/4
The Gaussian is a nice bumpy shape, but sometimes we hope for a smooth (i.e. C∞) function like the Gaussian that is 𝒂𝒍𝒔𝒐 compactly supported.
One such class of functions is called "Bump functions"
1/6
I published this in a 1-pager:
P. Milanfar, “A Persian Folk Method of Figuring Interest”, Mathematics Magazine, vol. 69, no. 5, December 1996
My late dad refused to be a co-author. But when it appeared, he printed it out and framed it; and hung it on the wall of the house. 🙂
1/7 The choice of nonlinear activation functions in neural networks can make a big difference. Why?
Because iterating (i.e. repeatedly composing) even simple nonlinear functions can be tricky. Wild, or chaotic behavior can emerge even with something as simple as a quadratic.
I figured out how the two formulae relate: the historical formula is the Taylor series of the exact formula around r=0.
But the crazy thing is that the old Persian formula goes back 100s (maybe 1000s) of years before Taylor's, having been passed down for generations
3/4
What do polar coordinates, polar matrix factorization, & Helmholz decomposition of a vector field have in common?
They’re all implied by Brenier’s Theorem: a cornerstone of Optimal Transport theory. It’s a fundamental decomposition result & deserves to be better known.
1/5
The Gaussian is a nice bumpy shape, but sometimes we hope for a smooth (i.e. C∞) function like the Gaussian that is 𝒂𝒍𝒔𝒐 compactly supported.
One such class of functions is called "Bump functions"
1/6
A quick thread on a ‘simple’ problem: denoising
x = u + e
Noisy signal = x
Clean signal = u
Gaussian iid noise = e
Two well-known approaches:
MAP: Maximum a-posteriori
MMSE: Minimum mean square error
They don’t coincide, but share several interesting properties
1/4