Collection of Subjectively Interesting Papers via Pinterest

 

<shortrandomblogpost>

I understand that Pinterest is not a very popular service among academics, but I’ve found it to be useful for keeping track of papers that have made an impression on me. I also wish others took some time to curate their own lists, as it would help me build reading lists.

Find the board here: http://pinterest.com/karpathy/research/

Have I missed some awesome papers?

</shortrandomblogpost>

On Expediting the Discovery of Relevant Academic Literature

I wanted to share a few quick thoughts and analysis about the NIPS 2012 papers visualization page I put together a few weeks ago and maybe get a bit of discussion going about future. For those not familiar, very briefly, the page displays a list of all accepted papers to the conference, but also shows small paper thumbnails, list of top100 words in the paper color-coded based on topics, and offers functionality to sort all accepted papers based on a topic or similarity to any paper. This allows one to more quickly sift through the huge number of papers and quickly find the ones that are most relevant to them. The page went on to collect a few thousand hits over a period of few weeks (details in figures, for those interested).

The project started very innocently with my frustration of going through the accepted papers on the official NIPS page on a late Friday. It took only a few minutes before I threw my hands up, closed the tab and spent the next few AM hours putting together version 0.1, then thought I should release it because I found it personally useful, and then I added the LDA and other bells and whistles on top as it somehow became popular over the weekend.

My takeaway from the popularity of the page is that there is demand for these kinds of visualizations and interactive ranking schemes and I’ve become quite excited about possible future directions as a result. I already had a few suggestions from people NIPS ranging from personalized recommendations for particular authors to various other fancy visualizations and embeddings, or more social comment/voting features. I’m soliciting more thoughts.

But more generally, something I’m also excited about and already started to build is to extend this into a full-blown academic papers search engine. Because Google Scholar is … uh… okay, but I think one can go way beyond what’s done there in terms of presentation, and that presentation should not be underestimated. I wrote large chunks of the client/backend for it already in Python but I’m currently stuck on parsing papers from NIPS/ICML/etc. for previous years and creating the structured database out of the unstructured mess that exists out there spread across several pages and papers that change their format every year. That’s the bottleneck that requires a lot of manual and tedious effort and I’m not quite sure how to deal with it. I will probably end up spending the time to do a few conferences/years, and then if it tuns out that the service is at all interesting or useful, see if I can put it up on github, and get others to help crowdsource more data? (I already tried once year ago as it turns out, but that was mostly a failure due to some good reasons). The details are fuzzy, but at least the idea is that what you see on the page would be more of a special case of a search for NIPS 2012, which one could go on to refine interactively based on additional queries that bias ranking towards certain topics, authors, keywords, etc.

In the short term, I’d like to continue producing a similar page for conferences when I can spare the time. I also released the code on Github under my favorite licence (WTFPL licence — “Do What The Fuck You Want To Public License” :D ), and I welcome and encourage anyone to build on it and release their own versions. In longer term, my eyes are set more generally on visually nice, user-friendly (these two are very important in my mind) and competent academic search across conferences, perhaps starting with a particular niche first. As I mentioned, I am a little stuck on that one for now.

I welcome any thoughts on this page, its short-term use for upcoming conferences, and more generally on how we could go about building pages that help us expedite discovery and analysis of relevant academic literature.

Renewable Energy, Climate Change

I’ve recently become interested in sustainable energy, climate, green technologies, electric cars, etc. I’ve been reading random blogs and articles about these topics for a while, but only recently have I decided to investigate these issues more exhaustively after discovering that David MacKay (yes, the awesome Machine Learning / Physicist one) has written a (free PDF!) book about these topics called “Sustainable Energy: Without the hot air” in 2008. A few months after he published the book, he was appointed Chief Scientific Advisor to the Department of Energy and Climate Change in UK, where he now works 80% of his time (and 20% back at Cambridge). The book is an interesting read and I thought it would be fun to dedicate a blog post to my own (shorter) notes, interpretations and conclusions. Who knows, maybe I’ll sway a few of my readers to become just as obsessed :)

Problem statement: Climate and Energy. So here’s the problem. As humans, we do a lot of stuff (travel, eat, heat, build, etc.) and doing stuff requires energy. Presently, about 80% of that energy can be traced back as coming from burning fossil fuels (coal, oil, gas mined from Earth). Now, the problem is that fossil fuels are a finite resource and we are consuming it at alarming rates. In fact, at predicted rates we may run out of these precious resources in 50-150 years. The problem gets worse: fossil fuels are very useful form of matter that takes hundreds of millions of years to form under ground and can be used in all kinds of interesting ways (to create plastics, for example) other than simply burning them for energy.

But the problem gets Even worse: burning fossil fuels releases large amounts of Carbon Dioxide into the atmosphere and this is very worrying. CO2 is a greenhouse gas and a surplus of this gas in the atmosphere above what is nominally produced by Earth leads to warmer earth, which leads to rising temperatures, which among other things melts ice, and just generally causes a whole cascade of events that upset the balance of the entire ecosystem. The problem is that nobody is really certain just how fragile our ecosystem is when faced with this sudden rise of CO2 levels in these last ~200 years since the industrial revolution and there a lot of scary disaster scenarios involving feedback loops that end with irreversible damage done to the planet and its life. The bottom line is that Earth’s climate is a complex system that is infinitely precious, and by burning fossil fuels we are really stretching its limits and playing with fire. The conclusion is inescapable and clear: we need to significantly reduce the rate at which we burn fossil fuel, we need to do it very quickly, and we need to do it in face of ever-increasing demand for energy from the fast-paced society we live in. So, what are our options?

Renewable Energy options..  Lets first consider harvesting sustainable energy from the most preferable energy source: the sun. We are at constant bombardment at rate of about 174 petawatts (this is a LOT, by the way, human energy consumption is at about 0.01% of this) of FREE energy from the sun. About 30% is reflected back to space by Earth, but the other 70% is pumped into clouds, oceans and the land mass in various ways.

- Our first chance to harvest this energy is most directly though solar panels. 
- A part of this energy goes into warming air, which rises and causes convection, wind, cyclones, etc. This 2nd grade sun’s energy can be harvested with wind turbines. 
- Wind blows over oceans and causes waves. Waves are 3rd grade sun’s energy and can be harvested on ocean surfaces using wave energy converters.
- Heated water also evaporates and rains back down on Earth. Flowing water in rivers can be harvested as 2nd grade sun’s energy through hydro plants.
- Earth’s biomass absorbs sunlight through photosynthesis to create plants and animals that feed on plants. Plants, animals and their biproducts (for example, ethanol) can be harvested for energy, but this is not always considered renewable because a lot of other nutrients are consumed in the process.

All of the above options have their pros and cons that I will go into shortly. However, we are not done with Earth’s sources of energy! Earth has more energy stored in it that can be for all practical purposes considered renewable. Mainly:

- Earth has a huge amount of molten hot material under surface. This heat can be harvested by digging tunnels deep into our crust. That is, we can harvest geothermal energy.
- Moon exerts its gravitational influence on our planet and causes tides. Energy from all this water rising and falling can be harvested through tidal pools and similar technologies. Exactly where does this energy come from, you may ask? :) What is being used up? It’s not obvious but the energy in tides can be  be traced back as coming from Earth’s rotational energy. Tides are actually “using up” Earth’s rotational energy and Earth is actually slowing down its rotation as a result (see Tidal acceleration)! For example, it turns out that around 600 million years ago, a day was about 22 hours.

Debatingly-renewable Energy options. Two other new contenders for energy harness the chemical energy (essentially) in configurations of molecules found on our planet. Mainly, I’m referring to energy stored in certain heavy elements such as Uranium, Plutonium and Thorium that can be harvested through Nuclear Fission, and binding energy due to strong nuclear force that can one day (maybe) be harnessed by fusing light molecules (such as heavy water, lithium) in Nuclear Fusion.

Pros/Cons, Economics. The short story is this. Photovoltaics is a young technology but shows the most long-term promise for clean, renewable energy and is my personal favorite by far. The main limitations right now is the cost of the technology. We simply haven’t yet figured out how we can build cheap solar panels and manufacture them at scale. Wind power is also a promising and clean resource that can be used alongside solar panels. However, both of these sources suffer from being intermittent because they rely on clouds and amount of wind. This is a big problem because it is difficult, lossy and expensive to store energy for release at will. Ideally, we would be able to extract exactly as much energy as is needed at any point in time and no more. Our best options for more reliable and stable energy are hydro (as water’s potential energy can be cheaply stored in dams) and geothermal energy. However, both of these are not very scalable so should be used in limited quantities to supplement wind and solar when supply is not meeting demand.

Nuclear fission is a controversial source of power due to worries about storage of radioactive biproducts that take on order of thousand years to dissipate and must be carefully stored deep underground. In addition, people are worried about nuclear material leaks through error/terrorism and the potential of these plants to be used as an excuse to build nuclear weapons. (see this discussion TED talk on whether or not we need Nuclear, from people who actually know what they’re talking about) . From what I understand though, it is also not clear if we can go on using fission forever because we are consuming minable Uranium that will run out on scale of hundred years or so. We may be able to use so-called fast breeder reactors while extracting Uranium from ocean which would allow the technology to yield a lot of energy over very long time-scales, but these are mostly conjectures at the moment. Similarly, Nuclear Fusion is presently the stuff of dreams and noone is certain if it will ever work. Projections of working reactors currently range in decades. However, if we were able to get nuclear fusion to work, it would completely and utterly dwarf all other renewable energy sources put together and provide clean energy for millions of years. Something to keep your eyes on! :)

Final opinion: My uneducated novice opinion on what government should do about this crysis, based on what I read so far: Crank up production of mostly solar and little more wind. Start reducing contribution from fossil fuels and more slowly that of nuclear power. (I oppose Nuclear power but I am also worried we may need it. For now, I choose to trust some research reports that suggest that we don’t need it and I also choose to believe that through research we can significantly improve solar technology and its scalability.)  Next, build a few less 2-billion-dollar carrier ships and throw first half of that money into photovoltaics research: Incentivise use of solar power through tax cuts to create additional demand and support startups and technolgy companies that enter this sector. Throw the other half into programs that support purely electric and self-driving vehicles. I also don’t think the popular opinion among ordinary people should be underestimated as a catalyst for change. Spend a last percent or two on making green technologies cool to ordinary people through propaganda programs– YouTube channels, viral videos, interactive sites, and getting popular media figures to endorse these technologies and educate their followers.

What can you do? Based on David MacKay’s analysis, the biggest energy sinks of an average person that can be influenced are Transportation (car) and Heating/Cooling in your house. So here’s what you should do: Buy Tesla Model S all-electric vehicle (or one of its descendants in the near future. These cars can now also be charged in Supercharger network for free. The Supercharger network gets power from solar, so you can be riding for free on pure sunlight!). Next, work on reducing your power-hungry heating system in the house. For this, consider getting Nest, the learning thermostat and also consider upgrading insulation in your house. Replace all your light bulbs with new, significantly more efficient LED lights. Finally, for extra cool points cover your roof with solar panels using, for example, Solar City.

Future plans. I dream of future in which we consume 100% renewable energy (mostly solar, wind, some hydro) and ride around exclusively in self-driving, fully electric vehicles. I’ve read through a few reports (like this one from Stanford) that outline plans to transition to 100% renewable energy usually by around 2050. Obama called for 80% renewable energy by 2035, but naturally some proponents think it is too ambitious. Meanwhile, I think Denmark is in the lead, as it has passed legislature that commits the country to 100% renewable by 2050. I hope to see more countries follow!

BONUS: some notes on my future home :)

The state of Computer Vision and AI: we are really, really far.

The picture on the left is funny.

But for me it is also one of those examples that make me sad about the outlook for AI and for Computer Vision. What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that have to fall in place for it to make sense. Here is my short attempt:

- You recognize it is an image of a bunch of people and you understand they are in a hallway
- You recognize that there are 3 mirrors in the scene so some of those people are “fake” replicas from different viewpoints.
- You recognize Obama from the few pixels that make up his face. It helps that he is in his suit and that he is surrounded by other people with suits.
- You recognize that there’s a person standing on a scale, even though the scale occupies only very few white pixels that blend with the background. But, you’ve used the person’s pose and knowledge of how people interact with objects to figure it out.
- You recognize that Obama has his foot positioned just slightly on top of the scale. Notice the language I’m using: It is in terms of the 3D structure of the scene, not the position of the leg in the 2D coordinate system of the image.
- You know how physics works: Obama is leaning in on the scale, which applies a force on it. Scale measures force that is applied on it, that’s how it works => it will over-estimate the weight of the person standing on it.
- The person measuring his weight is not aware of Obama doing this. You derive this because you know his pose, you understand that the field of view of a person is finite, and you understand that he is not very likely to sense the slight push of Obama’s foot.
- You understand that people are self-conscious about their weight. You also understand that he is reading off the scale measurement, and that shortly the over-estimated weight will confuse him because it will probably be much higher than what he expects. In other words, you reason about implications of the events that are about to unfold seconds after this photo was taken, and especially about the thoughts and how they will develop inside people’s heads. You also reason about what pieces of information are available to people.
- There are people in the back who find the person’s imminent confusion funny. In other words you are reasoning about state of mind of people, and their view of the state of mind of another person. That’s getting frighteningly meta.
-  Finally, the fact that the perpetrator here is the president makes it maybe even a little more funnier. You understand what actions are more or less likely to be undertaken by different people based on their status and identity.

I could go on, but the point here is that you’ve used a HUGE amount of information in that half second when you look at the picture and laugh. Information about the 3D structure of the scene, confounding visual elements like mirrors, identities of people, affordances and how people interact with objects, physics (how a particular instrument works,  leaning and what that does), people, their tendency to be insecure about weight, you’ve reasoned about the situation from the point of view of the person on the scale, what he is aware of, what his intents are and what information is available to him, and you’ve reasoned about people reasoning about people. You’ve also thought about the dynamics of the scene and made guesses about how the situation will unfold in the next few seconds visually, how it will unfold in the thoughts of people involved, and you reasoned about how likely or unlikely it is for people of particular identity/status to carry out some action. Somehow all these things come together to “make sense” of the scene.

It is mind-boggling that all of the above inferences unfold from a brief glance at a 2D array of R,G,B values. The core issue issue is that the pixel values are just a tip of a huge iceberg and deriving the entire shape and size of the icerberg from prior knowledge is the most difficult task ahead of us. How can we even begin to go about writing an algorithm that can reason about the scene like I did? Forget for a moment the inference algorithm that is capable of putting all of this together; How do we even begin to gather data that can support these inferences (for example how a scale works)? How do we go about even giving the computer a chance?

Now consider that the state of the art techniques in Computer Vision are tested on things like Imagenet (task of assigning 1-of-k labels for entire images), or Pascal VOC detection challenge (+ include bounding boxes). There is also quite a bit of work on pose estimation, action recognition, etc., but it is all specific, disconnected, and only half works. I hate to say it but the state of CV and AI is pathetic when we consider the task ahead, and when we think about how we can ever go from here to there. The road ahead is long, uncertain and unclear.  I’ve seen some arguments that all we need is lots more data from images, video, maybe text and run some clever learning algorithm: maybe a better objective function, run SGD, maybe anneal the step size, use adagrad, or slap an L1 here and there and everything will just pop out. If we only had a few more tricks up our sleeves! But to me, examples like this illustrate that we are missing many crucial pieces of the puzzle and that a central problem will be as much about obtaining the right training data in the right form to support these inferences as it will be about making them. Thinking about the complexity and scale of the problem further, a seemingly inescapable conclusion for me is that we may also need embodiment, and that the only way to build computers that can interpret scenes like we do is to allow them to get exposed to all the years  of (structured, temporally coherent) experience we have,  ability to interact with the world, and some magical active learning/inference architecture that I can barely even imagine when I think backwards about what it should be capable of.

In any case, we are very, very far and this depresses me. What is the way forward? :( Maybe I should just do a startup. I have a really cool idea for a mobile social local iPhone app.

EDIT: A friend pointed me to an awesome, relevant presentation by Josh Tenenbaum from AAAI 2012, “How to Grow a Mind: Statistics, Structure and Abstraction“.  I think we’re on the same page, except he’s probably at least 100x ahead of me.

Khan Academy + Computer Science

Exciting developments– Khan Academy recently revealed a neat interactive, live programming sandbox running Javascript on their website. I like that they went with Javascript + Processing library combo for this purpose. The idea is that the best way to get children interested in Computer Science is not to start getting them to write Hello World and Binary search, but to have them write cool interactive, visual demos and games. This has been my philosophy for a long time, and I’ve even tried to get my feet wet in this area by putting together a set of tutorials for making games in Python. Instead of me trying to motivate this, I recommend you read their blog post announcing this new initiative. If you’re interested in this topic, I would further recommend this neat lecture that inspired them to develop this in the first place (minutes 2-23 are most interesting and related).

Go ahead and check out the demos and starter code they’ve put together to demonstrate the power of the sandbox. For example, here are some animation demos. You write code on the left, and it is immediately executed and results are shown on the right. You can also do nifty things such as hold down the button over any number, slide mouse left or right to change it, and see the results right away on the right. Awesome!

Wasting no time, I jumped to create a few cool programs. For example,

- Here is a Mandelbrot set solver I put together in a few minutes
- Here is an N-body physical simulation of gravity  , though admittedly it has a bit of numerical issues. Maybe I’ll try to upgrade it Runge-Kutta integration
- Here is a heart-drawing animation for fun :)

I also ported a few fun Canvas demos you can find on the internet into their API.:

- Lorenz Attractor . Go ahead and change the parameters to see how the attractor behaves! (Original attractor code taken from a gist)
- Tetris!! With the actual tetris code taken from a canvas coding blog post.

Anyway, the idea is that this sandbox allows for rapid prototyping of cool visualizations, and very easy sharing of code across people to make cool things. For example, someone took the Lorenz attractor and modified it so that it is animated within a few minutes. Awesome! Anyway, I think it will be a great tool for younglings who want to learn how to think like a programmer. I am also slightly envious as I had no such fun tools to draw on when I was young. Instead, I had to write PASCAL and program projection matrices in OpenGL to get things to move :(

I am looking forward to developments in this area! I hope they implement a way to explore all these cool programs, and that they provide a nicer and more comprehensive hand-held and well-documented and explained introduction through these demos, not just a few comments. But I’m sure that’s all coming.

CVPR 2012 Highlights

CVPR 2012 just ended in Providence and I wanted to quickly summarize some of my personal highlights, lessons and thoughts.

FREAK: Fast Retina Keypoint
FREAK is a new orientation-invarient binary descriptor proposed by Alexandre Alahi et al. It can be extracted on a patch by comparing two values in the gaussian pyramid to get every bit, similar to BRIEF. They show impressive results for discriminative power, speed, and also draw interesting connections to a model of early visual processing in the retina. Major bonus points are awarded for a beautiful C++ Open Source OpenCV compatible implementation on Github.
Philosophically and more generally, I have become a big fan of binary descriptors because they are not wasteful, in the sense that every single bit is utilized to its full potential and nothing is wasted describing the 10th decimal place. They also enable lightning-fast computation on some architectures. I’m looking forward to running a few experiments with this!
Tracking & SVMs for binary descriptors
Suppose you want to track a known object over time in an image stream. A standard way to do this would be to compute features keypoints on the object image (using SIFT-like keypoints, say), and use RANSAC with some distance metric to robustly estimate the homography to the keypoints in the scene. Simply doing this per frame can do a decent job of detecting the object, but in the tracking scenario you can do much better by training a discriminative model (an SVM, for example) for every keypoint, where you mine negative examples from patches in the scene that are everywhere around the keypoint. This has been established in the past, for example in the Predator system.
But now suppose you have binary descriptors, such as FREAK above. Normally it is lightning fast to compute hamming distances on these, but suddenly you have an SVM with float weights, so we’re back to slow dot products, right? Well, not necessarily thanks to this trick I noticed in this [Efficient Online Structured Output Learning for Keypoint-Based Object Tracking [pdf]] paper. The idea is to train the SVM as normal on the bit vectors, but then approximate the trained weights as a linear combination of some binary basis vectors. In practice, you can use around 2 or 3 bit vectors that, when appropriately combined with (fast) bitwise operations and linear combinations thereafter produce a result that approximates the full dot product. The end result is that you can use a discriminative model with binary vectors and enjoy all the benefits of fast binary operations. All this comes at a small cost of accuracy due the approximation, but in practice it looks like this works!
Hedging your bets: richer outputs
I’d like to see more papers such as “Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition” from Jia Deng. Jia works a lot with ImageNet, and working with these large datasets demands more interesting treatment of object categories than 1-of-K labels. The problem is that in almost all recognition tasks we work with rather arbitrarily chosen concepts that lazily slice through an entire rich, complex object hierarchy that contains a lot of compositional structure, attributes, etc. I’d like to see more work that acknowledges this aspect of the real world.
In this work, an image recognition system is described that can analyze an input image at various levels of confidence and layers of abstraction. For example if you provide an image of a car, it may tell you that it is 60% sure it’s a Smart Car, 90% sure it is a car, 95% that it is a vehicle, and 99% sure that it is an entity (the root node in the ImageNet hierarchy). I like this quite a lot philosophically, and I hope to see other algorithms that strive for richer outputs and predictions.
Teaching 3D Geometry to Deformable Part Models [pdf]
Speaking of rich outputs, I was pleased to see a few papers (such as the one above, from Pepik et al.) that try to go beyond bounding boxes, or even pixel-wise labelings. If we hope to build models of scenes in all their complexity, we will have to reason about all the contents of a scene and their spatial relationships in the true, 3D world. It should not be enough to stop at a bounding box. This particular paper improves only a tiny bit on previous state of the art though, so my immediate reaction (since I didn’t fully read the paper) is that there is more room for improvement here. However, I still like the philosophy.

100Hz pedestrian detection
This paper [pdf] by Rodrigo Benenson presented a very fast pedestrian detection algorithm. The author claimed at the oral that they can run the detector at 170Hz today. The detector is based on simple HOG model, and the reason they are able to run at such incredibly high speeds is that they use a trick from Piotr Dollar’s paper [The fastest detector in the west [pdf]] that shows how you can closely approximate features between scales. This allows them to train only a small set of svm models at different scales, but crucially they can get away with only computing the HOG features on a single scale.

Steerable Part Models
Here’s the problem: DPM model has all these part filters that you have to slide through your images, and it can get expensive as you get more and more parts for different objects, etc. The idea presented in this paper by Hamed Pirsiavash is to express all parts as a linear combination of a few basis parts. At test time, simply slide basis parts through the image and compute the outputs for all parts using the appropriately learned coefficients. The basis learning is very similar to sparse coding, where you iteratively solve convex problems holding some variables fixed. The authors are currently looking into using Sparse Coding as an alternative as well.
I liked this paper because it has strong connections to Deep Learning methods.  In fact, I think I can express this model’s feed forward computation as something like a Yann LeCun style convolutional network, which is rather interesting. The steps are always the same: filtering (AND), concatenation, normalizing & pooling (OR), alternating. For example, a single HOG cell is equivalent to filtering with gabors of 9 different directions, followed by normalization and average pooling.
Neural Networks: denoising and misc thoughts
This paper [Image denoising: Can plain Neural Networks compete with BM3D? [pdf]] by Harold C. Burger shows that you can train a Multi Layer Perceptron to do image denoising (including, more interestingly, JPEG artifact “noise” when using high compression) and it will work well if your MLP is large enough, if you have a LOT of data, and if you are willing to train for a month. What was interesting to me was not the denoising, but my brief meditation after I saw the paper on strengths and weaknesses of MLPs in general. This might be obvious, but it seems to me that MLP’s excel at tasks where N >> D (i.e. much much more data than dimension) and especially when you can afford the training time. In these scenarios, MLP essentially parametrically encodes the right answer for every possible input. In other words, in this limit MLP becomes almost like a nearest neighbor regressor, except it is parametric. I think. Purely a speculation :)
Neural Networks and Averaging
Here’s a fun paper: “Multi-column Deep Neural Networks for Image Classification“. What happens when you train a Yann LeCun style NN on CIFAR-10? You get about 16% error. If you retrain the network 8 times from different initializations, you consistently get about the same 16% result. But if you take these 8 networks and average their output you get 11%. You have to love model averaging…
Basically what I think is going on is that every network by itself covers the data with the right label, but also casts projections over the entire space outside of training instances that are all essentially of random label. However, if you have 8 such networks that all cast different random labels outside of the data, averaging their outputs washes out this effect and regularizes the final prediction. I suppose this is all just a complicated way of thinking about overfitting. There must be some interesting theory surrounding this.
Conditional Regression Forests for Human Pose Estimation
This is just an obligatory mention of this new Random forests paper, where Microsoft improves on their prior work in pose estimation from a depth image. The reason for this is that I’m currently in love with Random Forests philosophically, and I wish they were more popular as they are beautiful, elegant, super efficient and flexible models. They are certainly very popular among data scientists (for example, they are now used as _the_ blackbox baseline for many competitions at Kaggle), but they don’t get mentioned very often in academia or courses and I’m trying to figure out why. Anyway, in this paper they look at how one can better account for the variation in height and size of people.
The Role of Image Understanding in Contour Detection [pdf]
This paper shows that patch level segmentation is basically a solved problems, and humans perform on par with our best algorithms. It is when you give people context of the entire image when they start to outperform the algorithms. I thought this was rather obvious, but the paper offers some quantitative support and they had cool demo at the poster.
Accidental pinhole and pinspeck cameras: revealing the scene outside the picture [pdf]
This was an oral by Antonio Torralba that demonstrated a cool CSI style image analysis. Basically, if you have a video of a scene and someone occludes the source of light, they can become an accidental anti-pinhole, which lets you reconstruct the image of the scene behind the camera. Okay, this description doesn’t make sense but it’s a fun effect.
Towards good practice in large-scale learning for image classification [pdf]
This paper has a lot of very interesting tips and tricks for dealing with large datasets. Definitely worth at least a skim. I plan to go through this myself in detail when I get time.
Misc notes
- There was a Nao robot that danced around in one of the demo rooms. I got the impression that they are targeting more of a K12 education with the robot, though.
- We had lobster for dinner. I felt bad eating it because it looked too close to alive.
- Sebastian Thrun gave a good talk on self driving car. I saw most of it before in previous talks on the same subject, and I can now reliably predict most of his jokes :)
- They kept running out of coffee throughout the day. I am of the opinion that coffee should never be sacrificed :(
I invite thoughts, and let me know if I missed something cool!

Musings on Intelligence: thought experiments

Isn’t intelligence just unbelievably annoying? How does it work? I spend many hours pondering this question. In this post I outline two of my more interesting thought experiments that aim to probe the answers. As I go through these in my head, I always think about how a robot could achieve these same “thoughts” or inferences. What kind of algorithms are required to at least approximately match my thinking process?

Though Experiment #1. Try this: fully introspect your thinking process while doing a random, routine task. Suppose you’re sitting at your desk in the office and suddenly decide to get some coffee from coffee shop across the street. Think about every detail of your thought process as you go along: You form a plan to go down to the street. The plan is hierarchical in nature: overall goal, waypoints, immediate plans of getting from A to B, all of your muscle contractions that get executed to meet each tiny goal on the way… Just before you walk out of your office you slow down a bit in front of the door because the hallway can be full of people who may be walking quickly and are unaware of you. In other words, you’re considering the possible dangers and planning ahead, minimizing the risk of undesirable outcomes. As you walk forwards, a person is coming across from you. You immediately infer the goal of that person: They are most likely trying to pass you and continue on their way down the hallway. You steer slightly to the right and you anticipate them moving slightly to the left. You walk down the steps and you’re about to open the door, but suddenly you notice a person coming in from the outside. Again, you understand that they want to come in to the building. You immediately infer that they are likely to open the door. You also notice that the other person is not looking at you but slightly down at their feet while walking, so you infer that they are probably unaware of you. You step aside and wait for them to open the door and pass. Finally, you get to the shop and you see a line. You understand how a line works: people line up and wait for their turn to order things. You line up at the end because that is the right thing to do. You don’t stand too far back and face elsewhere because other people who want to line up will be confused about whether or not you’re waiting in line…

I feel like I’m doing an injustice to this exercise, but in general it is overwhelming to think about all the tiny inferences my brain is automatically making at any time. Now, how could a robot match similar processes or inferences? How could it ever learn what a line is at a coffee shop? How is it represented as a data structure in its memory? Or the fact that the rule is to “line up at the end of the line”? How could it ever understand that that person on the other side of the glass door had his own goal, and that in that particular moment his goal was to get into the building? How could it ever understand that the other person also has their knowledge base, and that since they were not looking at the robot they did not know it was there? And how could it ever resolve to deciding that a particularly efficient way to handle the scenario was for it it to step aside and wait for them to pass?

Though Experiment #2. For my second thought experiment consider a slightly different setup. It is so ordinary and so boring, and yet from all my experiments, I believe it reveals a lot about intelligence. It is inspired by a real-world situation: I was talking to a friend of mine at a party, when after a brief pause of us both taking a casual sip of our beverage, my friend suddenly asked: “Did you see John?”. The inferences that unfolded during my tiny state of confusion, on the other hand, are extraordinary if you try to enumerate them explicitly:

- John is probably a person. It probably isn’t a movie, or a thousand other things.
- I can’t think of a John I know at the moment. I know John’s, but I don’t think my friend knows them.
- My friend would not ask me the question if he thought I did not know John. So he thinks I know a John.
- What is the set of people that we both know? Maybe I know John but only from seeing him? Maybe my friend doesn’t know that I don’t know him by name.
- What were we talking about just seconds before? We were talking about an assignment for a class.  Is John in the class as well? Is there a person in the class who we sometimes hang out with and who’s name I don’t know, but should?
- Why is my friend asking this question? How does it fit with what we’ve just been talking about? How does it fit with what my friend would want to know at this moment?
- Is he merely thinking out loud, and does not really expect me to know John?
- Is he asking about the past? Did we ever talk about some John? Or is John a guest at the party that my friend is merely trying to find?
- Did I not hear my friend correctly? Maybe he meant Jen? We both know a Jen, but she doesn’t fit too well into context of the conversation moments ago. Is my friend trying to change the topic? Is there something interesting that happened with Jen in the last few days and maybe I don’t know about it?
- Did my friend ever ask this question or a similar one before?

It feels like my brain went through hundreds of immediate hypothesis like the ones above, racing to make sense of the situation; Striving to make it consistent. It felt like in a millisecond it tried to fit every hypothesis to the available data, and it felt like it retrieved vast amounts of past knowledge not only about the context of the situation at that time, but also context of an entire past of my entire relationship with my friend, and the events that unfolded moments ago. It felt like it was trying to find a hypothesis that “clicked”. It considered not only my knowledge, but a model of knowledge of me from the perspective of my friend, and even my guess at his immediate intentions. In other words, somehow I maintain a model of what every person I know knows about me and the world, their attitudes toward me and the world, and the experiences and contexts we share. I also have an understanding of their personalities, and the kinds of things they are likely to talk to me about. Interestingly, I would also argue that I maintain a degree of certainty on every such piece of knowledge, sometimes only as a summary, and sometimes with pointers to events that led me to believe them.

It is quite amazing that our brains are capable of doing all this in fractions of a second, and they do it thousands and thousands of times a day. I believe that the process outlined above is at the heart of intelligence, in that it is just a single example of more general reasoning machinery that is used at any moment in time. The brain is, as best as I can describe it, a Hypothesis Generating Bayesian Scoring Machine. And don’t get excited, by Bayesian I only mean the very simple idea that we have priors and assign likelihoods for every possible hypothesis, and we combine them in some way to get a winner: the hypothesis that “clicks” the best. And as far as I can tell, the inference is most similar to a kind of hybrid Loopy BP / MCMC scheme, where proposals that are based on experience are used to initialize hypotheses, and where a belief propagation-like procedure derives their consequences before scoring them.

In conclusion, these depressing thought experiments tell me that we are, indeed, very very (very!!) far from Artificial Intelligence. How can we write algorithms that can automatically explain data by generating and scoring hypotheses, while considering the full context? How do we write algorithms that understand and model intent, knowledge and goals of other agents? I don’t have the answers, but one thing I do know is that there is no single machine learning system that I’ve heard of that I consider to be on the right path. I’m being harsh and my expectations are high, but my main concern is that our algorithms for the most part don’t think, they compute boring feed-forward functions that depend on a fixed set of conveniently chosen parameters. An algorithm that attempts to model a mind must have a certain scent of meta… a scent that I have yet to feel.

 

My Last quarter: projects, courses, endeavors

First quarter at Stanford was extremely busy but a lot of fun. Here is the list of endeavors that kept me entertained:

1. I took two courses: Machine Learning with Andrew Ng, and Computer Vision with Fei Fei Li. Both courses were fun, even though they contained mostly information I’ve learned already at UBC. Regardless, it was nice to hear it all again and get to practice it more.

2. I rotated in Daphne Koller’s lab and worked on the Latent Structural Support Vector Machine. The optimization for LSSVM’s is done in a coordinate-descent fashion: Latent variables h are inferred given the weights of the SVM w, and then w is inferred given h. I worked on an extension to the first step: instead of inferring a fixed value of h, one tries to maintain a probability distribution over h. When inferring w in the second step, an expectation is calculated over h instead of simply using a fixed value. The intuition is that the algorithm should not be too hasty to commit to a bad h, or it can get stuck in a bad minimum. Of course, one pays a computational price for this, but the question was: is it worth it? As far as my experiments went with my specific data, the answer seems to be no. This general meta-issue is one that keeps coming up over and over again: Do you spend computational effort doing the right thing, or do you compute the wrong thing many times faster? In practice, the latter can be surprisingly effective.

Most importantly though, I reaffirmed during this rotation that this kind of work is not something I find personally appealing. I don’t get excited about mathematically reformulating a problem in some slightly different form, and seeing it perform 1% better than state of the art on my favorite dataset. What motivates me best are more tangible projects that address large conceptual challenges. Projects that have the goal of AI in mind, or the goal of getting robots to live among us. Projects that have meta in them. Projects that can make me say embarrassing things, such as “This must be how the brain works”.

3. For my course project for both Computer Vision and Machine Learning, I was advised by Gary Bradski from Willow Garage and I worked on Object Detection. More specifically, I worked on extensions to the recently published (ICCV 2011) LINEMOD Object detector by Stefan Hinterstoisser. Stefan’s work is essentially on super-fast, optimized implementation of template matching that can be applied to RGBD images (such as those coming from the Kinect) for object instance detection. I chose this project because it had all the tags necessary to get me excited: Kinect, Willow Garage, Object Detection, Super-fast, Vision, and Robots. In addition, I have this strange feeling that despite all the efforts that go into building clever systems for object detection, it will be common in 20 years to solve practical problems with template matching, naive bayes, and bag of words models. In fact, I’m not entirely convinced that this is unlike what the visual cortex does in humans, at least for large portion of the low-level processing.

However, clearly it is not practical to have a separate template for every possible view and for every possible object, so there must be mechanisms in place to scale the naive object-centric template matching strategy. I investigated two ways of scaling the algorithm based on: 1. Simple intuition that not all parts of the image should receive the same amount of attention in terms of matching, because it is possible to reject boring regions of images as candidates for objects based on very coarse matching at low resolution. I was able to use this (trivial) intuition to speed up the algorithm 20x without any loss in recall. And 2: It would be nice if we didn’t have to have a separate, large template for entire objects. Instead, I explored a hough-voting approach where I detected little parts of objects, and had them vote for object center. The intuition is that, for example, if you detect a bottle cap with high certainty, then a bottle center should be somewhere below. This turned out not to work too well but I was so puzzled by it that I kept searching, and indeed, shortly after the report was due I uncovered a severe bug in the code base I was using as a black box for matching that would directly lead to bad performance in these part-based experiments. Unfortunate!

I liked working on this project a lot! You can read my final report here. [PDF]

4. Those of you who know me also know that I get very easily excited about anything Education. And since Andrew Ng’s Machine Learning class was offered to the public online for free last quarter, I did not hesitate and volunteered almost 10-15 hours a week helping to prepare the programming assignments for the class. Looking back, it was probably not the best choice considering my career as a researcher, but I do not regret my choice. It was a lot of fun being involved in something I consider to be so ground-breaking, and I really hope that all the new initiatives that seek to revolutionize online education, such as Coursera, Udacity, and MITx go on to become very successful. And I hope I earned some bragging rights, because I’ll be able to say that I was there, involved and at the heart of it when it all began.

 

This quarter I am rotating with Andrew Ng’s group working with Adam Coates, and I am taking Convex Optimization with Stephen Boyd and Probabilistic Graphical Models with Daphne Koller and Kevin Murphy. More on this later! :)

My “Values and Assumptions about Teaching and Learning”

Those of you who know me well may also know that I get very passionate about education. I can write a whole another 10-page post on some of my thoughts on Khan Academy, and more recently the MLclass, AIclass, DBclass, etc offered in Stanford. (By the way, update: I’ve volunteered to help make assignments for the ML class, and I LOVE to be a part of it). My name is on the “About us” page, and will go down in educational history! (ok just kidding, but I’m proud of it anyway :p)

For now, however, I wanted to share this writeup that I just randomly discovered hidden deep inside my Dropbox. It is my “Values and Assumptions about Teaching and Learning” that I submitted with my application for one of the top TA awards at University of British Columbia last year. My application was rejected (which, by the way, I am bitter about because I think my application was overall very strong and there is no other student I know who worked even close as hard as I did on my TA duties, who volunteered to TA more courses than was required, who volunteered many many more hours than he should have spent, who received identically near-perfect student evaluations every time…. I am normally a fairly modest person, but here I refuse. Ah well, hard work not recognized, fine with me.) Regardless, the writeup has some of my thoughts on what I learned while teaching (most of my experience was in teaching Tutorials – i.e. ~5-30 people per class with mean at around 20, and helping out students who worked on assignments in learning center). Forgive the slight cheeseness of it at times :)

————————————————–

When I sometimes help a group of students along as they try to complete some problem, I wonder if they realize that I, as a teacher, am also in a process of solving an extremely difficult problem: that of teaching. It is very hard to over-estimate the difficulty of being an effective teacher. Even a simple question from a struggling student is often just a tip of an iceberg: a brief manifestation of a deeper misunderstanding. The task of the teacher is not to simply answer the question (that’s easy!), but to first infer the exact shape and size of this iceberg, and then to address the source of the confusion. Over the last few years, I came to realize that teaching is one of the most intellectually demanding problems that I can hope to work on, and solving it correctly for some students, in some cases, is a great source of satisfaction.

I have accumulated many tips and tricks of teaching over the last two years, during which I conducted a tutorial almost every other day. In an effort to make my essay concrete, I will attempt to justify from experience a few of my core teaching principles. One of the first surprising discoveries I made when I started out was that being very comfortable with the content of the course was, paradoxically, detrimental to my ability to teach it. As I was trying to explain the material, I would frequently catch myself skipping over details in a problem derivation, simply because certain leaps of logic were obvious to me. For this reason, I volunteered to undertake the universally most hated task that a TA can have: marking assignments. Students are generally bad at conveying their misunderstanding, and are often even reluctant to admit it. A commonly occurring situation is that they aren’t even aware of it in the first place. Overall, getting my hands dirty and poring over students’ work in detail enabled me to more clearly understand the kinds of problems that often come up, it reminded me of all the little pieces of knowledge that I now take for granted, and ultimately led me to become a more effective instructor.

One of my other core principles was also strongly reinforced through personal experience. When I first started teaching, I felt very comfortable with the course material. After all, the course I taught only involved simple mathematics that I carried out many times since my first year. To my surprise, however, once I actually started teaching I realized that my understanding of these elementary concepts was only superficial, and often simply rule-driven. Forcing myself to make sense of it as I was explaining it to others led me directly toward a deeper understanding of all concepts and their relations. Similarly, as teachers we should encourage our students to not only passively absorb information, but to actively try to make sense of it through interaction, collaboration, and teaching.

My process of improvement as a teacher is not unlike the one that my students go through. We gradually learn to become better through long periods of sustained practice. I don’t pretend to have anything figured out, but eagerly look forward to learn more.

Isaac Asimov’s I, Robot: thoughts

I finally had a chance to read *Isaac Asimov’s I, Robot*.
It was certainly an interesting experience, given that the short stories in it were written at about 1940-1950, but the events the in book take place at about 2070. (i.e. right now in 2011 we are almost exactly half way there)

The book contains 9 short stories, from which the ones I would most recommend are *Reason* and *Evidence*.

What strikes me as most interesting is the nature of predictions in the book. Some predictions are too pessimistic and some are too optimistic, but in funny ways. Here are examples:

- The robots in 2070 are described to be *heavy, metalic, and have diaphragms*. More likely, we’d now think that robots at that time will be made of super light-weight carbon fibers, and they certainly won’t have diaphragms when we can just use speakers?

- Most interestingly, in charge of the hardcore theory of robots are … *mathematicians*. In fact, the positronic brains are seen to yield *behavior based on solutions of differential equations*. These days, we would most likely not think of including (pure) mathematicians in robotics, and we rarely ever think of algorithms in AI/Machine Learning in terms of differential equations. (wait, should we? :) )

- One story mentions that the protagonists recorded a *video*, and that he *had to to get it developed*. Interesting that it was not obvious that this limitation would not be overcome by 2070, and that we wouldn’t be using film.

- Even though some of the above contain severely pessimistic views of the world, Isaac imagines us to have *hyperatomic drives* in 2070, that allow for easy interstellar travel. It is strange to think that we can conquer space, but still need to “develop” a video.

Anyway, overall I liked the stories. Many of them essentially come down to an almost detective-like story, where there is something wrong with the robots, and the protagonist has to figure out how the observed behavior has come about from the 3 laws and logical inferences. In general I like the idea that sufficiently advanced robots will become so complicated that we will lose the ability to fully interpret their behavior. There will simply be too many moving parts, and what we observe in terms of the behavior will only ever be the tip of the iceberg. The underlying, perfectly deterministic and individually understandable complexity will simply collapse all together into one term, and we will call it *personality*. I look forward to these times, at some point around 2070 (sounds reasonable to me).