How we are teaching computers to understand pictures

At the beginning of this year, I met a new area of computer science that I fell in love with. The area was deep learning in the field of computer vision. To have a better understanding of this field, I took a course in my university named Introduction to Computer Vision.

Also, I have followed a course taught by Fei-Fei Li named CS231n: Convolutional Neural Networks for Visual Recognition provided by Stanford University online, and I have to admit that it was amazing. This course focuses on deep learning architectures centered particularly on image classification tasks. A detailed syllabus can be found in the links below.

A few days ago, I have encountered a TED Talk by Fei-Fei Li that is an introduction to Computer Vision in which she explains her area of research and her quest in this field. After that, I have decided to highlight some parts of the talk to give insights to the ones who might be interested in Computer Vision research.

In this talk, Fei-Fei Li explains that we have built smart machines but also the fact that these machines are still blind. This is mainly because these machines only see objects as pixels that consist of bunch of numbers that do not represent a meaning. Therefore, these machines have only the ability to see but we also want them to look those objects and see them thoroughly with understanding the meaning that lies behind them.

Fei-Fei Li also mentions that a simple observation changed her way of thinking. She says “By age three, a child would have seen hundreds of millions of pictures of the real world. That’s a lot of training examples. So, instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.” She indicates that even a cat can be seen in various forms just like in the examples below. Therefore, there is a need for a huge dataset to compose those training examples like a three-year-old had seen.


With the help of this observation, she started ImageNet project together with Professor Kai Li at Princeton University in 2007. After two years, the project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words, and this project contains 62,000 different versions of the cats that we mentioned above.

Now, the computers are able to determine whether a picture contains a cat and tell where the cat is in that picture. Although this is a great achievement, Fei-Fei Li tells that another milestone will be hit and the computers will not only see the picture but also generate sentences by learning both from pictures and natural language. For now, the computers are able to generate human-like sentences but still, we have a long way to go.

At the end of the talk, Fei-Fei Li mentions that these machines will not only be used for their intelligence. She says that we will collaborate with them in so many ways to build a better world for our children, and tells that this is her quest: to give computers visual intelligence and to create a better future for her son, Leo and for the world.

References

CS231n: Convolutional Neural Networks for Visual Recognition. (n.d.). Retrieved from http://cs231n.stanford.edu/.

Li, F. (n.d.). How we’re teaching computers to understand pictures. Retrieved from ted.com.