Tuesday, June 24, 2008

Cloud Robot

I was reading about cloud computing.
So here's an idea:
A robot relying on vision can upload its real-time sensor input to the cloud.
The cloud can extract whatever it likes and present it to humans.
But the cloud will also do all the image processing and download the result (?) back to the robot.
The result could be motor instructions or descriptors for the robot to work with.
So that the robot itself doesn't need a high-powered processor.
All that's needed is real fast wireless link:
640*480*4*60=73,728,000 bytes/sec!
(640x480 pixels, each 4 bytes (RGBI) @ 60 frames/sec)
You could strip that down to 3 bytes/pixel @ 30 frames/sec 640*480*3*30=27,648,000 bytes/sec
OK, 320x240 image = 320*240*3*30=6,912,000 bytes/sec
OK, 256-shade grey scale = 320*240*30=2,304,000 bytes/sec
That's possible!
And that leaves 500 KBs to send stuff back.
Probabaly been done. But maybe not so real time?

Sunday, June 22, 2008

2008 Canadian Robot Vision Conference

This is my report on the Canadian Intelligent Systems Collaborative (AI/GI/CRV/IS) 2008 Conference. This was five conferences in one event. My interest was in the CRV (Computer and Robot Vision) conference held by the Canadian Image Processing and Pattern Recognition Society (CIPPRS).
The buildings in which conference was held were being reconstructed and the noise was distracting. But, with the inconvenience tolerated, the conference was very enlightening for a beginner like me.
The conference was essentially three days, with a keynote each morning and papers or talks throughout each day.
The only keynote I heard was the first, which was by Peter Carbone from Nortel. This was interesting to me (with my history in telecommunications), but off-target for the AI people who were the majority of conference attendees. Mr. Carbone made several predictions that I think will be significant in telecom: widespread broadband wireless penetration by 2010, telephony using SOA with mashup potential, and 100 MBs real-time encryption capability.
Following the keynote the CAIAC Precarn Intelligent Systems Challenge was announced. This offers a $10K prize to the student submitting the best method of detecting ships meeting at sea using satellite and radar tracking data. I think the poor data makes this a challenging problem.

The highlight of the conference for me was a talk by Dr. Steven Zucker from Yale. He was the only presenter who seemed interested in doing AI and computer vision to emulate biology, as I am. I think artificial systems should understand what they're working on; be a part of their world, as biological systems are. A biological goal Zucker identified is guiding animal movement, such as monkeys jumping to tree branches. This objective is the same as Arathorn's example of goats jumping to rocky ledges. Zucker's talk was mainly on stereo vision. He confirmed my assessment that Canny edge detectors suck. He implemented a nice curve detector based on tangents. He showed how he used spacial and orientation disparity to get a better matching of the image pair. A nice point was about self-referential calibration: a system that can move can identify its own parts (e.g. in a mirror) as the ones that move when it moves them.
I missed the talk by Dr. James Crowley, to my regret. I gathered that his points included that intelligence requires embodiment and autonomy. This confirms my subscription to the philosophy of Spinoza, who states that the mind is the entire body. Any organism's mental reality would not be what it is without all of the sensory input and motor feedback provided by the body.
The talk by Dr. Greg Dudek about his AQUA robot was interesting because of the focus and completeness of the project. It's another very specialized machine, although you can program its actions by a visual language. Apparently they discarded a visual system that recognized human hand gestures.

I attended all of the CRV paper presentations. These seemed to be arranged in ascending order of complexity and accomplishment. I was surprised that I could understand much of the work. Some of the papers were not amazing to me at all. Some were incremental improvements on previous work. Most were applications of existing work. This may be a survey of the state of the art, or it may just be a sampling of people who are trying to get attention (who didn't go to other conferences). I'm not going to summarize all of the papers - just give criticisms of the ones I found useful.
'An Efficient Region-Based Background Subtraction Technique' and 'Ray-based Color Image Segmentation' presented image segmentation optimizations based in iterative deduction. This is a good and intuitive idea and I think I can implement it using layers of neural networks. The ray-based segmentation idea was clever, but had problems finding all segments and was slow. I still don't know if colour-based segmentation is natural.
The methods used in 'A Cue to Shading - Elongations Near Intensity Maxima' to differentiate shadows from textures got me confused, but Gipsman's point that knowledge of shading detection is still primitive surprised me - I think analysis of shading would be fundamental to determining shape and orientation of 3D objects. I agree with her that feedback from higher layers will be essential. But I think the feedback will loop: the shape of the shadow will help in recognizing the object and the shape of the object will help in recognizing the shadow.
'Fast Normal Map Acquisition Using an LCD Screen Emitting Gradient Patterns' presents an innovative method for lighting objects to get 3D information. An interesting point is their use of the polarized LCD light and a filter to remove specular reflection. I found later that the human eye can differentiate linear from non-linearly polarized light (see Haidinger's Brush). Perhaps the brain can use this information in determining where the light is really coming from?
'Realtime visualization of monocular data for 3D reconstruction' was a treat for me because it relates so well to my planned measurement with a camera project. To me, this paper is like an instruction book on how to model 3D space from a single camera. I must look into its Simultaneous Localization and Mapping (SLAM) methods and other tricks. Monocular is cheap, stereo is more accurate? Again, the system doesn't have a clue what its looking at, but it may be a good start for a more complex system. I must analyze to more depth.
'Object Class Recognition using Quadrangles' is a general-purpose implementation of edge-based object recognition which also considers colour uniform regions. On top of this the authors implemented a structural descriptor (the paper describes quadrangles only, but the speaker described use of ellipse descriptors in their newer work) and template-based spacial relationship matching system much simpler than that used by Sinisa Todorovic's self-learning segment-based system (but less capable). 'Geometrical Primitives for the Classification of Images Containing Structural Cartographic Objects' is another system base on edge/ region/ structural descriptor, but focused on the single problem of finding roads, bridges and such in satellite imagery. The software seems to be more capable, handling higher level primitives such as blobs, polygons, arcs and junctions. It uses AdaBoost binary classification. Results were good except for detecting bridges. I suggest looking for the bridge shadows.
Most of the motion tracking papers used very simple recognition techniques or did not describe them. '3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis' presents a highly mathematical approach that seems to be another form of template matching. It works well but it will take quite an effort for me to understand it. 'Visual-Model Based Spatial Tracking in the Presence of Occlusions' presents a pre-processing trick to mask occlusions from a template/visual-model based 3D tracking system. While the system is highly performant, using the GPU, it is highly specialized to a single object. 'Automatically Detecting and Tracking People Walking Through Transparent Door with Vision' tracks Harris corners through time. It can be taught to subtract expected movements from new ones, by simple geometric trajectory comparison. This can serve many applications, but the use of just corners means it can't tell you what is moving through the scene. But could specializations like this be used as keys to brain behaviour? e.g. does the brain just use the moving corners of a door to perceive it? 'Invariant Classification of Gait Types' classifies body movements by comparison to a database of shape contexts derived from template silhouettes. This is an efficient and accurate method used in handwriting recognition, and I think I'll look into it more, because the bin concept applied to pattern matching lends itself to implementation using neural networks.
'Active Vision for Door Localization and Door Opening using Playbot' is another specialization - for doorframes and handles. Its advance is active vision. The robot solves its position geometrically after it detects the door by using a pre-programmed door size - meaning it will only work with one size of door. The active vision part is that the robot takes pictures at multiple angles and multiple positions and solves its position using the camera angles and the detected door edges and corners, then calculates a move to a new position. 'Automatic Pyramidal Intensity-based Laser Scan Matcher for 3D Modeling of Large Scale Unstructured Environments' tackles an incredibly hard problem of mosaicing adjacent spherical laser images without feature, location or rotation information by matching their overlapping depth values. This is useful for other mosaicing problems, but I don't think its complicated methods will be required in most computer vision applications which will have less interval between images and can rely on feature detection and sense of place. '6D Vision Goes Fisheye for Intersection Assistance' shows that using fisheye lenses provides a wider angle of view with only small hits on their low processing time requirements and relatively wide accuracy requirements in a real-time stereo mobile object tracking application. 'Challenges of Vision for Real-Time Sensor Based Control' explains how additional sensor input to an extended Kalman filter can be used to supplement poor video data caused by bad camera angles.

One thing I brought away from this conference is that although there is a large amount of existing work and many new efforts in the computer vision field, the presented applications are not trying to understand or duplicate biology. They're using mathematical methods to solve specific problems. Well perhaps the concepts can be implemented in neural networks. And the solutions are so specific! I guess it'll be a long time until there is general purpose vision. And not surprisingly so, because that will require general purpose concept representation. Too bad I didn't hear the AI papers too.
Since this was my first academic conference, I learned that what to look for in papers is what is new, or what can be adapted to my purpose. Attending has motivated me to get an IEEE membership so I can access more research papers. Poster presentations seem pretty valueless to me. Either they don't present enough information or I am forced to stand while reading an entire paper.
Another thing is that there is a lot of existing technology out there that can be used to solve problems. A counterpoint to this and a kind of semi-corollary to the first point is that a lot of the existing technology is highly focused, inaccurate and slow, so there is still a lot of research and development needed.
From a business point of view, I got no leads on paying work. Some people at the conference believe that contracting in this field can be viable. But I think I'll have to prove I'm capable by example before anyone will hire me. Since most researchers only solve special cases, another opportunity is to complete a project to make it useful in lots of situations.