Better Living Through Crowdsourcing

crowdChristian Kreutz explores the many technologies the the world is using to make sense of real world data in the digital domain. These technologies, apart and collectively, enable computers to more accurately interpret the world as we understand it. In the hopes that they’ll be able to tell us more about our reality than we are able to infer unaided.

Our relationship with these technologies is self-reinforcing, it’s both driven by, and the cause of, an explosion of the ‘sharing’ of content. In other words, the more data we have, the more we want to understand and contextualize it. The more we understand, the greater the motivation to create and share even more.

The Information Age, Amplified

Eric Schmidt, CEO of Google, recently talked about just how fast humans are creating content:

Thanks to the Internet, we now double every two days all stored information. The estimated amount is 5 exabytes according to Eric Schmidt (Google) and it took human kind 2000 years to get a similar amount of archived information.

So how are machines able to parse all this data from the real-world? Well, there are a few ways…

  • Text Recognition and Natural Language Processing
  • Voice Recognition
  • Mobile Data Collection
  • Image Processing and Computer Vision

That’s a few, but also consider a number of other technologies, programs for mining the social graph, mapping, checking-in, active learning…too many to list. The point is, the sum of these parts allows for platforms that attempt to understand media as close to the way humans do as possible. Of course, the benefit of computing is that algorithms work faster and more efficiently than we do. Despite the number technologies listed above, artificial intelligence isn’t quite where it needs to be to completely automate managing it all.

Just today there were reports that Cuil, a search engine that relied upon semantic parsing algorithms to mine the dark web, might be shutting down. I’m sure their technology was sound and some of the brightest minds in the business started Cuil, but there are real difficulties in relying on machines to do complex tasks where context is the variable.

Crowdsource the Filter

Our approach is to address the problem from a different angle, where humans can distribute work to many, use machines to aggregate the output of that productivity, and then work with smart tools that learn from the users needs and expectations. If our code isn’t smart enough to make sense of data on it’s own (it’s not) but humans are (yet they aren’t as fast or organized), then perhaps part of the solution lies in optimizing human efforts at filtering content, adding context and using the result as the base for improving future algorithmic decisions. This is called active learning, where the interactions of a human operator improves algorithms assigned to perform certain functions.

My colleague Patrick Meier refers to this as Crowdsourcing the Filter. I think at least in the near term, this is the future of intelligent computing, where smart machines assist humans, helping to us to accomplish the tasks we need to accomplish more efficaciously.

At CrowdConf next month on October 4th, SwiftRiver will be onsite demonstrating some of the applications we’ve built from this understanding. This is part of our approach to solving the problem of ‘too much data’. We’ll let the big guys like Google, Microsoft and IBM figure out the secrets to scalable a.i. In the mean time, our goal at SwiftRiver is to democratize access to tools that help people make sense of data, on their terms.

Natural Language Processing with Swift River

One of the core features of Swift River is the Language Computation Core, or SiLCC as we like to call it (Swift Language Computation Component). Users send feeds to SiLCC which, using a number of machine learning techniques, parses the incoming text and extracts relevant keywords. The idea is that these keywords (tags) can then be used to infer taxonomic relationships between content items. Some camps refer to this as semantic programming, others refer to it as artificial intelligence, but the general concept remains the same: helping programs to perform tasks based on a growing series of complex conditions. In this case ‘auto-tagging’ or ‘predictive tagging’ based on conditions learned from user behavior and preset rules.

The diagrams below illustrate how this dataflow works. Text passing through SiLCC are parsed, tags are extracted, those tags are then reapplied in the Swift River UI. There, Swift attempts to build relationships between tags. (ex. items tagged with “chile” and “earthquake” are likely related. However items tagged ‘chili’ and ‘earthquake’ likely are not.) Of course other factors are considered like date, time, the point of origin and location of the content creator.



One of the services running within SiLCC is another service called SLISa, which we like to call Lisa (because the ‘s’ is silent, hehe). SLISa is the Swift Language Improvement Service App and it trains SiLCC to learn from user interaction. When users of Swift edit or flag tags as inaccurate, SLISa is the service that creates all the conditions that helps SiLCC to learn from it’s mistakes and improve for the future.



SiLCC is an open source project being developed in Python using the pyNLP toolkit. There’s several additional layers of text parsing that I haven’t touched upon including how SiLCC deals with SMS txtspk and Twitter picoformats like hashtags but more on that in a future post!

More on SiLCC at http://swift.ushahidi.com/extend/silcc/. If you have a passion for machine learning, large data sets, and intricate algorithms you might also consider joining the Swift River Google Group or our public Skype Chat.

The Alpha release of Swift River, Version 0.0.9 Rumba will be available to the public on March 31, 2010. Developers can find always find the latest working build and issue tracker at http://github.com/ushahidi/Swiftriver.