Some Reflections on Sweeper from N.E.A.T. Nigeria

In April we were contacted by a group out of Georgia Tech, M.I.T. and student on the ground in Nigeria about the, then, upcoming elections. This group of individuals, together working as N.E.A.T. (the Nigerian Election Aggregation Team) wanted to run a campaign that mashed up data from several different Ushahidi deployments, Twitter and other sources, displaying them in their own Ushahidi deployment. They ended up writing a lot of custom code but this was the first ‘stress test’ of the SwiftRiver platform and our Sweeper application to date.

The following is a review of the N.E.A.T. team’s experiences with Sweeper. It was written by Thomas Smyth from Georgia Tech just after their election project was complete on May 2:

What Sweeper Did Well

  • Quick setup: Jon had our instance up in running in what seemed like a heartbeat. This was much appreciated.
  • Reliability: Sweeper stayed up pretty reliably as long as I didn’t break it!
  • Auto-Tagging: This feature was pretty neat and our system used Sweeper’s tags for meta-analysis.
  • Support: Matt was available consistently for in-depth help and scheming. We appreciated this.

Issues With Sweeper

  • Bugginess: Several major bugs were encountered, e.g. the duplication service. But this is to be expected for a young project.
  • Twitter lag: Twitter updates weren’t showing up for many minutes. Since Twitter was our main source of timely information, this was a big problem. We ended up implementing our own scraper using Twitter’s stream API, which has worked brilliantly. Matt and I have discussed this.
  • Searching: Sweeper currently doesn’t allow searching of reports, and this was a desired feature which we implemented. We also implemented a ‘saved search’ feature, which turned out to be quite useful. It allows the user to specify a search string (such as “guns or bombs or knives”) to be “tracked”. The system then searches all incoming reports and maintains a time series visualization. This allows a user to see what topics are ‘spiking’. Something like this would fit nicely in the the analytics panel in Sweeper.
  • Analytics panel: There are a few good things here but the interface could be a lot denser, so that more useful analytics could be added. For instance, top tags could be represented with a compact table rather than a bar chart. Charts should only be used in cases where the visual representation provides a clear benefit. Pie charts are usually unnecessary, etc.
  • Geolocation problems: The automatic geolocation service was quite dodgy. I didn’t do any actual counting but I’d say upwards of half the results were wrong. I think it’s a difficult thing to do automatically. So much ambiguity, etc. We ended up building a custom solution for geolocation, incorporating polling booth data (120k of them!) from INEC. The system could automatically recognize a polling unit code like 03/04/12/013 in a tweet, and translate that into a geolocation.
  • Scanning interface: The main interface of sweeper, where users quickly scan through reports and categorize them, could be more efficient. It’s not clear why each report needs to take up so much space, and why the interface doesn’t scale to fit the whole screen. The animations were also somewhat disorienting. In our system, we tried a system where users ‘checked out’ a batch of 10 reports and quickly scanned them in a compact table format, marking relevant ones with a checkbox. This seemed to work nicely, and didn’t require (I think) as many requests to the server. In general, I think Sweeper’s interface could be tightened a lot. Users are more likely to be experienced, frequent visitors, rather than occasional ones (I think). Therefore you can make it a little more efficient and specialized than a general purpose website. I think users would appreciate this. I’d be happy to consult further here if there is interest.
  • Code and documentation: Much of the functionality described above could perhaps have been added to Sweeper. However, we found it hard to get started on adding plugins. The codebase could be better organized so that it is clear where code for different components should go. The code itself could also be cleaner in places. Also, documentation needs to be available. But again, we realize Sweeper is a young project and these things are surely on the TODO list!

That’s all I have for now guys. Let me know if you have any questions. Many thanks for everything. Let’s keep talking!

This is great feedback and some of it we’ve already begin working on, while the rest (both the code and the suggestions) have been added to our roadmap.

(Photo from http://www.uiowa.edu)

Knight News Challenge Grant!

It’s truly an honor to accept a $250,000 grant from the Knight Foundation for the SwiftRiver project!  It’s the culmination of a long journey that began in 2008 but evolved in 2010 when I joined the project as (product designer) and later Matthew Griffiths (lead developer).

Swift is an open-source initiative who’s goal is to make the process of vetting information more efficient.  The project to date has progressed well thanks in no small part to the following people: Matthew Griffiths (so important to this project I mentioned him twice), Ahmed Maawy, Charl Van Neikerk, Heather Ford, Vladimir Ermakov, The Ushahidi team, Omidyar Networks, Chris Blow, Ed Bice, Kaushal Jhalla, Neville Newey, Edmar Ferreira, Pete Warden, Patrick Meier, Anahi Ayala, Ethan Zuckerman, the TED staff, Google’s 2010 Summer of Code Participants (Mang-Git, Soe, Nishith Rastogi), the Guardian’s Activate staff, Product(RED) and many others. This project would be nowhere without you all so thanks for making it happen.

For many of us, this project represents a new way of democratizing access to the tools for understanding and vetting information which is needed by Ushahidi, journalists, and many others.

Open Source Bookmark Curation

With the latest release of Sweeper, you can roll your own bookmarking service. This is really powerful when you start activating plugins like our auto-tagger SiLCC or our our Push plugins which can output all of your bookmarked content as a feed that can be consumed by other applications.

We call this little plugin Quiver. It’s where you manually collect and store information using Sweeper. Essentially it turns Sweeper into a your free and opensource Delicious clone, with all the contextualization and aggregation features that people have come to love it for.

So how does it work? It’s simple! Just download and install any version of Sweeper following the current release of v0.3.2 which can be found here.

Once you’ve done that, go to the ‘sources panel’.

Select ‘Quiver’ from the list.

Drag the bookmarklet to your browser bar.

Done! Sweeper is a tool for the curation of real-time media. Now the things you find interesting can be mashed up with the content you’re aggregating from the web, twitter, email and other feeds! It’s particularly useful for journalists or researchers who need the real-time content, but who want to augment that with their personalized interests and findings.

Get it from Swiftly.org

Algorithms Augmenting Human Decisions

Here’s an update about the SwiftRiver platform from PDF11 which I had the pleasure of speaking at yesterday. My slides are below and here you can find video of my presentation

Crowdsourcing 102: Mining Real-Time Data  

The summation of the talk is that the Swift project has been assigned a very complex and incredibly difficult task: to verify and contextualize data from the mobile and social web. How do we do this, well this seems to be the part that confuses people. It’s not any of our apps, and it’s not any of our individual APIs that we rely upon to do this. It’s the combination of all these things together, as part of one robust algorithm that tries to digitally reconstruct the real-world context, using the features extracted from the content t prioritize and de-prioritize information relevant to that context.

I like to refer to this as folksonomic triage where layers of historic, social, temporal, geospatial and other types of information are layered one another to perform a function, and the system (through a process called active learning) then learns how to improve form the user’s interaction. What this attempts to do is allow the human to give the machine algorithms some insight into the types of content they prefer, and the types of content they dont. A statistical profile of the content features of each type is recorded, with varying degrees of nuance in-between including accounting for bias, crosstalk, irrelevancy and falsehoods.

Some of this happens on the application side, some of it happens on the logic/cloud side of things. This is because it’s very important that user understand that the platform is there to serve them, and not the other way around; algorithms augmenting human decision making. This means we’ve abstracted some elements of the system logic (the elements that everyone needs to re-use over and over again) while the things specific to the use of the platform, are defined in the UI.

Usecases

We’re really excited to have had a number of really amazing partners new and old using the platform. This includes groups like Newsti.ps who are building a ‘people’s newswire’ using the Swift products.

There are also some really big uses that are occurring. For instance this BBC article profiles the PAX system that is using our platform to power a conflict early warning system. They want to index massive amounts of data from around the world and then use that data to spot historic patterns and trends that then can be used to demonstrate confidence in future patterns.

One of our favorite uses of the Swift platform to date was Product (RED)’s use last year to mashup large quantities of social media activity to power their Turn The World (RED) campaign.

There have been many more uses that we can’t talk about yet, but hopefully those become pubic soon.

Some Numbers

There are currently eight different code repositories housing the greater Swift project. Each of these API elements is tackled as if it were a single problem. This includes code for location disambiguation, natural language processing, influence detection, reputation monitoring and duplication filtering. You can find more about them here - http://blog.swiftly.org/post/5788873594/resources-for-developers

  • These combined repos contain around 150,000 lines of code (not including frameworks like Kohana)
  • Over 7,000 downloads of Sweeper to date
  • Which theoretically means at least 7,000 users of our APIs
  • Sweeper users tend to aggregate thousands of items of content over the life of a deployment which means we’ve taken around 70,000,000 items of unstructured data and done things to it like add location, tags or filtered the duplicates. That’s a very liberal extrapolation, but if gives you a sense of the amount of data we’re dealing with.
  • As the project moves forward, and all our APIs are finally completed, this number will grow exponentially. With RiverID alone (which tracks the reputation of content and people online) we expect to be indexing over half a billion items of content and actions from the social web alone by the end of the year. That’s just one API, the others will also need to scale on equal terms.

Photo by Fabrice Florin

Building a people’s newswire with Newsti.ps

[Guest blog post by Jenka Soderberg, a 2011 Knight Fellow at Stanford University and Evening News Director at KBOO Community Radio in Portland, Oregon. She can be reached at jenka [at] stanford [dot] edu]. This is a cross-post from the Ushahidi blog.

When I first started working on www.Indymedia.org in 2000, I was really excited about the platform it provided: a way for people who witnessed news events to immediately publish text, audio, video and photos to an OPEN newswire.  This was unprecedented on the web at that time, and led to an explosion of open multimedia content-posting sites.  Since its inception at the World Trade Organization protests in Seattle in 1999, the Independent Media Center expanded into over 200 local sites worldwide, all funneling featured content into the main (global) site www.indymedia.org.  In many ways, this could represent the way news organizations operate in the future – but most of the major news companies haven’t caught on to this trend just yet.

I got into the world of journalism because I didn’t trust the media.  Time and again, I’d read, hear or watch news stories that were grossly inaccurate, one-sided and oversimplified.  So I took seriously the slogan, “Don’t hate the media, BE the media”, and helped launch a bunch of indymedia centers and microradio stations all over the world, always with the hope of giving voice to the voiceless, allowing people to tell their own stories and to share in the narrative that was developing about them without the often-damaging involvement of advertising dollars and managing editors who presume to dumb things down for audiences they believe they have to entertain as well as inform.

Now, with more and more people turning away from traditional media to get their news online (see chart), it seems those audiences, about whom so many assumptions were made by the management of media corporations, are trying to find their own way in the new media world and find the news that they think is important and valuable.

Unfortunately, this often means that people seek out only news sources that confirm and uphold their existing points of view, and may be just as full of inaccuracies, speculation and oversimplification as the news media that they were trying to escape.

How can we get through the mess of misinformation to find the real tips of breaking news events, as they’re happening, and get this information out to as broad an audience as possible?

I’ve been working with a team at Stanford this year to use Ushahidi’s Swiftriver platform, and specifically Sweeper (one of the multiple tools in the Swiftriver toolbox) to try to extract real newstips from the deluge of 140-character texts and tweets, and try to figure out which newstips are real and accurate.  Our project description and current newswire is at www.newsti.ps

We’re implementing this in the Occupied Palestinian Territories, an area where many news incidents are under-reported in the US, and others are over-reported, giving US audiences a skewed perspective of the reality on the ground.  We’re using the Swiftriver platform to skim the web and twitter for keywords that are then filtered by keyword, location, reputation and duplication and organized into a database.  Our reporters in different parts of the Palestinian Territories (the West Bank, Gaza and Jerusalem), can follow up on the most poignant of these tips and verify their accuracy.  These reporters have created the International Middle East Media Center (www.imemc.org), currently the most widely-read English-language news site based in the Palestinian Territories.

We’re also working on a way to allow people who witness news events but don’t have the luxury of a smart phone yet (only 2% of cellphone users in the Palestinian Territories have smart phones, and 3G is extremely spotty), to send texts and photos directly into our system as well.  For translation of Arabic texts, we’ve solicited the help of the crowdsourced translation team of www.meedan.net.

Like with Indymedia, we think that this work can be an alternative to the mainstream media – although, as always, they are free to use these news stories, it seems unlikely that many will.  When news corporations are focused on selling advertising dollars instead of providing accurate news for their audiences, they will continue to go the way of the dinosaurs, as they are doing.  Unfortunately what we’re losing right now are lots of good, investigative news reporters who held politicians’ feet to the fire, reported on breaking news events and local issues, investigated wrongdoing by large companies, connected audience members with the stories of people in different circumstances far across the globe, but with whom they could relate due to the strength of the writing and storytelling.  What we’re left with right now, to a large extent, are cable news channels whose focus is on entertainment and advertising, and vitriolic talk radio that exuberantly embraces speculation, rumor and misinformation over fact-checked, accurate news reports.  On the local news front, AOL’s newest branchild, patch.com, threatens to replace real local reporting with half-hearted, badly-written reports that are unapologetically inaccurate.

Can we get a ‘people’s newswire’ based on eyewitness reports of newsworthy events?  I believe we can – if we combine the automation of systems like Swiftriver, the data visualization possibilities of tools like Ushahidi, and the insight of trained reporters who can follow up on potential leads.  Heck, if we can do it in the Palestinian Territories, then we can do it anywhere!

The video below is a short presentation about this project. Be sure to check out our website www.newsti.ps for real-time updates during the upcoming humanitarian flotilla to break the siege on the Gaza Strip.

Video of my presentation earlier today at Personal Democracy Forum

Tags: pdf11

Introducing Push Plugins

Anyone pulling from the nightly repo may have noticed a cool new feature for the Swift Core that Ahmed wrote last month, our Push Plugin architecture.  This, as well as a number of other features will be released with the next release of Sweeper and the Swift PHP Core.


How Push Plugins Work

Push Plugins allow SwiftRiver applications to acquire content via push (versus pull) commands.  For instance, if a user needs an SMS gateway to submit to a Swift app, you no longer need to poll the server for that content, instead, the gateway can tell your app when there’s content by pushing to the application.  
 
This plugin architecture currently supports receiving data through the standard HTTP methods GET and POST.
 
In addition, this architecture can be extended through Push Plugins to support the injection of any kind of data into the system. For example, the uploading of content from files, or the use of bookmarklets such as the Quiver extension.

How To develop Push Plugins

Locate the Modules/SiSPS/PushParsers folder, this is where you’ll find push parsers. To develop a push parser you will need to do the following:

  1. Create a file named <parsername>PushParser.php (the class name needs to be the same as the file name).
  2. Needs to be in the namespace Swiftriver\Core\Modules\SiSPS\PushParsers;
  3. Implement the following methods:
  • PushAndParser($raw_content = null, $post_content = null, $get_content = null)
  • GetDescription() - This is what gets displayed in the Sweeper UI that describes how the parser works
  • ReturnType() - Returns the type that describes what the push parser is all about

The second and third methods are implemented for display purposes so that your parser can be displayed correctly in Swift applications.

The first method is where you need to write code to convert the content being received by your parser into the Swift object model, this function should also return the content back to the rest of the SwiftRiver Core once its finished.

Depending on the type of resource your parser is listening out for, it will receive the content in one of the three variables $raw_content, $post_content and $get_content.

Jon’s talk from TED Global 2010 about the evolution of the SwiftRiver platform and learning from mistakes in crowd-sourcing.