Knight News Challenge Grant!

It’s truly an honor to accept a $250,000 grant from the Knight Foundation for the SwiftRiver project!  It’s the culmination of a long journey that began in 2008 but evolved in 2010 when I joined the project as (product designer) and later Matthew Griffiths (lead developer).

Swift is an open-source initiative who’s goal is to make the process of vetting information more efficient.  The project to date has progressed well thanks in no small part to the following people: Matthew Griffiths (so important to this project I mentioned him twice), Ahmed Maawy, Charl Van Neikerk, Heather Ford, Vladimir Ermakov, The Ushahidi team, Omidyar Networks, Chris Blow, Ed Bice, Kaushal Jhalla, Neville Newey, Edmar Ferreira, Pete Warden, Patrick Meier, Anahi Ayala, Ethan Zuckerman, the TED staff, Google’s 2010 Summer of Code Participants (Mang-Git, Soe, Nishith Rastogi), the Guardian’s Activate staff, Product(RED) and many others. This project would be nowhere without you all so thanks for making it happen.

For many of us, this project represents a new way of democratizing access to the tools for understanding and vetting information which is needed by Ushahidi, journalists, and many others.

Open Source Bookmark Curation

With the latest release of Sweeper, you can roll your own bookmarking service. This is really powerful when you start activating plugins like our auto-tagger SiLCC or our our Push plugins which can output all of your bookmarked content as a feed that can be consumed by other applications.

We call this little plugin Quiver. It’s where you manually collect and store information using Sweeper. Essentially it turns Sweeper into a your free and opensource Delicious clone, with all the contextualization and aggregation features that people have come to love it for.

So how does it work? It’s simple! Just download and install any version of Sweeper following the current release of v0.3.2 which can be found here.

Once you’ve done that, go to the ‘sources panel’.

Select ‘Quiver’ from the list.

Drag the bookmarklet to your browser bar.

Done! Sweeper is a tool for the curation of real-time media. Now the things you find interesting can be mashed up with the content you’re aggregating from the web, twitter, email and other feeds! It’s particularly useful for journalists or researchers who need the real-time content, but who want to augment that with their personalized interests and findings.

Get it from Swiftly.org

Algorithms Augmenting Human Decisions

Here’s an update about the SwiftRiver platform from PDF11 which I had the pleasure of speaking at yesterday. My slides are below and here you can find video of my presentation

Crowdsourcing 102: Mining Real-Time Data  

The summation of the talk is that the Swift project has been assigned a very complex and incredibly difficult task: to verify and contextualize data from the mobile and social web. How do we do this, well this seems to be the part that confuses people. It’s not any of our apps, and it’s not any of our individual APIs that we rely upon to do this. It’s the combination of all these things together, as part of one robust algorithm that tries to digitally reconstruct the real-world context, using the features extracted from the content t prioritize and de-prioritize information relevant to that context.

I like to refer to this as folksonomic triage where layers of historic, social, temporal, geospatial and other types of information are layered one another to perform a function, and the system (through a process called active learning) then learns how to improve form the user’s interaction. What this attempts to do is allow the human to give the machine algorithms some insight into the types of content they prefer, and the types of content they dont. A statistical profile of the content features of each type is recorded, with varying degrees of nuance in-between including accounting for bias, crosstalk, irrelevancy and falsehoods.

Some of this happens on the application side, some of it happens on the logic/cloud side of things. This is because it’s very important that user understand that the platform is there to serve them, and not the other way around; algorithms augmenting human decision making. This means we’ve abstracted some elements of the system logic (the elements that everyone needs to re-use over and over again) while the things specific to the use of the platform, are defined in the UI.

Usecases

We’re really excited to have had a number of really amazing partners new and old using the platform. This includes groups like Newsti.ps who are building a ‘people’s newswire’ using the Swift products.

There are also some really big uses that are occurring. For instance this BBC article profiles the PAX system that is using our platform to power a conflict early warning system. They want to index massive amounts of data from around the world and then use that data to spot historic patterns and trends that then can be used to demonstrate confidence in future patterns.

One of our favorite uses of the Swift platform to date was Product (RED)’s use last year to mashup large quantities of social media activity to power their Turn The World (RED) campaign.

There have been many more uses that we can’t talk about yet, but hopefully those become pubic soon.

Some Numbers

There are currently eight different code repositories housing the greater Swift project. Each of these API elements is tackled as if it were a single problem. This includes code for location disambiguation, natural language processing, influence detection, reputation monitoring and duplication filtering. You can find more about them here - http://blog.swiftly.org/post/5788873594/resources-for-developers

  • These combined repos contain around 150,000 lines of code (not including frameworks like Kohana)
  • Over 7,000 downloads of Sweeper to date
  • Which theoretically means at least 7,000 users of our APIs
  • Sweeper users tend to aggregate thousands of items of content over the life of a deployment which means we’ve taken around 70,000,000 items of unstructured data and done things to it like add location, tags or filtered the duplicates. That’s a very liberal extrapolation, but if gives you a sense of the amount of data we’re dealing with.
  • As the project moves forward, and all our APIs are finally completed, this number will grow exponentially. With RiverID alone (which tracks the reputation of content and people online) we expect to be indexing over half a billion items of content and actions from the social web alone by the end of the year. That’s just one API, the others will also need to scale on equal terms.

Photo by Fabrice Florin

Building a people’s newswire with Newsti.ps

[Guest blog post by Jenka Soderberg, a 2011 Knight Fellow at Stanford University and Evening News Director at KBOO Community Radio in Portland, Oregon. She can be reached at jenka [at] stanford [dot] edu]. This is a cross-post from the Ushahidi blog.

When I first started working on www.Indymedia.org in 2000, I was really excited about the platform it provided: a way for people who witnessed news events to immediately publish text, audio, video and photos to an OPEN newswire.  This was unprecedented on the web at that time, and led to an explosion of open multimedia content-posting sites.  Since its inception at the World Trade Organization protests in Seattle in 1999, the Independent Media Center expanded into over 200 local sites worldwide, all funneling featured content into the main (global) site www.indymedia.org.  In many ways, this could represent the way news organizations operate in the future – but most of the major news companies haven’t caught on to this trend just yet.

I got into the world of journalism because I didn’t trust the media.  Time and again, I’d read, hear or watch news stories that were grossly inaccurate, one-sided and oversimplified.  So I took seriously the slogan, “Don’t hate the media, BE the media”, and helped launch a bunch of indymedia centers and microradio stations all over the world, always with the hope of giving voice to the voiceless, allowing people to tell their own stories and to share in the narrative that was developing about them without the often-damaging involvement of advertising dollars and managing editors who presume to dumb things down for audiences they believe they have to entertain as well as inform.

Now, with more and more people turning away from traditional media to get their news online (see chart), it seems those audiences, about whom so many assumptions were made by the management of media corporations, are trying to find their own way in the new media world and find the news that they think is important and valuable.

Unfortunately, this often means that people seek out only news sources that confirm and uphold their existing points of view, and may be just as full of inaccuracies, speculation and oversimplification as the news media that they were trying to escape.

How can we get through the mess of misinformation to find the real tips of breaking news events, as they’re happening, and get this information out to as broad an audience as possible?

I’ve been working with a team at Stanford this year to use Ushahidi’s Swiftriver platform, and specifically Sweeper (one of the multiple tools in the Swiftriver toolbox) to try to extract real newstips from the deluge of 140-character texts and tweets, and try to figure out which newstips are real and accurate.  Our project description and current newswire is at www.newsti.ps

We’re implementing this in the Occupied Palestinian Territories, an area where many news incidents are under-reported in the US, and others are over-reported, giving US audiences a skewed perspective of the reality on the ground.  We’re using the Swiftriver platform to skim the web and twitter for keywords that are then filtered by keyword, location, reputation and duplication and organized into a database.  Our reporters in different parts of the Palestinian Territories (the West Bank, Gaza and Jerusalem), can follow up on the most poignant of these tips and verify their accuracy.  These reporters have created the International Middle East Media Center (www.imemc.org), currently the most widely-read English-language news site based in the Palestinian Territories.

We’re also working on a way to allow people who witness news events but don’t have the luxury of a smart phone yet (only 2% of cellphone users in the Palestinian Territories have smart phones, and 3G is extremely spotty), to send texts and photos directly into our system as well.  For translation of Arabic texts, we’ve solicited the help of the crowdsourced translation team of www.meedan.net.

Like with Indymedia, we think that this work can be an alternative to the mainstream media – although, as always, they are free to use these news stories, it seems unlikely that many will.  When news corporations are focused on selling advertising dollars instead of providing accurate news for their audiences, they will continue to go the way of the dinosaurs, as they are doing.  Unfortunately what we’re losing right now are lots of good, investigative news reporters who held politicians’ feet to the fire, reported on breaking news events and local issues, investigated wrongdoing by large companies, connected audience members with the stories of people in different circumstances far across the globe, but with whom they could relate due to the strength of the writing and storytelling.  What we’re left with right now, to a large extent, are cable news channels whose focus is on entertainment and advertising, and vitriolic talk radio that exuberantly embraces speculation, rumor and misinformation over fact-checked, accurate news reports.  On the local news front, AOL’s newest branchild, patch.com, threatens to replace real local reporting with half-hearted, badly-written reports that are unapologetically inaccurate.

Can we get a ‘people’s newswire’ based on eyewitness reports of newsworthy events?  I believe we can – if we combine the automation of systems like Swiftriver, the data visualization possibilities of tools like Ushahidi, and the insight of trained reporters who can follow up on potential leads.  Heck, if we can do it in the Palestinian Territories, then we can do it anywhere!

The video below is a short presentation about this project. Be sure to check out our website www.newsti.ps for real-time updates during the upcoming humanitarian flotilla to break the siege on the Gaza Strip.

Localizing News

The following post was written by a volunteer developer, Vladimir G. Ermakov a Master’s student at Carnegie-Mellon University in Pennsylvania. Over the past few months he took on an ambitious project: to contribute code that would allow us to parse news articles and attempt to auto-detect the primary location that is the subject of any given text.


Localizing News by Vladimir Ermakov

The amount of information available in electronic format is rapidly increasing. It is becoming possible to find out real-time about the current events in a particular part of the world based on electronic data such as news articles, blog entries, twitter feeds and SMS messages. Even though the data is available, there is an overwhelming amount of it and it is hard to stay on top of events that are of relevance. Getting informed about recent developments is particularly important in the times of crisis, when lives could depend on timely response. In this project I am exploring ways to pinpoint the location discussed in text documents. I am able to achieve good results by combining location keywords extracted by Yahoo! Placemaker service with state of the art machine learning and natural language processing techniques.

The basic approach that I’ve embarked upon is to extract location keywords from a document using Yahoo Placemaker service, and then apply classification techniques to disambiguate, which of these locations is most relevant to the document at hand. I’ve conducted experiments with Naïve Bayes and Fisher classifiers using bag of words model for feature extraction, but these did not give good results. I explored an alternative approach: use count and position of location keywords extracted by Placemaker and feed them into a SMV. This proved to be a very effective way of determining the country that is the focus of the document. Applying lemmatization to location adjectives such as Russian and converting them to nouns such as Russia helped improve the results even further.

While the Reuters-21578 is was a great dataset to use for training classifiers and experimenting with the data, the articles there were collected 20 years ago. What made this project interesting for me, is the possibility of visualizing the news around the world on a map, and seeing whether sudden rise in the number of articles published can be an indicator of some important events.

To make this possible I had to obtain a recent dataset. Reuters has archived articles from the last several years on their website. I developed a simple crawler that visited news articles from this archive, downloaded them to my server, and extracted the news article text content. I then passed this content off to the Yahoo Placemaker service, and output the data with the location labels into XML files. I then could use my scripts to run the experiments on this new dataset, just like I did with the original data.

I limited my data collection to the most recent articles. The archive contained over 400,000 news articles for 2010, which too many to download. I restricted the crawler to randomly pick 10% of the articles from each day of the year. This was still a significant amount of data, 80,000 articles, and fairly representative of the whole archive.

After all the experiments I was able to narrow down on a working solution for mapping news articles - extract location information from the article using Yahoo Placemaker service, making sure to lemmatize location adjectives, extract normalized count and position of location keywords within the article, and apply SVM classifier to decide which of these locations are more important to the article. The results were encouraging, and I believe this solution is ready to deploy into a real world application. I am hoping to implement an extension to Swiftriver platform in the near future that uses this method to classify news articles by country.


Valdimir’s paper is a much longer, and much more fascinating read than I could share here but if you’d like to read it. He can be reached by emailing vermakov [at] emu [dot] edu.

We’re working on folding this and other contributions into the next release.  Thanks for the awesome work Vladimir!  Other developers interested in contributing to the Swift platform can find out more here.

Vote in the Knight News Challenge

Every year the Knight Foundation rewards innovation in technology primarily targeting professional and citizen journalists. The rewards are grants that help projects scale and improve their platforms.  We just entered and wanted to take some time to explain our vision and what we think makes us a worthy applicant.

What is SwiftRiver’s mission?  To democratize access to tools that can be used to filter and make sense of realtime information from SMS, Twitter, Email and the Web.

Where do we add value to news? SwiftRiver is free and open source. This includes apis for natural language processing, location detection, reputation & trust, duplication filtering and influence detection.

We make these tools open for two reasons: Firstly, because in large news rooms, staff want complete control over their platforms and they need to be able to modify and customize workflows as needed.  This tends to mean they develop similar tools in-house which is great for organizations with those types of resources, not so great for organizations who can’t.  Secondly, our goal is to make these advanced intelligence tools available to journalists in even the most remote, unconnected places. 

Who needs our products?  The strongest demand for SwiftRiver is actually from journalists who are increasingly overwhelmed by the task of sorting through vast streams of data.  We’re actually working with several different groups from around the world who want to use applications like Twitter and Facebook to gather news, who share the problem of identifying the kernels of reliable information amidst a sea of ‘noise’.

Why should you vote for us? SwiftRiver has gone from merely a concept that was laid out two years ago, to a tangible product over the last year on very limited resources.  

Although, we’re part of the Ushahidi family (still a small company in it’s own right), we don’t have access to the same financial resources or staff.  They all have their hands-full making Ushahidi the great product that it is.  Because we’re a small team, we can’t develop things as quickly as we might like.  Demand is way out-pacing our ability to deliver and scale.

We’re a very small team: one full-time person people, one part-time developer and we’ve only this month added a third.  

Who are you targeting? Swift is for people overwhelmed by data.  That’s a very broad problem that essentially effects everyone with a computer and connection to the internet.  This makes a singular audience difficult to suss out.  I like to say this: We built a platform and we’re using our platform to target different industries, primarily, data journalists.

There are many other uses of the SwiftRiver platform, many that people are discovering without our guidance and hopefully that means what we’re doing is powerful, adaptable, relevant in different scenarios, easy to use and most importantly accesible to all.

Vote or ask questions about SwiftRiver in the Knight News Challenge.

Our Week at the Guardian

This was perhaps one of the busiest weeks in the history of the Guardian newspaper after it was thrown into a tailspin on Monday following some small organization publishing a few secret documents. It was incredibly convenient timing that it coincided with a friendly visit from Ushahidi who had long been scheduled to spend some time with the Guardian staff. Jonathan Gosier (Director of Product, SwiftRiver) and Brian Herbert (Lead Software Developer, Ushahidi) have spent the past week with the Guardian staff as part of their Guardian Activate program.

Daithi O Crualaoic explains the Guardians decisions in customizing Ushahidi.

What is Guardian Activate? A Guardian platform aimed at world-changers who have proven that through the use of technology and the Internet, we can make the world a better place. Past speakers at Guardian Activate Summit have included Katrin Verclas (Mobile Active), Rose Shuman (QuestionBox.org), Eric Schmidt (Google) and Ethan Zuckerman (Global Voices).

Our own discussions with the Guardian staff spanned a number of topics:

  • Lessons learned from Guardian’s uses and modifications of Ushahidi
  • The role open source software like Ushahidi plays in investigative journalism
  • Data Visualizing and Informatic Cartography (mapping)
  • Exploring the SwiftRiver platform and products
  • Ideas for new open source products for newsrooms and journalists

Guardian champions data journalism

We also had a great tour of the massive four executive floors of the Guardian’s operations in London and the opportunity to sit in on a few non-sensitive meetings and editorial discussions. In the past, Ushahidi has collaborated with newsgroups like Al Jazeera, the Guardian, BBC, Thomson Reuters and others. So it was good to finally have an intense week of discussion with one of the world’s foremost leaders in news, to get some insight as to how our products can be improved to aid the journalistic process.

At the 2010 Guardian Activate Summit our very own Juliana Rotich (Program Director, Ushahidi) gave this talk:

This article also appears on the Ushahidi Blog here.

SwiftRiver Releases Plugins for Wordpress



For all you Wordpress publishers out there interested in SwiftRiver there are two official plugins we’re releasing today that bring Swift to your platform of choice: WP-SiLCC and WP-Veracity.

WP-SiLCC



WP-SiLCC is an auto tagging plug-in. Users who run news sites or aggregators should consider using this to add a basic level of taxonomy to all posts. WP-SiLCC also allows users to tag their own posts for sites that prefer a more folksonomic approach. WP-SiLCC uses active learning techniques to improve how it parses text over time.

Download WP-SiLCC from Wordpress.org

WP-Veracity



WP-Veracity applies bayesian algorithms to your content to help surface posts based on “interestingness”, influence and time-published rather than popularity alone. From SwiftRiver’s perspective, popularity is only an indicator of influence, not necessarily an indicator of authority. This plug-in calculates popularity (number of hits, trackbacks, comments), a bayes score and time (older content falls off organically) to offer a better picture of the most interesting posts on your blog at any given time.

Download WP-Veracity from Wordpress.org




For developers interested in creating their own plugins using Swift Web Services, visit our documentation wiki.

Variations on a Theme

In a past life, before developing software, I was a musician. The two have a lot in common actually: recursive pattern, rhythm, syntax, meter. I suppose most developers don’t think of code this way, but I do. It needs to look as good to humans as it does to machines. When I took over development of Swiftriver and I was looking for a theme to weave through all of our releases, it was natural to default to what I love: music.

Drummer in Ouagadougou


Each release of Swift River will carry the name of a style of African music. The release schedule appears below. I think it’s fitting to able to pay homage the music around me in this way and it actually serves as a starting point for people looking to be exposed new styles of music that they may not already know. The first version of Swift, an early Alpha, will be available on March 31st, 2010 and there will be regular updates and iterations to follow. If you’re interested in how Swift River verifies and filters the crowd, visit us here.

Alpha Releases
0.0.0 Rumba (Release Date: March 31st, 2010)
0.1.0 Apala
0.2.0 Batuque
0.3.0 Benga
0.4.0 Bikutsi
0.5.0 Cape Jazz
0.6.0 Chimurenga
0.7.0 Fuji
0.8.0 Harare
0.9.0 Jit

Beta Releases
1.0.0 Jùjú (August 1st, 2010)
1.1.0 Kizomba
1.2.0 Kuduro
1.3.0 Kwaito
1.4.0 Kwela
1.5.0 Makossa
1.6.0 Malouf
1.7.0 Maloya
1.8.0 Marrabenta
1.9.0 Museve
2.0.0 Mbalax

Non-Beta and Beyond
2.1.0 Mbaqanga
2.2.0 Mbube
2.3.0 Morna
2.4.0 Palm
2.5.0 Raï
2.6.0 Sakara
2.7.0 Sega
2.8.0 Soukous
2.9.0 Taarab
3.0.0 Zouk

Ouagadougou Drummer Photo by Babasteve