Open Source Bookmark Curation

With the latest release of Sweeper, you can roll your own bookmarking service. This is really powerful when you start activating plugins like our auto-tagger SiLCC or our our Push plugins which can output all of your bookmarked content as a feed that can be consumed by other applications.

We call this little plugin Quiver. It’s where you manually collect and store information using Sweeper. Essentially it turns Sweeper into a your free and opensource Delicious clone, with all the contextualization and aggregation features that people have come to love it for.

So how does it work? It’s simple! Just download and install any version of Sweeper following the current release of v0.3.2 which can be found here.

Once you’ve done that, go to the ‘sources panel’.

Select ‘Quiver’ from the list.

Drag the bookmarklet to your browser bar.

Done! Sweeper is a tool for the curation of real-time media. Now the things you find interesting can be mashed up with the content you’re aggregating from the web, twitter, email and other feeds! It’s particularly useful for journalists or researchers who need the real-time content, but who want to augment that with their personalized interests and findings.

Get it from Swiftly.org

Algorithms Augmenting Human Decisions

Here’s an update about the SwiftRiver platform from PDF11 which I had the pleasure of speaking at yesterday. My slides are below and here you can find video of my presentation

Crowdsourcing 102: Mining Real-Time Data  

The summation of the talk is that the Swift project has been assigned a very complex and incredibly difficult task: to verify and contextualize data from the mobile and social web. How do we do this, well this seems to be the part that confuses people. It’s not any of our apps, and it’s not any of our individual APIs that we rely upon to do this. It’s the combination of all these things together, as part of one robust algorithm that tries to digitally reconstruct the real-world context, using the features extracted from the content t prioritize and de-prioritize information relevant to that context.

I like to refer to this as folksonomic triage where layers of historic, social, temporal, geospatial and other types of information are layered one another to perform a function, and the system (through a process called active learning) then learns how to improve form the user’s interaction. What this attempts to do is allow the human to give the machine algorithms some insight into the types of content they prefer, and the types of content they dont. A statistical profile of the content features of each type is recorded, with varying degrees of nuance in-between including accounting for bias, crosstalk, irrelevancy and falsehoods.

Some of this happens on the application side, some of it happens on the logic/cloud side of things. This is because it’s very important that user understand that the platform is there to serve them, and not the other way around; algorithms augmenting human decision making. This means we’ve abstracted some elements of the system logic (the elements that everyone needs to re-use over and over again) while the things specific to the use of the platform, are defined in the UI.

Usecases

We’re really excited to have had a number of really amazing partners new and old using the platform. This includes groups like Newsti.ps who are building a ‘people’s newswire’ using the Swift products.

There are also some really big uses that are occurring. For instance this BBC article profiles the PAX system that is using our platform to power a conflict early warning system. They want to index massive amounts of data from around the world and then use that data to spot historic patterns and trends that then can be used to demonstrate confidence in future patterns.

One of our favorite uses of the Swift platform to date was Product (RED)’s use last year to mashup large quantities of social media activity to power their Turn The World (RED) campaign.

There have been many more uses that we can’t talk about yet, but hopefully those become pubic soon.

Some Numbers

There are currently eight different code repositories housing the greater Swift project. Each of these API elements is tackled as if it were a single problem. This includes code for location disambiguation, natural language processing, influence detection, reputation monitoring and duplication filtering. You can find more about them here - http://blog.swiftly.org/post/5788873594/resources-for-developers

  • These combined repos contain around 150,000 lines of code (not including frameworks like Kohana)
  • Over 7,000 downloads of Sweeper to date
  • Which theoretically means at least 7,000 users of our APIs
  • Sweeper users tend to aggregate thousands of items of content over the life of a deployment which means we’ve taken around 70,000,000 items of unstructured data and done things to it like add location, tags or filtered the duplicates. That’s a very liberal extrapolation, but if gives you a sense of the amount of data we’re dealing with.
  • As the project moves forward, and all our APIs are finally completed, this number will grow exponentially. With RiverID alone (which tracks the reputation of content and people online) we expect to be indexing over half a billion items of content and actions from the social web alone by the end of the year. That’s just one API, the others will also need to scale on equal terms.

Photo by Fabrice Florin

Building a people’s newswire with Newsti.ps

[Guest blog post by Jenka Soderberg, a 2011 Knight Fellow at Stanford University and Evening News Director at KBOO Community Radio in Portland, Oregon. She can be reached at jenka [at] stanford [dot] edu]. This is a cross-post from the Ushahidi blog.

When I first started working on www.Indymedia.org in 2000, I was really excited about the platform it provided: a way for people who witnessed news events to immediately publish text, audio, video and photos to an OPEN newswire.  This was unprecedented on the web at that time, and led to an explosion of open multimedia content-posting sites.  Since its inception at the World Trade Organization protests in Seattle in 1999, the Independent Media Center expanded into over 200 local sites worldwide, all funneling featured content into the main (global) site www.indymedia.org.  In many ways, this could represent the way news organizations operate in the future – but most of the major news companies haven’t caught on to this trend just yet.

I got into the world of journalism because I didn’t trust the media.  Time and again, I’d read, hear or watch news stories that were grossly inaccurate, one-sided and oversimplified.  So I took seriously the slogan, “Don’t hate the media, BE the media”, and helped launch a bunch of indymedia centers and microradio stations all over the world, always with the hope of giving voice to the voiceless, allowing people to tell their own stories and to share in the narrative that was developing about them without the often-damaging involvement of advertising dollars and managing editors who presume to dumb things down for audiences they believe they have to entertain as well as inform.

Now, with more and more people turning away from traditional media to get their news online (see chart), it seems those audiences, about whom so many assumptions were made by the management of media corporations, are trying to find their own way in the new media world and find the news that they think is important and valuable.

Unfortunately, this often means that people seek out only news sources that confirm and uphold their existing points of view, and may be just as full of inaccuracies, speculation and oversimplification as the news media that they were trying to escape.

How can we get through the mess of misinformation to find the real tips of breaking news events, as they’re happening, and get this information out to as broad an audience as possible?

I’ve been working with a team at Stanford this year to use Ushahidi’s Swiftriver platform, and specifically Sweeper (one of the multiple tools in the Swiftriver toolbox) to try to extract real newstips from the deluge of 140-character texts and tweets, and try to figure out which newstips are real and accurate.  Our project description and current newswire is at www.newsti.ps

We’re implementing this in the Occupied Palestinian Territories, an area where many news incidents are under-reported in the US, and others are over-reported, giving US audiences a skewed perspective of the reality on the ground.  We’re using the Swiftriver platform to skim the web and twitter for keywords that are then filtered by keyword, location, reputation and duplication and organized into a database.  Our reporters in different parts of the Palestinian Territories (the West Bank, Gaza and Jerusalem), can follow up on the most poignant of these tips and verify their accuracy.  These reporters have created the International Middle East Media Center (www.imemc.org), currently the most widely-read English-language news site based in the Palestinian Territories.

We’re also working on a way to allow people who witness news events but don’t have the luxury of a smart phone yet (only 2% of cellphone users in the Palestinian Territories have smart phones, and 3G is extremely spotty), to send texts and photos directly into our system as well.  For translation of Arabic texts, we’ve solicited the help of the crowdsourced translation team of www.meedan.net.

Like with Indymedia, we think that this work can be an alternative to the mainstream media – although, as always, they are free to use these news stories, it seems unlikely that many will.  When news corporations are focused on selling advertising dollars instead of providing accurate news for their audiences, they will continue to go the way of the dinosaurs, as they are doing.  Unfortunately what we’re losing right now are lots of good, investigative news reporters who held politicians’ feet to the fire, reported on breaking news events and local issues, investigated wrongdoing by large companies, connected audience members with the stories of people in different circumstances far across the globe, but with whom they could relate due to the strength of the writing and storytelling.  What we’re left with right now, to a large extent, are cable news channels whose focus is on entertainment and advertising, and vitriolic talk radio that exuberantly embraces speculation, rumor and misinformation over fact-checked, accurate news reports.  On the local news front, AOL’s newest branchild, patch.com, threatens to replace real local reporting with half-hearted, badly-written reports that are unapologetically inaccurate.

Can we get a ‘people’s newswire’ based on eyewitness reports of newsworthy events?  I believe we can – if we combine the automation of systems like Swiftriver, the data visualization possibilities of tools like Ushahidi, and the insight of trained reporters who can follow up on potential leads.  Heck, if we can do it in the Palestinian Territories, then we can do it anywhere!

The video below is a short presentation about this project. Be sure to check out our website www.newsti.ps for real-time updates during the upcoming humanitarian flotilla to break the siege on the Gaza Strip.

Subjectivity, Veracity and Truth

SwiftRiver is constructed from the viewpoint that there are no absolute truths and that what is considered to be factual by most is still highly subjective or biased depending upon context.

Thus, we build tools which allow users to curate their own depiction of a perspective. In the same way that there are more than one newspaper, more than one political party in most countries, more than one religion, even more than one ‘official’ source for occurrences like earthquakes or climate change. We build tools that enable people to convey their confidence in datasets. This in no way implies that data is unbiased.

Swift apps add many layers of context to data as meta-data for making processing data faster, which in turn give our own systems more ammowith which to attempt to understand and auto-mate it’s processing.

What does veracity mean?

Veracity is simply the term we use to represent the baseline of trust that our users have conveyed about content, sources and events.  This baseline allows us to do things like recommend related content that is likely be relevant to their view; or in the case of an organization, the collective view.

A number of factors go into creating this profile: how content is organized, how it’s interacted with, how people have behaved in the past, how certain communities feel about it’s members and vice-versa, calculations for real-world phenomena like time and location of an event and so on.

When it comes to verifying data, our tools serve two purposes.

  1. For a public-facing deployment of our apps (including Ushahidi), we offer tools that allow the user to make a case to the public about a particular view. For example, these are the people in the crowd whom they trust, and what those people had to say about an event.
  2. For a non-public facing deployment, some of our apps (like Sweeper) can be used to structure data, conditionally filter and view it. This is useful for setting up automated workflows like ‘pass only approved content, taged with location and these tags, and pass that data over to Ushahdi or some other application’.

In both cases the users, the people behind the deployment, are creating their unique baseline for trust, and therefore are putting forth what they consider to be accurate, or favored, content.

Isn’t this bias?

Yes. Any system operated by a human, and I would go further to say machine created by a human, is subject to some sort of bias.

What does it mean to verify data

In the context of most Ushahidi applications ‘verified’ means corroborated or confirmed by a human. This means on the receiving end, the person ‘verifying’ the data is essentially saying “I’m taking the onus to approve this report because something, or someone, has indicated that this is true.”

Does that mean untrue information can be ‘verified’ either intentionally or accidentally? Yes. All of the terms are highly subjective and people have a number of preconceptions about what these things mean. They are mere abstractions that represent user behavior and intended use.

Verification Levels

It is a mistake to assume that because something is ‘verified’ or has a high veracity score, that it is a fact. What these indications are actually telling viewers and/or the deployer is that this is the baseline for accuracy set forth by an editing body (the deployer). Even verifying reports multiple times by independent participants will not account for human bias or fallibility.

The numbers are there because they are an additional layer of context (readable by machines and humans) allowing the deployer(s) to curate information based on the trust profile they’ve set forth through their interactions.

Likewise, viewing the same data geo-spatially simply implies that this is one community’s understanding of the collected data and what they perceive it to represent. It’s simply a faster way to view data to build up a baseline of ‘favor’ and then use the scores to filter out the content is less likely to fit that profile.

- Jon Gosier, Director of Product

Our Week at the Guardian

This was perhaps one of the busiest weeks in the history of the Guardian newspaper after it was thrown into a tailspin on Monday following some small organization publishing a few secret documents. It was incredibly convenient timing that it coincided with a friendly visit from Ushahidi who had long been scheduled to spend some time with the Guardian staff. Jonathan Gosier (Director of Product, SwiftRiver) and Brian Herbert (Lead Software Developer, Ushahidi) have spent the past week with the Guardian staff as part of their Guardian Activate program.

Daithi O Crualaoic explains the Guardians decisions in customizing Ushahidi.

What is Guardian Activate? A Guardian platform aimed at world-changers who have proven that through the use of technology and the Internet, we can make the world a better place. Past speakers at Guardian Activate Summit have included Katrin Verclas (Mobile Active), Rose Shuman (QuestionBox.org), Eric Schmidt (Google) and Ethan Zuckerman (Global Voices).

Our own discussions with the Guardian staff spanned a number of topics:

  • Lessons learned from Guardian’s uses and modifications of Ushahidi
  • The role open source software like Ushahidi plays in investigative journalism
  • Data Visualizing and Informatic Cartography (mapping)
  • Exploring the SwiftRiver platform and products
  • Ideas for new open source products for newsrooms and journalists

Guardian champions data journalism

We also had a great tour of the massive four executive floors of the Guardian’s operations in London and the opportunity to sit in on a few non-sensitive meetings and editorial discussions. In the past, Ushahidi has collaborated with newsgroups like Al Jazeera, the Guardian, BBC, Thomson Reuters and others. So it was good to finally have an intense week of discussion with one of the world’s foremost leaders in news, to get some insight as to how our products can be improved to aid the journalistic process.

At the 2010 Guardian Activate Summit our very own Juliana Rotich (Program Director, Ushahidi) gave this talk:

This article also appears on the Ushahidi Blog here.