Summer of Swift: Soe

The Google Summer of Code has ended. This year SwiftRiver was a mentoring organization and we wanted to give our GSoCers some ‘face’ time on the blog by interviewing them. Soe is a developer who worked on our distributed reputation product, River ID.

Interview with Soe Q: What is your educational (or professional) background?

Bachelor in Mechanical Engineering. Starting Master’s in Sustainable Engineering this academic year. I like to code for fun and for social wellness.

Q: The project you were working on was called ____________. Why did you select that as your GSoC project and what did you learn from working on it?

RiverID. I find it challenging as I am required to build a scalable platform interacting with remote Swift instances from scratch. The project involves technical challenges to work with NoSQL for scalability and OAuth and REST for interacting with remote Swift instances. Additionally, it involves working with human psychology to determine and reward useful contributions towards Swift instances.

Q: What challenges did you run into during and how did you overcome them?

Working with edge technologies; hence, not many reading materials and examples are available. Hence, it involves lots of trying out and experimenting around.

Q: GSoCers get to choose the organizations they work with, why did you choose to work with SwiftRiver?

I am much impressed by social impact that Ushahidi has. As SwiftRiver would add more values to Ushahidi, I wanted to contribute something to this upcoming platform.

Q: Any closing remarks?

I will continue contributing to RiverID till it is functional. As I have keen interest in developing journalism tools for rural areas, my experience working with RiverID would be very useful for my future project.

On Triage and Verifying Crowdsourced Reports

SwiftRiver is a platform consisting of a number of unique products and technologies. The goal is to aggregate information from multiple media channels (SMS, Twitter, Email, RSS feeds from the web) and to add context: The ‘who’, ‘what’ and ‘where’ of that which is being discussed in each message. So, who the message is about, what it’s about, and where the message originated from. Swift then uses these details to help predict the relevancy of the information coming to the user. This allows us to promote content the user cares about while suppressing content they are less likely to (spam, inaccuracies, falsehoods, and crosstalk).

One of the technologies in the works for the Swift platform is RiverID, it’s a distributed reputation system. It works through a process we call ‘triage’, where two or more (usually three) types of data are compared to make insights that aren’t possible when looking at the data alone.

Let’s use the recent earthquakes in Haiti as an example of how this works. Let’s say we get a message that says “People trapped in a severely unstable building in Neighborhood X.” Our question becomes, who is telling us this? Can they be trusted, and is the information accurate? Traditionally all these questions have to be asked and answered on the fly. That creates a bottleneck on how much information an organization can process: they either put trusted people in the field or they work with vetted organizations on the ground. This isn’t possible for organizations who want to gather crowd-sourced reports. The problem still exists and it’s now amplified because there are even more anonymous people who need to be vetted.

With the above message there are a few ways to attempt verification of what’s being reported. So we might start with location. If we know the text message has originated from someone in Haiti (there are ways to do this, for instance just looking at the country-code is one way) that location information can then inform our triage dataset.

The second form of context we can attempt to add is corroboration. Are there other reports coming from the same general location and time that corroborate what this message is telling us? If everyone in Neighborhood X is saying that it’s a perfectly sunny day and the kids are playing outside, we have a conflict. Either the crowd is lying or the text message is. So we compare one message with others to see if the stories align, and that becomes an addition to our data set. This used to take a lot of human hours. We want to speed up that process by using algorithms and natural language processing.

The third data set (the last mile) is this all becomes fun because location and corroboration can tell us a lot but they aren’t always perfect indicators. So we attempt to look at history. Has this person reported anything before? If so were they reliable then? Do we know their telephone number? In other words, can we use history as context? This is where RiverID comes in. RiverID allows a user or organization to form a profile on a user’s communication graph. If I (as a user of Swift) know someone’s name, have their phone number, email address, blog url, and social network profiles I can store all that data as a profile of the source. Then in the future if I get a text message out of the blue from Haiti, it just may end up being someone who I have a profile on.

The text message is no longer coming from anonymous sources in the crowd, it’s now coming from an identifiable sources with unique histories. From that point it’s just a matter of looking back at that users history to try to make a decision. If they tend to be reliable and accurate, their RiverID profile will give the statistical advantage to actions they take to verify other reports.

Now, I should preface that we (at SwiftRiver) never have access to all that user data. Only the organizations using our platform do, it all happens on their servers or behind their firewalls. We never touch their data, nor would we ever need to, as every use of SwiftRiver is going to have different context, and subsequently differing needs. RiverID data might only be relevant in specific contexts. Essentially we’ve taken the idea of something like Facebook Connect and we’re making it completely opt-in, and completely decentralized (the user stores user profiles, we just reference their database). This allows the organization access reputation profiles unique to their groups needs.

On a final note, I should say that triage may not always consist of the same data types. In this case it was location, corroboration and user history; in other cases it might include things like the time of the report or accuracy (as determined by the user).