Wikidata can change the way citizen scientists contribute

If you’ve been following discussions on citizen science, you’ve probably realized that researchers are generating so much data, that they need extensive help for parsing the data and making it more useful. For many projects, citizen scientists have answered the call for help–making enormous contributions. Sure, there was a recent study which found that “Most participants in citizen science projects give up almost immediately”, but as Caren Cooper pointed out:

Just by trying, citizen scientists made important contributions regardless of whether or not they chose to continue.

But I digress...
But I digress…

What does citizen science have to do with wikidata? On that matter, what the heck is wikidata?

Much of citizen science contributions come in some form of data collection (observations; sample collection; taking measurements, pictures, coordinates, etc) or classification (identification, data entry, etc) but few citizen scientists participate in analyzing the data.

From ‘Surveying the Citizen Science Landscape’ by Andrea Wiggins and Kevin Crowstone (click the figure to read the paper, it’s open access)

Wikidata (a linked, structured database for open data) may serve to change that. Naturally, wikidata relies on the contributions of volunteers; however, the data incorporated into wikidata is open for anyone to use. In fact, wikidata is begging to be used and citizen scientists and citizen data scientists are welcome to use it. An international group of has already put together a grant proposal (open/crowdsourced in the true spirit of wikipedia) to make wikidata an open virtual research environment. Dubbed, Wikidata for Research the proposal aims to establish “Wikidata as a central hub for linked open research data more generally, so that it can facilitate fruitful interactions at scale between professional research institutions and citizen science and knowledge initiatives.”

As exciting as this all is, there is a lot of work that still needs to be done for making wikidata more successful. Although it’s open access, it’s still a bit inaccessible due to the lack of clear documentation for new users. It’s not that the information doesn’t exist–there is a ton of information on wikidata available and a lot of neat tools already available and in development. You just have to look really hard for it. Fortunately, the wikidata community is already aware of the key issues that need to be addressed in order to become more successful.

Researchers have already taken considerable effort to make science more accessible by contributing to science-related articles. There are over 10,000 genes already in wikipedia thanks (in part) to the Gene Wiki initiative! It makes sense that wikidata is next. A lot of progress has been made in this arena, but I’ll save that for later.

If you can READ, you CAN HELP

As mentioned in Andrew’s Tedx Talk in this post, one of the grand challenges in scientific research is creating a system where all scientists can can access the knowledge embedded within the entirety of research literature. With such a system, scientists would be able to build the bridges they need to connect ideas across disparate fields of research faster. But how could such a system be built? While computation-based text miners can be fast and automatic, they are rather error-prone as the ability to read remains a uniquely human skill.

Mark2Cure enlists the help of citizen scientists to annotate biomedical literature–in essence allowing biomedical texts to be mined for useful information.

Beta testing starts soon. Join here now to help:

This post was originally written for Mark2Cure and can be viewed in its entirety here.

How Would Mark2Cure Expedite Scientific Discovery?

How would thorough annotations improve information extraction from biomedical research literature? To illustrate one of the issues Mark2Cure aims to address, we’ll start with an example drawn from history–the undiscovered public knowledge that an information scientist named Don Swanson successfully mined in the 1980’s.

At that time, there were 2,000 research articles on Raynaud’s syndrome, a disease characterized by feelings of numbness and cold in some parts of the body in response to cold temperatures or stress. There were also about 1,000 research articles on fish oils, but there were no articles that spanned, joined, or linked both subjects.

In the case of Mark2Cure, citizen scientists will annotate abstracts of these research articles since the pace of research publication is much higher now than it was in the 1980’s.

This post was originally written for Mark2Cure and can be viewed in its entirety here.

Neat Science Thursday – Social Experiments over the Internet

As a fan of video games with a keen interest in human behaviors, I was fascinated by the Twitch Plays Pokemon social experiments set up on the video streaming site, Twitch. The programmer that designed this experiment streamed a game of pokemon and parsed actionable comments from the channel’s chatroom. The actionable comments (or commands) were then executed in the game allowing the crowd to essentially play the game. Although participation varied at times, the number of participants reached a whopping 1,165,140 giving Twitch plays pokemon red recognition by the Guinness World Records for having “the most participants on a single-player online videogame”.

In spite of speculation that the players would never reach sufficient consensus for each decision point and that trolls would never allow any progress to be made in the game, it took the participants about 16.5 days to finish the game. Red may have spent a lot of time walking into corners, or jumping off ledges, but eventually he made it to the finish line. Twitch plays pokemon series offers a fascinating look at how users organize themselves, contribute to, and alter the landscape surrounding the game. Memes (like Consult Helix) were born and several pokemon became religious icons.

twitch plays pokemon

Of course, in the case of Twitch plays pokemon, the players were actively engaging in that social experiment, and twitch users were not automatically included unless they engaged in the series. In most cases, anyone using the internet is unwittingly taking part in social experiments.

As pointed out in a recent techreview article:

    “When doing things online, there’s a very large probability you’re going to be involved in multiple experiments every day,” Sinan Aral, a professor at MIT’s Sloan School of Management, said during a break at a conference for practitioners of large-scale user experiments last weekend in Cambridge, Massachusetts. “Look at Google, Amazon, eBay, Airbnb, Facebook—all of these businesses run hundreds of experiments, and they also account for a large proportion of Web traffic.”

To no one’s surprise, there was outrage when facebook users discovered they were unwittingly taking part of a social experiment, exhibiting:

    how few people realize they are being prodded and probed at all.

    “I found the backlash quite paradoxical,” said Alessandro Acquisti, a professor at Carnegie Mellon University who studies attitudes towards privacy and security. Although he thinks user experiments need to be conducted with care and oversight, he felt the Facebook study was actually remarkably transparent. “If you’re really worried about experimentation, you should look at how it’s being used opaquely every day you go online,” Acquisti said.

The internet has made it much easier to conduct and take part in social experiments–so easy that we’re participating just by going online whether or not we like it! To be fair, the experimenters aren’t limited to nameless faces behind big companies…anyone can get in on the action! In one case, a photographer sent her picture to 40 different amateur photoshop retouchers from 25 different countries using an online task crowdsourcing site called Fiverr. The result revealed interesting variations on how individual retouchers from across the globe defined beauty; however, it would be hard to draw any conclusions given that few studies have been done on the population that participates on Fiverr.

Microtask sites like Amazon Mechanical Turk are also getting into the game. Fortunately there is an effort to learn more about the turkers in this case. In fact, there’s an entire blog dedicated to social science experiments conducted on Amazon Mechanical Turk, leading to a very interesting post about the Amazon Mechanical Turk as the new face of behavioral science.

I wonder how that photoshop experiment would have gone if done using Amazon Mechanical Turkers, especially since AMT allows for prequalification testing of the potential worker pool.

Should people browsing on the internet harbor any hope for privacy or exclusion from being unwittingly used or is that already an illusion of the past?

Neat Science Thursday – Too Much Information

Personalized medicine has been a goal for a growing number of biomedical researchers over the last twenty years. Considering the fact that biomedical research literature on personalized medicine has grown from 5-10 articles/year in the 1980’s to over 2500 articles per year since 2013, incredible progress has been made towards this extremely challenging goal. For personalized medicine to happen, at least two elements are necessary: 1. A means of acquiring personalized data is needed, and 2. A means of integrating, analyzing, and applying that data. The explosive improvements in the amount, quality, efficiency, and cost-effectiveness of obtaining personalized data , creates a huge challenge in the integrating, analysis, and application of that exponentially growing body of data available. Thus, the challenge of personalized medicine primarily lies in the integration and analysis of ‘Big Data.’ Yes, there is always room for improvement in data acquisition, but all the growing data is problematic if it cannot be effectively utilized. Researchers from all over the world have been working on analysis tools in order to better extract useful information from the growing available sets of omics data, and [Warning: shameless plug alert] The Su lab’s Omics Pipe is one attempt to automate the best practice multi-omics data analysis pipelines. [End shameless plug]

Barbour Analytics published a fascinating post on the N-of-1 problem in Big Genomics. It is a great read for anyone interested in personalized medicine, rare diseases, big data, and bioinformatics. Here is a teaser to encourage you to take a look at the original post:

    How do we assess the impact of a single novel mutation, or a set of novel mutations, unique to an individual? This is the N of 1 problem in Big Genomics. Statistics, and statistical genetics rely on summary, on binning the patterns of populations of individuals into categories of adequate size that we can compare groups using standard metrics like mean, median, mode, standard error, and in more elaborate frameworks use more sophisticated metrics like moments, edges, vectors and ridges.

    The N of 1 problem in Big Genomics will require modeling approaches, to construct models of the genome, and make projections on the likely function of single de novo mutations, and suites of these private mutations. Robust modeling efforts in this area will be a major challenge in the era of genomic medicine, and personalized medicine. At present we are effectively constrained to study mutations that have recurred throughout evolution. As our population grows, as the number of persons under care, and participating in genomic medicine increases, we will need to address the private mutation issue head on.

    We can look to cancer genomics for some guidance. Cancers are a genomic disease, with both inherited and de novo elements, and direct sequencing of genomes often reveal unique mutations that lead to unique cancer profiles. This field has the advantage of seeing a clear disease manifestation in the form of tumor growth, often restricted to a tissue or cell type. This helps make more direct inferences about the likely function of the novel mutation.

    That said, we face the inherent limitation that a mutation may be unique, or at least rare, and for this reason it is difficult to use traditional statistical approaches, approaches that rely on summary, on the behavior of groups of instances. While the genotype information may be limited to one person in these instances, we can assist ourselves in this effort by capturing more information about clinical and biological phenotype. Detailed phenotypic characterizations of a tumor or affected tissue – extending to the transcriptome, kinome, metabolome, cytokine profiles, cell morphology and indeed clinical status itself, can help us perform a sort of reverse interpolation to infer the function of the single N of 1 de novo mutation. While the mutation may be unique or rare, the disease manifestation itself may be common, or at least share key features with other maladies.

Now go read the original post here: ‘Private Mutations’: The N of 1 in Big Genomics — Barbour Analytics.