Results of the Binary Battle

We are happy to announce that we won the Mendeley / PLoS Binary Battle. This comes unexpected, although we worked hard to achieve this. As we see openSNP as a community-driven project: We really want to thank all of you: For voting for openSNP in the final round of the Battle. For sharing your data. For finding bugs. For your critique. For your ideas and feature suggestions. For all of your support. This is a great source of motivation, especially if I think of implementing all the upcoming feature-ideas we have.

We also want to send our congratulations to PaperCritic, which came in second, and rOpenSci which is the second runner up. Both are definitely worth a look (as are all the other entries of the binary battle as well). It is really great to see what creative minds can build with open data and open APIs.

As Philipp is currently writing his master thesis (and I’m also working for the last exam of this year) there hasn’t been much new in terms of features in the last weeks. But this should end in a week or two and we already have some plans. And we are also applying for some small funding via the german Wikimedia Foundation and their WissensWert-contest, which funds projects that support open knowledge. We are trying to get the funding in order to get data of more people who are into sharing their data genotyped (and may lack the financial resources to get it done). This could lead to some more data sets on openSNP.

Thanks a lot ! And if you have any questions: Just contact us, we are really looking forward to get in touch with you.

You can vote for us

The Mendeley/PLoS Binary Battle now also features a public vote where you can vote for the Top 10+1 submissions. The result of the public vote will count as one point to be added to the expert judges votes.

If you want to help us, give us your vote and spread the word. Thanks a lot!

First Results of the Survey on Sharing Genetic Information

General Information

We have finally taken the time to analyze some of the results of the survey on sharing genetic information we did before we started working on openSNP.

Some general information: Overall 229 people participated in this survey. About 25% of participants gave their chromosomal sex as XX, 74% as XY and there are no differences in terms of usage of DTC-companies between those groups. The mean age of the participants is ~33, the youngest being 15, the oldest being 70. Over 80% of participants gave their ethnicity as caucasian.

Nearly 40% of all participants have already used a DTC-company to get themselves genotyped, further 30% plan to do so while 30% don’t plan to get genotyped. This high amount of participants that got themselves genotyped seems to be the result of the ways we spread the survey: We posted it at the 23andMe-community, sent it to the DIYBio-mailing list and some bloggers out of the fields of genetics/personal genomics also wrote posts on the survey (again: Thanks a lot for your support). We also spread the survey using Twitter, Facebook and Google+. We chose this approach as our goal was not to survey a representative sample, but to assess the demand for a service like openSNP.

68% of all participants said they would agree to share data with their DTC-company, no matter if it shared the data with others, 26% would agree to share, given that the company didn’t distribute the data to others and about 7% were not willing to share at all. No real surprise here: Those who have already been genotyped or are planning to get genotyped are more willing to share than those who don’t plan to. It would be interesting to know if people don’t want to get genotyped because they don’t want to share their data with a company (e.g. Don’t trust DTC-companies).

General reasons (not) to share

My girlfriend says I'm not allowed to display the mean of scaled answers, but then again, she also objects to the bars having shadows, so I wouldn't listen to her.

We also asked a few questions on why people would or wouldn’t share their data with others. Each question could be answered by making a selection on a five point scale, ranging from 1 (strongly disagree) to 5 (strongly agree). There are quite large differences in reasons why people would like to share. The most agreed upon answer is to help scientists with their work (mean = 4.53, median = 5), followed by personal benefits (mean = 3.64, median = 4) and curiosity (mean = 3.5, median = 4). Over half of all people strongly disagree with personalized advertising as a motivation for sharing data (mean = 1,72, median = 1).

There is less diversity in reasons not to publish the results, although the median of “fear of discrimination” and targeted advertising show that over half of all participants at least agree on those questions (medians = 4), while the medians of the questions about consequences for closely related and privacy breaches are in general more neutral (medians = 3).

Differences between customers/non-customers?

We also used an ANOVA and Tukey’s range test to see if there are any differences in agreeing/disagreeing on those questions between survey participants who have already gotten genotyped, those who plan to get genotyped and those who don’t plan to get genotyped. On the topics why people would share their data we found significant differences for the questions regarding helping scientists, having personal benefits and curiosity. Participants who have already gotten genotyped do agree more on those questions, compared to those who don’t plan to get genotyped. For out of curiosity and to help scientists this is even true for comparing the don’t plan to-group to the plan to-group, with the latter one agreeing more.

Regarding reasons not to share genetical information we find similar results: Those who don’t plan to get genotyped agree significantly more on all four questions, compared to those who have gotten themselves genotyped.

Summary (tl;dr)

Although there are no big surprises in those statistics, it is great to get some results regarding our own guesses:

  • People who are already customers of Direct-To-Consumer testing companies (or at least plan on becoming a customer) are more likely to share their data with the company, even if the company allows others to use the data.
  • Customers of DTC testing companies do agree more on questions regarding reasons to share genetical information than those participants who don’t plan to become customers.
  • Those who don’t plan to get genotyped do agree more on topics regarding reasons not to share their data than those who are already genotyped.

It seems that participants who (plan to) get genotyped are feeling more optimistic about the benefits of sharing their data with the DTC company as well as the public and see less problems in possible reasons not to share their data with others, compared to those who don’t plan to get themselves genotyped. And the same seems true vice versa, of course: those who do not plan to get themselves genotyped will agree more to questions concerning the risks of sharing, while scoring lower on questions concerning the possible benefits of doing so.

It’s too bad that we can’t find out (given the current survey) if this is more than correlation. Do people feel more optimistic and lose some of their fears about sharing their data, after they’ve gotten genotyped? Or do they get themselves genotyped because they feel more optimistic about it in the first place (which seems more likely to me)?

We will explore the data set a bit more in the future. Do you have ideas what things we should take a look at?

Binary Battle, Wissenswert-Contest and planned features

Binary Battle & WissensWert-Contest

We are participating in the Mendeley/PloS Binary Battle. Over 40 applications that make use of the Mendeley and PLoS APIs were submitted. A selection of 11 submissions will get reviewed by some great judges and those get the chance to win 10.001 US $. Were happy that we made it into this final selection. But you should also check out the other applications, there are some great tools.

We are also participating in the WissensWert-Contest of the german Wikimedia foundation. They fund ideas that make use of open licenses and try to support Free Knowledge with up to 5000 €. We applied for the funding to get some people genotyped that would like to make their results freely available, but lack the financial resources to pay for it themselves (this is a thing we quite often encountered). With the money we could get over 30 people genotyped by a DTC-company. Making those results available to the public would provide a great resource for people who are interested in personal genetics.


There is not much new at openSNP in terms of features, but: Currently we are working on implementing the Distributed Annotation System (DAS) into openSNP. DAS is a protocol that has been around for ~10 years and it allows the delivery of genetic information in a way that can be easily reused and makes remixing the data really easy. For example the UCSC Genome Browser and ENSEMBL make heavy use of it to display sequences, along with their annotation (SNPs, genes, diseases etc.). Rafael Jimenez and Manuel Corpas also use the DAS-protocol for their MyKaryoView, which is a genome browser that is meant to be used with genotyping data.

We are also working on adding support for zipped files, but currently Philipp and I are facing a high workload at our universities. If you are interested in helping out and doing some coding in Ruby on Rails: Feel free to do so. All of our code can be found at GitHub and we have a mailing list.

On Crawling Efforts and Requesting Data

Some Statistics

We love to share some more data on openSNP with all of you and now seems like a good time to do so.

  • Up to now our database stores a total number of 34 977 228 polymorphisms of 39 different users. Those are divided into 1 933 962 different SNPs.
  • Users have entered a total number of 412 phenotypes, split into 28 different categories.
  • Due to the great support of Mendeley (they relaxed the API-limit for us) we already finished crawling all papers on those SNPs we know of from their database. Those add up to 5940 papers. 698 of those are published as Open Access-papers, so they can be freely accessed by everybody.
  • We also finished crawling the SNPedia and were able to find 7760 different pages that contain information on SNPs that we have listed. This includes links to primary literature as well as summaries on the effects of specific SNPs
  • While we did not finish crawling the Public Library of Science yet (259098 SNPs still need to be checked), we could already find 1135 publications that deal with SNPs listed on openSNP.

On Navigation

All this makes a nice source of information for everyone who is interested in SNPs (and their possible effects), as well for everybody who likes to play around with personal genomics-data. Today we changed the URL-layout a little to make it a bit easier for those of you who are frequently interested in finding out about a specific SNP:

The old URLs just used the internal database-ID of the SNP to deliver the site you were looking for. So if you were interested in rs7903146 you had to visit, which was not that nice, as the URL is not informative and you always had to perform a search on openSNP to find the page of interest.

The new URL-layout uses the name of the SNP, so you can easily visit and find all the information you were looking for. But don’t panic if you bookmarked some of the old URLs, they still work, so you don’t have to change a thing.

Enjoy playing around!

28c3 Ticket Sale

The presale-dates for the 28c3 have just been announced. Tickets will be sold on this occasions:

  1. Sunday, November 06, 10:00PM CET (UTC+1) (½ of all tickets)
  2. Monday, November 14, 16:00PM CET (UTC+1) (¼ of all tickets)
  3. Tuesday, November 29, 10:00AM CET (UTC+1) (¼ of all tickets)

If you really would like to participate there is a tip: As tickets are always short you shouldn’t wait but be in front of your internet-device as the sale starts. In order to be able to buy yourself a ticket you need an account for the presale-system. The standard fee for tickets is 80€.

Tagged , ,

At the 28th Chaos Communication Congress

Some weeks ago we submitted a talk named Crowdsourcing Genome Wide Association Studies during the call for participation for the 28th Chaos Communication Congress (28c3 in short) and we are happy to announce that our talk was accepted. The Chaos Communication Congress is an annual congress of the international hacker scene and is organized by the german Chaos Computer Club and attracts between 2,500 and 3,500 visitors each year. Philipp and I are going to speak about Personal Genomics, how Association Studies work, what the goal of openSNP is and – of course – about privacy problems that arise in a world of cheap DTC-genomics.

So if you interested in our talk (and the many other fabulous talks that will take place during those 4 days) you might want to visit Berlin. The 28c3 takes place from December 27th to December 31st 2011, but the ticket sale has not started yet. We’ll try to keep you posted when tickets can be bought. But don’t worry if you can’t make it (although you will miss the opportunity to have drink with us): There will be recordings of all talks.

Server Migration and New Features

We are happy to tell you that we upgraded the server that hosts openSNP: We started of with a small, single-core machine with 1 GB of RAM but we (and many of the users) quickly found out that his machine definitely was running at its limits. To change this we moved the openSNP-project some hours ago to a new machine that offers 4 CPU-cores and 24 GB of RAM. We are quite optimistic that this should be enough to host us for the next time. As those questions came up lately: We are mostly paying for the hosting by ourselves, although Kai Werthwein was so kind to donate for a month of hosting costs. Thanks for the support!

We also used this occasion to optimize the code and to implement some of the features we mentioned in the last posting: First of all we optimized our database-tables and our file-parsing-script to further speed up things. Then we also implemented support for customers of FamilyTreeDNA. If you were genotyped by them you and got results in the Illumina-format you now can upload the results to openSNP. Afterwards we started implementing a basic email-notification system. Users now can get notified via email about replies to their comments, about new messages and – good news for all achievement-hunters – even about new phenotypes.

And now enjoy using openSNP, new users can easily join here. And of course: Let the feature-requests keep coming. We are looking forward to your ideas!

5 days after launch – Time for some more information

A good five days have passed since we started openSNP and it has been a lot of fun, even more work and not that much sleep since then. But it’s time to answer some questions and give you some feedback on what we have done so far and what we are up to:

Who is behind openSNP?

Whoops, this is definitely something we could have made more clear in the first posting. But here we go: Bastian, Fabian and Philipp did their undergraduate studies in Life Sciences and are currently doing their master-programmes. Bastian currently studies Ecology & Evolution, Fabian studies Biology and Philipp studies Computer Science. Helge is the only “real” web developer on the team and has helped us a lot in testing much of the things we did. We are not working full time on this project, this is more of a hobby. Please give us some time to answer your questions, fix bugs and stuff like this as we are doing this in our free time besides our studies and day jobs.

Why all this?

OpenSNP is a non-profit, open-source project that is about sharing genetical and phenotypic information. The idea to this project came to Bastian after he was genotyped by 23andMe in May and started playing around with his data. During his research he became frustrated, because it was not that easy to find mode data. He started working on openSNP to fix this. To be clear: This project is not about making money, selling data or to quote Google: “We don’t wanna be evil”. We are just interested in making science more open and accessible.

Some numbers

Up to now 20 people have registered with openSNP and eight of them have uploaded their genotyping files. All genotyping files are now parsed into the openSNP database. Together, this already accounts for 1327142 different SNPs and in total we have 7672504 SNPs in the database. Given our bumpy start those numbers are great.


Many of you have found bugs in openSNP and we have tried hard to fix them all. For example: There were some bugs in the commenting/messaging system which could break displaying those pages correctly. There was also a bad usability bug on the settings page. Those bugs should all be fixed by now. Thanks to all of you, especially to Nash, who totally deserves his “extremly high” on finding openSNP-bugs. If you discover any other bugs: Just let us now, we will start hunting them down right away.


The performance, especially in the first days and regarding the parsing of your genotyping files was horrible. Two factors caused this trouble: #1: Our bad job on writing a performant import script. #2: our limited server capacities. We worked hard on the first issue and somehow we solved it (In tech-speak: We drastically optimized the number of database transactions). Now it should take a maximum of 3 hours to parse a file with 1.5 million SNPs. This is as fast as it gets, given our current server capacities.

For the tech-savvy people: Right now, openSNP runs on a single-core machine with only 1 GB of RAM. Even now this is not enough power to deliver a good experience. But we are already looking for a larger machine (with more cores and much more RAM) to give you a better time using openSNP.

New Features

We already have a number of ideas and new features we want to implement into openSNP and we would like to present some of them to you:

  • Adding support for Family Tree DNA, which is another service that provides DTC-testing. Nash was kind enough to provide us a file which we can use to implement the file upload for this provider.
  • Mail Notifications: Right now users don’t get notified about new content they may be interested in: New messages, new phenotypes, new replies to their comments. In the near future, we will implement those important mail notifications (Of course: you will be able to easily disable/enable them in the settings).
  • Implementing Social Media. We know: People love to socialize and share, so we are probably going to implement support for Facebook, Twitter et al., so you can easily share the latest phenotypes you entered or the latest achievements you unlocked (if you want to).
  • Making downloading genotyping files of users even easier. Right now, there is no easy way to download an annotated data-set of a single user. This will be fixed.
  • Would you like to see a “following”-feature for phenotypes? The idea would be that by this you can be easily notified by mail about all changes to this phenotype: So if there are new comments, new variations and new genotyping-files available you could get an email. Is this something you’d like?
  • Would you like to use openSNP to annotate papers? Say you read a paper which was linked on openSNP, would you like to link your comments to this paper and make it available for others? Of course, this could be linked using the PLoS or Mendeley API.

We really appreciate getting your feedback on the feature ideas we already have. And as usual: You have an idea for a feature that is missing and not on this list? Please let us know.

Tagged , , , ,

Welcome to openSNP

What is Personal Genomics?

Welcome to the openSNP-project. We’d like to use this first blogpost to give you a general introduction to the project, which data we’d like to use and what the possible benefits & use cases of openSNP may be.

Companies that perform Direct-To-Customer (DTC) genetic tests have now been around for about six years, with 23andMe – founded in 2006 – and deCODEme being two of the oldest companies that are on the market. Their customers receive a test tube via mail, spit into this tube and send it back to their DTC-company to get their genetic information analyzed. The tests that such DTC companies perform do not utilize the more famous DNA-sequencing but rely on faster and still cheaper DNA microarrays instead.

Those microarrays screen for around 1 million genetic markers, called Single Nucleotide Polymorphisms (SNPs). A SNP is a genomic variation, where a single base is changed at one site between members of a population. Usually a SNP has only two alleles (variants) and occurs with a frequency of at least 1% in the population. Spread over the whole human genome, each of us carries around 10 million variable sites, where 10% are covered by DTC-companies. Many of those markers are known to be associated with certain conditions. For example, there are variations of SNPs that are associated with elevated risks for breast cancer or Alzheimer’s. Other SNPs can be used to predict how a person metabolizes chemicals or drugs.

The Rise of Personal Genomics

The company 23andMe released an overview over their customers in June 2011. At this time they had genotyped (as the kind of testing they perform is also called) over 100.000 customers of which over 70 % were willing to allow 23andMe to use their genotyping data for research purposes and over 50 % of all customers participated in different surveys on medical conditions, drug metabolizing etc.
23andMe uses the results to perform their own genome wide association studies (GWAS). Those studies check for statistical differences between different groups. In a simple example one could have a group that is known to have Alzheimer’s and a control-group that does not have Alzheimer’s. Given enough participants, one can then look for genetical variants that are over- or underrepresented in one of the groups. The variants that are found by this method can then be used as predictors for Alzheimer’s.

23andMe published a couple of papers in 2011 that show how they use their datasets (with up to 30.000 individuals) to reproduce already known associations and find new predictors for Parkinson’s. The sheer amount of datasets they can utilize, combined with customers that are willing to take surveys on different things, from diseases to the metabolization of coffee, gives them a great opportunity to perform a lot of meaningful research. Unfortunately, this great dataset is not made available to other researchers outside of 23andMe and their collaborators.

An Open Alternative?

While there may be many valid reasons not to publish those datasets, we feel that research projects all over the world and science in general would benefit from such a rich source of linked, genetic data that is freely available. And although genome wide association studies need a minimum number of participants to be able to find significant variations, it is not necessary to have 30.000 participants in your study. There are many publications that find lots of SNPs that can be used as significant predictors for certain conditions, from obesity to asthma. And many of those only have a total number of participants of < 5000 individuals.

Lets transfer this to 23andMe: Given the total number of customers, one only needs 5 % of them to participate in freely sharing their genetic information together with basic information on some medical conditions or other variations to reach the critical mass to be able to perform simple association studies! To our knowledge there are currently a few individuals worldwide that already share their 23andMe results freely (nearly all of them without any linked data).

We set up a small survey on how many customers of 23andMe would be willing to share both kinds of data with the general public and out of 88 people that are already a customer of a DTC company 15 % already shared their information and in total 75 % would be willing to do so. Additionally, out of 72 individuals who are planning to take a DTC test in the future, 61 % would be willing to share their results and some linked data. Given those results, there should be enough customers of DTC companies that would be willing to share data, enabling genome wide association studies (granted, we got a small sample size. We will publish the whole results of the survey as soon as possible). Due to those results we started working on such an open alternative.

The Idea of openSNP

OpenSNP wants to be a repository and an open platform to collect this kind of data. The vision is to enable everybody to perform crowd-sourced association studies to create new knowledge about our genes. Additionally we would like to enable everyone to find out more about their own results.

Up to now, people that wanted to share their genotyping data had to find a solution on their own: Some put the data on their own webspace, others to GitHub, others on some FTP-servers. But not only was phenotypic data missing, there was also no way of easily finding and downloading this data. On openSNP, users and especially customers of personal genomics companies have the chance to easily upload their genotyping data and publish details on their phenotypes.

What’s in it for the Users?

(Citizen) scientists get the option to easily add new conditions and phenotypes they are interested in to find DTC-customers that are willing to answer questions on those while openSNP also allows for an easy mass download of all data or of data partitioned into groups (like A: all users that have Alzheimer’s B: all users that don’t have Alzheimer’s) so the data is already in a basic shape for a GWAS. Additionally we provide simple RSS-feeds that deliver the latest data, either all data-sets or split for the condition of interest. This should make it really easy to get data out of openSNP.

Customers of DTC tests on the other hand can also benefit from using openSNP. One of the main reasons why people would want to freely publish their genotyping data, according to our survey, is that people like to help and open up science and like to participate in an approach as crowd-sourced GWAS. This is definitely something that openSNP supports and that was one of the main reasons for building the platform.

Another big reason why people like sharing data is the hope to get some personal benefit from it, for example finding others to chat with about their personal results or finding some primary literature on their results. openSNP tries to deliver this experience as it offers to find other users that share variations and conditions, as well as comment options that enable sharing personal experiences on conditions and variations.

We also implemented the APIs of the Public Library of Science (PLoS) and Mendeley. Those are used to find the latest publications on the genetic variations that are covered by 23andMe and deCODEme. We rate those publications according to the number of readers and if they are Open Access publications or not. We also crawl the SNPedia to deliver links to the user generated content there. By combining Mendeley, the PLoS and SNPedia we can deliver lots of great, curated content on the genetic variations to the users of openSNP.

To give openSNP a try, you can simply start browsing the phenotypes as well as the genetic variations. If you are interested in additional functions (mass downloading genotyping files, creating new phenotypes or just commenting) you can now easily create an account. Please bear in mind that openSNP right now is in its early beta stage, so you might encounter bugs.

If you need some help regarding using openSNP, want to tell us of some bugs or just have some questions you can read the FAQ, comment on this post or reach us via eMail or via IRC in #openSNP @ freenode.

Your openSNP-team.

Bastian, Fabian, Helge and Philipp

%d bloggers like this: