Wednesday, December 2, 2009
I’m into this thing called the Semantic Web, I’ve been doing some projects at work, but I’ve also decided to start doing some stuff on my own behalf. That means I have to keep everything separate including the apps I use to achieve the results. At work I have access to some great software, for example Pipeline pilot, but it comes at a cost. It’s true that this software does not understand anything about RDF and semantics. But I can easily cobble together something that gives me RDF as an end result, and its native ease of use means that processing and manipulating of files and the data is trivial.
Over the next couple of months I will be RDF-ising collections of data available from the FDA website. They have huge collections of data in various silos available for download in mainly flat file format.
My thoughts around this are; it’s great that the data is there, but it takes each of us to do something with each file (normally create a RDBMS database) and it doesn’t help create meaning from the data because nothing is linked.
My idea is to start creating RDF versions of the datasets and show how each dataset may give rise to new insights. More importantly I want to start linking the datasets to allow more thought provoking questions to be asked across them. Here are a couple of examples.
1. Given the adverse events published; can I use statistical methods (possibly disproportionate analysis) to spot drugs that have an unusually high incident rate in a given month and is this rate increasing over time?
2. Given an adverse event for a drug you may then want to find out all other related drugs that could be impacted. So I intend to link the drugs in the adverse events to the orange book (list of all approved drugs). Each drug has a list of ingredients and so any other drug with common or overlapping ingredients may also be affected.
Now this may seem silly, but if a company waits for the FDA to spot this, it is likely that the drug will be withdrawn. But wouldn’t it be great if the company could see the trend and intervene to find out what is happening. It might be that it is being prescribed in the wrong way and to remedy this all they need to do is to send out new guidelines and some notes to the doctors. The cost would be negligible compared to losing the drug from the market.
There is also the opportunity to start linking this data into and/or with the open linked data project which already includes some great information e.g. Drugbank, MEDRA etc.
I’ve started converting the Orange Book data. It seems relatively simple with just three flat text files. However on inspection of the products.txt file you find that it not only describes the product but also the drug and the multiple ingredients with each having a dose. All of this is in a single line of data with multiple, and different, delimiters. So how do I manipulate this text into a better format from which to create RDF? Well, without my expensive tools it’s not been easy. Knime and Talend have helped but I don’t always want to think of my data as simple tables, I want to create more detailed relationships, split, and pivot and recombine data and I want to do it without writing lots of java, perl or python. Quite a bit has been done in our old friend Excel and all of this is just preparing the data ready to be RDF-ised! You might expect the data published to be in tip top condition but it’s amazing the amount of data cleaning that needs to be done
A friend recently had a similar experience. He had left a semantic company to work on his own. Then came the questions. “What’s the easiest way to manipulate data?” Answers were similar to above. “How do I create RDF?” “That’s another good question” I replied. “How do I show the data?” answer “Exhibit is a good start, but it won’t deal with much data, and you have to have it in JSON format”..... “Ah that could be problem” he said. But of course there is Babel, “but I can’t” he said “it’s private data”.
Now my friend is a techie and was able to produce scripts / apps to do the stuff he needed. I’m not, so I continue to struggle.
However there is some light. TopQuadrant have released a free version of their TopBraidComposer software. It’s a “MUST HAVE” app for your desktop if you intend to do anything with semantic data. I’m not going to talk about it, but there is so much in it that I’m taking a while to get to grips with it all.
At end of the day I just want to do something with the data and not spend most of my time creating it. It’s hard to show people the hoops I have had to jump through and to expect them to think anything other than, “This semantic web stuff seems a lot of hassle”. It’s hard to “cross that chasm” unless you have tools and it’s impossible to expect someone to be able to do this while trying to understand what the semantic web is all about in the first place.
Wednesday, November 25, 2009
I've been fortunate enough to work with clinical data this year. It's been a new experience with new challenges.
I was a bit scared at first, since clinical data is always thought of as the holy grail of Pharma data. I was expecting something very complicated, but in reality it's quite straight forward. There were two main challenges. The first was getting around a preconception that as this is regulated data you can't do anything with it. To me that meant you can't touch the original data but it shouldn't stop you duplicating the data and using it elsewhere to feed back into discover research or to use it to look at other questions that the trial may not have focussed on. But all that's another story. Secondly; after winning the first battle, was actually getting someone to agree to give you some data. Even within an organisation this proved tricky, for reasons that lead to yet another story, but that's for another time.
With those resolved I started to look at the data. The data was in a format aligned with the SDTM standard published by CDISC. The first thing that struck me was actually the simplicity of the data. I was expecting something a lot scarier, but it was scary how badly put together the data structures were. The data structures seemed to be aligned with the physical CRF forms used to submit data which made the data a bit powerless rather than powerful. However, with linked data we can change all of that.
So all we need now is the SDTM ontology to align and format the data against. Asking around I found some partial bits and pieces but nothing substantial to use with all the data I had, so there was no choice other than to write my own. Having studied the SDTM standard it didn't take long, but there were several choices to make along the way. The version I came up with isn't the way I would have designed it from a true ontology approach. There was an existing standard and I didn't want to stray too far away from it so that the concepts would remain familiar to people. Yet I also had to enable the data.
When looking at the ontology, those familiar with clinical data and SDTM, will see the usual concepts such as demographics, subject visits, adverse events and medical history along with all the others. What I did do was extract some of the major linking elements, which were not in fact separately described, in the standard. So you will see concepts such as "unique subject", study and a few others. I've also added the properties with names matching the standard. In some cases I've also added some constraints, but not all. Concepts are of course linked with object properties and I've used the "unique subject" class as the center of attention linking to just about everything. That's on purpose as in most cases you want to correlate the subject with all the other data.
I've used the ontology in several pieces of software and it's worked pretty well. Asking questions of the data has been pretty straight forward and produced some pretty good demos.
I'm not saying the ontology is brilliant. I'm sure you'll find mistakes, omissions and areas where I've not gone into enough detail. However I'd like to do my bit for the community and put the ontology out there for people to use. Feel free to alter it for your needs and please give feedback on your experiences with it.
You can get it here. Have fun.
p.s. I've used the namespace "http://cdisc.org/CDISC/Ontology/SDTM#" which is not a valid one.
Tuesday, November 24, 2009
It's been a while since I last posted. I started the blog with good intentions but due to some controversy I laid off for a while and didn't get back into it.
Hopefully I'll keep going this time!
It's been an eventful year, with some highlights being;
The CSHALS conference was great with some fabulous talks on SW stuff really being applied to Life science areas. It really made me think. Thanks to Eric Neumann for inviting me to speak on the INCHI stuff I have previously blogged about, I'm just sorry I made a bit of a hash of it in the allotted time.
SEMTech was always going to be a highlight, especially taking part in two talks. Unfortunately travel budgets didn't allow my attendance but the talks went ahead with others stepping in to do my bits as well as their own. Special thanks to Dean Allemang and Alan Greenblatt.
I've worked in some different areas this year, most notably with clinical data. This was a real eye opener if not for all the right reasons. I'll blog about this later and publish my version of an ontology I have put together to represent the SDTM standard used in the clinical arena.
I'm currently doing some projects working on FDA published data and I hope to publish some demos on this in the near future.
Back soon with lots to write about.... Hopefully.