Wednesday

The DVCS that is not a DVCS

Today, I found out about Veracity. The opening paragraphs put up my Skepticism Alert. It claims to be able to do things that git and mercurial were unable to do. It is right in that DVCS has changed the landscape, the two articles with the deepest insights on these changes is first, Forking, The Future of Open Source, and Github, written by @sogrady, an analyst over at Redmonk; the second is "Distributed Version Control is here to stay, baby", written by Joel Spolsky, a self-proclaimed holdout of the days of CVS.

However, the first feature listed for Veracity already tells me it will open up a lot of possibilities (assuming it does what it claims to do):
Veracity goes beyond versioning of directories and files to provide management of records and fields, with full support for pushing, pulling and merging database changesets, just like source tree changesets.


Years ago, right around when Youtube got popular and we saw the prolific rise of other competing offerings such as Vimeo, there was an obscure site called tvlinks.co.uk. Its maintainer had painstakingly compiled links to TV episodes on YouTube, Vimeo, and other streaming sites. There were many TV series with complete episodes listings. I was able to watch much of Star Trek: DS9 and Stargate (yeah, I had a limited childhood).

tvlinks.co.uk eventually got shut-down. Hulu didn't show up until three years later, and they were constrained by broadcast rights. This being the year that git came to sweep out the centralized version control system, it occurred to me that had this set of links been decentralized, it would have been much more difficult to take this down. Further, instead of relying on a single maintainer, I could subscribe to several different sources and did my own merging. A knowledge base for links to TV shows doesn't change the world, however, there are a number of datasets that would. The attempt to figure out how to implement this with git stopped my brain cold. I filed it away for later.

A year later, I stumbled over CouchDB. At the time, it had promised many things. Its unstructured JSON format is platform-agnostic. Its very architecture assuming replication and unreliable nodes implied that maybe, just maybe this can form the basis of a distributed, decentralized knowledge base. Sadly, the documentation made it seem all of this were implemented, but it wasn't. CouchDB only went 1.0 today. Further, I would have had to implement the multi-version merging yourself.

In the years since, we have seen data.gov and the Amazon Public Datasets. Tim Berners-Lee had been talking about semantic web for years, and in 2010, he was finally able to give the TED Talk, 2009: The Year Open Data Went Worldwide.

In fact, @sogrady extended the ideas of how Github and DVCS impacted the world of open-source software and extended it to the idea of datasets:
In the open source world, forking used to be an option of last resort, a sort of “Break Glass in Case of Emergency” button for open source projects. What developers would do if all else failed. Github, however, and platforms with decentralized version control infrastructures such as Launchpad or, yes, Gitorius, actively encourage forking (coverage). They do so primarily by minimizing the logistical implications of creating and maintaining separate, differentiated codebases. The advantages of multiple codebases are similar to the advantages of mutation: they can dramatically accelerate the evolutionary process by parallelizing the development path.

The question to me is: why should data be treated any different than code? Apart from the fact that the source code management tools at work here weren’t built for data, I mean. The answer is that it shouldn’t [Emphasis mine's] (@sogrady, "The Future of Open Data Looks Like ... Github?")
... but wait ... doesn't Veracity claim to do decentralized, versioned data?

According to the announcement, Veracity uses this particular feature for decentralized user accounts, as well as tags, commits, etc. It has a pluggable storage engine, so we can theoretically use filesystems, SQL, and NOSQL solutions. But again, to call this a DCVS for versioning mere source code is missing the most significant feature -- the ability to decentralize data.

We'll see what the full capability of Veracity is when it comes out.

Update: Eric Sinks comments on the decentralized database in Veracity -- http://news.ycombinator.com/item?id=1515827.

1 comment:

  1. The idea that Veracity can handle datasets the size of Data.gov remains to be seen. One of the obvious applications I can see includes syncing data between your smartphone, your laptop, and cloud data. Another is having a decentralized 'Like' button -- not just that is no longer beholden to an organization such as Facebook, but also that you choose who you share the database of your "Like".

    One less obvious application: annotations for books. Ancient books such as Sun Tzu's Art of War and the even older book, the I-Ching actually come with a core received text and a large selection of "accepted" commentaries; those attached commentaries are now published as part of the work. The scholars had conventions in the way they wrote these annotations, or who quoted from who.

    Another example: multiple teams of scientists pushing and pulling data for the same experimental protocol in order to aggregate data. They can be of course, annotated with confidence values based on the criteria set by each team. There is room for multiple world views.

    ReplyDelete