Towards a Decentralized Data Marketplace — Part Two
In our previous blog post, we explained the overarching purpose of Enigma beyond Catalyst, our platform that permits anyone to build a crypto hedge fund by leveraging our financial data marketplace. As we treatment our token sale on September 11th, we will be releasing more blog posts to explain our vision and technology, so go after us on Medium to make sure you don’t miss a thing.
To quickly recap what was said — at Enigma we are staying true to our vision of creating a decentralized data marketplace protocol for the web. With Catalyst and our token sale, we are creating the very first application to utilize the protocol and speed up its adoption (hence the name Catalyst).
In this post, we will provide a (high-level) overview of the data marketplace and its internal workings. We’ll present some design choices, and walk you through the main components required in building the protocol. This is the 2nd part of a two part post explaining our extended vision.
Data Marketplace Technology
At a high level, Enigma creates a decentralized data marketplace that permits people, companies and organizations to contribute data (we call these data curators), which users of the system can then subscribe to and consume. In other words, the data marketplace acts as a single gateway to a network of databases, possessed collectively by data curators who supply the content. The marketplace is decentralized and not possessed by any single party. Instead, listing and subscribing to a data source is managed on-chain, including prizes and penalties, which are exchanged using our token.
The data itself, its storage and transmission, lives off-chain. In that sense, the blockchain acts as the controller of the network (as was illustrated in the Enigma whitepaper), while the off-chain network treats everything else. In particular, the off-chain component treats everything from routing requests, verifying permissions, parsing and forwarding queries to the source (e.g. a database, a CSV file), and ultimately — routing the result back in the peer-to-peer network to the requesting client (encrypted with the client’s pubkey).
The figure above provides more details on how the data marketplace functions. Before we go into detail, let us provide a 30,000 feet explanation.
Registering a fresh data-set starts by submitting a fresh data curation script, which then sends a transaction to the blockchain that all knots track, announcing a fresh data set has been made available. This fresh resource has a permanent address that clients can subscribe to on-chain, while paying the specified fee. Once subscribed, clients can consume the data in their applications, by signing a query request and broadcasting it over the off-chain network of peers.
Fairness in the system is assured mostly through economic incentives — good datasets will have more subscribers, which pay more to their data curators and are ranked higher in the system. Data sources can have a ‘try before you buy’ option, which help attract fresh subscribers. If a data set is then shown to not be useful, most people would unsubscribe and the source would be further diminished in ranking. In addition, some provable offenses (e.g., a data curator going offline) can be penalized.
Below, we will walk you through an end-to-end example of how to use the data marketplace, while pointing out significant technical aspects. We assume for plainness that all knots run an enigma knot — a total peer-to-peer knot that connects to the off-chain Enigma network, creating the data marketplace, as well as the host blockchain where the on-chain management is done. For the foreseeable future, our plan is to connect our network to Ethereum. At the time of writing, it is the most secure chain that also provides us with the smart-contract functionality that we require.
Naturally, the very first step in the process is to upload a data-set to the marketplace and register it. Under the fetish mask, this is somewhat of an involved procedure, which is a mix of on-chain and off-chain operations:
- The data curator, through their off-chain client, submits a cron-like job with the data curation script and details on how frequently to execute the job (e.g., minutely, hourly, daily). This ensures the data always remains up-to-date. In addition, the client needs to specify some metadata such as the name of the database, where it is physically stored, and which knots execute the job (more on this later).
- A register transaction is sent to the blockchain, signed by the data curator. The transaction needs to include in its payload the name of the data-set; a price to subscribe (we’ll assume for plainness these are monthly subscriptions); a deposit the data curator locks in as collateral; payout addresses of knots that assist the curator (e.g., in providing computing and storage resources); and a hash of all meta-data that is stored off-chain.
- All meta-data concerning the fresh data-set is broadcast as a signed message to all knots in the enigma off-chain network, and stored in a collective, distributed hash table (DHT). To ensure data integrity and network robustness, we plan to use some revision of PBFT and S/Kademlia. This should present a good enough solution against malicious coalitions of certain size (inbetween ⅓–½ of the knots, depending on the implementation). We expect this to be sufficient given that like in Bitcoin, knots have an implicit incentive to remain fair in order to earn data mining prizes.
While all Enigma knots (not including skinny clients) must be a part of the Enigma network — which means they are both connected to the host blockchain and the off-chain network, there are two other (optional) roles they can serve. The very first is acting as a worker knot, which executes cron jobs that aggregate the data. The 2nd role serves as a storage and query engine for the data. This is known as a persistence knot.
Persistence knots are meant to be especially lean and modular. They are simply wrappers that can parse data queries coming from the enigma network and route them to a more traditional database engine and storage unit. To begin with, we plan to build support for a single common underlying database. Over time, we hope to enable a fully distributed database that lives inwards of our network.
Subscribing (or unsubscribing) to a data-set happens totally on-chain. To make discovery effortless, data sources are ranked according to a plain public metric:
Once a user finds a data-set that they like and wish to subscribe to, they send a subscription transaction specifying the unique path of the data-set, as well as the required payment for it. This transaction can be automated in the client so that it is repeated every month. In fact, this will likely be the default behavior. If the payment is valid, the smart-contract adds the signer’s address to the list of subscribers.
As opposed to subscription, consumption happens downright off-chain. A user who wishes to query a specific data-set, for example — <curator_address>/coinbase_data, can broadcast a message requesting it. Knots in the off-chain network propagate the message, until it reaches a persistence knot having direct access to the data. The persistence knot then parses the request, verifying that the user is subscribed to the data on-chain. If the user has the suitable permission, the knot obtains the data (potentially after issuing the parsed query on behalf of the user), encrypts it with the public-key of the user and propagates the message back in the network until it reaches its destination.
A leading principle in the marketplace is that data curators are in charge of the quality and availability of their own data. This greatly simplifies the incentive mechanism and ensures users of the system love better service. As mentioned before, Data curators need to stake tokens on every data-set they share. If the data goes offline, or is otherwise corrupt (addressing this is beyond the scope of this post), they will have to pay it from their deposit. Such dishonest deeds are likely to reduce the number of subscribers to the data, further decreasing the payoff of the data curator. With the assumption that curators are rational agents who wish to maximize their utility, this ensures their best strategy is to provide quality data and ensure its availability.
While data curators can construct and serve their data through a local knot acting as a worker and persistence knot, they may choose to outsource these operations to other knots. This permits data curators to go offline after registering a fresh data-set. Worker and persistence knots are specified as extra payout addresses in the registration transaction (which could be updated by the data curator at any time). This ensures that these knots remain incentivised to do good work. If the curator wishes to keep all the prizes to herself, she can specify her own knot(s) as the worker/persistence address.
In some cases, data that is sold in the system would need to remain confidential even at use. At Enigma, we performed a excellent deal of research around Computing over Encrypted Data, and our aim is to step by step introduce these ideas into our protocol.
Originally, we plan to introduce relatively rapid deterministic and order-preserving encryption mechanisms, that will enable data curators to encrypt their data at the source. This presents a fairly good trade-off in practice, that provides adequate security assures, albeit not flawless, as they don’t sate indistinguishability against adaptive chosen-plaintext attacks. Later on, we plan to step by step introduce more complicated ideas that provide ideal security. We refer you to our whitepaper to learn more.
Open research questions
There are many open research questions that we’re still thinking about. We welcome everyone to help us in making progress on these. Some of the questions present practical problems that we will deal with in the (far) future. Others, interesting related intellectual questions.
- Off-chain consensus — blockchains are not yet scalable enough to run lengthy validation scripts. For example, a script that verifies a data-set adheres to specific validation rules. For that, some kind of weak-consensus model, or a quorum treatment, could be applicable.
- Prizes/penalties model for reducing trust in outsourced knots.
- Prevent data leakage — developing a client that connects and serves queries in the enigma network through SGX. This would prevent exfiltration of data by persistence knots.
- Improving robustness and security of the off-chain layer based on newer, more secure BFT consensus and routing protocols.
We hope this post was helpful in explaining the details of our long-term vision of creating a sustainable decentralized data-marketplace. One that is curated by the community, operated by an off-chain network and facilitated through a blockchain. If you haven’t done so already, and you wish to better understand why we’re building a data marketplace for the web, please visit the very first part of this post.
If you have any questions, please join our community on Slack or Telegram. We look forward to working with you to achieve our vision for the future of data and crypto investing!