How to use Ethereum, web3, and your own database for performant and cryptoassured shared data

Buzz Moschetti
Apr 27, 2023
5 min read

Crypto-assured Data Sharing

We know that Ethereum provides a great platform for decentralized maintenance of cryptoassured data -- data that cannot be modified without invalidating hashes, nor can new states of data be "inserted in between" 2 consecutive changes to state. The web3 frameworks (webj, web3js, web3.py, etc.) also make it very easy and type-safe to engage with an Ethereum provider without using the low level JSON-RPC API. But there are a number of practical issues:

Data storage on Ethereum is (comparatively) very expensive. Assume storing a 256-bit word costs 20000 gas; thus, 1KB costs 640K gas. At market price of $1864.69/ETH on 24-Apr-2023, this is nearly $60/KB.
Querying and analytics on the data is very poor to say the least
It is very difficult to integrate on-chain data with off-chain data in a way that does not break the cryptoassurances of the way on-chain data is consumed.

Given that almost all consumers of on-chain data will also have off-chain data needs and that off-chain data is nearly 100% likely to be residing in a popular database such as Postgres, MongoDB, Oracle, MySQL, or SQLite, an ideal solution is to bridge the two worlds by separating the two concerns:

The database will be used to store the data because it has to anyway to support integration, query, and analytics requirements.
The blockchain will store only the hash of the new material -- not the material itself -- and furthermore will emit the new material plus the hash on the transaction log.

Here is the workflow:

1. A new function and associated event is added to a contract Foo e.g.:

contract Foo {
     bytes32 last_update_dgst;
	 event NewDataPosted(bytes memory data, bytes32 dgst);
	 ...
     function dataBridge(bytes memory data, bytes32 k256_hash) public {
	   bytes32 chk = keccak256(data);

	   if(chk != k256_hash) {
	     revert("keccak256 mismatch between supplied and computed values");
	  }

	  last_update_dgst = chk; // change state on bytes32
	  emit NewDataPosted(data,last_update_dgst);
    }
}

Assume we create a new Foo at address 0x8 (we shorten the address for simplicity here).

2. A client side application creates data, persists it in "unverifed" state or similar, and then posts to the blockchain. Note the importance of the ability to convert the SomeData object into a byte array representation. We will use a combination of python3 and Java for the pseudocode; Java tends to provide a little more color on types:

{
	   SomeData my_data;

	   my_data.set_business_key("K1"); // the unique logical key for this
	                   // data, not necessarily the DATABASE
					   // unique key e.g. cid or_id

	   byte[] bb = convert_to_bytes(my_data);
	   byte[] my_hash = Keccak256(bb);

	   # Representative example of things we would want to save:
       persist_to_DB(my_data, my_hash, '0x8', "unverified");

	   # Now push to the blockchain. We do not show the ABI setup
	   # and web3 provider connectivity here for simplicity; the
	   # important part is the call to method "dataBridge" with
	   # arguments bb and my_hash:
	   tx_receipt = web3j.contracts['0x8'].dataBridge(bb, my_hash);
}

Note that calling a state change function on contracts does not return data in-line but rather a transaction receipt.

3. Next, a peer application to the poster (or perhaps the same program in a different thread) sets up a filter on the ethereum transaction log for the contract address and specific event name that we emitted in the smart contract:

current_block_num = fetch last block processed from database
contract = w3.eth.contract(address='0x8')
transfer_filter = contract.events['NewDataPosted'].create_filter(fromBlock=current_block_num)

logs = transfer_filter.get_all_entries()

4. For each log, extract the content, deserialize, match IDs, and update the database:

# web3.py as an example sets event data into 'args'
byte[] bb = log['args']['data']
byte[] khash = log['args']['dgst']
long blockNum = log.get_block_number()
 
# Recompute hash to ensure incoming is not corrupted:
assert(khash == Keccak256(bb))

SomeData d2 = SomeData.fromBytes(bb)

Map<String,Object> record = lookup_in_DB(d2.get_business_key())

if record is None:
  print("error; cannot find matching record to verify")
else:
  if record['hash'] == khash:
    # Our stored record for some business key has a hash that
    # matches the incoming event; mark the record as verified!
    update_in_DB(d2.get_business_key(), "VERIFIED", blockNum)
  else:
    print("error: record found but hashes do not match")

That's it! The database now has a cryptoassured copy of the data and the blockchain holds the "fingerprint" to the data. A slightly more complex but more useful implementation might also include a timestamp on the blockchain along with the hash.

Features:

This serves as a "public redistribution" system because the data will end up in all the TX logs, guaranteeing at least 1 Ethereum node can have its TX log scanned for a particular contract and event topic which will yield our emitted data. Individual copies of data in local databases are of course physically mutable because a DBA/root can change the data -- and the hash as well -- but a rehash of data compared against the blockchain will fail. Even a sophisticated hack involving local manipulation of an ethereum node storage trie (to create a "local lie") in combination with a hack on the local database will fail if a comparison is made to any other node that has captured the TX log.
There is (comparatively) little gas cost because very little information is changing in blockchain storage.
Clients can efficiently wait for changes in the filter framework; the last seen block is saved with the "VERIFIED" flag to eliminate reprocessing.

Considerations:

The practical upper limit of transmitting data through the transaction log appears to be about 41Kb. A common misunderstanding is that gas is only consumed by state changing functions and is proportional to the amount of data being changed. In fact, both pure logic (especially for() loops!) and emitting data on the log costs a very small amount of gas but for bigger tasks this can add up and you will eventually run out of gas.
All data on the blockchain including logs is public; therefore, care must be taken not to expose sensitive/personal data. However, this overall architecture can in fact help to solve this problem by publishing anonymized identifiers in the log which can be easily and performantly linked back to "real" identifiers in the host database which can be properly secured, controlled, and managed.
The storage of the new hash into the blockchain is "transactionally consistent" with the storage of the data in the TX log after consensus but the capture of the new data in the TX log will lag, and the lag may vary from listener to listener. If authoritative state is required at any point in time, then any local query function must always interrogate the blockchain for the stored hash in a read-only fashion at the same time as the local DB to ensure that the hashes match, indicating the local database copy is up to date with the blockchain. If the hashes do not match, the local environment must wait until the update appears on the log and can be captured. Read-only requests of just 32 bytes of data (the keccak256 hash) are essentially zero cost and very fast -- although certainly not as fast as a SELECT against the local DB.

Susynct

How to use Ethereum, web3, and your own database for performant and cryptoassured shared data

Recent Posts

Comments

Contact us

Susynct