Read documents of Apache shardingsphere several years ago, and used to think it is the best database sharding library in client side. After trying to use it in a real-world application, problems reveal. First, the ecosystem has grown so large. Even a demo spring boot application can reference lots of dependencies. Second, when loading large data set from multiple shards, multi-threading is not used. I still have to manually implement it myself to improve load time.

Actually, what I need is the ability for selecting a database shard implicitly. When I write select t_user from..., it is rewritten to select t_user[0-7] from.... Here’s some alternative options I found:

1. hibernate interceptor

Refer to javadoc of StatementInspector class.

2. datasource proxy

See: https://jdbc-observations.github.io/datasource-proxy/docs/current/user-guide/#built-in-support

3. spring boot 3

See: https://spring.io/blog/2022/05/02/ever-wanted-to-rewrite-a-query-in-spring-data-jpa

But spring boot 3 requires java 17 and it only applies to jpa repository.

Feature matrix of NoSQL databases, listed in Appendix of Seven Databases in Seven Weeks:

  MongoDB CouchDB Riak Redis PostgreSQL Neo4j HBase
Genre Document Document Key-value Key-value Relational Graph Columnar
Version 2.0 1.1 1.0 2.4 9.1 1.7 0.90.3
Datatypes Typed Typed Blob Semi-typed Predefined and typed Untyped Predefined and typed
Data Relations None None Ad hoc (Links) None Predefined Ad hoc (Edges) None
Standard Object JSON JSON Text String Table Hash Columns
Written in Language C++ Erlang Erlang C/C++ C Java Java
Interface Protocol Custom over TCP HTTP HTTP, protobuf Simple text over TCP Custom over TCP HTTP Thrift, HTTP
HTTP/REST Simple Yes Yes No No Yes Yes
Ad Hoc Query Commands, mapreduce Temporary views Weak support, Lucene Commands SQL Graph walking, Cypher, search Weak
Mapreduce JavaScript JavaScript JavaScript, Erlang No No No (in the distributed sense) Hadoop
Scalable Datacenter Datacenter (via BigCouch) Datacenter Cluster (via master-slave) Cluster (via add-ons) Cluster (via HA) Datacenter
Durability Write-ahead journaling, Safe mode Crash-only Durable write quorum Append-only log ACID ACID Write-ahead logging
  MongoDB CouchDB Riak Redis PostgreSQL Neo4j HBase
Secondary Indexes Yes Yes Yes No Yes Yes (via Lucene) No
Versioning No Yes Yes No No No Yes
Bulk Load mongoimport Bulk Doc API No No COPY command No No
Very Large Files GridFS Attachments Lewak (deprecated) No BLOBs No No
Requires Compaction No File rewrite No Snapshot No No No
Replication Master-slave (via replica sets) Master-master Peer-based, master-master Master-slave Master-slave Master-slave (in Enterprise Edition) Master-slave
Sharding Yes Yes (with filters in BigCouch) Yes Add-ons (e.g., client) Add-ons (e.g., PL/Proxy) No Yes via HDFS
Concurrency Write lock Lock-free MVCC Vector-clocks None Table/row writer lock Write lock Consistent per row
Transactions No No No Multi operation queues ACID ACID Yes (when enabled)
Triggers No Update validation or Changes API Pre/post-commits No Yes Transaction event handlers No
Security Users Users None Passwords Users/groups None Kerberos via Hadoop security
Multitenancy Yes Yes No No Yes No No
Main Differentiator Easily query Big Data Durable and embeddable clusters Highly available Very, very fast Best of OSS RDBMS model Flexible graph Very large-scale, Hadoop infrasturcture
Weaknesses Embed-ability Query-ability Query-ability Complex data Distributed availability BLOBs or terabyte scale Flexible growth, query-ability