Read documents of Apache shardingsphere several years ago, and used to think it is the best database sharding library in client side. After trying to use it in a real-world application, problems reveal. First, the ecosystem has grown so large. Even a demo spring boot application can reference lots of dependencies. Second, when loading large data set from multiple shards, multi-threading is not used. I still have to manually implement it myself to improve load time.
Actually, what I need is the ability for selecting a database shard implicitly. When I write select t_user from..., it is rewritten to select t_user[0-7] from.... Here’s some alternative options I found:
Adopting to using Spring Data JPA these day, there is a post saying: IDENTITY generator disables JDBC batch inserts. To figure out the impact, create a table with 10 data fields and an auto-increment id for testing. I am using MySQL 5.7.20 / MariaDB 10.3.3 / Spring Data JPA 1.11.8 / Hibernate 5.0.12.
MySQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATETABLE`t_user`(
`id`int(11)NOT NULLAUTO_INCREMENT,
`field1`varchar(255)DEFAULTNULL,
`field2`varchar(255)DEFAULTNULL,
`field3`varchar(255)DEFAULTNULL,
`field4`varchar(255)DEFAULTNULL,
`field5`varchar(255)DEFAULTNULL,
`field6`varchar(255)DEFAULTNULL,
`field7`varchar(255)DEFAULTNULL,
`field8`varchar(255)DEFAULTNULL,
`field9`varchar(255)DEFAULTNULL,
`field10`varchar(255)DEFAULTNULL,
PRIMARY KEY(`id`)
)ENGINE=InnoDBDEFAULTCHARSET=utf8;
And generate the persistence entity, add @GeneratedValue annotation:
As mentioned, Hibernate/JPA disables batch insert when using IDENTITY. Look into org.hibernate.event.internal.AbstractSaveEventListener#saveWithGeneratedId() for details. To make it clear, it DOES run faster when insert multiple entities in one transaction than in separated transactions. It saves transaction overhead, not round-trip overhead.
The generated key is eventually retrieved from java.sql.Statement#getGeneratedKeys(). And datasource-proxy is used to display the underlining SQL generated.
2. TABLE
Now switch to GenerationType.TABLE. Just uncomment the corresponding @GeneratedValue and @TableGenerator annotation. Result looks like:
I began to think that was the whole story for batch, and the datasource-proxy interceptor also traced down the batch SQL. But after I looked into dumped TCP packages using wireshark, I found the final SQL was still not in batch format. Say, they were in:
The latter one saves client/server round-trips and is recommended by MySQL. After adding rewriteBatchedStatements=true to my connection string, MySQL generated batch statements and result was much improved:
Last switch to GenerationType.SEQUENCE. Sequence is a new feature added in MariaDB 10.3 series. Create a sequence in MariaDB with:
MySQL
1
CREATESEQUENCE`s_user`STARTWITH1INCREMENTBY100;
Generally, the increment should match the one specified in @SequenceGenerator, at least >= allocationSize. See org.hibernate.id.enhanced.PooledOptimizer#generate().
Hibernate apparently does not support the new feature, I dealt with it by adding a new dialect:
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
packagecom.gonwan.spring;
importorg.hibernate.dialect.MySQL5Dialect;
/*
* Copied from org.hibernate.dialect.PostgreSQL81Dialect.
supportsSequences() adds the sequence support. supportsPooledSequences() adds some pool-like optimization both supported by MariaDB and Hibernate. Otherwise, Hibernate uses tables to mimic sequences. Refer to org.hibernate.id.enhanced.SequenceStyleGenerator#buildDatabaseStructure(). Result with and without batch:
Dramatically improved when compared to the table generator. A sequence generator uses cache in memory(default 1000), and is optimized to eliminate lock when generating IDs.
4. Summary
1 thread
2 threads
4 threads
8 threads
16 threads
32 threads
IDENTITY
823
609
1188
2329
4577
9579
TABLE
830
854
1775
3479
6542
13768
TABLE with batch
433
409
708
1566
2926
6388
SEQUENCE
723
615
1147
2195
4687
9312
SEQUENCE with batch
298
155
186
356
695
1545
From the summary table, IDENTITY is simplest. TABLE is a compromise to support batch insert. And SEQUENCE yields the best performance. Find the entire project in Github.