Seeding Elasticsearch with test data using zefaker and esbulk
In this short article, we will see how you can use zefaker to generate 5 million records of random data which can easily be indexed in Elasticsearch using esbulk
.
Prerequisites:
- zefaker (version 0.6 at the time of writing)
- JDK 15+
- Go 1.16+
- esbulk
- Elasticsearch 7+ / OpenSearch 1.0.0
- curl (optional, but really you should have this already)
Creating the zefaker
file
Firstly, we need to create a Groovy script to use with the zefaker to specify the form of our random data.
You can copy the code snippet below and place it in a file named data.groovy
which we will pass to zefaker to generate our data.
// in data.groovy
import com.google.gson.JsonObject
firstName = column(index= 0, name= "firstName")
lastName = column(index= 1, name= "lastName")
age = column(index= 2, name= "age")
accountStatus = column(index=3, name="accountStatus")
accountMeta = column(index=4, name="accountMeta")
generateFrom([
(firstName): { faker -> faker.name().firstName() },
(lastName): { faker -> faker.name().lastName() },
(age): { faker -> faker.number().numberBetween(18, 70) },
(accountStatus): { faker -> faker.options().option("Open", "Closed") },
// You can nest objects like this
(accountMeta): { faker ->
def meta = new JsonObject()
meta.addProperty("totalTokens", faker.number().numberBetween(5000, 10000))
meta.addProperty("activityStatus", faker.options().option("Active", "Dormant"))
return meta
}
])
Generating the data
zefaker requires Java to be installed to run. I'm assuming you have the java
command in your PATH.
With that we can run the following to generate 5 million rows of random data exported into a JSON Lines format (basically a plain text file where each line is a JSON Object).
$ java -jar zefaker-all.jar -f data.groovy -jsonl -output elasticdata.jsonl -rows 5000000
You can also try the zefaker
web instance I have running here, this will save you from having to install Java or zefaker on your machine. Make sure you select JSON Lines as the export option
Getting esbulk
utility
We will use esbulk
, a nifty small command-line program written in Go, to perform the indexing. We will have to build it first.
$ git clone https://github.com/miku/esbulk
$ cd esbulk
$ go build
This will create an executable named esbulk (esbulk.exe on Windows). You can add it on your PATH
Indexing the data in Elasticsearch
Again, it is assumed that you have installed Elasticsearch or OpenSearch and have it running. We can use the following command to index our data:
$ esbulk -index "people-2021.07.07" -optype create -server http://localhost:9200 < elasticdata.jsonl
After esbulk completes (silently) you can check that the operation was successful by visiting localhost:9200/people-2021.07.07/_search in your browser or using curl like so:
$ curl -G http://localhost:9200/people-2021.07.07/_search
And, that's all folks. Hope you found this useful.
Show some love and star zefaker :)