REDIS: Iterating through database with SCAN

It has recently come to my attention that iterating though stuff is quite important. Be it a lists, sets or hashes. This can generally be accomplished using the general SCAN method in Redis. However convenient, it is a little confusing at first so lets analyze it a bit closer.

Another option you will notice is the COUNT. I believe this is generally set to 10 give or take. Now you might come up with the same idea as I did. Surely setting the COUNT to 1 will make SCAN act like a normal iterator, returning one value at the time.

This is not the case as I had to learn the hard way.

Strangely enough the COUNT value is only an approximate value of results SCAN will return. Even if set to 1, you will very often receive 2 or more entries at the time so you will always need to iterate through the returned values if you want to deal with entries one by one.

Python code example

So say you have a hash database called ‘hash_database’ because you have no imagination and its late and you want to iterate through its members using redis-py. You would probably want to write something like this


import redis

database = redis.StrictRedis('localhost')
cursor = 0
while true:

    cursor, entries = database.hscan('hash_database', cursor)

    # Just printing entries one by one
    for name, content in entries.items():
    print("Name: {} -- Content: {}".format(name, content))

    if cursor == 0:
        break

This is a pretty simple way to handle this odd iteration method. If you’re concerned about the connection speed, you can set the COUNT to some higher number and receive larger chunks of your data at each call. One again, those chunks will vary in size with each iteration, but generally stay close to the defined number.

As much as this is not extremely complex, redis-py makes your life even simpler offering the convenience wrapper scan-iter. This will simplify the above code thusly:

<pre>import redis

database = redis.StrictRedis('localhost')

for entry in db.hscan_iter('read_sequence'):
    print("Name: {} -- Content: {}".format(entry[0], entry[1]))

How much simpler is this?! Note that in this case the entries will be returned one by one and you don’t have to worry about the cursor value. You can still speficy the COUNT value, which  I assume determines the number of database entries obtained in each ‘call to the database’. However that’s only my assumption and if you know more about it, please leave me a comment.

So why am I even bothering explaining the original SCAN method if redis-py makes it so much simpler? Well mostly because pure Redis doesn’t have any SCAN_ITER method so when accessing it directly or writing lua scripts you’ll still have to rely on the CURSOR. Partially also because I find it quite amusing. It reminds me of The Twelve Tasks of Asterix where they needed to fetch a permit A38 from ‘The Place That Sends You Mad’ sending Asterix and Obelix from one cubicle to another. If you’ve seen the movie and/or ever dealt with bureaucracy,  you would know what I’m refering to.

Advertisements

REDIS: 3 ways to clone a database (in python)

It has recently come to my attention that REDIS is actually a very useful tool for managing your database. It’s very fast it’s supposedly reliable. At least Twitter and Pinterest seem to think so and if Pinterest says it, who am I to oppose.

As much as on a whole REDIS is quite intuitive, one might still run into troubles like I did when I was trying to clone a database and maintain an ‘original’ version alongside a version that had been edited. Apparently there are multiple ways how to go about it, but as it’s often the case, some ways are better the others.

One REDIS instance with multiple databases

This might look like the most straightforward way to deal with multiple databases as suggested here. REDIS is able to operate with multiple databases in one instance. If not specified, REDIS will create database at port 0, however if you so desire, you can change that. In order to achieve that you’ll need to do the following.

Open a single redis-client in your command line or however you decide to do that and proceed to python:

import redis

database1 = redis.StrictRedis('localhost', port = 6379, db=0)
database2 = redis.StrictRedis('localhost', port = 6379, db=1)

database1.set('key1', 'value1')
database2.set('key2', 'value2')
...

Now you have two database object, which each point to the same instance. You can access the two databases using their objects or the following links as described in the redis-py documentation.

This might look convenient, but it’s definitely not recommended. Let’s hear from Salvatore Sanfilippo himself as stated here:

I understand how this can be useful, but unfortunately I consider Redis multiple database errors my worst decision in Redis design at all… without any kind of real gain, it makes the internals a lot more complex.

(Salvatore Sanfilippo, 15.5.2010)

So what is the correct way to do this?

Mark your keys

Instead of separating two databases, you might come up with a system of key names that reflect belonging to two different datasets. There should nothing inherently wrong with that and it might end up being the easiest way. One could imagine a system like this:

database1_key1
database1_key2
database2_key1
...

Use one REDIS instance per database

If you don’t like the idea of mixing up the keys from two databases, having two separate instances of REDIS running should do the work. In this case you’ll probably need to specify port for each one of them as follows:

redis-server --port 6379

This way you can specify the port value for each REDIS instance separately. This is what you will use in your python code.

import redis database = redis.StrictRedis('localhost', port = 6379)

database.set('key1', 'value1')

Copying data between databases

Last but not least, here is a little tip on how to clone a database. Again there are a few ways to do that.

Dump database and reaload it

This is suggested over here. It feels a little bit more like a hack then a proper procedure and I haven’t tested it, but I don’t see a reason why it wouldn’t work. You can dump the whole database into a file asynchronously using:


database.bgsave()

This will create a dump.rdb file, which contains a copy of your database. You’ll need to copy it in a different folder and initiate the second database from there. As the second database starts, it will look for dump.rdb file to initialize from. This will effectively create a new clone.

Enslave and liberate method

This is not exactly an official name, but it’s quite descriptive of what’s going on. You will need to create two instances and make one slave of the other:


database1 = redis.StrictRedis('localhost', port=6379) 

database2 = redis.StrictRedis('localhost', port=6380)

database2.slaveof('localhost', port=6379)

The property of slave database is that it gets a copy of all the entries in a master database. This transfer happens on the background so you don’t have to worry about it.

Now because we want to have a clone of the original database, we need to ‘un-slave’ the second  database, or liberate it if you will. And a free database is a slave to whom? Nobody! And that’s precisely how it’s done.


database2.slaveof()

This breaks the bond between the databases and stops the data from being transfered together with future edits in database1. And there you have it – a clone of database is done. Just a side note though, you probably want to check if the data transfer is finished, so it might be a good idea to compare for instance the database sizes.

import time

if database1.dbsize() == database2,dbsize():
    database2.slaveof()
else:
    print('Waiting for the data transfer to finish')
    time.sleep(1)

There is probably a better way to do this. I reckon there is a way to ask for a confirmation of the transfer. REDIS is very general with its confirmation messages. However I didn’t have time to look into that. If you know any better, please leave me a comment. Thanks!