如何在谷歌AppEngine上实现“自动递增”

时间:2022-11-22 23:45:15

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.

我必须以一种“强烈的单调增加”的方式给某物贴上标签。它可以是发票号、装运标签号或类似的东西。

  1. A number MUST NOT BE used twice
  2. 数字不能重复使用
  3. Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
  4. 当所有的小数字都被使用(没有空穴)时,每个数字都应该被使用。

Fancy way of saying: I need to count 1,2,3,4 ... The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.

奇特的说法是:我需要数1 2 3 4……我现有的数字空间通常是100万,我每天可能需要1000个。

I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".

我知道这在分布式系统中是一个难题,而且通常我们会更好地使用GUIDs。但在这种情况下,出于法律原因,我需要“传统编号”。

Can this be implemented on Google AppEngine (preferably in Python)?

这能在谷歌AppEngine上实现吗(最好是在Python中)?

9 个解决方案

#1


24  

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

如果你必须要有顺序递增的数字,并且没有间隔,你需要使用一个单独的实体,在一个事务中更新,以“消费”每个新号码。在实践中,您将被限制为每秒生成1-5个数字——这听起来对您的需求是合适的。

#2


7  

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.

如果您放弃了id必须严格按顺序排列的要求,您可以使用分级分配方案。基本思想/限制是事务不能影响多个存储组。

For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

例如,假设您有“用户”的概念,您可以为每个用户分配一个存储组(为每个用户创建一些全局对象)。每个用户都有一个保留id列表。在为用户分配ID时,选择一个预留的(在事务中)。如果没有留下任何id,那么从全局池中分配100个id(例如)的新事务,然后创建一个新事务,将它们添加到用户并同时提取一个id。假设每个用户只与应用程序进行顺序交互,那么用户对象上就不会出现并发。

#3


7  

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:

gaetk -谷歌AppEngine工具包现在附带了一个简单的库函数来获取序列中的数字。它基于Nick Johnson的事务处理方法,可以很容易地用作Martin von Lowis的分片方法的基础:

>>> from gaeth.sequences import * 
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]

The functionality is basically implemented like this:

功能基本上是这样实现的:

def _get_numbers_helper(keys, needed):
  results = []

  for key in keys:
    seq = db.get(key)
    start = seq.current or seq.start
    end = seq.end
    avail = end - start
    consumed = needed
    if avail <= needed:
      seq.active = False
      consumed = avail
    seq.current = start + consumed
    seq.put()
    results += range(start, start + consumed)
    needed -= consumed
    if needed == 0:
      return results
  raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)

def get_numbers(needed):
  query = gaetkSequence.all(keys_only=True).filter('active = ', True)
  return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

#4


5  

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.

如果您对顺序不太严格,您可以“切分”增量。这可以被认为是“最终顺序的”计数器。

Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.

基本上,您有一个实体是“master”计数。然后有许多实体(基于需要处理的负载)拥有自己的计数器。这些碎片从主目录中保留大块的id,并在它们的范围之外服务,直到它们耗尽值。

Quick algorithm:

快速算法:

  1. You need to get an ID.
  2. 你需要一个ID。
  3. Pick a shard at random.
  4. 随便挑一块碎片。
  5. If the shard's start is less than its end, take it's start and increment it.
  6. 如果shard的开始小于它的结束,就开始并增加它。
  7. If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
  8. 如果碎片的起始值等于(或更多的oh)它的结束,到master,取值,并向其添加n。将碎片设置为检索到的值+ 1,并结束检索到的+ n。

This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.

这可以很好地缩放,然而,你可以输出的数量是碎片的数量乘以你的n值。如果你想让你的记录看起来向上,这可能会有用,但是如果你想让它们表示顺序,那就不准确了。还需要注意的是,最新的值可能有漏洞,因此如果您出于某种原因使用该值进行扫描,您将不得不注意这些漏洞。

Edit

I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

我的应用需要这个(这就是为什么我要搜索P)所以我实现了我的解决方案。它可以抓取单个id,也可以高效抓取批量。我在一个受控的环境(在appengine上)测试过它,并且它执行得非常好。您可以在github上找到代码。

#5


4  

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

看一下分片计数器是如何生成的。它可以帮助你。另外,你真的需要它们是数值型的。如果唯一的满足就是使用实体键。

#6


0  

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

记住:分片增加了获得唯一的、自动增加的值的概率,但不能保证它。如果你必须有一个独特的自动驾驶系统,请采纳尼克的建议。

#7


0  

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.

我为我的博客实现了一些非常简单的东西,它增加了一个IntegerProperty, iden而不是Key ID。

I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.

我定义max_iden()以查找当前使用的最大iden整数。此函数扫描所有现有的博客文章。

def max_iden():
    max_entity = Post.gql("order by iden desc").get()
    if max_entity:
        return max_entity.iden
    return 1000    # If this is the very first entry, start at number 1000

Then, when creating a new blog post, I assign it an iden property of max_iden() + 1

然后,在创建一个新的blog post时,我将其赋值为max_iden() + 1的iden属性。

new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()

I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.

我想知道您是否也想在此之后添加一些验证功能,即确保max_iden()现在已经增加,然后再转移到下一个发票上。

Altogether: fragile, inefficient code.

全部:脆弱,低效的代码。

#8


0  

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).

或者,您可以使用allocate_ids(),正如人们所建议的,然后预先创建这些实体(例如,使用占位符属性值)。

first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]

Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.

然后,在创建新发票时,您的代码可以通过这些条目查找ID最低的条目,这样占位符属性就不会被真实数据覆盖。

I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

我还没有把它付诸实践,但似乎它在理论上应该行得通,而且很可能存在人们已经提到过的同样的局限性。

#9


0  

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.

我正在考虑使用以下解决方案:使用CloudSQL (MySQL)插入记录并分配顺序ID(可能是任务队列),稍后(使用Cron任务)将记录从CloudSQL移回数据存储。

The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

实体也可以有一个UUID,因此我们可以在CloudSQL中从数据存储映射实体,并且还可以有顺序ID(出于法律原因)。

#1


24  

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

如果你必须要有顺序递增的数字,并且没有间隔,你需要使用一个单独的实体,在一个事务中更新,以“消费”每个新号码。在实践中,您将被限制为每秒生成1-5个数字——这听起来对您的需求是合适的。

#2


7  

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.

如果您放弃了id必须严格按顺序排列的要求,您可以使用分级分配方案。基本思想/限制是事务不能影响多个存储组。

For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

例如,假设您有“用户”的概念,您可以为每个用户分配一个存储组(为每个用户创建一些全局对象)。每个用户都有一个保留id列表。在为用户分配ID时,选择一个预留的(在事务中)。如果没有留下任何id,那么从全局池中分配100个id(例如)的新事务,然后创建一个新事务,将它们添加到用户并同时提取一个id。假设每个用户只与应用程序进行顺序交互,那么用户对象上就不会出现并发。

#3


7  

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:

gaetk -谷歌AppEngine工具包现在附带了一个简单的库函数来获取序列中的数字。它基于Nick Johnson的事务处理方法,可以很容易地用作Martin von Lowis的分片方法的基础:

>>> from gaeth.sequences import * 
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]

The functionality is basically implemented like this:

功能基本上是这样实现的:

def _get_numbers_helper(keys, needed):
  results = []

  for key in keys:
    seq = db.get(key)
    start = seq.current or seq.start
    end = seq.end
    avail = end - start
    consumed = needed
    if avail <= needed:
      seq.active = False
      consumed = avail
    seq.current = start + consumed
    seq.put()
    results += range(start, start + consumed)
    needed -= consumed
    if needed == 0:
      return results
  raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)

def get_numbers(needed):
  query = gaetkSequence.all(keys_only=True).filter('active = ', True)
  return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

#4


5  

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.

如果您对顺序不太严格,您可以“切分”增量。这可以被认为是“最终顺序的”计数器。

Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.

基本上,您有一个实体是“master”计数。然后有许多实体(基于需要处理的负载)拥有自己的计数器。这些碎片从主目录中保留大块的id,并在它们的范围之外服务,直到它们耗尽值。

Quick algorithm:

快速算法:

  1. You need to get an ID.
  2. 你需要一个ID。
  3. Pick a shard at random.
  4. 随便挑一块碎片。
  5. If the shard's start is less than its end, take it's start and increment it.
  6. 如果shard的开始小于它的结束,就开始并增加它。
  7. If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
  8. 如果碎片的起始值等于(或更多的oh)它的结束,到master,取值,并向其添加n。将碎片设置为检索到的值+ 1,并结束检索到的+ n。

This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.

这可以很好地缩放,然而,你可以输出的数量是碎片的数量乘以你的n值。如果你想让你的记录看起来向上,这可能会有用,但是如果你想让它们表示顺序,那就不准确了。还需要注意的是,最新的值可能有漏洞,因此如果您出于某种原因使用该值进行扫描,您将不得不注意这些漏洞。

Edit

I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

我的应用需要这个(这就是为什么我要搜索P)所以我实现了我的解决方案。它可以抓取单个id,也可以高效抓取批量。我在一个受控的环境(在appengine上)测试过它,并且它执行得非常好。您可以在github上找到代码。

#5


4  

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

看一下分片计数器是如何生成的。它可以帮助你。另外,你真的需要它们是数值型的。如果唯一的满足就是使用实体键。

#6


0  

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

记住:分片增加了获得唯一的、自动增加的值的概率,但不能保证它。如果你必须有一个独特的自动驾驶系统,请采纳尼克的建议。

#7


0  

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.

我为我的博客实现了一些非常简单的东西,它增加了一个IntegerProperty, iden而不是Key ID。

I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.

我定义max_iden()以查找当前使用的最大iden整数。此函数扫描所有现有的博客文章。

def max_iden():
    max_entity = Post.gql("order by iden desc").get()
    if max_entity:
        return max_entity.iden
    return 1000    # If this is the very first entry, start at number 1000

Then, when creating a new blog post, I assign it an iden property of max_iden() + 1

然后,在创建一个新的blog post时,我将其赋值为max_iden() + 1的iden属性。

new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()

I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.

我想知道您是否也想在此之后添加一些验证功能,即确保max_iden()现在已经增加,然后再转移到下一个发票上。

Altogether: fragile, inefficient code.

全部:脆弱,低效的代码。

#8


0  

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).

或者,您可以使用allocate_ids(),正如人们所建议的,然后预先创建这些实体(例如,使用占位符属性值)。

first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]

Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.

然后,在创建新发票时,您的代码可以通过这些条目查找ID最低的条目,这样占位符属性就不会被真实数据覆盖。

I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

我还没有把它付诸实践,但似乎它在理论上应该行得通,而且很可能存在人们已经提到过的同样的局限性。

#9


0  

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.

我正在考虑使用以下解决方案:使用CloudSQL (MySQL)插入记录并分配顺序ID(可能是任务队列),稍后(使用Cron任务)将记录从CloudSQL移回数据存储。

The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

实体也可以有一个UUID,因此我们可以在CloudSQL中从数据存储映射实体,并且还可以有顺序ID(出于法律原因)。