文本编辑器中的行管理

时间:2022-05-17 14:14:08

I've been working on a text editor for some time. I made a custom edit control, from scratch, and I've got the basics down now. The problem I am facing is regarding line management. Since, my program relies on dividing the input text into lines(the text is printed line by line), line management is pretty important. I was using std::vector to store the line positions. I am using a Piece Table for my text processing, but for the sake of simplicity, let's say that I have an array of characters. I add/insert an element into the line vector every time the user presses enter. The issue is that every time the user inserts a character, the whole structure is disturbed. For example :

我已经在文本编辑器上工作了一段时间。我从头开始制作了一个自定义编辑控件,现在我已经掌握了基础知识。我面临的问题是关于生产线管理。因为,我的程序依赖于将输入文本划分为行(文本逐行打印),所以行管理非常重要。我使用std :: vector来存储行位置。我正在使用Piece Table进行文本处理,但为了简单起见,我假设我有一个字符数组。每次用户按下回车键时,我都会在行向量中添加/插入一个元素。问题是每次用户插入一个字符时,整个结构都会受到干扰。例如 :

         0   1   2   3   4   5    6   7   8   9   10
text = ['h','e','l','l','o','\n','W','o','r','l','d']
state of line vector : 
line[0] = 0 
line[1] = 6

Let's say the user inserts a character('x') after the text[2]:

假设用户在文本[2]之后插入一个字符('x'):

         0   1   2   3   4   5    6   7   8   9   10  11
text = ['h','e','l','x','l','o','\n','W','o','r','l','d'] 
state of line vector : 
line[0] = 0 
line[1] = 6

Because of the insertion, I would need to update the value of each element in the lines vector after the current line. The same for deletion. If there are 1000 lines in a program and the user edits the first line, it would be quite inefficient to update all 999 elements(except the first one).

由于插入,我需要在当前行之后更新行向量中每个元素的值。删除相同。如果程序中有1000行并且用户编辑第一行,则更新所有999个元素(第一行除外)将非常低效。

What I was thinking of was to keep each line independent of each other. But that would lead to complications when an existing line is divided into two lines. So I'd like to know what's a good way to go about the problem?

我想到的是保持每条线彼此独立。但是,当现有生产线分为两条生产线时,这会导致复杂化。所以我想知道解决这个问题的好方法是什么?

Edit: Just to clarify, I am using a data structure called Piece Table. I am not using an array of characters. Here is what a piece table data structure is : http://www.cs.unm.edu/~crowley/papers/sds.pdf

编辑:只是为了澄清,我正在使用一个名为Piece Table的数据结构。我没有使用一组字符。这是一个表格数据结构:http://www.cs.unm.edu/~crowley/papers/sds.pdf

3 个解决方案

#1


3  

The classic data structure used by many editors is the "Gap Buffer". This basically has a working space that lives around the cursor where activity happens so that the local operations happen quickly. Then, when the cursor moves, the gap will, assuming a change happens, move with it.

许多编辑器使用的经典数据结构是“Gap Buffer”。这基本上有一个工作空间,它围绕着活动发生的光标,以便本地操作快速发生。然后,当光标移动时,假设发生了变化,间隙将随之移动。

As far as line calculations, the modern systems are fast enough where you can pretty much simply scan the buffer and look for lines. The nice thing is that you don't need to do this on most operations, so you refrain from doing it all the time. Also, there's a difference between physical lines in the buffer (i.e. collections of characters ending with an EOL marker) and soft lines (ala word wrap, etc.). Consider a modern word processor where paragraphs are routinely a single "line" but wrap to the page margins. Of course, you can handle this either way.

就线计算而言,现代系统足够快,您可以非常简单地扫描缓冲区并查找线条。好处是你不需要在大多数操作中都这样做,所以你不要一直这样做。此外,缓冲区中的物理线(即以EOL标记结束的字符集合)和软线(ala自动换行等)之间存在差异。考虑一个现代的文字处理器,其中段落通常是单个“行”,但包裹到页边距。当然,您可以采用这种方式处理。

Finally, for most operations on the keyboard, you can simply use relative positions (i.e. if you insert a new line, then it's straightforward to add a new line marker to a line array, since you already know the point you are at within the buffer). But when you do, say, a large paste operation of several lines, it's likely faster to just cram it all in and recalculate the entire buffer (as an alternative, you could always break the paste up in to lines, and insert them one by one behind the scenes, just like a normal line).

最后,对于键盘上的大多数操作,您可以简单地使用相对位置(即,如果插入新行,则可以直接向行数组添加新行标记,因为您已经知道了您在缓冲区内的点)。但是,当您执行多行的大型粘贴操作时,将其全部填入并重新计算整个缓冲区可能会更快(作为替代方法,您可以始终将粘贴内容分解为行,并将其插入一行一个幕后,就像一个正常的线)。

For huge huge buffers, or slow slow computers, you may want to consider not worrying so much about the global state (exactly how many lines are in the buffer, exactly what line you might be on, etc.) at any one point and kick off that kind of recalculation in to the background. Most likely the pause will be minor (but annoying if you're typing), and will catch up as soon as the human simply pauses to catch their thoughts. Clearly this can complicate the design and you'll likely be ok using brute force on modern hardware for the time being.

对于巨大的缓冲区或缓慢的慢速计算机,你可能想要考虑不要担心全局状态(确切地说缓冲区中有多少行,你可能在哪一行上,等等)在任何一点和踢关闭那种重新计算的背景。很可能暂停会很轻微(但如果你打字就会很烦人),并且只要人类停下来抓住他们的想法就会赶上。显然,这可能会使设计复杂化,您可能暂时可以在现代硬件上使用蛮力。

#2


1  

Vector will work fine.

矢量将工作正常。

Consider having the line allocated dynamically, and having the vector store a pointer to the line. Moving a bunch of pointers to line is much cheaper than moving the lines themselves.

考虑动态分配行,并使向量存储指向行的指针。将一堆指针移动到线上要比移动线本身便宜得多。

You also might want to consider some sort of Gap Buffer techniques.

您可能还想考虑某种Gap Buffer技术。

#3


0  

If I understand the question, you're keeping track of the positions of the lines with an auxiliary data structure along these lines:

如果我理解了这个问题,你就会沿着这些线跟踪具有辅助数据结构的线的位置:

line  offset  length
   0       0      65
   1      65      30
   2      95      50
   3     145       1
   4     146      13
 ...

If the length of line n changes by d, then you have to update the offset of all of the remaining lines by d. And that's slow when there are a lot of lines.

如果行n的长度改变d,则必须用d更新所有剩余行的偏移量。当有很多行时,这很慢。

You could keep track of landmarks. Instead of offsets being from the beginning of the sequence, you have them be relative to some landmark.

你可以跟踪地标。不是从序列的开始处偏移,而是让它们相对于某个地标而言。

Supposed you create a landmark for every 100 lines. The first hundred lines are tracked just the same, since the first landmark is at the beginning of the file. But the next hundred lines simply have offsets, and the landmark has the absolute offset from the beginning of the file for line 100.

假设您为每100行创建一个地标。由于第一个地标位于文件的开头,因此前100行被跟踪相同。但接下来的一百行只有偏移量,并且地标具有从第100行的文件开头的绝对偏移量。

So when you change the length of a line, you only need to update the offsets for the rest of the lines in that landmark, plus the offsets of the remaining landmarks. That's still O(n), but there's a pretty big divisor which will make it faster.

因此,当您更改线条的长度时,您只需要更新该地标中其余线条的偏移量,以及其余地标的偏移量。那仍然是O(n),但是有一个相当大的除数会使它更快。

But we can do better. Instead of just maintaining a list of landmarks, suppose we put them in a tree, where the leaves of the tree are your line, and the root represents the entire file. To find the offset of a given line, you add the offsets of all its ancestors together. And if a line changes, you simply update one node and its ancestors. This gives O(log n), at the cost of some bookkeeping. The space overhead is not significantly worse than the doubly-linked list you're already using.

但我们可以做得更好。我们不是仅仅维护一个地标列表,而是假设我们将它们放在一棵树中,树的叶子是你的线,而root表示整个文件。要查找给定行的偏移量,请将其所有祖先的偏移量添加到一起。如果一行改变,你只需更新一个节点及其祖先。这给出了O(log n),代价是一些簿记。空间开销并不比您已经使用的双链表更糟糕。

#1


3  

The classic data structure used by many editors is the "Gap Buffer". This basically has a working space that lives around the cursor where activity happens so that the local operations happen quickly. Then, when the cursor moves, the gap will, assuming a change happens, move with it.

许多编辑器使用的经典数据结构是“Gap Buffer”。这基本上有一个工作空间,它围绕着活动发生的光标,以便本地操作快速发生。然后,当光标移动时,假设发生了变化,间隙将随之移动。

As far as line calculations, the modern systems are fast enough where you can pretty much simply scan the buffer and look for lines. The nice thing is that you don't need to do this on most operations, so you refrain from doing it all the time. Also, there's a difference between physical lines in the buffer (i.e. collections of characters ending with an EOL marker) and soft lines (ala word wrap, etc.). Consider a modern word processor where paragraphs are routinely a single "line" but wrap to the page margins. Of course, you can handle this either way.

就线计算而言,现代系统足够快,您可以非常简单地扫描缓冲区并查找线条。好处是你不需要在大多数操作中都这样做,所以你不要一直这样做。此外,缓冲区中的物理线(即以EOL标记结束的字符集合)和软线(ala自动换行等)之间存在差异。考虑一个现代的文字处理器,其中段落通常是单个“行”,但包裹到页边距。当然,您可以采用这种方式处理。

Finally, for most operations on the keyboard, you can simply use relative positions (i.e. if you insert a new line, then it's straightforward to add a new line marker to a line array, since you already know the point you are at within the buffer). But when you do, say, a large paste operation of several lines, it's likely faster to just cram it all in and recalculate the entire buffer (as an alternative, you could always break the paste up in to lines, and insert them one by one behind the scenes, just like a normal line).

最后,对于键盘上的大多数操作,您可以简单地使用相对位置(即,如果插入新行,则可以直接向行数组添加新行标记,因为您已经知道了您在缓冲区内的点)。但是,当您执行多行的大型粘贴操作时,将其全部填入并重新计算整个缓冲区可能会更快(作为替代方法,您可以始终将粘贴内容分解为行,并将其插入一行一个幕后,就像一个正常的线)。

For huge huge buffers, or slow slow computers, you may want to consider not worrying so much about the global state (exactly how many lines are in the buffer, exactly what line you might be on, etc.) at any one point and kick off that kind of recalculation in to the background. Most likely the pause will be minor (but annoying if you're typing), and will catch up as soon as the human simply pauses to catch their thoughts. Clearly this can complicate the design and you'll likely be ok using brute force on modern hardware for the time being.

对于巨大的缓冲区或缓慢的慢速计算机,你可能想要考虑不要担心全局状态(确切地说缓冲区中有多少行,你可能在哪一行上,等等)在任何一点和踢关闭那种重新计算的背景。很可能暂停会很轻微(但如果你打字就会很烦人),并且只要人类停下来抓住他们的想法就会赶上。显然,这可能会使设计复杂化,您可能暂时可以在现代硬件上使用蛮力。

#2


1  

Vector will work fine.

矢量将工作正常。

Consider having the line allocated dynamically, and having the vector store a pointer to the line. Moving a bunch of pointers to line is much cheaper than moving the lines themselves.

考虑动态分配行,并使向量存储指向行的指针。将一堆指针移动到线上要比移动线本身便宜得多。

You also might want to consider some sort of Gap Buffer techniques.

您可能还想考虑某种Gap Buffer技术。

#3


0  

If I understand the question, you're keeping track of the positions of the lines with an auxiliary data structure along these lines:

如果我理解了这个问题,你就会沿着这些线跟踪具有辅助数据结构的线的位置:

line  offset  length
   0       0      65
   1      65      30
   2      95      50
   3     145       1
   4     146      13
 ...

If the length of line n changes by d, then you have to update the offset of all of the remaining lines by d. And that's slow when there are a lot of lines.

如果行n的长度改变d,则必须用d更新所有剩余行的偏移量。当有很多行时,这很慢。

You could keep track of landmarks. Instead of offsets being from the beginning of the sequence, you have them be relative to some landmark.

你可以跟踪地标。不是从序列的开始处偏移,而是让它们相对于某个地标而言。

Supposed you create a landmark for every 100 lines. The first hundred lines are tracked just the same, since the first landmark is at the beginning of the file. But the next hundred lines simply have offsets, and the landmark has the absolute offset from the beginning of the file for line 100.

假设您为每100行创建一个地标。由于第一个地标位于文件的开头,因此前100行被跟踪相同。但接下来的一百行只有偏移量,并且地标具有从第100行的文件开头的绝对偏移量。

So when you change the length of a line, you only need to update the offsets for the rest of the lines in that landmark, plus the offsets of the remaining landmarks. That's still O(n), but there's a pretty big divisor which will make it faster.

因此,当您更改线条的长度时,您只需要更新该地标中其余线条的偏移量,以及其余地标的偏移量。那仍然是O(n),但是有一个相当大的除数会使它更快。

But we can do better. Instead of just maintaining a list of landmarks, suppose we put them in a tree, where the leaves of the tree are your line, and the root represents the entire file. To find the offset of a given line, you add the offsets of all its ancestors together. And if a line changes, you simply update one node and its ancestors. This gives O(log n), at the cost of some bookkeeping. The space overhead is not significantly worse than the doubly-linked list you're already using.

但我们可以做得更好。我们不是仅仅维护一个地标列表,而是假设我们将它们放在一棵树中,树的叶子是你的线,而root表示整个文件。要查找给定行的偏移量,请将其所有祖先的偏移量添加到一起。如果一行改变,你只需更新一个节点及其祖先。这给出了O(log n),代价是一些簿记。空间开销并不比您已经使用的双链表更糟糕。