如何使用excel在较大的字符串中搜索大量短文本字符串?

时间:2023-01-25 07:07:47

I am working with a dataset containing 763508 text strings about 8 characters in length and want to locate these strings in a set containing 277 strings each 500 characters long. It is important that my return values indicate which of the short strings occur, how many times they occur, and where they occur in the 500 character strings. I understand that this is a fairly complex task so any pointers in the right direction are greatly appreciated!

我正在使用包含763508个文本字符串的数据集,其长度大约为8个字符,并且希望在包含277个字符串的集合中找到这些字符串,每个字符串长度为500个字符。重要的是,我的返回值指示哪些短字符串出现,它们出现了多少次,以及它们出现在500个字符串中的位置。我知道这是一个相当复杂的任务,因此非常感谢任何正确方向的指针!


Just to add a bit of context to this question I am working with expression data and am looking at TF binding sites present in a set of differentially expressed genes. Although it would theoretically be easier to just do MEME analysis on MEME-suite, MEME data is challenging to export, format, and analyze in a way that is useful to me. Thanks for any help!

只是为这个问题添加一些上下文,我正在处理表达数据,并且正在查看存在于一组差异表达基因中的TF结合位点。虽然理论上在MEME-suite上进行MEME分析理论上会更容易,但MEME数据对我的输出,格式化和分析具有挑战性。谢谢你的帮助!

1 个解决方案

#1


2  

Pretty basic and may be slowish...

非常基本,可能会很慢......

Sub Tester()
    Dim needles, haystacks, h, n, i As Long, j As Long, p As Long
    Dim rDest As Range

    'short sequences in sheet 1 ColA (no gaps)
    needles = Sheets(1).Range("A1").CurrentRegion.Columns(1).Value

    'longer sequences in sheet 2 ColA (no gaps)
    haystacks = Sheets(2).Range("A1").CurrentRegion.Columns(1).Value

    'start recording hits here
    Set rDest = Sheets(3).Cells(Rows.Count, 1).End(xlUp).Offset(1, 0)

    For i = 1 To UBound(haystacks, 1)
        h = haystacks(i, 1)
        For j = 1 To UBound(needles, 1)
            n = needles(j, 1)
            p = InStr(1, h, n)
            'loop while have a hit
            Do While p > 0
                rDest.Resize(1, 3).Value = Array(i, n, p)
                Set rDest = rDest.Offset(1, 0)
                p = InStr(p + 1, h, n)
            Loop
        Next
    Next i

End Sub

If you expect a lot of hits then writing them line-by-line may slow you down and that part could be optimized to be faster.

如果您期望大量点击,那么逐行编写它们可能会降低您的速度,并且可以优化该部分以加快速度。

#1


2  

Pretty basic and may be slowish...

非常基本,可能会很慢......

Sub Tester()
    Dim needles, haystacks, h, n, i As Long, j As Long, p As Long
    Dim rDest As Range

    'short sequences in sheet 1 ColA (no gaps)
    needles = Sheets(1).Range("A1").CurrentRegion.Columns(1).Value

    'longer sequences in sheet 2 ColA (no gaps)
    haystacks = Sheets(2).Range("A1").CurrentRegion.Columns(1).Value

    'start recording hits here
    Set rDest = Sheets(3).Cells(Rows.Count, 1).End(xlUp).Offset(1, 0)

    For i = 1 To UBound(haystacks, 1)
        h = haystacks(i, 1)
        For j = 1 To UBound(needles, 1)
            n = needles(j, 1)
            p = InStr(1, h, n)
            'loop while have a hit
            Do While p > 0
                rDest.Resize(1, 3).Value = Array(i, n, p)
                Set rDest = rDest.Offset(1, 0)
                p = InStr(p + 1, h, n)
            Loop
        Next
    Next i

End Sub

If you expect a lot of hits then writing them line-by-line may slow you down and that part could be optimized to be faster.

如果您期望大量点击,那么逐行编写它们可能会降低您的速度,并且可以优化该部分以加快速度。