Алгоритм поиска Кнута Морриса Пратта для поиска и замены в потоке

Я хочу реализовать алгоритм поиска и замены в потоке, где я бы создал класс потока, который будет принимать исходный поток в качестве входных данных, а метод чтения будет читать из исходного потока по мере необходимости и выполнять поиск/замену как поток читается. Это предотвратит одновременную загрузку всего набора данных в память и позволит выполнять поиск и замену в очень больших наборах данных.

Чтобы сделать это, я подумал, что начну с существующего алгоритма поиска, адаптирую его к потоковому подходу, а затем адаптирую функцию замены.

Ниже приведена реализация Кнута Морриса Пратта, которую я адаптировал из онлайн-примера. Кто-нибудь адаптировал что-то подобное к потоковому подходу? Мне нужно будет учитывать поиск через границы чтения, что я пока не знаю, как бы я это сделал.

/// <summary>
/// The KMP matching algorithm uses degenerating property (SearchPattern having same sub-SearchPatterns appearing more than once in the SearchPattern) 
/// of the SearchPattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we 
/// detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take 
/// advantage of this information to avoid matching the characters that we know will anyway match. 
/// </summary>
/// <seealso cref="https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/"/>
public class KMPSearch
{
    /// <summary>
    /// Pattern we are looking for
    /// </summary>
    readonly string sSearchPattern_m;

    /// <summary>
    /// Text we are looking in
    /// </summary>
    readonly string sData_m;

    /// <summary>
    /// Index for text
    /// </summary>
    private int iDataPosition_m;
    private int iDataLength_m;

    /// <summary>
    /// Index for search pattern
    /// </summary>
    private int iSearchPatternPosition_m;

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// </summary>
    readonly int[] lstLongestProperPrefix_m;

    public KMPSearch(string sSearchPattern, string sData)
    {
        this.sSearchPattern_m = sSearchPattern;
        this.sData_m = sData;
        this.iDataLength_m = sData.Length;

        // create lps that will hold the longest prefix suffix values for SearchPattern             
        this.lstLongestProperPrefix_m = new int[sSearchPattern.Length];

        // Pre-process the SearchPattern (calculate lps array) 
        this.ComputeLPSArray();

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat
    }

    /// <summary>
    /// Find next match
    /// </summary>
    /// <returns></returns>
    public int Next()
    {
        int iMatchIndex = -1;

        //We start comparison of pat[iSearchPatternPosition_m] with iSearchPatternPosition_m = 0 with characters of current window of text.
        //We keep matching characters txt[iDataPosition_m] and pat[iSearchPatternPosition_m] and keep incrementing iDataPosition_m and iSearchPatternPosition_m while 
        //pat[iSearchPatternPosition_m] and txt[iDataPosition_m] keep matching.
        while (iDataPosition_m < this.iDataLength_m)
        {
            if (this.sSearchPattern_m[iSearchPatternPosition_m] == this.sData_m[iDataPosition_m])
            {
                iSearchPatternPosition_m++;
                iDataPosition_m++;
            }

            if (iSearchPatternPosition_m == sSearchPattern_m.Length)
            {
                //Console.WriteLine("Found SearchPattern at index %d ", iDataPosition_m - iSearchPatternPosition_m);
                iMatchIndex = iDataPosition_m - iSearchPatternPosition_m;

                iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
            }

            // mismatch after j matches 
            else if (iDataPosition_m < this.iDataLength_m && this.sSearchPattern_m[iSearchPatternPosition_m] != this.sData_m[iDataPosition_m])
            {
                //When we see a mismatch
                //* We know that characters pat[0..iSearchPatternPosition_m - 1] match with txt[iDataPosition_m-iSearchPatternPosition_m..iDataPosition_m - 1] 
                //  (Note that iSearchPatternPosition_m starts with 0 and increment it only when there is a match).
                //* We also know (from above definition) that lps[iSearchPatternPosition_m - 1] is count of characters of pat[0…iSearchPatternPosition_m - 1] 
                //  that are both proper prefix and suffix.
                //* From above two points, we can conclude that we do not need to match these lps[iSearchPatternPosition_m - 1] characters with 
                //  txt[iDataPosition_m  -iSearchPatternPosition_m..iDataPosition_m - 1] because we know that 
                //  these characters will anyway match. Let us consider above example to understand this.

                // Do not match lps[0..lps[iSearchPatternPosition_m - 1]] characters, 
                // they will match anyway 
                if (iSearchPatternPosition_m != 0)
                {
                    iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
                }
                else
                {
                    iDataPosition_m = iDataPosition_m + 1;
                }
            }

            if (iMatchIndex > -1)
            {
                return iMatchIndex;
            }
        }
        return iMatchIndex;
    }

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// Fills lps for given pattern pat[0..M-1] 
    /// lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]. 
    /// </summary>
    private void ComputeLPSArray()
    {
        // length of the previous longest prefix suffix 
        int len = 0;

        this.lstLongestProperPrefix_m[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1;
        while (i < this.sSearchPattern_m.Length)
        {
            if (this.sSearchPattern_m[i] == this.sSearchPattern_m[len])
            {
                len++;
                this.lstLongestProperPrefix_m[i] = len;
                i++;
            }
            else // (pat[i] != pat[len]) 
            {
                // This is tricky. Consider the example. 
                // AAACAAAA and i = 7. The idea is similar 
                // to search step. 
                if (len != 0)
                {
                    len = this.lstLongestProperPrefix_m[len - 1];

                    // Also, note that we do not increment 
                    // i here 
                }
                else // if (len == 0) 
                {
                    this.lstLongestProperPrefix_m[i] = 0;
                    i++;
                }
            }
        }
    }
}

Затем я взял описанный выше алгоритм и модифицировал его в Stream. В качестве доказательства концепции поток будет вызывать событие во время своего метода чтения каждый раз, когда находит шаблон поиска. В настоящее время у него есть ограничение, заключающееся в невозможности поиска за границами чтения. Таким образом, если за раз считывается 1024 байта, а длина исходного потока составляет 2048 байт, для чтения всего потока выполняется два чтения. Проблема в том, что если шаблон поиска начинается с индекса 1000 и имеет длину 40 байт, он не будет найден. Я думаю, что как только эта проблема будет решена, реальная функциональность замены не будет такой сложной. Я ищу предложения о том, как реализовать поиск через границы чтения. Вероятно, это связано с кэшированием части предыдущего чтения. Кто-нибудь знает о потоковой реализации, подобной этой, или предложениях?

public class KMPSearchStream : Stream
{
    public override bool CanRead { get { return true; } }

    public override bool CanSeek => throw new NotImplementedException();

    public override bool CanWrite => throw new NotSupportedException();

    public override long Length => throw new NotSupportedException();

    public override long Position { get => throw new NotImplementedException(); set => throw new NotImplementedException(); }

    public class PatternFoundEventArgs
    {
        public int Index { get; internal set; }
    }

    public delegate void PatternFoundEvent(PatternFoundEventArgs e);

    private Stream strSource_m;

    /// <summary>
    /// Pattern we are looking for
    /// </summary>
    private byte[] searchPattern_m;

    /// <summary>
    /// Text we are looking in
    /// </summary>
    private byte[] data_m;

    /// <summary>
    /// Index for text
    /// </summary>
    private int iDataPosition_m;
    private int iDataLength_m;

    /// <summary>
    /// Index for search pattern
    /// </summary>
    private int iSearchPatternPosition_m;

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// </summary>
    readonly int[] lstLongestProperPrefix_m;

    public KMPSearchStream(Stream strSource, byte[] searchPattern)
    {
        if (strSource == null)
        {
            throw new ArgumentNullException(nameof(strSource), "Source stream is null.");
        }
        if (searchPattern == null || searchPattern.Length == 0)
        {
            throw new ArgumentNullException(nameof(searchPattern), "Pattern to find is null or empty.");
        }

        this.strSource_m = strSource;
        this.searchPattern_m = searchPattern;

        // create lps that will hold the longest prefix suffix values for SearchPattern             
        this.lstLongestProperPrefix_m = new int[searchPattern.Length];

        // Pre-process the SearchPattern (calculate lps array) 
        this.ComputeLPSArray();

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat
    }

    public event PatternFoundEvent OnPatternFound;

    public override void Flush()
    {
        throw new NotSupportedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotSupportedException();
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        int iRead = this.strSource_m.Read(buffer, offset, count);

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat

        this.data_m = buffer;
        this.iDataPosition_m = offset;
        this.iDataLength_m = iRead;

        int iIndex;

        while ((iIndex = this.Next()) > -1)
        {
            this.OnPatternFound(new PatternFoundEventArgs()
            {
                Index = iIndex
            });
        }
        return iRead;
    }

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// Fills lps for given pattern pat[0..M-1] 
    /// lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]. 
    /// </summary>
    private void ComputeLPSArray()
    {
        // length of the previous longest prefix suffix 
        int len = 0;

        this.lstLongestProperPrefix_m[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1;
        while (i < this.searchPattern_m.Length)
        {
            if (this.searchPattern_m[i] == this.searchPattern_m[len])
            {
                len++;
                this.lstLongestProperPrefix_m[i] = len;
                i++;
            }
            else // (pat[i] != pat[len]) 
            {
                // This is tricky. Consider the example. 
                // AAACAAAA and i = 7. The idea is similar 
                // to search step. 
                if (len != 0)
                {
                    len = this.lstLongestProperPrefix_m[len - 1];

                    // Also, note that we do not increment 
                    // i here 
                }
                else // if (len == 0) 
                {
                    this.lstLongestProperPrefix_m[i] = 0;
                    i++;
                }
            }
        }
    }

    /// <summary>
    /// Find next match
    /// </summary>
    /// <returns></returns>
    public int Next()
    {
        int iMatchIndex = -1;

        //We start comparison of pat[iSearchPatternPosition_m] with iSearchPatternPosition_m = 0 with characters of current window of text.
        //We keep matching characters txt[iDataPosition_m] and pat[iSearchPatternPosition_m] and keep incrementing iDataPosition_m and iSearchPatternPosition_m while 
        //pat[iSearchPatternPosition_m] and txt[iDataPosition_m] keep matching.
        while (iDataPosition_m < this.iDataLength_m)
        {
            if (this.searchPattern_m[iSearchPatternPosition_m] == this.data_m[iDataPosition_m])
            {
                iSearchPatternPosition_m++;
                iDataPosition_m++;
            }

            if (iSearchPatternPosition_m == searchPattern_m.Length)
            {
                //Console.WriteLine("Found SearchPattern at index %d ", iDataPosition_m - iSearchPatternPosition_m);
                iMatchIndex = iDataPosition_m - iSearchPatternPosition_m;

                iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
            }

            // mismatch after j matches 
            else if (iDataPosition_m < this.iDataLength_m && this.searchPattern_m[iSearchPatternPosition_m] != this.data_m[iDataPosition_m])
            {
                //When we see a mismatch
                //* We know that characters pat[0..iSearchPatternPosition_m - 1] match with txt[iDataPosition_m-iSearchPatternPosition_m..iDataPosition_m - 1] 
                //  (Note that iSearchPatternPosition_m starts with 0 and increment it only when there is a match).
                //* We also know (from above definition) that lps[iSearchPatternPosition_m - 1] is count of characters of pat[0…iSearchPatternPosition_m - 1] 
                //  that are both proper prefix and suffix.
                //* From above two points, we can conclude that we do not need to match these lps[iSearchPatternPosition_m - 1] characters with 
                //  txt[iDataPosition_m  -iSearchPatternPosition_m..iDataPosition_m - 1] because we know that 
                //  these characters will anyway match. Let us consider above example to understand this.

                // Do not match lps[0..lps[iSearchPatternPosition_m - 1]] characters, 
                // they will match anyway 
                if (iSearchPatternPosition_m != 0)
                {
                    iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
                }
                else
                {
                    iDataPosition_m = iDataPosition_m + 1;
                }
            }

            if (iMatchIndex > -1)
            {
                return iMatchIndex;
            }
        }
        return iMatchIndex;
    }

}

23.07.2019

Ответы:

Если я вас правильно понял, цель состоит в том, чтобы сканировать текущий поток и заменить все вхождения маски заменяющим значением, прежде чем перенаправить его в выходной поток.

Если да, то к системе есть несколько требований:

Пересылаться могут только те данные, для которых можно доказать, что они не являются частью полной маски.
Данные должны быть перенаправлены как можно скорее, чтобы свести перегрузку к минимуму.

Абстракция
По сути, вы буферизуете все данные перед их пересылкой.
После каждого символа вы проверяете, может ли буфер быть частью маски.
Если да, то не быть частью маски, вы освобождаете (записываете в вывод) один символ за другим из буфера, пока либо буфер не станет пустым, либо буфер снова не будет соответствовать маске.

Код:
Я создал вспомогательный класс, который по существу делает это.
Я не знаю, реализует ли он ваш алгоритм, это просто то, что я придумал.
Согласно моему тестов, это работает, но я ничего не могу гарантировать.
Вы можете использовать его в своем классе потока и просто перенаправлять вызовы и получать обратные вызовы.
Пересылать входные данные через Write([...]) методов, получить вывод через обратный вызов вывода и определить маску и строку замены. Он поддерживает очистку и сброс.

using System;
using System.Collections.Generic;

/// <summary>
/// Class that handles replacement of string occurrences that match a mask in a string stream. 
/// </summary>
public class StreamReplacement
{
    private Action<char> output;

    private string mask;

    private string replacement;

    // Buffer for unverified characters
    private readonly List<char> buffer;

    // Current index on the mask for the next character added
    private int maskPosition;

    /// <summary>
    /// All verified characters will be passed to the output
    /// </summary>
    /// <exception cref="System.ArgumentNullException">The output cannot be null.</exception>
    public Action<char> Output
    {
        get => output;
        set => output = value ?? throw new ArgumentNullException();
    }

    /// <summary>
    /// Gets or sets the mask to test for.
    /// </summary>
    /// <exception cref="System.ArgumentException">The mask must contain at least one character.</exception>
    public string Mask
    {
        get => this.mask;
        set
        {
            if (string.IsNullOrEmpty(mask))
            {
                throw new ArgumentException("The mask must contain at least one character.");
            }

            this.mask = value;

            // Clean buffer and throw out characters that can't be part of the mask anymore.
            WriteVerifiedCharactersFromBuffer();
        }
    }

    /// <summary>
    /// Gets or sets the replacement string to output when the mask has been encountered.
    /// </summary>
    public string Replacement
    {
        get => replacement;
        set => replacement = value ?? string.Empty;
    }

    /// <summary>
    /// Initializes a new instance of the <see cref="StreamReplacement"/> class.
    /// </summary>
    /// <param name="output">The output.</param>
    /// <param name="mask">The mask.</param>
    /// <param name="replacement">The replacement string for mask encounters.</param>
    /// <exception cref="System.ArgumentNullException">Output cannot be null.</exception>
    /// <exception cref="System.ArgumentException">The mask must contain at least one character.</exception>
    public StreamReplacement(Action<char> output, string mask, string replacement)
    {
        this.output = output ?? throw new ArgumentNullException(nameof(output));
        this.mask = string.IsNullOrEmpty(mask) ? throw new ArgumentException("The mask must contain at least one character.") : mask;
        this.replacement = replacement ?? string.Empty;
        this.buffer = new List<char>(mask.Length);
    }

    /// <summary>
    /// Writes a single character to the stream.
    /// </summary>
    public void Write(char c)
    {
        WriteCharacter(c);
    }

    /// <summary>
    /// Writes the specified text to the stream.
    /// </summary>
    public void Write(string text)
    {
        if (string.IsNullOrEmpty(text))
        {
            return;
        }

        foreach (var c in text)
        {
            WriteCharacter(c);
        }
    }

    /// <summary>
    /// Writes the specified characters to the stream.
    /// </summary>
    /// <param name="characters">The characters.</param>
    public void Write(params char[] characters)
    {
        if (characters == null)
        {
            return;
        }

        foreach (var c in characters)
        {
            WriteCharacter(c);
        }
    }

    /// <summary>
    /// Flushes all buffered characters, even if they could be part of the mask.<br/>
    /// Starts from scratch after flushing.
    /// </summary>
    public void Flush()
    {
        foreach (var c in buffer)
        {
            output(c);
        }

        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Clears the buffer without writing any buffered data to the output stream.
    /// </summary>
    public void Reset()
    {
        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Writes the character either to the buffer or output stream and handles masking.
    /// </summary>
    private void WriteCharacter(char c)
    {
        if (mask[maskPosition] == c)
        {
            AddUnverifiedCharacter(c);
        }
        else
        {
            buffer.Add(c);
            WriteVerifiedCharactersFromBuffer();
        }
    }

    /// <summary>
    /// Stores a character in the buffer that could be part of the mask.
    /// </summary>
    private void AddUnverifiedCharacter(char c)
    {
        buffer.Add(c);
        maskPosition++;

        if (maskPosition == mask.Length)
        {
            buffer.Clear();
            maskPosition = 0;

            foreach (var repChar in replacement)
            {
                output(repChar);
            }
        }
    }

    /// <summary>
    /// Writes characters that cannot be part of the mask to the output.
    /// </summary>
    private void WriteVerifiedCharactersFromBuffer()
    {
        // Find possible new start position in buffer
        var newStart = 0;

        for (; newStart < buffer.Count; newStart++)
        {
            if (IsBufferPartOfMask(newStart))
            {
                WritePartOfBuffer(newStart);
                buffer.RemoveRange(0, newStart);
                maskPosition = buffer.Count;
                return;
            }
        }

        WritePartOfBuffer(buffer.Count);
        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Determines whether the buffer content can be part of the mask, starting at the given index.
    /// </summary>
    private bool IsBufferPartOfMask(int offset)
    {
        for (var i = 0; i < buffer.Count - offset; i++)
        {
            var c = buffer[offset + i];
            if (c != mask[i])
            {
                return false;
            }
        }

        return true;
    }

    /// <summary>
    /// Writes the beginning part of buffer to the output
    /// </summary>
    private void WritePartOfBuffer(int count)
    {
        for (var i = 0; i < count; i++)
        {
            output(buffer[i]);
        }
    }
}

23.07.2019

Новые материалы

Кластеризация: более глубокий взгляд

Кластеризация — это метод обучения без учителя, в котором мы пытаемся найти группы в наборе данных на основе некоторых известных или неизвестных свойств, которые могут существовать. Независимо от..

Как написать эффективное резюме

Предложения по дизайну и макету, чтобы представить себя профессионально Вам не позвонили на собеседование после того, как вы несколько раз подали заявку на работу своей мечты? У вас может..

Частный метод Python: улучшение инкапсуляции и безопасности

Введение Python — универсальный и мощный язык программирования, известный своей простотой и удобством использования. Одной из ключевых особенностей, отличающих Python от других языков, является..

Как я автоматизирую тестирование с помощью Jest

Шутка для победы, когда дело касается автоматизации тестирования Одной очень важной частью разработки программного обеспечения является автоматизация тестирования, поскольку она создает..

Работа с векторными символическими архитектурами, часть 4 (искусственный интеллект)

Hyperseed: неконтролируемое обучение с векторными символическими архитектурами (arXiv) Автор: Евгений Осипов , Сачин Кахавала , Диланта Хапутантри , Тимал Кемпития , Дасвин Де Сильва ,..

Понимание расстояния Вассерштейна: мощная метрика в машинном обучении

В обширной области машинного обучения часто возникает необходимость сравнивать и измерять различия между распределениями вероятностей. Традиционные метрики расстояния, такие как евклидово..

Обеспечение масштабируемости LLM: облачный анализ с помощью AWS Fargate и Copilot

В динамичной области искусственного интеллекта все большее распространение получают модели больших языков (LLM). Они жизненно важны для различных приложений, таких как интеллектуальные..

Machine Learning JavaScript Blockchain Artificial Intelligence Data Science Cryptocurrency Software Development Python Web Development Coding Deep Learning AI Bitcoin React Software Engineering Ethereum Web3 Business Crypto Nodejs Solidity Development Front End Development Data Finance Money Java Trading Typescript Smart Contracts Productivity Tech Startup Investing Neural Networks Developer Computer Science NLP