INTERNET-DRAFT
EXPIRATION 6 April 2017
Independent Submission A. Cheney
Category: Informational prettydiff.com
October 2017
Simple Diff Algorithm
draft-cheney-simplediff-00.txt
Abstract
This document describes a simplified algorithm for performing
computational comparisons. The design goal is to achieve an approach
that uses the smallest amount of logic. Necessary secondary goals
include an approach that is language agnostic, fast, extensible, and
more predictable than other similar algorithms.
Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF). Note that
other groups may also distribute working documents as
Internet-Drafts. The list of current Internet-Drafts is at
http://datatracker.ietf.org/drafts/current.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
Comments
Comments are solicited and should be addressed to
info@prettydiff.com.
Cheney Informational [Page 1]
RFC XXXX Simple Diff Algorithm October 2017
Table of Contents
1. Introduction ....................................................3
2. Understanding how to compare ....................................3
2.1. It is more about the equality than the differences .........3
2.2. Algorithm quality ..........................................4
2.3. Speed ......................................................4
2.4. Output format ..............................................6
3. How the algorithm works at a high level .........................7
3.1. Getting by with a hash map and some counts .................7
3.2. The third and final pass through the data ..................8
3.3. The "fix" function ........................................11
3.4. Done ......................................................16
4. Extensibility ..................................................23
5. Security considerations ........................................23
6. IANA considerations ............................................23
7. References .....................................................23
8. Author's address ...............................................23
Cheney Informational [Page 2]
RFC XXXX Simple Diff Algorithm October 2017
1. Introduction
There are many comparison algorithms available, which are commonly
referred to as "diff" algorithms after the similarly named Unix
application. While there are many algorithms already available none
achieve speed, precision, and simplicity in an easily predictable
manner.
The most commonly fault of many diff algorithms is that they solve
for the incorrect problem. Instead of accomplishing the goal, to
find how two samples differ, they instead will try to solve for a
common computer science problem called "Longest Common Subsequence
(LCS)". The LCS problem seeks to find the longest blocks of equality
in samples of comparison, which sounds ideal. Unfortunately, the
execution of this approach is not predictable as it is bound to the
commonality of the compared samples. Furthermore, it isn't simple or
fast. Worse still is that this approach is a precision failure in
situations where differences are interspersed across frequent
repetition.
2. Understanding how to compare
2.1. It is more about the equality than the differences
A good diff algorithm will attempt to identify as much equality as
possible. Everything else qualifies as differences. The metric for
quality and precision is a smaller count of captured differences.
The smaller this number the better, presuming differences aren't
escaping undetected.
False negatives, which is allowing differences through without
detection, is really bad. This is absolute failure in a diff
algorithm. False positives, which is identifying more differences
than there actually are is also bad, but a false positive is much
better than a false negative. This means it is safer to report more
differences than are actually present, which is why a higher number
of reported differences is less precise. Maximum precision is
reporting differences without any false negatives or false positives.
One way to achieve higher precision is to first capture as much
equality as possible. For everything else that is left over
prioritize unique items first and report these items as differences.
Finally determine the next bit of equality or uniqueness and
everything in between is either a change, an insertion, or a
deletion.
Cheney Informational [Page 3]
RFC XXXX Simple Diff Algorithm October 2017
2.2. Algorithm quality
The primary priorities when writing this kind of code are execution
speed, precision (as previous described), and code simplicity. In
most cases precision is the most important factor in code design,
but in some cases speed is more important when processing large input
or a batch of thousands of files. Simplicity is necessary so that
other people can understand the code and modify it as necessary to
improve upon the design and additional features. No algorithm is
ever capable of meeting all needs for all use cases, so it is
important that other people can understand the code with minimal
effort.
2.3. Speed
Faster execution is the result of a couple of things. The most
important consideration for making an algorithm faster is to reduce
the number of passes through data where possible. After that
eliminate all costly operations. In most programming languages
simple arithmetic and static string comparisons are cheap to execute
particularly if the examined strings aren't changing. The
theoretical minimum number of data passes is two as you have to read
the contents of each submitted sample. This proposed algorithm
achieves speed by only taking three complete passes through data and
making all possible effort to never repeat a step or loop iteration.
This approach is linear and predictable where the number of
interations passing through data is computed from the sum of:
* number of iterations from the first sample
* number of iterations from the second sample
* the number of iterations from the smallest of those samples.
Performance of the mentioned approach is heavily based upon key
access in a hash map, which will likely vary by programming language.
Until a way to perform a comparison without reading from the samples
is discovered 2 data passes can be safely assumed as the fastest
possible approach. Between that and the Simple Diff approach there
are two possibilities for increased performance. The first
possibility is to make decisions and write output immediately upon
reading from the second sample so as to have only two data passes.
The challenge with this approach is that analysis occurs in the
moment of the current loop interation without knowledge of what comes
later in the second sample. A more practical second performance
possibility is to write a smaller hash map. Writing a smaller hash
map means moving some of the decision tree up front before a separate
formal analysis phase. In order for this approach to be viable this
early step in logic must be tiny, non-replicating, and a different
means of iteration must be devised to account for a data store of
unpredictable size.
Cheney Informational [Page 4]
RFC XXXX Simple Diff Algorithm October 2017
The proposed algorithm could be faster than the popular Myers' O(ND)
approach. That may or may not be true and no evidence was provided
either way (conjecture is not evidence). In terms of experimental
criteria algorithms are themselves largely irrelevant. More
important is the implementation of those algorithms. If exercising
the Myers' approach makes fewer decisions and has fewer total data
passes, as described in the previous paragraph, then it likely is
faster. Pronounced emphasis is applied to the word "total" because
this makes all the difference.
Many prior diff applications don't provide a complete third data
pass. Instead they provide the minimum two complete data passes over
the samples and various smaller passes as calculated by block moves
and edit distances. If these various fractional passes are non-
linear, which means starting and stopping in different places
respectively, their performance is less predictable. If these
fractional passes are non-linear and touch any given data index more
than once they have repetitive logic, and likely are not as fast. To
affirmatively guarantee superior performance over this Simple Diff
approach there needs to be fewer passes over data, which means no
repetition and a smaller number of iterations. Predictability
ensures the performance of an application scales roughly
proportionately to the size of the provided samples, where an
unpredictable approach would scale disproportionately (slower over
time).
The approach taken here is fast. It cannot be argued,
scientifically, it is the fastest ever (or slowest) approach for its
level of accuracy without also writing alternate algorithms into
applications with identical application constraints. One could argue
this approach is the fastest ever comparative algorithm for its level
of predictability and precision.
Cheney Informational [Page 5]
RFC XXXX Simple Diff Algorithm October 2017
2.4. Output format
This Simple Diff algorithm inherits its output format from the
application jsdifflib, https://github.com/cemerick/jsdifflib. The
output is a big array containing child arrays at each index. The
child arrays each contain 5 indexes in the format:
* type
* start in first sample
* end in first sample
* start in second sample
* end in second sample
There are 4 types defined in the output:
* "equal" - both array index items are identical at the indexes
compared
* "replace" - changes are present at indexes compared
* "insert" - the index of the second sample is unique to the
second sample
* "delete" - the index of the first sample is unique to the first
sample
Cheney Informational [Page 6]
RFC XXXX Simple Diff Algorithm October 2017
3. How the algorithm works at a high level
3.1. Getting by with a hash map and some counts
This algorithm takes after the Paul Heckel algorithm. First thing is
to transform the string input into arrays. The most primative way to
do this is to split the input by lines where each line of input is an
index in an array. Once the two input samples are converted to
arrays they will each have to be examined.
The first loop will walk through an array representing one of the
code samples and make a decision for a hash map which is named
"table" in the following code sample. The two arrays are simply
named "one" and "two". Each index of the array, by default is a line
of code, which will serve as a key in the hash map named "table". If
this key does not exist then it will be added. The value for each
key is an object with two properites named "one" and "two". These
properties are simply a count indicating the number of times the key
is encountered in the two arrays. If the key is already present then
we will simply increase the value of properties "one" or "two"
respective of the current array.
Here is an example of this step in code:
do {
if (table[two[b]] === undefined) {
table[two[b]] = {
one: 0,
two: 1
};
} else {
table[two[b]].two += 1;
}
b += 1;
} while (b < lenb);
Cheney Informational [Page 7]
RFC XXXX Simple Diff Algorithm October 2017
3.2. The third and final pass through the data
Once the table is fully populated with all indexes from both arrays,
"one" and "two", we need a third and final loop to walk across the
data and make some simple decisions. Here is this complete step in
code.
do {
c = a;
d = b;
if (one[a] === two[b]) {
equality();
} else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) {
replaceUniques();
} else if (table[one[a]][1] < 1 && one[a + 1] !== two[b + 2]) {
deletion();
} else if (table[two[b]][0] < 1 && one[a + 2] !== two[b + 1]) {
insertion();
} else if (
table[one[a]][0] - table[one[a]][1] === 1 &&
one[a + 1] !== two[b + 2]
) {
deletionStatic();
} else if (
table[two[b]][1] - table[two[b]][0] === 1 &&
one[a + 2] !== two[b + 1]
) {
insertionStatic();
} else if (one[a + 1] === two[b]) {
deletion();
} else if (one[a] === two[b + 1]) {
insertion();
} else {
replacement();
}
a = a + 1;
b = b + 1;
} while (a < lena && b < lenb);
Before moving any further please understand this logic is fast and
simple, but it isn't precise at all. We will fix this later with a
child function named "fix". Corrections for precision are a
secondary step so as to not disrupt performance and also isolate
complexity into a separate single location.
In the above code references "a" and "b" are simply positive integer
incrementors. The "a" reference is always associated with the "one"
array while the "b" reference is always associated with the "two"
array. This is necessary so that each array may increment
independently without collision.
Cheney Informational [Page 8]
RFC XXXX Simple Diff Algorithm October 2017
The "c" and "d" references are secondary incrementors used as
closures in the child functions. These secondary incrementors allow
for child loops to occur without mutation to the primary
incrementors.
The first thing that happens in this loop is an attempt to discover
equality. Everything else is less important and so happens later in
the decision tree.
The second rule is to identify items that occur only one more time in
each array. If the next item to occur in the arrays is not equal but
occurs just once more then we know the item is a unique change.
The third rule is to determine if the current item is a deletion. If
the current item, and possibly additional following items,
is present in the first sample but no longer exists in the second
sample then it is a dynamic deletion.
The fourth rule is the same as the third rule but in reverse to
determine dynamic insertions.
The fifth rule is to determine if the current index occurs exactly
one more time in the first sample than in the second sample and yet
is not a match for the next item in the first sample with two items
up in the second sample. This is a static deletion, or rather a
deletion of a fixed number of items.
The sixth rule is the same as the fifth rule but in reverse to
determine static insertions.
The seventh and final rule determines that the current items in the
samples are non-unique changes.
The child functions are as follows:
equality = function simpleDiff_equality() {
do {
table[one[c]].one -= 1;
table[one[c]].two -= 1;
c += 1;
d += 1;
} while (
c < lena &&
d < lenb &&
one[c] === two[d]
);
fix(["equal", a, c, b, d]);
b = d - 1;
a = c - 1;
},
Cheney Informational [Page 9]
RFC XXXX Simple Diff Algorithm October 2017
deletion = function simpleDiff_deletion() {
do {
table[one[c]].one -= 1;
c += 1;
} while (
c < lena &&
table[one[c]].two < 1
);
fix(["delete", a, c, -1, -1]);
a = c - 1;
b = d - 1;
},
deletionStatic = function simpleDiff_deletionStatic() {
table[one[a]].one -= 1;
fix([
"delete", a, a + 1,
-1,
-1
]);
a = c;
b = d - 1;
},
insertion = function simpleDiff_insertion() {
do {
table[two[d]].two -= 1;
d += 1;
} while (
d < lenb &&
table[two[d]].one < 1
);
fix(["insert", -1, -1, b, d]);
a = c - 1;
b = d - 1;
},
insertionStatic = function simpleDiff_insertionStatic() {
table[two[b]].two -= 1;
fix([
"insert", -1, -1, b, b + 1
]);
a = c - 1;
b = d;
},
Cheney Informational [Page 10]
RFC XXXX Simple Diff Algorithm October 2017
replacement = function simpleDiff_replacement() {
do {
table[one[c]].one -= 1;
table[two[d]].two -= 1;
c += 1;
d += 1;
} while (
c < lena &&
d < lenb &&
table[one[c]].two > 0 &&
table[two[d]].one > 0
);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
},
replaceUniques = function simpleDiff_replaceUniques() {
do {
table[one[c]].one -= 1;
c += 1;
d += 1;
} while (
c < lena &&
d < lenb &&
table[one[c]].two < 1 &&
table[two[d]].one < 1
);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
}
It must be noted that most of the these child functions do contain
their own loops, but these loops are not duplicates. The primary
parent loop, the so called third and final loop, is adjusted forward
by the distance these smaller loops increment so that there is no
repetition or lose of execution speed.
3.3. The "fix" function
We can see from the child function code that arrays are generated
which appear to be the format described as the child arrays of the
output. Instead of pushing this data to the output array directly
it will be passed through a function named "fix" which serves as a
sort of quality control filter. This function checks for things
like:
Cheney Informational [Page 11]
RFC XXXX Simple Diff Algorithm October 2017
* A diff type immediately following the same diff type, which can be
combined into a single child array for the output
* An insertion immediately following a deletion, or vise versa, which
should likely be a replacement
* Whether false positives generated from replacements immediately
following an insertion or deletion can be eliminated
The general idea is to only accept a child array of the output as an
argument and make a decision after comparing it against the previous
child array of the output. Aside from two narrow edge cases there is
no analysis of the original data. Think of it like an appeals court
in that no new evidence is allowed, because only the laws are under
scrutiny.
The code for the fix function is as follows:
fix = function simpleDiff_fix(code) {
var prior = codes[codes.length - 1];
if (prior !== undefined) {
if (prior[0] === code[0]) {
if (code[0] === "replace" || code[0] === "equal") {
prior[2] = code[2];
prior[4] = code[4];
} else if (code[0] === "delete") {
prior[2] = code[2];
} else if (code[0] === "insert") {
prior[4] = code[4];
}
return;
}
if (prior[0] === "insert" && prior[4] - prior[3] === 1) {
if (code[2] - code[1] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[2];
code[0] = "insert";
code[1] = -1;
code[2] = -1;
} else if (code[0] === "delete") {
code[0] = "replace";
code[3] = prior[3];
code[4] = prior[4];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
Cheney Informational [Page 12]
RFC XXXX Simple Diff Algorithm October 2017
} else if (code[0] === "delete") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
code[1] = code[1] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (
prior[0] === "insert" &&
code[0] === "delete" &&
code[2] - code[1] === 1
) {
prior[4] = prior[4] - 1;
code[0] = "replace";
code[3] = prior[4];
code[4] = prior[4] + 1;
} else if (
prior[0] === "delete" &&
prior[2] - prior[1] === 1
) {
if (code[4] - code[3] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4];
code[0] = "delete";
code[3] = -1;
code[4] = -1;
} else if (code[0] === "insert") {
code[0] = "replace";
code[1] = prior[1];
code[2] = prior[2];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "insert") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[3] + 1;
code[3] = code[3] + 1;
Cheney Informational [Page 13]
RFC XXXX Simple Diff Algorithm October 2017
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (
prior[0] === "delete" &&
code[0] === "insert" &&
code[4] - code[3] === 1
) {
prior[2] = prior[2] - 1;
code[0] = "replace";
code[1] = prior[2];
code[2] = prior[2] + 1;
} else if (prior[0] === "replace") {
if (code[0] === "delete") {
if (one[code[2] - 1] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[4] = prior[4] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2]] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 11;
table[one[c - 1]][0] =
table[one[c - 1]][0] - 1;
}
}
} else if (code[0] === "insert") {
if (one[prior[2] - 1] === two[code[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2] - 1] === two[prior[4]]) {
if (prior[4] - prior[3] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 1;
table[two[d - 1]][1] =
table[two[d - 1]][1] - 1;
}
Cheney Informational [Page 14]
RFC XXXX Simple Diff Algorithm October 2017
}
}
}
}
codes.push(code);
};
Cheney Informational [Page 15]
RFC XXXX Simple Diff Algorithm October 2017
3.4. Done!
The following is the complete algorithm with all parts assembled:
function simpleDiff () {
var table = {},
lines = function simpleDiff_lines(str) {
str = str
.replace(/\r\n/g, "\n")
.replace(/\r/g, "\n");
return str.split("\n");
},
one = (typeof options.source === "string")
? lines(options.source)
: options.source,
two = (typeof options.diff === "string")
? lines(options.diff)
: options.diff,
lena = one.length,
lenb = two.length,
a = 0,
b = 0,
c = 0,
d = 0,
codes = [],
fix = function simpleDiff_fix(code) {
var prior = codes[codes.length - 1];
if (prior !== undefined) {
if (prior[0] === code[0]) {
if (
code[0] === "replace" ||
code[0] === "equal"
) {
prior[2] = code[2];
prior[4] = code[4];
} else if (code[0] === "delete") {
prior[2] = code[2];
} else if (code[0] === "insert") {
prior[4] = code[4];
}
return;
}
if (
prior[0] === "insert" &&
prior[4] - prior[3] === 1
) {
if (code[2] - code[1] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[2];
Cheney Informational [Page 16]
RFC XXXX Simple Diff Algorithm October 2017
code[0] = "insert";
code[1] = -1;
code[2] = -1;
} else if (code[0] === "delete") {
code[0] = "replace";
code[3] = prior[3];
code[4] = prior[4];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "delete") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
code[1] = code[1] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (
prior[0] === "insert" &&
code[0] === "delete" &&
code[2] - code[1] === 1
) {
prior[4] = prior[4] - 1;
code[0] = "replace";
code[3] = prior[4];
code[4] = prior[4] + 1;
} else if (
prior[0] === "delete" &&
prior[2] - prior[1] === 1
) {
if (code[4] - code[3] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4];
code[0] = "delete";
code[3] = -1;
code[4] = -1;
} else if (code[0] === "insert") {
code[0] = "replace";
Cheney Informational [Page 17]
RFC XXXX Simple Diff Algorithm October 2017
code[1] = prior[1];
code[2] = prior[2];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "insert") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[3] + 1;
code[3] = code[3] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (
prior[0] === "delete" &&
code[0] === "insert" &&
code[4] - code[3] === 1
) {
prior[2] = prior[2] - 1;
code[0] = "replace";
code[1] = prior[2];
code[2] = prior[2] + 1;
} else if (prior[0] === "replace") {
if (code[0] === "delete") {
if (one[code[2] - 1] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[4] = prior[4] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2]] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 11;
table[one[c - 1]][0] =
table[one[c - 1]][0] - 1;
}
}
} else if (code[0] === "insert") {
Cheney Informational [Page 18]
RFC XXXX Simple Diff Algorithm October 2017
if (one[prior[2] - 1] === two[code[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2] - 1] === two[prior[4]]) {
if (prior[4] - prior[3] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 1;
table[two[d - 1]][1] =
table[two[d - 1]][1] - 1;
}
}
}
}
}
codes.push(code);
},
equality = function simpleDiff_equality() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
table[one[c]][1] = table[one[c]][1] - 1;
c = c + 1;
d = d + 1;
} while (c < lena && d < lenb && one[c] === two[d]);
fix(["equal", a, c, b, d]);
b = d - 1;
a = c - 1;
},
deletion = function simpleDiff_deletion() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
c = c + 1;
} while (c < lena && table[one[c]][1] < 1);
fix(["delete", a, c, -1, -1]);
a = c - 1;
b = d - 1;
},
deletionStatic = function simpleDiff_deletionStatic() {
table[one[a]][0] = table[one[a]][0] - 1;
fix([
"delete", a, a + 1,
-1,
-1
]);
a = c;
b = d - 1;
},
Cheney Informational [Page 19]
RFC XXXX Simple Diff Algorithm October 2017
insertion = function simpleDiff_insertion() {
do {
table[two[d]][1] = table[two[d]][1] - 1;
d = d + 1;
} while (d < lenb && table[two[d]][0] < 1);
fix(["insert", -1, -1, b, d]);
a = c - 1;
b = d - 1;
},
insertionStatic = function simpleDiff_insertionStatic() {
table[two[b]][1] = table[two[b]][1] - 1;
fix([
"insert", -1, -1, b, b + 1
]);
a = c - 1;
b = d;
},
replacement = function simpleDiff_replacement() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
table[two[d]][1] = table[two[d]][1] - 1;
c = c + 1;
d = d + 1;
} while (
c < lena &&
d < lenb &&
table[one[c]][1] > 0 &&
table[two[d]][0] > 0
);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
},
replaceUniques = function simpleDiff_replaceUniques() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
c = c + 1;
d = d + 1;
} while (
c < lena &&
d < lenb &&
table[one[c]][1] < 1 &&
table[two[d]][0] < 1
);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
};
// * First Pass, account for lines from first file
// * build the table from the second file
Cheney Informational [Page 20]
RFC XXXX Simple Diff Algorithm October 2017
do {
if (options.diffspaceignore === true) {
two[b] = two[b].replace(/\s+/g, "");
}
if (table[two[b]] === undefined) {
table[two[b]] = [0, 1];
} else {
table[two[b]][1] = table[two[b]][1] + 1;
}
b = b + 1;
} while (b < lenb);
// * Second Pass, account for lines from second file
// * build the table from the first file
lena = one.length;
a = 0;
do {
if (options.diffspaceignore === true) {
one[a] = one[a].replace(/\s+/g, "");
}
if (table[one[a]] === undefined) {
table[one[a]] = [1, 0];
} else {
table[one[a]][0] = table[one[a]][0] + 1;
}
a = a + 1;
} while (a < lena);
a = 0;
b = 0;
// find all equality... differences are what's left over solve
// only for the second set test removing reverse test removing
// undefined checks for table refs
do {
c = a;
d = b;
if (one[a] === two[b]) {
equality();
} else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) {
replaceUniques();
} else if (
table[one[a]][1] < 1 &&
one[a + 1] !== two[b + 2]
) {
deletion();
} else if (
table[two[b]][0] < 1 &&
one[a + 2] !== two[b + 1]
) {
insertion();
Cheney Informational [Page 21]
RFC XXXX Simple Diff Algorithm October 2017
} else if (
table[one[a]][0] - table[one[a]][1] === 1 &&
one[a + 1] !== two[b + 2]
) {
deletionStatic();
} else if (
table[two[b]][1] - table[two[b]][0] === 1 &&
one[a + 2] !== two[b + 1]
) {
insertionStatic();
} else if (one[a + 1] === two[b]) {
deletion();
} else if (one[a] === two[b + 1]) {
insertion();
} else {
replacement();
}
a = a + 1;
b = b + 1;
} while (a < lena && b < lenb);
if (lena - a === lenb - b) {
if (one[a] === two[b]) {
fix(["equal", a, lena, b, lenb]);
} else {
fix(["replace", a, lena, b, lenb]);
}
} else if (a < lena) {
fix(["delete", a, lena, -1, -1]);
} else if (b < lenb) {
fix(["insert", -1, -1, b, lenb]);
}
return codes;
}
Cheney Informational [Page 22]
RFC XXXX Simple Diff Algorithm October 2017
4. Extensibility
Since the code is simple and its execution is predictable the
approach is open to modification. In different contexts a most
precise comparison of text is not the most desirable objective. One
example is to ignore certain particular differences that might be
considered as unimportant. This degree of flexibility and
extensibility allows for custom precision.
Custom flexibility allows for a variety of advanced concerns. Such
concerns may include nuances in accounting for language specific
criteria whether programming, spoken, or written. Due to flexible
rules, whether colloquialisms or syntax variations, differences of
literal text may commonly reside that are not represented as
differences in the parsed, or interpreted, result. In artificially
intelligent systems it is necessary to account for these variations
as comparisons occur.
5. Security considerations
None.
6. IANA considerations
None.
7. References
Heckel, P. (April 1978) "A technique for isolating differences
between files". Communications of the ACM. 21 (4): 264-268.
doi:10.1145/359460.359467
Myers, E. (1986) "An O(ND) Difference Algorithm and Its Variations".
Retrieved from http://xmailserver.org/diff2.pdf on 3 October 2017.
8. Author's address
Austin Cheney
http://prettydiff.com
info@prettydiff.com
EXPIRATION 6 April 2017
Cheney Informational [Page 23]