Diario de mferreira &middot; Archivos para enero 2021

Proposal: algorithms for measuring the reliability of identifiers and for deciding when an ID gets Research Grade

In a recent exchange of comments about one observation of a lichen, some users expressed their discontent about
(1) users who provide preliminary IDs for their observations with insufficient knowledge and conficence,
(2) users who carelessly validate other users' IDs that are either erroneous or uncertain, inappropriately raising those IDs to Research Grade.
I don't regard (1) as a problem, as long as preliminary non-RG IDs are regarded as nothing more than mere suggestions for the forthcoming IDers. In fact, providing an educated "best guess" makes it a lot easier for an expert to validate the ID if it is correct: pressing the "confirm" button is all it takes.
As a solution to the second problem, some changes in the ID algorithms were vaguely proposed:
(A) measuring the reliability of each user's IDs for each taxon;
(B) considering the reliability scores of the many identifiers of a specimen in the requirements for Research Grade.

I will now make some objective suggestions for the tasks (A) and (B) mentioned above.

A procedure to calculate a reliability score (R-score) for each user and for each taxon identified by that user
1) Consider all eligible IDs provided by user X that match the following criteria:
a) the ID belongs to the taxon under consideration or to higher taxa contained in the same taxon;
b) it was the first ID provided by that user X for that observation;
c) it was one of the first 3 IDs for that observation (to ensure some level of independence and to discourage endless ID queues);
d) it was not removed, either by that user X or by a curator
(for example: if user X replaces an ID at species level by a broader ID at genus level compatible with the previous one, the ID at species level was removed by user X, but the implicit ID at genus level contained in that first ID was not removed by user X).
Let N be the number of such eligible observations.
2) Also consider the ineligible IDs provided by user X that match the following criteria:
the ID belongs to the taxon under consideration or to higher taxa contained in the same taxon;
it was later removed or contested by the same user X (either by simply removing the ID or by providing an incompatible ID).
Let X be the number of such removed/contested observations.
3) For each observation k among the N eligible observations, consider the first ID provided by that user X and the subsequent IDs by other users (only the last ID from each user for that observation). Let pk be the number of such IDs that are compatible with the first ID provided by user X and let nk be the number of subsequent IDs that are incompatible with the ID provided by user X.
4) Compute the R-score for user X, for the taxon under consideration:
N>2: R = (p1+p2+...+pN)/(p1+n1+p2+n2+...+pN+nN) x N/(N+2X)
N=2: R = (p1+p2+...+pN)/(p1+n1+p2+n2+...+pN+nN) x N/(N+2X) x (2/3)
N=1: R = (p1+p2+...+pN)/(p1+n1+p2+n2+...+pN+nN) x N/(N+2X) x (1/3)
N=0: R=0
(The N/(N+2X) factor penalizes users who sistematically make bad IDs and then remove or review them. Without this factor, a user could simply remove bad IDs to make them ineligible for the calculation of R (1.d), keeping only the accurate IDs to inflate the R-score.)
(The 2/3 or 1/3 factors reduce the R-scores of users with fewer than 3 eligible IDs for that taxon.)
A procedure to choose a most likely ID, eventually a Community ID and eventually a Research Grade ID for each observation
1) For each taxon provided as a possible ID (considering only the last ID from each user) and for each of the lower taxa that contain that taxon: let T be that taxon under evaluation.
a) Calculate the sum p of the R-score of all users who provided an ID compatible with that taxon T.
b) Calculate the sum n of the R-scores of all users who provided an ID incompatible with that taxon T. For each user X who suggested a taxon A that does not contain T, consider the R-score of that user X for the alternative taxon A; otherwise, if A contains T (user X indicated "I know it is not T but I am sure it is A") then consider the R-score of user X for the contested taxon T.
c) If p=n=0 (if all IDers have null R-scores) then consider n=1 to avoid division by zero in the next step.
d) Calculate the ratio r = p/(p+n).
2) The most likely ID will be the ID with the highest taxonomic level for which r>1/2.
3) The Community ID, if any, will be the ID with the highest taxonomic level for which r>2/3 and p>1.
4) The Research Grade ID, if any, will be the ID with the highest taxonomic level for which r>3/4 and p>7/4, but only if that ID is at species or higher taxonomic level.
A practical example
Suppose that users A, B, C, D and E identified 5 and only 5 observations of organisms in the subfamily Mimosoideae. These were their IDs, listed in the order as they were provided (the lowercase letter identifies the user who provided the ID):

Observation V:
V1a1: Acacia dealbata [replaced later]
V2b: Acacia mearnsii
V3c: Acacia (any species)
V4d: Acacia mearnsii
V5a2: Acacia (any species)

Observation W:
W1b: Acacia mearnsii
W2a1: Acacia dealbata [replaced later]
W3d: Leucaena (any species)
W4c: Acacia (not mearnsii or dealbata)
W5a2: Leucaena (any species)

Observation X:
X1a1: Acacia (any species) [replaced later]
X2b: Acacia dealbata
X3c: Acacia dealbata
X4d: Acacia mearnsii
X5e: Acacia dealbata
X6a2: Acacia dealbata

Observation Y:
Y1e: Acacia dealbata
Y2a: Acacia dealbata
Y3b: Acacia dealbata
Y4c: Acacia dealbata

Observation Z:
Z1e: Acacia dealbata
Z2b: Acacia dealbata

Let's calculate the R-scores for all users and relevant taxa:

A, Acacia dealbata:
1 eligible ID by user A: Y2a
2 removed/contested IDs by user A: V1a1, W2a1
3 positive IDs: Y2a, Y3b, Y4c
0 negative IDs after A's eligible IDs
RaAd = 3/(3+0) x 1/(1+2x2) x 1/3 = 1/15

A, Acacia:
3 eligible IDs by user A: V1a1, X1a1, Y2a
1 removed/contested ID by user A: W2a1
12 positive IDs: V1a1, V2b, V3c, V4d, X1a1, X2b, X3c, X4d, X5e, Y2a, Y3b, Y4c
0 negative IDs after A's eligible IDs
RaA = 12/(12+0) x 3/(3+2x1) = 3/5

A, Mimosoideae:
4 eligible IDs by user A: V1a1, W2a1, X1a1, Y2a
0 removed/contested IDs by user A
All positive IDs
0 negative IDs
RaM = 1 x 1/(4+2x0) = 1

B, Acacia dealbata:
3 eligible IDs by user B: X2b, Y3b, Z2b
0 removed/contested IDs by user B
7 positive IDs: X2bm X3c, X5e, X6a2
1 negative ID after B's eligible IDs: X4d
RbAd = 7/(7+1) x 3/(3+2x0) = 7/8

B, Acacia mearnsii:
2 eligible IDs by user B: V2b, W1b
0 removed/contested IDs by user B
3 positive IDs: V2b, V4d, W1b
3 negative IDs after B's eligible IDs: W3d, W4c, W5a2
RbAm = 3/(3+3) x 2/(2+2x0) x 2/3 = 1/3

B, Acacia:
5 eligible IDs by user B: V2b, W1b, X2b, Y3b, Z2b
0 removed/contested IDs by user B
14 positive IDs: V2b, V3c, V4d, V5a2, W1b, W4c, X2b, X3c, X4s, X5e, X6a2, Y3b, Y4c, Z2b
2 negative IDs after B's eligible IDs: W3d, W5a2
RbA = 14/(14+2) x 5/(5+2x0) = 7/8

B, Mimosoideae:
5 eligible IDs by user B: V2b, W1b, X2b, Y3b, Z2b
0 removed/contested IDs by user B
All positive IDs
0 negative IDs
RbM = 1 x 5/(5+2x0) = 1

C, Acacia dealbata:
1 eligible ID by user C: X3c
0 removed/contested IDs by user C
3 positive IDs: X3c, X5e, X6a2
1 negative ID after C's eligible IDs: X4d
RcAd = 3/(3+1) x 1/(1+2x0) x 1/3 = 1/4

C, Acacia:
2 eligible IDs by user C: V3c, X3c
0 removed/contested IDs by user C
All positive IDs
0 negative IDs after C's eligible IDs
RcA = 1 x 2/(2+2x0) x 2/3 = 2/3

C, Mimosoideae:
Same as for Acacia.
RcM = 2/3

D, Leucaena:
1 eligible ID by user D: W3d
0 removed/contested IDs by user D
2 positive IDs: W3d, W5a2
1 negative ID after D's eligible IDs: W4c
RdL = 2/(2+1) x 1/(1+2x0) x 1/3 = 2/9

D, Mimosoideae:
1 eligible ID by user D: W3d
0 removed/contested IDs by user D
All positive IDs
0 negative IDs after D's eligible IDs
RdM = 1 x 1/(1+2x0) x 1/3 = 1/3

E, Acacia dealbata:
2 eligible IDs by user E: Y1e, Z1e
0 removed/contested IDs by user E
All positive IDs
0 negative IDs
ReAd = 1 x 2/(2+2x0) x 2/3 = 2/3

E, Acacia:
Same as for Acacia dealbata.
ReA = 2/3

E, Mimosoideae:
Same as for Acacia dealbata.
ReM = 2/3

Now let's calculate the p scores and r ratios for all observations and relevant taxa:

V, Acacia mearnsii:
Positive IDs: V2b, V4d
Negative IDs: None
p = RbAm + RdAm = 1/3 + 0 = 1/3 (0.33)
n = 0
r = 1

V, Acacia:
Positive IDs: V2b, V3c, V4d, V5a2
Negative IDs: None
p = RbA + RcA + RdA + RaA = 7/8 + 2/3 + 0 + 3/5 = 257/120 (2.14)
n = 0
r = 1

Most likely ID for V: Acacia mearnsii
Community ID for V: Acacia
RG ID for V: (None)

W, Acacia mearnsii:
Positive ID: W1b
Negative IDs: W3d, W4c, W5a
p = RbAm = 1/3 (0.33)
n = RdL + RcAm + RaL = 2/9 + 0 + 0 = 2/9 (0.22)
r = 3/5 (0.6)

W, Acacia:
Positive IDs: W1b, W4c
Negative IDs: W3d, W5a
p = RbA + RcA = 7/8 + 2/3 = 37/24 (1.54)
n = RdL + RaL = 2/9 + 0 = 2/9 (0.22)
r = 111/127 (0.87)

W, Leucaena:
Positive IDs: W3d, W5a
Negative IDs: W1b, W4c
p = RdL + RaL = 2/9 + 0 = 2/9 (0.22)
n = RbAm + RcA = 1/3 + 2/3 = 1
r = 2/11 (0.18)

Most likely ID for W: Acacia mearnsii
Community ID for W: Acacia
RG ID for W: (None)

X, Acacia dealbata:
Positive IDs: X2b, X3c, X5e, X6a2
Negative IDs: X4d
p = RbAd + RcAd + ReAd + RaAd = 7/8 + 1/4 + 2/3 + 1/15 = 223/120 (1.86)
n = RdAm = 0
r = 1

X, Acacia mearnsii:
Positive IDs: X4d
Negative IDs: X2b, X3c, X5e, X6a2
p = RdAm = 0
n = RbAd + RcAd + ReAd + RaAd = 7/8 + 1/4 + 2/3 + 1/15 = 223/120 (1.86)
r = 0

X, Acacia:
Positive IDs: X2b, X3c, X4d, X5e, X6a2
Negative IDs: None
p = RbA + RcA + RdA + ReA + RaA = 7/8 + 2/3 + 0 + 2/3 + 3/5 = 337/120 (2.81)
n = 0
r = 1

Most likely ID for X: Acacia dealbata
Community ID for X: Acacia dealbata
RG ID for X: Acacia dealbata

Y, Acacia dealbata:
Positive IDs: Y1e, Y2a, Y3b, Y4c
Negative IDs: None
p = ReAd + RaAd + RbAd RcAd = 2/3 + 1/15 + 7/8 + 1/4 = 223/120 (1.86)
n = 0
r = 1

Y, Acacia:
Positive IDs: Y1e, Y2a, Y3b, Y4c
Negative IDs: None
p = ReA + RaA + RbA RcA = 2/3 + 3/5 + 7/8 + 2/3 = 337/120 (2.81)
n = 0
r = 1

Most likely ID for Y: Acacia dealbata
Community ID for Y: Acacia dealbata
RG ID for Y: Acacia dealbata

Z, Acacia dealbata:
Positive IDs: Z1e, Z2b
Negative IDs: None
p = ReAd + RbAd = 2/3 + 7/8 = 37/24 (1.54)
n = 0
r = 1

Z, Acacia:
Positive IDs: Z1e, Z2b
Negative IDs: None
p = ReA + RbA = 2/3 + 7/8 = 37/24 (1.54)
n = 0
r = 1

Most likely ID for Z: Acacia dealbata
Community ID for Z: Acacia dealbata
RG ID for Z: (None)

Summary of the IDs for the example above, with some comments

Observation V:
V1a1: Acacia dealbata [replaced later]
V2b: Acacia mearnsii
V3c: Acacia (any species)
V4d: Acacia mearnsii
V5a2: Acacia (any species)

Most likely ID: Acacia mearnsii (r=1, p=0.33), not RG
Community ID: Acacia

Observation W:
W1b: Acacia mearnsii
W2a1: Acacia dealbata [replaced later]
W3d: Leucaena (any species)
W4c: Acacia (not mearnsii or dealbata)
W5a2: Leucaena (any species)

Most likely ID: Acacia mearnsii (r=0.6, p=0.33), not RG
Community ID: Acacia (r=0.87, p=1.54)
Althouth Acacia mearnsii got as many positive IDs as negative IDs, the relevant R-scores of the users who provided negative IDs were rather low. Therefore, Acacia mearnsii remained as the most likely ID. For similar reasons, Acacia became the community ID with a rather high r ratio.

Observation X:
X1a1: Acacia (any species) [replaced later]
X2b: Acacia dealbata
X3c: Acacia dealbata
X4d: Acacia mearnsii
X5e: Acacia dealbata
X6a2: Acacia dealbata

Most likely ID: Acacia dealbata (r=1, p=1.86)
Community ID: Acacia dealbata
RG ID: Acacia dealbata
With one less positive ID for Acacia dealbata, this observation would not reach Research Grade.
It's also worth noting that an r value of 1 was obtained, even with one ID against Acacia dealbata, because the user D has an R-score of 0 for the alternative taxon Acacia mearnsii (none of his IDs for Acacia mearnsii are among the first 3).

Observation Y:
Y1e: Acacia dealbata
Y2a: Acacia dealbata
Y3b: Acacia dealbata
Y4c: Acacia dealbata

Most likely ID: Acacia dealbata (r=1, p=1.86)
Community ID: Acacia dealbata
RG ID: Acacia dealbata
Again, with one less positive ID for Acacia dealbata, this observation would not reach Research Grade.

Observation Z:
Z1e: Acacia dealbata
Z2b: Acacia dealbata

Most likely ID: Acacia dealbata (r=1, p=1.54), not RG
Community ID: Acacia dealbata
The 2 IDs from users E and B were not enough to raise this observation to Research Grade: although both users have rather high R-scores for Acacia dealbata (0.67 and 0.88), they are not high enough. However, if both users were "experts" with R-scores of 0.9 or above, then the p value would become 1.8 or higher, above the 7/4 threshold, and this observation would reach Research Grade with only 2 positive IDs.
If C had also identified this specimen as Acacia dealbata, the p value would raise to at least 1.54+0.25 = 1.79 > 7/4 (it would actually raise to a higher value, because that additional ID would increase user D's R-score for Acacia dealbata above the current value of 0.25). With those 3 positive IDs for Acacia dealbata, this observation would have reached Research Grade.

Publicado el 06 de enero de 2021 a las 05:52 PM por

mferreira | 0 comentarios | Deja un comentario

Diario de mferreira

Archivos de Diario para enero 2021

06 de enero de 2021

(Tags that I use) might be:

Proposal: algorithms for measuring the reliability of identifiers and for deciding when an ID gets Research Grade

Archivos