Appearences + Hero(es): Goofy, Mickey Mouse, Mickey Mouse
YOU HAVE MICKEY'S NAME TWICE
DBI has [hero:M+G] [xapp:MM,GO]. Is there a bug in DIZNI?
I've also seen this in my indices and in other stories. The hero's name appears twice
I have also problems with the advanced search function. I get "no results" where there should have been a lot (for example, i only put Kind:story, Characters:DD,HDL, Date:before 1996 and i get no result)
There are extra (unofficial) characters with codes "MM ", "GO " (including the space). This confuses the software.
I guess this is in gr.dbi's Fraply's indexes? apparently, these have been corrected already.
it's not only in fraply's indexes. as I had tried to explain in my own error message (which had been subsequently labeled as "dublicate"), MM and DD appears in doubles almost everywhere. just do a search for all YD and YM items (where there are, I presume no extra spaces) and you'll see what we mean.
But, Kaya, the problems in gr.dbi have spreaded everywhere because those characters have been automatically added in the legend file. Let's see tomorrow if this is solved.
The space should not have been a problem, btw. Dizni removes spaces. But maybe there was a non-printable character? That still doesn't explain why Dizni/COA think that is the same as MM on place 1, while they think it's different on place 2.
Kaya: yes, we know. MM, GO, and DD appear double everywhere. That is why *this* error report is flagged as "important".
I tried to remove some records manually, but the SQL php script is not ready anymore for non-SELECT statements.
the problem of double MM's and double DD's seems to have been solved as of now, at least in YD and YM items; cheers
I found some bugs in gr.dbi: some lines used wrong numbers of spaces, so when DIZNI read these lines, some ISV fields ended up wrong, often as invalid UTF-8. Maybe COA ignored the problematic bytes, resulting in "DD + invalid byte combination" being treated as identical to the normal "DD".
I corrected lots of errors in gr.dbi, including several problems related to wrong number of spaces around the hero field. Let's see what happens tomorrow.
I see that my changes to gr.dbi didn't solve this -- MM and DD (and possibly also other characters) are currently listed twice everywhere on COA. Maybe I didn't find all errors in gr.dbi, or maybe the reason to this was something else.
So my temporary deletion of the records in the database worked, even though it gave me an error message. And of course, this morning the wrong records came back.
It seems that GO is not mentioned twice anymore, but MM and DD still do.
However, I found lots of " MM" and " DD" in es.dbi, which didn't cause any problem. Maybe " DD" is OK and "DD " is not?
For info, here is the complete diff for gr.dbi edits on Oct 27:
http://inducks.org/cg...&r2=1.1159
While performing:
SELECT c.charactercode,RIGHT(c.charactercode,1),ASCII(RIGHT(c.charactercode,1))
FROM inducks_appearance a, inducks_character c
WHERE a.storyversioncode='IC AR 334'
AND c.charactercode=a.charactercode
The results comes as:
charactercode^RIGHT(c.charactercode,1)^ASCII(RIGHT(c.charactercode,1))^
GO^O^79^
MM^M^77^
MM ^ ^32^
So there is a second "MM " charactercode with a space at the end. Apparently, something in a DBI file makes Dizni produce it. The change occured on Oct 27.
Is there a way to see if that is really a space, or something else?
(Dizni should always remove trailing spaces.)
> So there is a second "MM " charactercode with a space at the end. Apparently, something in a DBI file makes Dizni produce it. The change occured on Oct 27.
That's what you see on COA. But if you look at the data in the file inducks_character.isv, you see "MM<space><one extra byte>". This is an illegal UTF-8 byte sequence, so maybe COA (MySQL) ignores the last byte, making it a valid byte sequence.
Each UTF-8 character is 1, 2, 3 or 4 bytes. Specifically, a Greek letter is two bytes. Yesterday when I checked gr.dbi, I saw that some lines had a wrong number of spaces. This had a problematic result in the ISV files.
The title starts with a Greek letter (i.e. 2 bytes), but there's one space too little before the title. As a result, the first byte of the Greek letter ends up at the fourth position of the hero, while the second byte of the same Greek letter ends up as the first byte of the title.
I tried to fix these space problems yesterday, hoping that they would be the only cause to the problem. However, although GO is no longer duplicated, MM and DD still are. Maybe I missed some errors, or maybe someone added new errors after my fix.
Revision 1.1247 specifically deals with this space issue.
Revision 1.1245 indirectly deals with the space issue, since I removed two issues with wrong issuecodes (with space errors). These issues were already listed (without errors) under the correct issuecodes.
gr.log2 says:
gr/KL 50b @I TL 711-AP #295 hero <MM �> is not in the legend file
gr/KL 50c @I TL 637-A #295 hero <US �> is not in the legend file
gr/KL 50d @I TL 736-A #295 hero <MM �> is not in the legend file
gr/KL 50e @I TL 259-B #295 hero <DD �> is not in the legend file
So the KL 50 index needs to be corrected.
Seems that FWi changed the KL 50 index. Let's see what's happening tomorrow.
Yep, I've also changed COA so that it will refuse to commit any user input text that is not UTF-8.
That UTF-8 test wouldn't have helped, since the input *is* valid UTF-8. It's just that DIZNI reads a UTF-8 character byte by byte and regards a first byte part of a different field from the 2nd byte.
That wouldn't have caused a global problem though, if PHP/MySQL wouldn't have been so "nice" to ignore trailing spaces and weird characters in their SQL queries! (%^&)
Maybe it would be a good idea to make a DIZNI fix here? As far as I know, DIZNI checks that each line is valid UTF-8 in UTF-8 files, and *after* that splits up the line in several fields. Maybe it would be better to first split up the line in several fields, and after that check each field instead? And if a field is invalid UTF-8, clear the data of that field and complain to log1. And only check for other errors (heroes not in heroes.dbl etc.) after the charset check.
Well, I suppose I can set an option on MySQL so that it will only import lines without errors, and produce warnings when it finds errors.
However, I'd say it's a Dizni bug. Dizni should read the entire files in a given format, and split texts according to that format.
SPe:
>Maybe it would be better to first split up the line in several fields, and
>after that check each field instead? And if a field is invalid UTF-8, clear
>the data of that field and complain to log1
Actually, with this method, wouldn't it be possible that a field content is valid UTF-8 by coincidence? I think it's better that Dizni splits according to the encoding.
This is getting to look too much like a discussion (probably not of interest to the original poster)...
It is not a DIZNI but. DIZNI should forward all wrong character names to the ISV files (to avoid broken links on COA), and write log messages about them. And that is exactly what it does.
It is a PHP bug to regard 2 clearly different contents of a string as being the same.
The discussion was moved to our internal mailing list. Still, the bug should be solved (now or tomorrow).
"the bug should be solved" meaning "the bug should not be there anymore" (and not "we should still solve the bug").
I also see that I wrote "It is not a DIZNI but". I am right, of course. It is not a DIZNI but. Whatever that means. 8-)
Lots (but not all) hero names appear twice on the website. Coa or Dizni problem? Adding FWi as maintainer.