Well, as promised, here is the rant about PHP's DOMDocument component.
Some background: As part of my usual "have to study for a test, will have solo hackathons instead" methodology, I was re-rewriting my university syllabus fetching script to properly parse the syllabus as what it is - a (very badly formatted) set of HTML pages. The first version was hacked in another solo hackathon earlier this year (and what a coincidence - that was also around an exam period!) and was basically preg_matching all the way, with a bit of explodeing. It did the job, but I was ashamed of it I decided I'll pick it up again, as it's the core of my recent big project - "Classix" (probably a temporary name; more about the project some other time).
Anyways, back to the point.
I picked up DOMDocument for this task, as it seemed to be properly documented (nope) and highly covered around the web (ok, maybe). Basically, it's a component you can feed an HTML/XML file to, it will parse it into its own data structure and will let you run DOM-style queries on it. That is great, especially since text modification isn't the way to go when dealing with HTML scripts; HTML isn't a strict enough/well defined language to qualify for such tasks. In short - don't try to manipulate it without actually DOM parsing it. You can't (efficiently) rely on that, as there are many edge cases, before even talking about the quality of the HTML you are dealing with.
In my case, the HTML was HORRIBLE.
Wait, did I say horrible? I meant HORRIBLE. (insert dripping blood effects here)
Probably a bit of auto generated pages from FrontPage-like software and a bit of 2002~ style of HTML designing. (all upper case tags, missing closing tags inside tables, random line feeds, and more...)
But that didn't scratch the surface of it.
The encoding specified in the headers sent was ISO-8859-8-i, which is a logically ordered ISO-8859-8, which is informally referred to as Latin/Hebrew.
Everything would be great if the only issue would be converting this to UTF8. No, it didn't make sense from the start - generating weird characters instead of UTF8 after being read by DOMDocument. I messed it for a bit, and then decided that the core part of the hackathon was to re-code the fetching scripts and not dealing with minor encoding issues; I decided I'll handle it later, as I guessed it would just be a simple iconv operation, maybe a bit getting my hands dirty on understanding the differences between the encodings and writing my own converter.
I finished quite quickly with changing my scripts to use DOMDocument instead of ugly hackish regexes, and started working on the scripts to normalize the data in the fetching database, preparing it for insert into the Classix database.
That is where the mess started.
Simple iconv'ing didn't work. It only left me with a different weird characters, where a "iconv("ISO-8859-8-I", "UTF-8", $string);" should have been enough.
I started debugging the HTML fetching and parsing process, the headers received by curl, looking at the characters one by one and writing a custom convert functions and what-not.
Eventually, and not surprisingly, I ended up at 7AM trying to figure out what the hell is wrong so I could go finally sleep (and maybe wake up the next day to study to the Calculus 2 test I have tomorrow (as in actually this post tomorrow, not that day tomorrow)). I decided my actual contributions at that hour were so much inefficient that I should just get a good night (morning?) sleep and get to it right when I wake up.
I woke up at 2PM, and it didn't take more than a few minutes to have me troubled at this issue again.
I've discovered something interesting - while Chrome said the page was ISO-8859-8-i encoded, and so did the headers read by curl, the HTML page was specifying Windows-1255 encoding, which is used to write Hebrew in Windows.
Again, that wouldn't be too much of a problem - but why are the headers contradicting themselves?! God, who the HELL was responsible for this mess? (Apparently the guy stated as the author in the head meta tags in BGU's syllabus doesn't have a Facebook account or any match in Google - except a source of an HTML request of one of the BGU web pages.)
Ok, with that discovery I was quite sure it will be a piece of cake working it out; I replaced the headers to the correct encoding and sent it to DOMDocument.
The characters were still weird. The database was still ugly. I was still wasting my precious study time on annoying issues.
I searched for similar issues. I went through some stackoverflow questions, forums and DOMDocument documentation on the PHP manual, before finding a short answer from someone (can't find the source now) that lead me to the conclusion.
DOMDocument does a
horrible terrible FUCKING BAD job detecting the encoding of the HTML you feed it with loadHTML(), and defining the encoding when creating the DOMDocument object apparently does nothing. Either it detects it properly and you should be happy it works, or it doesn't. Nothing can fix that. No converting will help.
For it's defense, detecting the right encoding isn't easy; It's a messy field, and the huge amount of encodings that exist for just Hebrew is ridiculous.
That said, IF I TELL YOU IT'S WINDOWS-1255 THEN THAT'S WHAT I MEAN. IT'S NOT "HEY DOMDOCUMENT, THIS MIGHT BE WINDOWS-1255, NOW GO DO YOUR THING AND SEE IF YOU AGREE".
No. It's not that easy.
You will be excited to know that it actually derives the encoding set from the head meta tag inside the loaded file. Which, if you remember, we changed to ISO-8859-8-i to match with the actual page headers we received. So I happily changed everything back to WIndows-1255 and was excited by the idea I will see Hebrew finally after so long time full of random weird characters (Just like in my trip to Central America last summer).
So here is the thing - DOMDocument completely ignores your HTML file. No encoding detection, and if there is, a really poor one. It didn't matter which encoding the file was encoded in or what the encoding in the headers or the head meta tag was. Nothing.
So the fun has begun. How do I convince the so kind DOMDocument that this is not UTF-8.
Using the proper way to define the encoding it should use, in the DOMDocument object creation second parameter, yield no happy moments.
After an extensive search, I ended with this method:
$result = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1255" /></head><body>'.$result.'</body></html>';
Ugly as fuck, but this convinces DOMDocument that your file is encoded in the WIndows-1255 encoding (even though a few of the HTML files stated ISO-8859-8-i as did the actual file headers, and Chrome).
After that, everything was just fine. Just like that. So much time wasted on unimportant issues, while in the same time span I fetched and normalized into a database the whole university course syllabus, along with additional data and relations between everything.
Doesn't that suck?
After high-fiving my sister (who had no idea what we were celebrating but went along with it), I sat down and smiled. Just smiled for a whole minute.
I smiled because I realized that this is exactly what I want to do my entire life. Code, fail, code some more, fail some more, and then get it right.
I learned that failing only empowers the great moments after it, and hell, they worth it.
Oh, and FUCK YOU DOMDOCUMENT! DIAF, etc.