Well, as promised, here is the rant about PHP's DOMDocument component.

Some background: As part of my usual "have to study for a test, will have solo hackathons instead" methodology, I was re-rewriting my university syllabus fetching script to properly parse the syllabus as what it is - a (very badly formatted) set of HTML pages. The first version was hacked in another solo hackathon earlier this year (and what a coincidence - that was also around an exam period!) and was basically preg_matching all the way, with a bit of explodeing. It did the job, but I was ashamed of it I decided I'll pick it up again, as it's the core of my recent big project - "Classix" (probably a temporary name; more about the project some other time).

Anyways, back to the point.
I picked up DOMDocument for this task, as it seemed to be properly documented (nope) and highly covered around the web (ok, maybe). Basically, it's a component you can feed an HTML/XML file to, it will parse it into its own data structure and will let you run DOM-style queries on it. That is great, especially since text modification isn't the way to go when dealing with HTML scripts; HTML isn't a strict enough/well defined language to qualify for such tasks. In short - don't try to manipulate it without actually DOM parsing it. You can't (efficiently) rely on that, as there are many edge cases, before even talking about the quality of the HTML you are dealing with.

In my case, the HTML was HORRIBLE.
Wait, did I say horrible? I meant HORRIBLE. (insert dripping blood effects here)
Probably a bit of auto generated pages from FrontPage-like software and a bit of 2002~ style of HTML designing. (all upper case tags, missing closing tags inside tables, random line feeds, and more...)

But that didn't scratch the surface of it.
The encoding specified in the headers sent was ISO-8859-8-i, which is a logically ordered ISO-8859-8, which is informally referred to as Latin/Hebrew.
Everything would be great if the only issue would be converting this to UTF8. No, it didn't make sense from the start - generating weird characters instead of UTF8 after being read by DOMDocument. I messed it for a bit, and then decided that the core part of the hackathon was to re-code the fetching scripts and not dealing with minor encoding issues; I decided I'll handle it later, as I guessed it would just be a simple iconv operation, maybe a bit getting my hands dirty on understanding the differences between the encodings and writing my own converter.

I finished quite quickly with changing my scripts to use DOMDocument instead of ugly hackish regexes, and started working on the scripts to normalize the data in the fetching database, preparing it for insert into the Classix database.
That is where the mess started.

Simple iconv'ing didn't work. It only left me with a different weird characters, where a "iconv("ISO-8859-8-I", "UTF-8", $string);" should have been enough.
I started debugging the HTML fetching and parsing process, the headers received by curl, looking at the characters one by one and writing a custom convert functions and what-not.
Eventually, and not surprisingly, I ended up at 7AM trying to figure out what the hell is wrong so I could go finally sleep (and maybe wake up the next day to study to the Calculus 2 test I have tomorrow (as in actually this post tomorrow, not that day tomorrow)). I decided my actual contributions at that hour were so much inefficient that I should just get a good night (morning?) sleep and get to it right when I wake up.
I woke up at 2PM, and it didn't take more than a few minutes to have me troubled at this issue again.
I've discovered something interesting - while Chrome said the page was ISO-8859-8-i encoded, and so did the headers read by curl, the HTML page was specifying Windows-1255 encoding, which is used to write Hebrew in Windows.

Again, that wouldn't be too much of a problem - but why are the headers contradicting themselves?! God, who the HELL was responsible for this mess? (Apparently the guy stated as the author in the head meta tags in BGU's syllabus doesn't have a Facebook account or any match in Google - except a source of an HTML request of one of the BGU web pages.)

Ok, with that discovery I was quite sure it will be a piece of cake working it out; I replaced the headers to the correct encoding and sent it to DOMDocument.

Nothing changed.

The characters were still weird. The database was still ugly. I was still wasting my precious study time on annoying issues.
I searched for similar issues. I went through some stackoverflow questions, forums and DOMDocument documentation on the PHP manual, before finding a short answer from someone (can't find the source now) that lead me to the conclusion.
DOMDocument does a horrible terrible FUCKING BAD job detecting the encoding of the HTML you feed it with loadHTML(), and defining the encoding when creating the DOMDocument object apparently does nothing. Either it detects it properly and you should be happy it works, or it doesn't. Nothing can fix that. No converting will help.
For it's defense, detecting the right encoding isn't easy; It's a messy field, and the huge amount of encodings that exist for just Hebrew is ridiculous.
No. It's not that easy.
You will be excited to know that it actually derives the encoding set from the head meta tag inside the loaded file. Which, if you remember, we changed to ISO-8859-8-i to match with the actual page headers we received. So I happily changed everything back to WIndows-1255 and was excited by the idea I will see Hebrew finally after so long time full of random weird characters (Just like in my trip to Central America last summer).

So here is the thing - DOMDocument completely ignores your HTML file. No encoding detection, and if there is, a really poor one. It didn't matter which encoding the file was encoded in or what the encoding in the headers or the head meta tag was. Nothing.

Just UTF-8.

So the fun has begun. How do I convince the so kind DOMDocument that this is not UTF-8.
Using the proper way to define the encoding it should use, in the DOMDocument object creation second parameter, yield no happy moments.
After an extensive search, I ended with this method:

$result = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1255" /></head><body>'.$result.'</body></html>';

Ugly as fuck, but this convinces DOMDocument that your file is encoded in the WIndows-1255 encoding (even though a few of the HTML files stated ISO-8859-8-i as did the actual file headers, and Chrome).
After that, everything was just fine. Just like that. So much time wasted on unimportant issues, while in the same time span I fetched and normalized into a database the whole university course syllabus, along with additional data and relations between everything.
Doesn't that suck?

After high-fiving my sister (who had no idea what we were celebrating but went along with it), I sat down and smiled. Just smiled for a whole minute.

I smiled because I realized that this is exactly what I want to do my entire life. Code, fail, code some more, fail some more, and then get it right.
I learned that failing only empowers the great moments after it, and hell, they worth it.


Facebook Hacker Cup, end of qualification round


Just got the email about being promoted to the next Facebook Hacker Cup round that will be held this Saturday (22nd January). Also, I found a really minor mistake in my Peg Game actual solution that was making me miss the actual number by 0.02~, for about 5% of the test cases. That was damn hard to notice when verifying output, but I believe md5 sum comparing is what I should stick to next time.

Round 1 is made of 3 sub-rounds, that from each of them 1,000 enthusiastic hackers may advance to round 2, for a total of 3,000 that will be attending that round.

I really hope to qualify at least to the second round (which is actually third in number), and will be trying my best this Saturday.

Good luck everyone and have fun!

Best alarm clock ever

So, I needed some alarm clock software, so I could make my PC play one of my favorite songs when I wake up to work. Instead of downloading one, I just opened Stairway to Heaven by Led Zeppelin @ YouTube, paused the video and inserted this javascript code into the URL:

javascript:setTimeout("window.location.reload();", 6.5 * (60 * 60 * 1000));

bam, 6.5 hours later Stairway to Heaven is rocking my speakers :). Being a programmer means you usually have creative solutions for a lot problems...

Facebook Hacker Cup

// dont worry, there are no actual algorithm explained in here!

Lately I've been playing around Facebook's Hacker Cup, while ignoring the problems with input files and time expired, it provides a lot of fun.

I also noticed a lot of people are getting stuck at the Peg Game problem, and not quite understanding how the first example input, 5 4 0 1 2 2 is suppose to generate 0 0.375. So, here's the explanation I've been trying to explain lately:

0 x.x.x.x
1  x.x.x
2 x.X...x
3  x.x.x
4 x.x.x.x
ignore the first line, start straight from second one.
so starting at slot 0, row 1 we hit a peg. since it's a side one, we don't have a choice where to go, and go to row 2 peg slot 1 (marked with a big X)
there we have a choice of either row3 peg slot 0, or row3 peg slot 1.
we have to sum the probability of this two options.
option1 (row3 peg slot 0):
again no choice. we go to row4 peg1, and then have a 0.5 probability of going to slot 0 (the goal).
so total for this option to hit the goal: 0.5*0.5 = 0.25.
option2 (row3 peg slot 1):
again a crossroad. (0.5 chance to go either)
option2.1 (row4 peg slot 1):
0.5 of hitting the goal.
option2.2 (row4 peg slot 2):
0 of hitting the goal.
so total for this option to hit the goal: 0.5*0.5*0.5 = 0.125.
sum of both options: 0.25 + 0.125 = 0.375

Hope this helps anyone.

I have a lot more to write about this contest from my observation point, but I'll hold with it until the round is over.

Edit: Here's my solution to Peg Game: http://dvirazulay.com/phps/hc-peggame.phps

Enjoy and good luck on the next round!

Budapest & A new job & The 3-weeks-marathon


The weekend in Budapest was a-m-a-z-i-n-g!

I really loved the city, and the people, well mostly the girls I've met. People in Hungary seem so warm and welcoming, that you just have to like going out there. I've met some really interesting people there! I spent about 1-2 hours with an half Jewish/bit Hungarian/bit German girl named Eniko that lives in England, so that can be one of my stops when travelling to Europe sometime this year or next year. She was the only one who actually spoke correct English, so that was really fun.

I really like meeting people from other countries and learn about their culture. That is mainly my trip idea next year. Lots of countries, lots of different cultures, lots of interesting people. I have met a few more girls there, but none of them worth mentioning as they were either stupid or with a really horrible English that annoyed me like hell.

A new job

Well, I'm no longer working for Activated Group; I've started working for a start-up company named Clear Applications, on a product named Zerrem, which is a job finding social network website (My profileZerrem). I'm really excited about it and hope I'll contribute a lot to the team and make this product a great thing. Oh, and I'll be getting my new car next week, a rented Mazda 2 from the company. Woohoo 🙂

And I've already chosen a name for her. Angelina, or just Angi. 🙂

The 3-weeks-marathon

Ok, it's mid-june, and the Psychometry test date is coming up. (1st July 2010)

I have truly A LOT of pressure on me now, with the new job and the upcoming test. I feel like I'm about to crash, but hanging on a very thin rope. I hope I'll do well enough in the test even due to this enormous amount of pressure, so I won't have to re-do it again.

Wish me luck!

I have decided to fix the world

I have finally decided to take a step into fixing the world.

Oh, yeah? How are you gonna fix the world?

I'm gonna redirect every Internet Explorer user to a Chrome/Firefox download page.

Okay, and how will that stupid action fix the world?

Every web developer knows Internet Explorer is our worst enemy. We spend lots of time adjusting our perfectly made Chrome/Firefox modern websites to show properly even on the old absent-minded Internet Explorer. With that time (and headaches) saved, we can actually feed some babies in Africa.

And there you go, the world is fixed. kkthxbye.

Setting my self some goals to reach

I'm a very, very goal-minded person, and as one, I always have in mind a few things:

  • How to be the best at what I'm doing.
  • How to make people wonder "how the hell did he pull that off".
  • How to be the first to do anything.

That's why every time I start working on something that makes me excited, I'm so in to it that I won't stop until I know for sure that I'm "the best", at least until something changes or there are not many people interested at what I've been doing. That's why I'm sure that at some point, when I will start working as what I really want to be (a lead Software Engineer) I'll be a workoholic.

This has a few good and bad sides, but as for now I see the good ones only, as I don't have a family or anything other than my own future to worry about, so I'm fine with that.

Only one big thing really bothers me about myself - I lose interest really fast if there are no more goals to achieve, new places to find, new targets to conquer. This includes everything - school, work, and even girls. I've yet to find one that didn't bore me after 2 months, or made it too difficult to continue. Those that does not bore me fast, I keep as friends. Someone said endless cycle?

So, in order for me not to completely freak out at home now since I finished my 3 years service in the army (as a commander in the Israeli artillery force), I've decided to put up a few 'to do's for myself to achieve in the next month:

  1. Check what the hell happened to my credit card and why did it stop working. March 02: My Psychometry school, Yoel Geva, decided they can do whatever they want and bill me for the wrong payments at the wrong dates, causing my credit card cancellation. Good job.
  2. First and most important - take a weekend off, go to Eilat with a couple of friends, drink more beer than I can imagine. (Carlsberg. Definitely not any of the other crappy brands)
  3. Finish learning the words in the dictionary for the Psychometry exams. (Israel's equivalent to the SAT exam taken in the US) - current progress: 40% - Last update: March 08 / Update: May 08, 91% ~ complete.
  4. Make this kind of to-do list for my little piece of the internet, this website. Must happen before the next Sunday.
  5. Finish going through my TV shows & movies collection, and convert the data into a database, allowing me to manipulate it easily. (Incoming - IMDB query tool)
  6. Get a working weight calculator. It's time to make sure I'm not just wasting my time in the GYM. March 02: 77kg... new goal: 72kg. Hopefully with the help of the gym's

    nutritionist. May 08: 74kg. 19% fat, a decrease of 7% since March. 2kg to hit the goal, and hopefully -7% fat to go.

  7. Get a job for the next 2 months.    March 02: Starting to work hopefully this Sunday at a factory nearby. Will probably be a lot of storage work, moving boxes around, etc., but that is temporary & will fit my gym work outs 😉   March 07: Dropped that idea, I want something much more serious. Looking for a job in my best field - web programming. / May 08: I have been working for 2 months in Activated Group as a Lead Web Developer. Currently working on a few projects, some are websites and some are Facebook applications. Will update on that in my next blog post.
  8. Plan the next week perfectly so I won't be missing things I wanted to do.
  9. Get an extra 1-2TB external HD. / First one is 100GB short to be full :p  I don't download as much as I used to, so will wait with that move, maybe until my birthday (4th July)
  10. Last but not least, make sure I write here at least once a week. I really like writing, and it really helps me clear my mind and decide about the things I want to do next. Sort-of replacing the need of a person who understands me or someone I can actually tell stuff to all the time. Someone that isn't affected by random mood changes, or continuous break-downs.

So whoever celebrates Purim, I wish you a happy holiday, and for those who don't, have a great week.

Over and out.

Dream Theater concert @ Israel, 16/06/2009

Not under your command, I know where I stand I won't change to fit your man, TAKE ME AS I AM!

As you can guess, I actually WAS THERE, and it was truly AMAZING!

Out of about 10,000 of fans 'living the dream' that evening, I was at the second row! I mean, if you can call it rows, as it was basically huge masses of people squeezed into a few meters right near the stage. Add in the fogos(circles of people running into each other) and drunk russians, and the impressive amount of girls attending - I swear some of them would not live after the concert if they wouldn't come with their boyfriends to protect them. Us at the front lines received a bit of water thanks to the security guys near the stage, but I have absolutely no idea what the rest of the people a little more to the back did...

Some may blame this behavior on the Israeli type of people, but I really doubt it's a different case on other countries, especially on metal concerts. On another sad note, at the end of the concert Petrucci threw his pick just over to the guy standing next to me, and both of us searched around the grass for about fifteen minutes looking for it, and found nothing but a sharp metal object someone designed just the size and shape of a pick... :'(

On the happier side, I bought a new Dream Theater shirt! I'll try to take a picture of it and upload tomorrow for all of you to see...

Keep on living the dream!

Dream Theater

