Wednesday, January 20, 2010

Django, solr and a few interesting stuff Part 1 : Solr power!!!!


****Long story alert, no code, but tech related, jump to lesson learned in the end*****


The Beginning


One of the thing I am working on now involving solr, django and a few thing. Along the way we discovered a haystack. 


So the story started when we found a project, that generate cash. After talking to the client, and looking at their system. We decided to try, and we decide build it using django, and solr, among a few component. Though we did not confirm that we will offer search for the first place...


To keep it short. Thus we begin to build it. 


To Serve Just java -jar start.jar


We started with solr, because it is easy to install anyway. Just grab solr from http://lucene.apache.org/solr/
What really cool about solr, within the solr folder, is example folder. That is a fully functional solr project. 
just cd into example folder. and run java -jar start.jar. Follow the tutorial to use the example folder


Since we are a lazy bunch, so we just copy the example folder into our project folder. And modify schema.xml in example/solr/conf/. Since schema.xml in the folder is usable, we just modify the fields. 


Roadblock


Since this project original database have quite a lot of table. So FIRST STEP is denormalize the data. Flatten the whole thing. 


And we decided to use solrpy for our script to load the data in database into solr. Which look ok until we have a situation of 1 to many relation in the database. In solrpyr, each repetitive field is put into a list and added to solr. 
This is not the worst issue. When we pull the data from solr. The field is not in order. Thus, it doesn't maintain the structure.


Thus we learn, JUST store the ID or primary key in SOLR, so that it can be refered to the database, just use the KEY to pull data from database. 


Then after many trial of error, to detect empty field etc. 


We give up, we discovered haystack!!!

Lesson Learned in Using Solr


1) Denormalized all fields to be stored in Solr.
2) Sometime, solr output does not reflect to the structure of the database(ok, we are very new in solr)
3) So if it is from a database, indexed everything  and store the key(probably a bad idea, because we are not to hit the db a lot, i don't know really)
4) Don't reinvent the wheel. Turn out that haystack have solved our problem and more.
5) Partially number 4, the client just dump us a file outputed in database, we tried to be hero, but turn out better to store in db first....because it is a mess processing the data..

No comments:

Post a Comment