Joseph M. Joy's profileJoseph M. JoyPhotosBlogListsMore ![]() | Help |
Joseph M. JoySoftware Architect, Microsoft Research India |
April 24 Digitization of the JBNHS: Getting StartedThis is Part 2 of the story of the Digitization of the Journal of the Bombay Natural History (JBNHS). My introductory post, “Digitizing 100 years of the Journal of the Bombay Natural History Society”, contains the background of this project and also links to subsequent posts. The project kicked-off in earnest in 2001. Kumaran met with Mr. J. C. Daniel, Honorary Secretary of the BNHS, conveying our intentions. Kumaran knew Mr Daniel personally, and he was very encouraging. Mr Daniel is one of the many truly remarkable people in BNHS’s long history, and continues to be active after over 50 years association with the Society! I had the fortune of being able to chat with Dr. Gary Starkweather, inventor of the laser printer, who was working at Microsoft Research in Redmond at the time. Dr. Starkweather gave me very good advice on the kind of scanning equipment I would need, what DPI to scan at, and so forth. He strongly suggested cutting off the spines of the journals if that could be done, because everything goes smoothly if you could have flat pages. He told me he used a band saw. I performed one, mercifully short, experiment trying to use my radial arm saw to saw through the spine of a discarded book – it shredded that poor book to pieces! I eventually found that the local Kinkos™ had a book-slicing machine (“the Titan” it was appropriately called) and used them for the entire project. Gary also suggested I get a Fujitsu Duplex Scanner with ADF (automatic document feeder), as he’d had good experience with that brand. I ended up buying a Fujitsu M4097D scanner, a real workhorse. You could feed up to 100 pages at a time, and it scanned both sides of each page at one shot (hence the term “Duplex”). Here is a picture of it, just after unpacking:
I realized very early on that to get full control over the project, I had to get a copy of the entire Journal for myself, and get it shipped over to Redmond, WA, USA, where I was living at the time. I did not mind sacrificing the “collector's value” of the Journal (by slicing of their spines) in order to get it into digital form. I also wanted full control over the scanning process, in particular the quality control and the handling of, especially, earlier volume of the Journal, which are quite rare to find these days. In fact, the BNHS’s own copies of the earliest volumes of the Journal are not in particularly good shape, because of the weather in Mumbai and no doubt due to the amount handling those volumes have received over the years. Serendipitously, Kumaran’s friend, Andrew Robertson, knew of a book dealer, Dieter Schierenberg b.v, based out of Amsterdam. This dealer specialized in older scientific journals, and had the entire set from Volume 1 to 72 for sale! I purchased that set and had it shipped over to Redmond. I remember hiring a station wagon and driving to the cargo area of SeaTac Airport, and picking up 2 palettes shrouded in black plastic, which were loaded onto the station wagon with a fork lift! I don’t have pictures from that memorable ride, but here is a picture of the books, both inside and outside their packaging (there were many more of them, of course):
Here’s a picture of me with some of the books (the enormity of the task seems to be sinking in…):
The BNHS provide me with the more recent journals – from 1973 through 2000. I had those too shipped (by my friend Deepak Amin) from Mumbai to Redmond. By November 2001 I had the entire 100 volumes of the Journal, and all the equipment. At that time, I wrote up a status report on the project which we sent to the BNHS. A PDF version of the report is here.
The journals themselves were marvelous, and it is easy to get lost in them. Here is the first page of the very first issue, dating from January 1886! I was surprised by the high quality color plates in even the earlier volumes of the Journal. Here’s a sample from Volume 17, which was published before 1910! I didn’t realize one could do that kind of high quality printing at the turn of the century (NOTE: while this is a color photograph of one of the pages. All the scanning I did was at black & white, i.e., 1 bit per pixel):
A detail from the above page is here:
In my next post I will go through a sample of some of the earlier content from the Journal. The content and style of writing of the earlier articles, mostly by British authors, is fascinating. Many of these authors straddled the role of Naturalists and Hunters like it was the most natural thing to do. There were many articles whose very title speak of eras gone by, such as a series of articles on the “Poisonous Plants of Bombay!” [Back to my introductory post on the Digitization of the JBNHS] March 28 Digitizing 100 years of the Journal of the Bombay Natural History Society
[This is the first of a series of posts on the Digitization of the Journal of the Bombay Natural History.] On February 18th (2009), at the Indian Institute of Science’s J N Tata Auditorium in Bangalore, the DVD of the first hundred years of the Journal of the Bombay Natural History Society (JBNHS) was officially released. Here’s a small picture from the DVD release ceremony that I pulled from the www.bnhs.org site:
From left to right are: Dr. Asad Rahmani (Director of the BNHS), Prof. C.S. Yogananda(H.O.D., Dept. of Mathematics, SJCE, Mysore), Prof. D.N.Rao (Chairman, Division of Biological Sciences, IISc), Kumaran Sathasivam, myself, Mr. J C Daniel (Hon. Secretary of the BNHS), and Ms. Vibhuti (Publications department, BNHS) What am I doing in this picture? That’ll become clear as you read on. It is an interesting story of a wonderful publication (the JBNHS) and of the value of persistence and collaboration in getting something significant achieved (the digitization of the entire journal, all 80,000 pages of it.) Besides myself, Kumaran Sathisivam, Prof. Yogananda and Diane Lancaster are key contributors to the project. The release was on the occasion of the “International Conference on Conserving Nature in a Globalising India”, 17-19 Feb. 2009. The conference itself was wonderful. Much of it was a sobering-bordering-on-depressing account of the decline of bio-diversity worldwide, but there were inspiring tales of heroic dedication and a few success stories as well. I’ve saved a copy of the program schedule here, in case you are interested in the talks. I wish the content covered in the conference was part of mainstream dialog, but that is wishful thinking. At some point I will summarize my takeaways from the conference. The conference was organized by the Bombay Natural History Society (www.bnhs.org). From their site: The Bombay Natural History Society is today the largest non-government organisation (NGO) in the Indian sub-continent engaged in nature conservation research. In the 125 years of its existence, its commitment has been, and continues to be, the conservation of India's natural wealth, protection of the environment and sustainable use of natural resources for a balanced and healthy development for future generations. The Society's guiding principle has always been that conservation must be based on scientific research - a tradition exemplified by its late president, Dr. Sálim Ali.
You can find a bit about the fascinating origins of the BNHS here, and on their current research focus and collaborations here. The Society has many activities and publications, but their most influential publication is the Journal of the Bombay Natural History (JBNHS). The first issue came out in 1886. Since then, every year, including during WW-I and WW-II, they have put out 3 to 4 issues, continuing to this day. Now, as to why I was in that picture, in the kind words of Dr. Asad Rahmani, Director of BNHS: After more than ten years of persistence by two of our members, Mr. Joseph Joy and Mr. Kumaran Sathasivam, spending long hours in various libraries, and scanning of more than 80,000 pages, a DVD covering 100 volumes of JBNHS was released during our International Conference. The final stages of the DVD production and other technical aspects were taken care of by Prof. Yoganand[a] of the Mysore University [from Dr. Rahmani’s March –April, 2009 BNHS Newsletter]. The project goes back over a decade, and in subsequent posts, I will give a roughly chronologically ordered summary of events that lead to the eventual publication of the 100 volumes as a DVD. OriginsMy friend, Kumaran Sathasivam, had the original idea of digitizing the Journal. Kumaran and I were classmates at IIT-Madras (1981-85). We were both avid birdwatchers and nature lovers during our time at IIT-M. Incidentally, Kumaran wrote an award-winning short children’s book on our exploits while at IIT-M, called “A Forest in the City”, unfortunately out of print, though I may put a scanned copy of it online at some point – it does brings back memories. I clearly remember one trip we made to the Bombay Natural History Society’s head office in Mumbai (then Bombay), called “Hornbill House,” sometime in the early 80s. It was amazing to be there in person, to let some of the history of the society seep in, and to look through their enormous collections of specimens of fauna -- some 26,000 birds, 20,000 mammals, 7500 amphibians and reptiles and 50,000 insects. ( See here for some information on these collections.) Later, after we graduated from IIT-M, Kumaran stayed closely in touch with wildlife studies (he has published a book on Marine Mammals of South India – more on that here). Kumaran had been perusing certain early issues of the Journal, (which are very hard to come by, and usually in poor condition), looking for references a particular species of mammal, and realized what a wealth of information lay in the pages of the journal, while at the same time how painfully slow it was to try to wade through the dusty volumes in the corner of some university library. So Kumaran brought up the idea of digitizing the journal, and I was immediately caught up by his vision. It was a large and ambitious project to take on (especially before 2000, at the time when large scale digitization of books had not begun happening), and at the time I was looking for something challenging to take on (outside of my work) that clearly had some benefit to humanity and all that. What a journey it has been! [Continued in my next post, "Digitization of the JBNHS: Getting Started"] June 02 Jim Gray TributeI attended the tribute to Jim Gray, May 31st 2008, at UC Berkeley. It was inspiring. The way he conducted his life and interacted with people across the industry and academia, the impact he’s had on many, many individuals – these are things that inspired me most, and I think that we can learn from his example. I think it was Ed Lazowska who said, during the Tribute, that while one cannot hope to become the intellectual giant that Jim was -- to go to sleep at night, hoping to wake up the next morning intellectually stronger – one can always learn from Jim’s example and strive to be a better human. I’ve written down some things that really struck me. I encourage you to read the “proceedings” from the Tribute, which are published as Volume 37, Number 2 (June 2008) of the SIGMOD Record (the online version of the issue is not up there as I write this).
I do not know Jim Gray personally. I have attended a few talks by him in Redmond, and knew of his pioneering work on Transaction Processing and of his work with Tom Barclay and others on TerraServer and his more recent work with the Astronomy community. I do know some folks who have interacted with him personally and they always had good things to say about him. It was at the tribute that I realized just how great a person he was in so many respects, and I am glad I attended this event, to see and hear in person, accounts from people he has worked with over the years.
Working across the industry. Jim really was at the very center of defining the fundamental properties of Transactions, and on efficient ways to implement them, including defining several levels of consistency and efficient locking protocols and innumerable other implementation guidelines. He actually wrote a lot of code that went into IBM’s System-R at the time. Bruce Lindsay, who worked with Jim back at IBM Research during the 70s, talked about his contributions there. What I didn’t know is that it was Jim who also moved the entire database industry to adopt standardized transaction processing benchmarks. He defined the original benchmark and encouraged the formation of the TPC council. How he went about this provides a glimpse into his method of working and how wide ranging his impact has been. As David DeWitt described, Jim wrote a paper in 1984 called “A measure of Transaction Processing Power (and old TR version; a .DOC version is here.).” This paper was deliberately published in a Trade magazine, Datamation, and was authored, tongue in cheek, “Anon Et Al.” Jim had worked with some 24 folk from industry and academia, whom he anonymously credited because of the controversial nature of the results, which for the first time showed the relative performance of several of the database systems of the day (presented anonymously in the paper with monikers ranging from “Lean and Mean” through “Funny”) in the stark glare of the new benchmarks metrics. In his industry-spanning manner that is classic Jim Gray, he had worked with these individuals from fiercely competing companies to get the data. He had the singular authority and professional integrity to pull this off, and get the companies to move away from their own proprietary benchmarks to the establishment of the TPx benchmarks, which in turn has had far reaching impact in driving innovation beyond databases – in storage, CPU architectures and software. As Tom Barclay mentioned in a later talk, Jim was really of the industry at large – it’s just that at any point of time some company signed up to pay his salary J.
500 special relationships. Ed Lazowska gave a talk he titled “500 Special Relationships: Jim as A Mentor to Faculty and Students,” and many of the things he mentioned were echoed by other speakers, either explicitly or it came out when they were recounting personal stories interacting with Jim. The essence of what they were saying was that Jim had these incredibly close relationships with 100s of people, many outside the database community. Ed himself is not a database person. He talked about how “Jim provided extraordinary guidance to me, my students, and our colleagues in the ocean sciences community…” and how Jim anonymously endowed almost $500K in undergraduate scholarships at the University of Washington (not his alma mater). Professor and astronomer Alexander Szalay of John Hopkins University talked of how he met Jim in the late 90s, and of Jim “rolling up his sleeves” in 2000, working multiple 20 hour days (!) converting the Sloan Digital Sky Survey (SDSS) into SQL. He talked about Jim “going native” and becoming much of an astronomer himself and a much-loved member of the astronomer community, so much so that an asteroid is to be named after him. He says “my friendship and collaborations with Jim took my career in new, entirely different directions… He impacted lives of many others around him on the same way.” Others, such as Pat Helland (see his blog), Michael Stonebraker and Gordon Bell, had similar stories to tell. They talked about his great technical insights, mentorship, especially of students and upcoming researchers, and of freely giving credit. Curtis Wong, when explaining Jim’s role in inspiring the World Wide Telescope, talked about Jim’s almost embarrassing generosity while giving credit, and how it was instrumental in motivating Curtis to get the WWT project off the ground. As you can imagine, his reputation for integrity combined with having so many personal relationships across the industry enabled him to serve a powerful networking role. Rick Rashid called Jim a “Gap Bridger … someone who could connect people, groups, companies and disciplines.” David Vaskevitch talked about Jim’s role as a sort of “transaction coordinator” for people who were making major career change decisions, especially between companies – they would always get his advice and as David put it in humorously in TP terms, Jim would ensure that these transactions went through smoothly, and people were not left “in flight” mid way through transitions.
Write it down! Another common theme was the importance Jim gave to writing and presenting ideas. He said in a 2002 interview (re-published in the (27,2) SIGMOD Record) that he learned this from his PhD Advisor, Michael Harrison at UCB (who also gave a talk, about Jim’s student days at UCB). Jim would often do the lion’s share of paper writing. Many commented about his ability to write crisply about complex topics. He often encouraged (and goaded) his colleagues to write and present. Andreas Reuter, who with Jim Gray wrote the famous book “Transaction Processing – Concepts and Techniques”, gave a talk about the experience working with Jim writing that book. There are many interesting anecdotes you can read from the SIGMOD record of the tribute, but Jim’s approach is best summarized in Andreas words: “I got the impression that for Jim the whole exercise served as an “upload” of all the things he had learned and thought about for decades, so that other people could pick them up and he was free to take off in to new territories.”
Tidbits. Some other things I noted, from the Tribute as well as reading the transcript of Jim’s 2002 interview are jotted down here. · Alex Szalay mentioned how Jim inquired about the “20 queries” – what are the most important 20 queries that (in this case) the astronomy community would want to make of the SDSS data. This really brought out the underlying requirements and constraints (and precipitated the right kind of dialog and debate about what was really necessary). Alex said that this method of dialog turned out to be very effective in getting to the bottom of requirements when they span multiple disciplines. · In Jim’s 2003 interview, he talked about how paradigm-changing ideas are often rejected for (first) publication. He says: “The original B-Tree paper was bounced; the data cube paper was bounced. The original transaction paper was bounced. Any paper that is non-linear is going to get bounced”. His advise remains to “go for the home run”, and be persistent.
Jim Gray was a great person. We would do well to emulate even a fraction of his qualities. February 16 Microsoft Puzzle Hunt and Puzzle SafariIn this post I talk about a series of puzzle contests that my friends and I have participated in, and how it inspired some of us to hold our own contests, including for children.
I have always been interested in puzzles, and have amassed quite a large collection of puzzle books and physical puzzles over the years. Several years ago (1999), my friends and I learned that some puzzle enthusiasts within Microsoft (we were in Redmond at the time), were organizing something called Puzzle Hunt. Teams of up to 12 were invited to participate in a 2 day event, that involved solving a series of puzzles, that were given to us in packets in several stages. We formed a team of some 5 (all IITians, I recall, an somewhat cocky about our expectation of doing rather well in the competition, perhaps even coming in the top three). Some 30 or 40 teams participated.
So early Saturday morning, a whole bunch of puzzle enthusiasts assembled on the Microsoft campus and received our first collection of packets. We had reserved a conference room to serve as a sort of control room (and noticed that many other conference rooms were taken up by other teams, some of these teams quite large – 8 to 12 members). Well guess what? We were completely blindsided by the difficulty of the puzzles. The first thing that flummoxed us was the fact that there were no instructions! A typical puzzle would have a title, and some pictures which didn’t seem to have any rhyme or reason to them, and nothing else. An example from the 2nd Puzzle Hunt is shown below.
Here is another one from one of the later puzzle hunts: we got a small plastic packet with some Jelly Belly(r) jelly beans in them. That’s it! There was an online system for reporting the status of various teams as they solved. As we stared uncomprehendingly at our set of puzzles, and the hours passed, we saw in the online system various teams solving puzzles, one after another. This was a 2 day event, and on the 2nd day we were pretty dispirited. I think we solved 2 or 3 puzzles out of perhaps 20, and came in the bottom third. The top three teams, meanwhile, were in a completely different level, having solved pretty much all the puzzles and the meta puzzle as well. We were astounded.
So after participating in the first Microsoft Puzzle Hunt, we were bruised, but also hooked, and we have participated in several since. We got to know some of the “tricks of the trade” (chiefly, to try out lots of things very fast, look for encodings of any kind, and not be fazed by ambiguity. The Internet gets used a lot). We generally fall in the middle of the pack, still far below the few teams at the very top, but at least having a lot of fun. You can find out more about the Microsoft Puzzle Hunt on the Wikipedia here: http://en.wikipedia.org/wiki/Microsoft_Puzzle_Hunt. Here's a picture I took at the beginning of the 2002 Puzzle Hunt, and another one of our team hunkering down, solving puzzles...
In 2001, another set of enthusiasts started another version of puzzle solving, called Puzzle Safari. These tended to involve many more and easier puzzles, teams could be of max size 4, and it involved a lot of running around campus. The answers to most puzzles were frequently some location on campus (say a conference room identified by its number), where one of us had to rush. To confirm that you had found the location, you had to locate a unique stamp pad hidden in that location, and mark your little booklet. This was a one-day affair, and at the end of it, we were pretty exhausted, mentally and physically, but it was loads of fun and we participated every year, as long as we were in Redmond (until 2004). Our team would generally come in the middle, sometimes amongst the top one thirds of the participants. As with Puzzle Hunt, the very top teams were in a league by themselves, and having gone through the gauntlet ourselves, we just couldn’t believe how they could get so much done in the same amount of time. A humbling and at the same time inspiring experience. You can find out more about Puzzle Safari on the Wikipedia here: http://en.wikipedia.org/wiki/Microsoft_Puzzle_Safari. A picture I took of the 2002 puzzle safari beginning is below. Unfortunately I couldn't find any pictures of all the mad running around and letterboxing activity. Our experiences have inspired my wife and I to hold a mini puzzle safari kind of event for our elder daughter’s birthday. We started this when she was very young: just 3 years old(!). At that age the children mostly ran around confused about what was going on. But we were extremely gratified a year later, when many of these same children, now 4ish years old, were eagerly anticipating the next puzzle (treasure) hunt! In this day of eye candy, TV, and videos, “reverse gifts” and short attention spans, these children had remembered something that had happened a year back and wanted more of it! So we’ve been having “puzzle hunts” every year, the last 3 have been in Bangalore India. My colleagues and I have also organized a puzzle hunt during our annual Microsoft Research India retreat, and that was a big hit. I will add more information on each event and hopefully it will entice others to hold their own mini puzzle hunts for children (and grownups too). The following pictures are four different puzzle solving events I was involved with organizing, covering diverse age groups, as you can see!
February 14 Mining Software RepositoriesSoftware projects can be extremely large. For example, the Microsoft Windows codebase is estimated to be over 100 million lines of code. However, these projects did not spring up overnight. They have evolved over a long period, with many people contributing over the years. The question is, how much information of value to the current developers of the system can be gleaned by examining the various repositories that constitute the software project? Quite a lot, as it turns out.
Dr. Prem Devanbu from U C Davis is one of the authorities in the field of mining software repositories. Dr. Devanbu recently (in Jan 2008) visited with MSR India for a few days, and gave a fascinating talk “Babes in woods” that talked of the challenges of new developers taking on complex software projects. Prem went on to describe work he and his collaborators at UC Irvine did on a system that recommended a list of “related functions” that a programmer should look at, when she is planning to use a particular function. Relationships amongst functions are useful in other contexts. For example, they can be used to study the impact of a particular code change in order to estimate the risk of a regression and to decide which tests to run. Dr. Devanbu described their FRAN tool that implements their HEAR algorithm. HEAR analyzes call graphs to come up with related-function recommendations, using linear algebra techniques that simulated infinite random walks in the neighborhood of the query function in the call graph. It was found to be an effective and fast technique. I you are at all interested (as I am) in the general area of mining software repositories for information that can be valuable to software development, I strongly recommend you read their related paper Recommending Random Walks by Saul, Z. M., Filkov, V., Devanbu, B. and Bird, C. In Proc. 15th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE-15), September 2007.
The references in the paper point to other fascinating work in the general area of mining software repositories, for example: 1. D. Cubranic, G. Murphy, J. Singer, and K. Booth. Hipikat: a project memory for software development. IEEE Transactions on Software Engineering, 31(6):446–465, 2005. 2. Z. Li and Y. Zhou. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In Proc. ACM FSE -13 (2005). 3. V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. ACM FSE-13, (2005). 4. T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl. Mining version histories to guide software changes (earlier paper here). IEEE Transactions on Software Engineering, 31(6):429-445, 2005.
These references are all relatively recent, and to me this indicates that the potential of mining software repositories remains largely untapped with many interesting results still to come. |
|
||
|
|