The Dan Melamed User Guide:
How to do research with Dan Melamed

Keeping Informed    
Resources:     My Time     Systems Support     Hardware     Software     Data     Other    
Reading    
Coding    
Listening    
Experimentation    
Writing    
Publishing: Authorship Conferences Presentations    
Vacations    
Expenses    
Other    

Keeping Informed

One of the first things you should do when you start to work with me is to sign up for the NLP-group and NYCNLP mailing lists. If you work on the 7th floor of 715/719 Broadway, then you should also sign up for the Broadway7 list.

Another thing you should do is create a web page for yourself, and add a link to it from http://nlp.cs.nyu.edu/people/. It need not be fancy, but it should exist. I put this in the "Keeping Informed" section for a reason. If the world knows you exist, knowledge and opportunities will come your way. Eventually, you should link your publications, software, and other goodies from your homepage, to encourage people to visit.

Take everything I say as opinion, not as gospel. If something doesn't make sense to you, ask for clarification. Don't hesitate to challenge me, or to suggest alternatives. I like it when people keep me on my toes. I like to engage in intellectually stimulating debate. If I end up losing a debate, so much the better, because that means I learned something. Learning is ultimately what I'm here for. These days I'm learning quite a lot, which means that I lose debates on a regular basis :)

Resources

My Time

Time is my most valuable resource. Please don't waste it. Please let me know as far in advance as possible if you need to cancel or postpone a scheduled meeting. On the other hand, you don't necessary need to schedule a meeting in order to speak with me. I'm happy to chat informally, as time permits. Feel free to interrupt me. The worst thing that can happen is that I might tell you to come back later.

When we meet, you should take notes, so that you don't forget stuff. (See "Writing" below.)

Systems Support

When you need help with hardware, software, connectivity, etc., you should try the following, in order:
  1. Ask the person sitting next to you, if any.
  2. Ask your advisor, if he's around.
  3. Check our FAQ list.
  4. Send email to helpdesk@cims.nyu.edu, with a cc: to your advisor.

Hardware

Here is a list of hardware available to people doing research with me. If you need more, ask me.

Your home directory is cross-mounted, so that you have access to it from all the boxes in the lab. We achieve this by actually storing it on an off-site server called NFS.CIMS.NYU.EDU. Unfortunately, this means that access to data in your home directory is slower than access to local disks. Therefore, data that does not need to be backed up should live in the /data partition of your desktop machine, rather than in your home directory.

Software

Here is a list of the software that is standardly installed on our workstations. If you will be doing intensive computation, you will want to learn about our Condor job distribution system, starting here. If you're planning to compile large C++ codebases (such as GenPar), then you'll want to learn about distcc.

If you want your C/C++ programs to be able to use more than 2GB of RAM (e.g. on s1), then you have to compile them with the -m64 switch for gcc.

If you are doing memory-intensive computing in Java, then you should understand the -Xms and -Xmx switches.

We have lots of useful software on s1 in /s1/software/. The stuff we're actively developing is in the CVS and/or Subversion repository. Oh, this means you have to learn Subversion and maybe CVS. They're pretty simple, though, and once you learn it you will love it. To learn Subversion, start with Chapters 2 and 3 of this excellent and free online book. A very short intro to CVS is here.

Data

Most of the useful data that we have is on s1 under /s1/data/. Some of it is read-protected due to license restrictions. If you need something that you can't read or can't find, please ask. Please pay attention to the license conditions, usually described in files like README, COPYRIGHT, and LICENSE. Some of the data may not leave the disks of s1. Much of it may not leave NYU. Breaking these rules can severely compromise my ability to do research, and therefore also your ability to do it with me.

Other

If you need some other resource to help you work effectively, whether it's common stuff or something very unusual, please don't hesitate to ask. No matter how outrageous the request, the worst possible outcome is that I'll say no. On the other hand, you might be surprised how far I'm willing to go to make your life easier. And even if I say no, I might keep your request in mind, and I might find a way to say yes later. Typical "other" resources include books, furniture, and computer hardware and software. Untypical resources include travel allowances, broadband service to your home, and exclusive use of computer hardware.

Reading

In research, one must be constantly vigilant against reinventing the wheel. Unoriginal work is not only a waste of time but, possibly worse, it might offend those whom you should have cited but didn't. So read anything you find interesting, but read everything that's directly relevant to your research. Learn to use the library, both the online and traditional varieties. Make an effort to obtain and read publications that are hard to find but likely to be relevant. If you find interesting ideas, write them down together with their source. If you develop original ideas while reading somebody else's, write down the source of your inspiration --- a stronger connection might crystallize later.

Some specific suggestions on how to read papers are here.

Let me know what you read. I might be able to offer relevant insights that increase your rate of knowledge gained per unit time invested. Knowing what you read also benefits me, because you will sometimes find and read something that I should read too but didn't know about. In that case, I will appreciate you alerting me to the source.

Listening

In addition to reading papers, you should attend presentations in your field. This applies not only to presentations that are directly relevant to your current research focus, but also to presentations that are tangentially relevant, as well as all presentations given by your colleagues and/or respected scientists. A good scholar should have breadth as well as depth. For example, I recommend regular attendance at the NYCNLP Forum. The topics that are chosen tend to be of the kind that will be relevant to you eventually, even if they are not relevant immediately.

Listening to presentations need not be a passive activity. To get the most out of your listening experience, it is sometimes a good idea to ask questions. Some people aren't sure when it's appropriate to ask questions about a formal presentation. During? After? It is almost never appropriate to ask questions in the middle of a research presentation. First, it is impolite to impose on other members of the audience with questions that they might not care about. Second, most presentations have time constraints, and you want to give the speaker time to say what s/he wants to say. Therefore, questions that are appropriate in the middle of a presentation are only those that are likely to help the speaker get their message across. To my knowledge, the only kind of question that fits this description is a very specific clarification question, like "What does the term X represent in your slide," which the speaker is likely to answer in a just a few words.

Many research presentations have a formal question period at the end, i.e. a time when the speaker is expected to field questions from the audience. What kinds of questions are appropriate to ask during this time? Remember that you can usually chat with the speaker informally later, either in person or by email. To avoid wasting other people's time, the questions that you ask in a public forum should be the kind that are interesting to the public. Even then, there are two types of questions to avoid.

First, you should avoid asking any questions when the speaker is one of your academic "allies," such as a research collaborator or a fellow student. The reason is that answering questions from the audience is one of the hardest parts of giving a presentation. You can't anticipate what the questions will be. Consequently, every "live" question runs the risk of embarrassing the speaker in a public forum, which benefits nobody. This risk is usually outweighed by the benefit of gaining a deeper understanding of the topic of the presentation. However, the risk of embarrassing somebody that you really care about outweighs any potential benefit to people not in that category. Of course, there are ways to ask questions that make your allies look good, but those are tricky, and rarely worth the effort.

Second, you should avoid public confrontations. The academic establishment puts a premium on "collegiality." Bad science will almost always be seen as such sooner or later. Don't publicly criticize somebody just because they're doing bad work. Do that privately and constructively. An exception might be justified if their bad work adversely affects you in some way. For example, if somebody misrepresents your work in an unflattering way, then you have a right to point this out publicly. Even then, I would not recommend public confrontations until you are an expert in the relevant field. Then, if you see a serious flaw in the presentation, and you don't mind embarrassing the speaker, aim for the head. If you misfire, you will look like a fool, so don't try this until you've seen others do it many times, and you are sure of your aim, and you can predict the political consequences with confidence. Definitely don't do this if you are emotionally upset. The academic community does not respect emotional outbreaks.

Coding

You should code under the assumption that your code will be distributed far and wide. This means that you should practice good software engineering, to make sure your code is portable, modular, and well-documented. Even if your code is never distributed, you yourself will benefit from coding this way, because you will often reuse your own code in other projects, sometimes years later, after you forget how it all fits together.

There is a trade-off between coding for the short term and coding for the long term. When you absolutely must finish some piece of code before a deadline, you might cut a few corners. However, I would like most of the code produced by my students and staff to be useful in the long term, even by people who may not have been around when the code was written. If you cut corners, I expect you to eventually return to your quick-and-dirty code, and make it long-lasting.

There are several kinds of software engineering tools that you should start using from day one:

This document is not the place for a treatise on software engineering. However, whenever I find myself wishing that somebody's code was better, I will add guidelines to this list.

Experimentation

The first law of data is that there are no laws for data. Anything and everything can and does happen in real-world data. In data-intensive research such as empirical NLP, you should never trust data to follow any rules, even if its documentation says that it does. E.g., you should not expect English text to be purely English, or even purely text! This means that your data-processing software should be robust to aberrations, and should fail gracefully if its input is unexpected. Given that it's infeasible to write perfectly robust software, you should often check-point and spot-check your processes, to make sure that the output looks reasonable, before it's fed into the next process. As you get more experience in NLP, you will develop an intuition for what "reasonable" looks like. However, some problems will be obvious even to novices.

In empirical NLP, we often deal with very large data sets and expensive algorithms. It is fairly common for experiments to run for weeks or even months. Unfortunately, computer systems are not as reliable as we'd like them to be. Therefore, if you are writing software to manage long-running experiments, you should write it with the expectation that it will crash before it finishes. In particular, you should save intermediate results to disk. You should also make it easy to restart from various places in the middle of the process. The latter is always easier if your code is written as a hierarchy of little programs, rather than as one monolithic program.

When experimenting with machine learning on large datasets, remember that some theories can be effectively tested with just a tiny subset of all your data. Although you should usually use all of the available data to generate final results, you can save much time by doing exploratory work on small samples.

When you do exploratory data analysis (EDA), resist the temptation to "collect butterflies." When people collect butterflies, they focus on the most beautiful/exotic ones. But Zipf's law says that exotic events are rare. Empirical NLP systems should work well as often as possible, so it's more important to analyze the frequent cases than the rare/unusual cases. All this is another way of saying that when you need to pick some data to stare at, you should use random sampling, not sampling by (real or apparent) importance.

Writing

Ideas and Work in Progress

If your ideas and results are not described in prose text, then they aren't worth sh*t. Keep a research journal handy. Write down everything. Guard it with your life. It doesn't have to be well-thought out or well-edited to be written down. The point is to record useful stuff in a medium that doesn't disappear as easily as memory (yours or RAM). If you're like most academics, then you have too much valuable information to keep in your head. To make it easy to write things down on the train or in the middle of the night, don't rely on a computer -- just use a pen. Ask to see my journal for more specific suggestions.

Technical Reports

When your ideas are more fully developed, or when you have a coherent set of results, it's time to start putting together a more organized manuscript. Click here for specific advice on how to write up research, including advice on content, format, and the software tools that you should use. When you have written something that approximates a complete and coherent document (of any length), it's time to circulate it in the Paper Network. One of the best ways to improve your papers (and the research that they are based on) is to get feedback from your peers. If your peers can't understand what you wrote, then neither will your reviewers or anybody else. In any case, you'll be amazed at what a second pair of eyes can catch. Every paper should circulate through the paper network at least once during its development. The benefit is bidirectional, since reading your peers' papers keeps you informed and stimulates the flow of ideas.

When you want my feedback on something you've written, please give it to me single-sided, double-spaced, in a font that's at least as large as the standard 11pt font in LaTeX. Please proofread it yourself before giving it to me, to catch obvious errors and typos. In particular, please use a spell-checker every time. These simple measures make it easier for me to read your work and to give you useful advice. I will usually mark up your document with suggestions. Some of the them might involve the editing notation listed here.

Publishing

In research, the old saying about "publish or perish" is pretty close to the truth. When you have some research results that others might benefit from, it's time to think about publishing. If you don't have a PhD, then you must arrange for me to read your article before submitting it for publication. I recommend it even if you have a PhD, but then it's not mandatory.

Authorship

Everybody who directly and significantly contributes to a publication, in terms of ideas, design, implementation, experimentation, writing, and editing, should usually be listed as one of the authors. Work on research infrastructure that benefits multiple projects does not count as a direct contribution.

Authorship carries responsibilities. If your name is on a publication, then the research community will hold you responsible for the work's quality and originality. Therefore, every author has the right to read the final version of a publication before it is submitted, and to request modifications. An author who disagrees with the content of a publication has the right to withdraw their name from the list of authors. Never include somebody as an author without their informed consent.

The default order of authors is alphabetical by last name. Modifications to this default might be appropriate when some author(s) contributed far more than others. People can be very sensitive on this issue, so consult with me if you're not sure.

Conferences

In fast-moving fields like Empirical NLP, the most prestigious publication venues are conferences, not journals. This fact of life has the advantage that you can develop a reputation more quickly. It has the disadvantage that the conference paper reviewing process is rather imprecise. You should get used to the idea that even the very best papers are sometimes rejected by short-sighted reviewers. E.g., when Mitch Marcus first tried to publish a paper about the Penn Treebank, it was rejected with the question "What possible relevance could this have to NLP?"

Before you submit the final version of a conference paper for publication, be sure to review these guidelines. Pay special attention to the required acknowledgments.

Most academic conferences require the authors of accepted papers to orally present their paper at the conference. As a rule of thumb, if you are working in my group, and you are the first author of a conference paper, then I will sponsor you to attend that conference to present your work. This simple rule has several non-obvious implications:

Before you attend your first conference, read up to section 6 of Networking on the Network: A Guide to Professional Skills for PhD Students by Phil Agre. Then, consult with me about effective conferencing strategies.

Presentations

The quality of oral presentations makes a huge difference to whether people will pay attention to you and your work. The key ingredients of a good presentation are planning and sufficient practice. You should prepare conference presentations far enough in advance to schedule a practice talk in front of our research group, and to make revisions afterwards. When you prepare your first few talks, you will be amazed at how long it can take to prepare good visual aides. Allocate at least a week just for that. Practice talks are mandatory for my students and staff. Before your practice talk, you should practice by yourself, in an empty room, until your presentation sounds fluent and natural. It may seem strange to talk out loud in an empty room, but you'll get used to it. Ask me for assistance if you get stuck.

Also, please note that the purpose of an oral or poster presentation at a conference is not to summarize the paper. The purpose is to advertise the paper, to get people to read it. If you're not sure about the difference, come talk to me about it.

Here is some specific advice on how to prepare and deliver oral presentations on technical material such as research results.

Vacations

For better or for worse, the major NLP conferences take place between May and August each year. This means that the paper submission deadlines for these conferences are mostly between December and February, inclusive. Since you will want to submit your latest and greatest results, you will typically want to write papers during these winter months. An advantage of doing this at the same time as everybody else is that you can participate in the paper exchange network (see "Writing" above). Therefore, it would behoove you to schedule vacations during some other time of the year. When I was in grad school, I usually attached vacations to conference trips, in order to save money on airfare.

Expenses

If you need reimbursement for an expense that I approved, please get a reimbursement form from my administrative assistant, fill it out, and give it to me with all your receipts. I will sign it, and send it to the right place. Depending on the amount, you'll get your money in a few days to a couple of weeks.

Most of the money that I use to cover travel expenses comes from grants that impose certain restrictions on how that money can be spent. In particular, most of these grants require that any airfare must be purchased from a U.S.-based airline. See here for more details. Do not break these rules without consulting me first.

Other Useful Tips

If you can think of things you've learned about working with me that other people would benefit from knowing, please email them to me so that I can add them here!
Dan Melamed (melamed at cs dot nyu dot edu)
Last modified: Tue Sep 12 15:57:30 EDT 2006