regex для синтаксического анализа хорошо сформированного многострочного словаря данных

Я пытаюсь прочитать и проанализировать словарь данных для публикации данных об использовании микросхем Американского сообщества Бюро переписи населения, как найдено here.regex для синтаксического анализа хорошо сформированного многострочного словаря данных

Он достаточно хорошо сформирован, хотя с небольшим количеством ошибок, когда вставляются несколько пояснительных примечаний.

Я думаю, что мой предпочтительный результат - либо получить блок данных с одной строкой на переменную, и сериализовать все метки значений для данной переменной в один словарь, хранящийся в поле словаря значений в той же строке (хотя иерархический json-подобный формат не будет плохо, но более сложным

Я получил следующий код:..

import pandas as pd 
import re 
import urllib2 
data = urllib2.urlopen('http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict13.txt') 

## replace newline characters so we can use dots and find everything until a double 
## carriage return (replaced to ||) with a lookahead assertion. 
data=data.replace('\n','|') 

datadict=pd.DataFrame(re.findall("([A-Z]{2,8})\s{2,9}([0-9]{1})\s{2,6}\|\s{2,4}([A-Za-z\-\(\) ]{3,85})",data,re.MULTILINE),columns=['variable','width','description']) 
datadict.head(5) 

+----+----------+-------+------------------------------------------------+ 
| | variable | width | description         | 
+----+----------+-------+------------------------------------------------+ 
| 0 | RT  | 1  | Record Type         | 
+----+----------+-------+------------------------------------------------+ 
| 1 | SERIALNO | 7  | Housing unit         | 
+----+----------+-------+------------------------------------------------+ 
| 2 | DIVISION | 1  | Division code         | 
+----+----------+-------+------------------------------------------------+ 
| 3 | PUMA  | 5  | Public use microdata area code (PUMA) based on | 
+----+----------+-------+------------------------------------------------+ 
| 4 | REGION | 1  | Region code         | 
+----+----------+-------+------------------------------------------------+ 
| 5 | ST  | 2  | State Code          | 
+----+----------+-------+------------------------------------------------+

до сих пор так хорошо список переменных есть, наряду с шириной в символах каждого

Я могу exp и это и получить дополнительные линии (где живут метки значений), например, так:

datadict_exp=pd.DataFrame(
re.findall("([A-Z]{2,9})\s{2,9}([0-9]{1})\s{2,6}\|\s{4}([A-Za-z\-\(\)\;\<\> 0-9]{2,85})\|\s{11,15}([a-z0-9]{0,2})[ ]\.([A-Za-z/\-\(\) ]{2,120})", 
      data,re.MULTILINE)) 
datadict_exp.head(5) 

+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| id | variable | width | description          | value_1 | label_1  | 
+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| 0 | DIVISION | 1  | Division code          | 0  | Puerto Rico | 
+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| 1 | REGION | 1  | Region code          | 1  | Northeast | 
+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| 2 | ST  | 2  | State Code          | 1  | Alabama/AL | 
+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| 3 | NP  | 2  | Number of person records following this housin... | 0  | Vacant unit | 
+----+----------+-------+---------------------------------------------------+---------+--------------+ 
| 4 | TYPE  | 1  | Type of unit          | 1  | Housing unit | 
+----+----------+-------+---------------------------------------------------+---------+--------------+

Так что получает первое значение и связанную метку. Моя проблема с регулярным выражением здесь, как повторить многострочный матч, начиная с \s{11,15} и до конца - т. Е. некоторые переменные имеют массу уникальных значений (ST или state code, а затем около 50 строк, обозначающих значение и метку для каждого состояния).

Я изменил на раннем этапе возврат каретки в исходном файле с помощью трубы, полагая, что тогда я могу бесстыдно полагаться на точку, чтобы соответствовать всем, пока не вернется двойная каретка, указывающая конец этой конкретной переменной, и вот где Я застрял.

Итак - как повторить многострочный шаблон произвольным числом раз.

(Впоследствии осложнение состоит в том, что некоторые переменные не полностью перечислены в словаре, но показаны с допустимыми диапазонами значений. Например, NP [число лиц, связанных с одним и тем же домохозяйством], обозначается символом `` 02 ..20` следующее описание. Если я не объяснить это, мой разборе пропустит такие записи, конечно.)

источник

2014-10-25 ako

возможный дубликат из [re .search Несколько строк Python] (http://stackoverflow.com/questions/18521319/re-search-multiple-lines-python) –

Это не регулярное выражение, но я разобран PUMSDataDict2013.txt и PUMS_Data_Dictionary_2009-2013.txt (Census ACS 2013 documentation, FTP server) с этим сценарием Python 3x ниже. Я использовал pandas.DataFrame.from_dict и pandas.concat, чтобы создать иерархическую структуру данных, также ниже.

функция

Python 3x разобрать PUMSDataDict2013.txt и PUMS_Data_Dictionary_2009-2013.txt:

import collections 
import os 


def parse_pumsdatadict(path:str) -> collections.OrderedDict: 
    r"""Parse ACS PUMS Data Dictionaries. 

    Args: 
     path (str): Path to downloaded data dictionary. 

    Returns: 
     ddict (collections.OrderedDict): Parsed data dictionary with original 
      key order preserved. 

    Raises: 
     FileNotFoundError: Raised if `path` does not exist. 

    Notes: 
     * Only some data dictionaries have been tested.[^urls] 
     * Values are all strings. No data types are inferred from the 
      original file. 
     * Example structure of returned `ddict`: 
      ddict['title'] = '2013 ACS PUMS DATA DICTIONARY' 
      ddict['date'] = 'August 7, 2015' 
      ddict['record_types']['HOUSING RECORD']['RT']\ 
       ['length'] = '1' 
       ['description'] = 'Record Type' 
       ['var_codes']['H'] = 'Housing Record or Group Quarters Unit' 
      ddict['record_types']['HOUSING RECORD'][...] 
      ddict['record_types']['PERSON RECORD'][...] 
      ddict['notes'] = 
       ['Note for both Industry and Occupation lists...', 
       '* In cases where the SOC occupation code ends...', 
       ...] 

    References: 
     [^urls]: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/ 
      PUMSDataDict2013.txt 
      PUMS_Data_Dictionary_2009-2013.txt 

    """ 
    # Check arguments. 
    if not os.path.exists(path): 
     raise FileNotFoundError(
      "Path does not exist:\n{path}".format(path=path)) 
    # Parse data dictionary. 
    # Note: 
    # * Data dictionary keys and values are "codes for variables", 
    # using the ACS terminology, 
    # https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html 
    # * The data dictionary is not all encoded in UTF-8. Replace encoding 
    # errors when found. 
    # * Catch instances of inconsistently formatted data. 
    ddict = collections.OrderedDict() 
    with open(path, encoding='utf-8', errors='replace') as fobj: 
     # Data dictionary name is line 1. 
     ddict['title'] = fobj.readline().strip() 
     # Data dictionary date is line 2. 
     ddict['date'] = fobj.readline().strip()  
     # Initialize flags to catch lines. 
     (catch_var_name, catch_var_desc, 
     catch_var_code, catch_var_note) = (None,)*4 
     var_name = None 
     var_name_last = 'PWGTP80' # Necessary for unformatted end-of-file notes. 
     for line in fobj: 
      # Replace tabs with 4 spaces 
      line = line.replace('\t', ' '*4).rstrip() 
      # Record type is section header 'HOUSING RECORD' or 'PERSON RECORD'. 
      if (line.strip() == 'HOUSING RECORD' 
       or line.strip() == 'PERSON RECORD'): 
       record_type = line.strip() 
       if 'record_types' not in ddict: 
        ddict['record_types'] = collections.OrderedDict() 
       ddict['record_types'][record_type] = collections.OrderedDict() 
      # A newline precedes a variable name. 
      # A newline follows the last variable code. 
      elif line == '': 
       # Example inconsistent format case: 
       # WGTP54  5 
       #  Housing Weight replicate 54 
       # 
       #   -9999..09999 .Integer weight of housing unit 
       if (catch_var_code 
        and 'var_codes' not in ddict['record_types'][record_type][var_name]): 
        pass 
       # Terminate the previous variable block and look for the next 
       # variable name, unless past last variable name. 
       else: 
        catch_var_code = False 
        catch_var_note = False 
        if var_name != var_name_last: 
         catch_var_name = True 
      # Variable name is 1 line with 0 space indent. 
      # Variable name is followed by variable description. 
      # Variable note is optional. 
      # Variable note is preceded by newline. 
      # Variable note is 1+ lines. 
      # Variable note is followed by newline. 
      elif (catch_var_name and not line.startswith(' ') 
       and var_name != var_name_last): 
       # Example: "Note: Public use microdata areas (PUMAs) ..." 
       if line.lower().startswith('note:'): 
        var_note = line.strip() # type(var_note) == str 
        if 'notes' not in ddict['record_types'][record_type][var_name]: 
         ddict['record_types'][record_type][var_name]['notes'] = list() 
        # Append a new note. 
        ddict['record_types'][record_type][var_name]['notes'].append(var_note) 
        catch_var_note = True 
       # Example: """ 
       # Note: Public Use Microdata Areas (PUMAs) designate areas ... 
       # population. Use with ST for unique code. PUMA00 applies ... 
       # ... 
       # """ 
       elif catch_var_note: 
        var_note = line.strip() # type(var_note) == str 
        if 'notes' not in ddict['record_types'][record_type][var_name]: 
         ddict['record_types'][record_type][var_name]['notes'] = list() 
        # Concatenate to most recent note. 
        ddict['record_types'][record_type][var_name]['notes'][-1] += ' '+var_note 
       # Example: "NWAB  1 (UNEDITED - See 'Employment Status Recode' (ESR))" 
       else: 
        # type(var_note) == list 
        (var_name, var_len, *var_note) = line.strip().split(maxsplit=2) 
        ddict['record_types'][record_type][var_name] = collections.OrderedDict() 
        ddict['record_types'][record_type][var_name]['length'] = var_len 
        # Append a new note if exists. 
        if len(var_note) > 0: 
         if 'notes' not in ddict['record_types'][record_type][var_name]: 
          ddict['record_types'][record_type][var_name]['notes'] = list() 
         ddict['record_types'][record_type][var_name]['notes'].append(var_note[0]) 
        catch_var_name = False 
        catch_var_desc = True 
        var_desc_indent = None 
      # Variable description is 1+ lines with 1+ space indent. 
      # Variable description is followed by variable code(s). 
      # Variable code(s) is 1+ line with larger whitespace indent 
      # than variable description. Example:""" 
      # PUMA00  5  
      #  Public use microdata area code (PUMA) based on Census 2000 definition for data 
      #  collected prior to 2012. Use in combination with PUMA10.   
      #   00100..08200 .Public use microdata area codes 
      #     77777 .Combination of 01801, 01802, and 01905 in Louisiana 
      #    -0009 .Code classification is Not Applicable because data 
      #       .collected in 2012 or later    
      # """ 
      # The last variable code is followed by a newline. 
      elif (catch_var_desc or catch_var_code) and line.startswith(' '): 
       indent = len(line) - len(line.lstrip()) 
       # For line 1 of variable description. 
       if catch_var_desc and var_desc_indent is None: 
        var_desc_indent = indent 
        var_desc = line.strip() 
        ddict['record_types'][record_type][var_name]['description'] = var_desc 
       # For lines 2+ of variable description. 
       elif catch_var_desc and indent <= var_desc_indent: 
        var_desc = line.strip() 
        ddict['record_types'][record_type][var_name]['description'] += ' '+var_desc 
       # For lines 1+ of variable codes. 
       else: 
        catch_var_desc = False 
        catch_var_code = True 
        is_valid_code = None 
        if not line.strip().startswith('.'): 
         # Example case: "01 .One person record (one person in household or" 
         if ' .' in line: 
          (var_code, var_code_desc) = line.strip().split(
           sep=' .', maxsplit=1) 
          is_valid_code = True 
         # Example inconsistent format case:""" 
         #   bbbb. N/A (age less than 15 years; never married) 
         # """ 
         elif '. ' in line: 
          (var_code, var_code_desc) = line.strip().split(
           sep='. ', maxsplit=1) 
          is_valid_code = True 
         else: 
          raise AssertionError(
           "Program error. Line unaccounted for:\n" + 
           "{line}".format(line=line)) 
         if is_valid_code: 
          if 'var_codes' not in ddict['record_types'][record_type][var_name]: 
           ddict['record_types'][record_type][var_name]['var_codes'] = collections.OrderedDict() 
          ddict['record_types'][record_type][var_name]['var_codes'][var_code] = var_code_desc 
        # Example case: ".any person in group quarters)" 
        else: 
         var_code_desc = line.strip().lstrip('.') 
         ddict['record_types'][record_type][var_name]['var_codes'][var_code] += ' '+var_code_desc 
      # Example inconsistent format case:""" 
      # ADJHSG  7  
      # Adjustment factor for housing dollar amounts (6 implied decimal places) 
      # """ 
      elif (catch_var_desc and 
       'description' not in ddict['record_types'][record_type][var_name]): 
       var_desc = line.strip() 
       ddict['record_types'][record_type][var_name]['description'] = var_desc 
       catch_var_desc = False 
       catch_var_code = True 
      # Example inconsistent format case:""" 
      # WGTP10  5 
      #  Housing Weight replicate 10 
      #   -9999..09999 .Integer weight of housing unit 
      # WGTP11  5 
      #  Housing Weight replicate 11 
      #   -9999..09999 .Integer weight of housing unit 
      # """ 
      elif ((var_name == 'WGTP10' and 'WGTP11' in line) 
       or (var_name == 'YOEP12' and 'ANC' in line)): 
       # type(var_note) == list 
       (var_name, var_len, *var_note) = line.strip().split(maxsplit=2) 
       ddict['record_types'][record_type][var_name] = collections.OrderedDict() 
       ddict['record_types'][record_type][var_name]['length'] = var_len 
       if len(var_note) > 0: 
        if 'notes' not in ddict['record_types'][record_type][var_name]: 
         ddict['record_types'][record_type][var_name]['notes'] = list() 
        ddict['record_types'][record_type][var_name]['notes'].append(var_note[0]) 
       catch_var_name = False 
       catch_var_desc = True 
       var_desc_indent = None 
      else: 
       if (catch_var_name, catch_var_desc, 
        catch_var_code, catch_var_note) != (False,)*4: 
        raise AssertionError(
         "Program error. All flags to catch lines should be set " + 
         "to `False` by end-of-file.") 
       if var_name != var_name_last: 
        raise AssertionError(
         "Program error. End-of-file notes should only be read "+ 
         "after `var_name_last` has been processed.") 
       if 'notes' not in ddict: 
        ddict['notes'] = list() 
       ddict['notes'].append(line) 
    return ddict

Создать иерархическую dataframe (форматированный ниже как клетки Jupyter Notebook):

In [ ]: 
import pandas as pd 
ddict = parse_pumsdatadict(path=r'/path/to/PUMSDataDict2013.txt') 
tmp = dict() 
for record_type in ddict['record_types']: 
    tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index') 
df_ddict = pd.concat(tmp, names=['record_type', 'var_name']) 
df_ddict.head() 

Out[ ]: 
# Click "Run code snippet" below to render the output from `df_ddict.head()`.

<table border="1" class="dataframe"> 
 
    <thead> 
 
    <tr style="text-align: right;"> 
 
     <th></th> 
 
     <th></th> 
 
     <th>length</th> 
 
     <th>description</th> 
 
     <th>var_codes</th> 
 
     <th>notes</th> 
 
    </tr> 
 
    <tr> 
 
     <th>record_type</th> 
 
     <th>var_name</th> 
 
     <th></th> 
 
     <th></th> 
 
     <th></th> 
 
     <th></th> 
 
    </tr> 
 
    </thead> 
 
    <tbody> 
 
    <tr> 
 
     <th rowspan="5" valign="top">HOUSING RECORD</th> 
 
     <th>ACCESS</th> 
 
     <td>1</td> 
 
     <td>Access to the Internet</td> 
 
     <td>{'b': 'N/A (GQ)', '1': 'Yes, with subscription...</td> 
 
     <td>NaN</td> 
 
    </tr> 
 
    <tr> 
 
     <th>ACR</th> 
 
     <td>1</td> 
 
     <td>Lot size</td> 
 
     <td>{'b': 'N/A (GQ/not a one-family house or mobil...</td> 
 
     <td>NaN</td> 
 
    </tr> 
 
    <tr> 
 
     <th>ADJHSG</th> 
 
     <td>7</td> 
 
     <td>Adjustment factor for housing dollar amounts (...</td> 
 
     <td>{'1000000': '2013 factor (1.000000)'}</td> 
 
     <td>[Note: The value of ADJHSG inflation-adjusts r...</td> 
 
    </tr> 
 
    <tr> 
 
     <th>ADJINC</th> 
 
     <td>7</td> 
 
     <td>Adjustment factor for income and earnings doll...</td> 
 
     <td>{'1007549': '2013 factor (1.007549)'}</td> 
 
     <td>[Note: The value of ADJINC inflation-adjusts r...</td> 
 
    </tr> 
 
    <tr> 
 
     <th>AGS</th> 
 
     <td>1</td> 
 
     <td>Sales of Agriculture Products (Yearly sales)</td> 
 
     <td>{'b': 'N/A (GQ/vacant/not a one family house o...</td> 
 
     <td>[Note: no adjustment factor is applied to AGS.]</td> 
 
    </tr> 
 
    </tbody> 
 
</table>

источник

2016-01-02 08:46:11

Пакет с 'parse_pumsdatadict': https://github.com/stharrold/dsdemos/blob/d15e9d3b661e2d432a7396e4db60a12931eb07f0/dsdemos/переписи.py # L27-L257 –

Похожие сообщения в блоге: https://stharrold.github.io/20160110-etl-census-with-python.html –

regex для синтаксического анализа хорошо сформированного многострочного словаря данных

ответ

Смежные вопросы